Anomaly Detection – Dataconomy

Finding loopholes with machine learning techniques

Kerem Gülen — Mon, 10 Oct 2022 15:23:28 +0000

One of the most popular applications of machine learning is anomaly detection. Outliers can be found and identified to help stop fraud, adversary assaults, and network intrusions that could jeopardize the future of your business.

This article will discuss how anomaly detection functions, the machine learning techniques that can be used, and the advantages of using machine learning for anomaly detection.

Anomaly detection in machine learning

An anomaly, also known as a variation or an exception, is typically something that deviates from the norm. In the context of software engineering, an anomaly is an unusual occurrence or event that deviates from the norm and raises suspicion.

A software program must function smoothly and predictably. Thus any anomaly poses a possible risk to the robustness and security of the program. Normally, you want to catch them all. Anomaly or outlier identification is the process of detecting anomalies.

An anomaly, also known as a variation or an exception, is typically something that deviates from the norm

The identification of uncommon things, events, or observations that raise questions by deviating noticeably from the rest of the data is known as anomaly detection (also known as outlier detection). Atypical data is typically associated with some issue or unusual occurrence, such as for example, financial fraud, health issues, structural issues, broken equipment, etc. Given that recognizing these events is frequently extremely valuable from a business standpoint, it is very interesting to be able to identify which data points can be labeled anomalies because of this connection.

Why do we need anomaly detection in machine learning?

Whether the deviation is good or negative, anomaly identification is crucial because it helps you gain a deeper knowledge of changes in business performance. Even if the causes of anomalous data are unfavorable, it is still worthwhile to look into them. Business users can identify unauthorized transactions or security breaches by putting intrusion detection systems in place.

What does anomaly detection do in machine learning?

A significant component of the implemented machine learning is frequent anomaly detection. Whether identifying fraudulent behavior in the financial sector or keeping an eye on product quality, anomaly detection is a crucial component of machine learning systems in many different industries. Anomaly detection with machine learning typically encompasses a much wider variety of data than is achievable manually. Models can do anomaly detection that takes into consideration complicated characteristics and behaviors and complex features and behaviors. Models can then be taught to look for unusual behavior or trends.

A significant component of the implemented machine learning is frequent anomaly detection

Depending on the type of data, there are many model construction methodologies for anomaly detection in machine learning. Either labeled data or, more frequently, unlabeled raw data sets will be used to train models. Models that have been trained on labeled data will keep an eye out for outliers that go beyond the specified threshold for normal data. A model will classify the raw data into categories after being trained on unlabeled data, and it will also identify outliers that exist outside the clusters. In both situations, the model recognizes what falls inside a range of acceptable behavior and will spot unusual behavior or data.

Machine learning and anomaly detection: Types of outliers

Let’s explore the types of different anomalies in machine learning. These are the anomaly detection types:

Global outliers
Contextual outliers
Collective outliers

The touchstone of machine learning: Epoch

Global outliers

A data point can be deemed a global anomaly if its value falls outside the bounds of all the other data points in the collection. It is, in other words, an unusual occurrence.

The analytics staff at the bank would be alarmed if, for instance, you consistently deposit an average American salary into your bank accounts but one month receives a million dollars.

Contextual outliers

When a contextual outlier is referred to, it signifies that its value doesn’t match what we would anticipate seeing for a comparable data point in the same context. The identical circumstance experienced in many contexts can occur since contexts are often temporal and are not always abnormal.

A subset of data points that depart from the norm indicates collective outliers

For instance, it’s common for shops to have an uptick in consumers around the holidays. On the other hand, a rapid uptick that occurs outside of holidays or sales can be viewed as a contextual outlier.

Collective outliers

A subset of data points that depart from the norm indicates collective outliers.

Tech firms often continue to expand in size. While not a widespread trend, some businesses may decline. However, we can spot a collective outlier if a number of businesses simultaneously show a decline in sales over the same time period.

What are the characteristics of anomaly detection?

Let’s quickly go through some of the characteristics of the anomaly detection issue:

Processing type
Data type
Modes by data labels
Application domain

Processing type

Both offline and online processing methods exist. One can obtain the best answer for the offline type since it is set when there is a complete collection of data. The online kind is chosen when data points arrive in batches (a subset of points) or all at once (real-time), and anomaly starts (changepoints) must be located as soon as they happen.

Anomaly detection is a crucial component of machine learning systems in many different industries

Data type

It is more convenient to think of the data as having been pre-processed and changed into ready-for-machine learning, even if it is frequently divided into structured, semi-structured, and unstructured kinds (details here). In this situation, as anomaly detection methods for distinct data kinds frequently differ greatly, data classification by modality is more beneficial.

Tabular data: This data is organized into lines, each of which includes details on distinct items. The same number of columns (some values may be missed) are included in each row, and each column represents a property value for the object that the row is describing.
Image data: It’s common for this to be a tensor or multidimensional array, where two dimensions (rows and columns) stand in for the x and y axes of space, and a third represents the brightness or grayscale of a pixel.
Video data: It typically combines the types of audio and time series of images, each of which is an instance of the picture type.
Time series data: This is a sequential observation of univariate or multivariate data through time. In a unique situation, data is observed at predetermined, evenly spaced intervals of time (such as yearly, monthly, quarterly, or hourly). Time series data is often a specific type of tabular data that frequently has an index in timestamp format.
Text data: This is broken down into words for phrases, sentences, and texts, or it is combined.
Audio data: When the sound is gathered sequentially, this is a unique case of time series data.

Modes by data labels

Modes can be classified as supervised, semisupervised, and unsupervised according to the labels assigned to the data. Each data point is assigned a normal or anomalous class by the data labels (or one of the anomaly classes). Labeled data points are required for both the normal and anomalous classes in supervised training mode.

Machines cannot learn a function that translates input features to outputs using unsupervised machine learning

Solving semi-supervised tasks is enabled by having markup for the normal or anomaly-free class and knowledge of it. The most popular techniques are unsupervised ones because they don’t need training data. These techniques frequently start with the premise that there are considerably fewer abnormal events than typical ones.

Anomaly detection algorithm outputs: Mainly there are two types of anomaly detection algorithms:
Scores: When the AD algorithm returns a level of abnormality for each data instance. It enables a flexible definition of the abnormality boundaries at the post-processing step.
Labels: When the AD algorithm assigns a label or class (normal/anomalous) to each data instance.

Application domain

Anomalies can be categorized into many sorts depending on the particular sector or application. Typically, different categories allude to the different sorts of anomaly occurrences and imply that different AD approaches and domain heuristics should be utilized. Anomalies from different domains can, however, be handled relatively similarly when it comes to mathematical difficulties. For instance, when dealing with sensor network anomalies, cyber-intrusions, and industrial defects, the same AD techniques are applied when the data is of the time series type.

Machine learning makes life easier for data scientists

What are the difficulties in anomaly detection?

We are aware that accurate anomaly identification requires a combination of ongoing statistical analysis and historical data. Importantly, the quality of the data and sample sizes employed in these models have a significant impact on the alerting as a whole. These are the biggest difficulties in anomaly detection:

Data quality
Training sample sizes
False alerting
Imbalanced distributions

Data quality

One important question you can have is, “Which algorithm should I use when constructing an anomaly detection model?” The type of problem you’re trying to address will obviously have a big impact, but one thing to think about is the underlying data.

The most important factor in developing an accurate, useful model is going to be data quality or the caliber of the underlying dataset.

Training sample sizes

For various reasons, having a big training set is crucial. The algorithm won’t have enough exposure to prior examples if the training set is too short of creating an accurate model of the expected value at a particular moment. The model’s overall accuracy will therefore be impacted by the way anomalies distort the baseline.

Another frequent issue with limited sample sets is seasonality. Having a big enough sample dataset is crucial because no day or week is the same. Depending on the industry, customer traffic volumes over the Christmas season may increase or dramatically decrease. For the model to effectively generate and monitor the baseline throughout common holidays, it is crucial to observe data samples from numerous years.

Semi-supervised machine learning strategies use a variety of techniques that can benefit from both huge volumes of unlabeled data and sparsely labeled data

False alerting

In a dynamic context, spotting anomalies is a great tool since it can use historical data to distinguish between expected behavior and unusual occurrences. What transpires, though, if your model frequently produces false alarms and is incorrect?

It can be challenging to win over hesitant users’ trust and just as simple to lose it, so it’s crucial to strike a balance.

Imbalanced distributions

Using a classification technique to create a supervised model is an additional approach to developing an anomaly detection model. To understand what is good or bad, this supervised model needs data that have been labeled.

The imbalanced distribution of labeled data is a prevalent issue. Since having a nice condition is normal, 99% of the labeled data will be biased in favor of good. The training set could not have enough examples to learn and associate with the negative condition as a result of this natural imbalance.

Anomaly detection approaches in machine learning

This chapter gives a general overview of anomaly detection methods based on the types of data that are accessible, how to assess an anomaly detection model, how each method creates a model of typical behavior and the advantages of deep learning models. We finish up by talking about potential difficulties while using these models.

Based on the kind of data required to train the model, anomaly detection techniques can be divided into different categories. In the majority of use cases, a very tiny portion of the whole dataset is anticipated to be made up of anomalous samples. Therefore, normal data samples are easier to find than aberrant ones, even when labeled data is available. For the majority of applications today, this presumption is crucial. These are the anomaly detection approaches in machine learning:

Supervised learning
Unsupervised learning
Semi-supervised learning
Local outlier factor (LOF)
K-nearest neighbors
DBSCAN
Autoencoders
Bayesian networks
Support vector machines (SVMs)

Supervised learning

Machines learn a function that maps input features to outputs based on sample input-output pairings while they are learning under supervision. Adopting application-specific knowledge into the process of anomaly detection is the aim of supervised anomaly detection algorithms.

Active learning overcomes the ML training challenges

The challenge of anomaly detection can be reframed as a classification task with enough normal and anomalous instances so that computers can learn to correctly anticipate whether a particular example is an abnormality or not. However, for many anomaly detection use cases, the ratio of normal to abnormal instances is severely skewed; even while there may be several classes of anomalies, each one may be significantly underrepresented.

Real-world anomaly detection use cases are well suited to semi-supervised machine learning

This method implies that the user can accurately classify all possible anomalies and has labeled examples for each kind. As abnormalities can manifest in a variety of ways and new anomalies can arise during testing, this is typically not the case in practice. Therefore, methods that generalize well and are better at spotting abnormalities that haven’t been seen before are preferred.

Unsupervised learning

Machines cannot learn a function that translates input features to outputs using unsupervised machine learning because they lack examples of input-output pairings. Instead, they discover structure within the input features and use that to learn. Unsupervised methods are more widely used in the field of anomaly identification than supervised ones because, as was already said, labeled anomalous data is comparatively uncommon. However, the type of anomalies one expects to find is frequently very particular. As a result, many of the abnormalities discovered in an entirely unsupervised approach may simply be noise and may not be relevant to the task at hand.

Semi-supervised learning

Semi-supervised machine learning strategies use a variety of techniques that can benefit from both huge volumes of unlabeled data and sparsely labeled data, acting as a type of middle ground. Due to the abundance of normal instances from which to learn and the dearth of examples of the more unusual or abnormal classes of interest, many real-world anomaly detection use cases are well suited to semi-supervised machine learning. One can train a reliable model on an unlabeled dataset and assess its performance using a small quantity of labeled data on the presumption that the majority of the data points in an unlabeled dataset are normal.

Applications like network intrusion detection, where there may be several examples of the normal class and a few examples of intrusion classes, but new types of intrusions may develop over time, are ideally suited for this hybrid technique.

Consider X-ray screening for border or airport security as another illustration. Unusual products that pose a security danger are uncommon and can take many different shapes. Additionally, any anomaly that poses a potential hazard may change in nature as a result of a variety of outside events. Therefore, it may be challenging to get sufficient quantities of useful examples of anomaly data.

kNN functions as an unsupervised machine learning method for anomaly detection

These circumstances might call for the identification of novel classes as well as anomalous classes, for which there may be few or no labeled data. A semi-supervised classification strategy that permits the detection of both known and previously unidentified abnormalities is the optimal response in these circumstances.

Local outlier factor (LOF)

The most popular method for anomaly identification is likely the local outlier factor. The idea of local density serves as the foundation for this method. It contrasts an object’s local density with the densities of the nearby data points. A data point is deemed an outlier if its density is lower than that of its neighbors.

K-nearest neighbors

A popular supervised machine learning approach for classification is kNN. KNN is a helpful tool when used to solve anomaly detection difficulties since it makes it simple to see the data points on the scatterplot and makes anomaly identification much more understandable. The fact that kNN performs well on both small and large datasets is an additional advantage.

In order to tackle the categorization problem, kNN doesn’t actually learn any ‘normal’ and ‘abnormal’ values. Therefore, kNN functions as an unsupervised machine learning method for anomaly detection. A range of normal and abnormal values is explicitly defined by a machine learning expert, and the algorithm automatically divides this representation into classes.

DBSCAN

This approach uses unsupervised machine learning and is based on the density principle. By examining the local density of the data points, DBSCAN may find clusters in sizable spatial datasets and generally produces positive findings when used for anomaly identification. The points that are not a part of any cluster are given their own class: -1, making it simple to spot them. In situations when the data is represented by non-discrete data points, this technique manages outliers successfully.

Machine learning engineers may find anomalies even in high dimensional data thanks to Bayesian networks

Autoencoders

This approach uses artificial neural networks to compress the data into lower dimensions in order to encode it. The data is then decoded by ANNs to recreate the initial input. The rules are already recognized in the compressed data, so when we lower the dimensionality, we don’t lose the necessary information.

Bayesian networks

Machine learning engineers may find anomalies even in high dimensional data thanks to Bayesian networks. When the anomalies we’re seeking are more subtle and challenging to spot and visualizing them on the plot might not yield the expected results, we employ this strategy.

Support vector machines (SVMs)

Another supervised machine learning approach that is frequently used for classification is the support vector machine (SVM). SVMs categorize data points using hyperplanes in multidimensional space. The threshold (%) for outliers that must be manually selected is the hyperparameter nu.

When there are multiple classes involved in the issue, SVM is typically used. However, it is also applied to single-class issues in anomaly detection. The model can determine whether unfamiliar data belongs to this class or is an anomaly because it has been trained to understand the “norm”.

Is anomaly detection supervised or unsupervised?

The unsupervised approach is the kind of anomaly detection that is most frequently used. There, using an unlabeled dataset, we build a machine learning model to fit to the typical behavior. We make the crucial assumption that the bulk of the training set’s data are typical examples throughout this process. But among them, there can be some odd data points (a small proportion). Any data point that considerably deviates from the expected behavior will then be marked as an anomaly. A classifier will be trained using a dataset that has been classified as “normal” and “abnormal” in supervised anomaly detection.

There will be a typical classification application when a new data point is introduced. Both of these approaches have advantages and disadvantages. A vast number of both positive and negative examples are needed for the supervised anomaly detection procedure. Due to the rarity of anomalous occurrences, it will be quite challenging to obtain such a dataset. Even if you were to acquire such a dataset, you would only be able to simulate the dataset’s anomalous patterns.

Another supervised machine learning approach that is frequently used for classification is the support vector machine (SVM)

But there are many various kinds of anomalies in every field, and future anomalies might not resemble the ones we’ve already observed at all. Any algorithm will have a very difficult time learning what the anomalies look like from anomalous samples. The unsupervised method is well-liked for this reason. It is far simpler to record regular conduct than it is to record the numerous varieties of anomalies.

Which algorithm will you use for anomaly detection?

One of the best algorithms for detecting anomalies is a support vector machine. A supervised machine learning method called SVM is frequently applied to classification issues.

Is SVM used for anomaly detection?

Yes. SVMs use hyperplanes in multidimensional spaces to distinguish between different classes of observations. Naturally, SVM is used to address issues with multi-class classification.

SVM is, however, also being used more frequently in one-class problems when all of the data are from a single class. In this instance, the algorithm is trained to understand what is “normal” so that it can determine whether fresh data should be included in the group or not when it is presented. If not, the new data is classified as anomalous or out of the ordinary.

What is the difference between an anomaly and an outlier?

Observations that deviate significantly from the mean or center of distribution are known as outliers. They may or may not signify abnormal behavior or conduct brought on by an alternative procedure. Anomalies, on the other hand, are data patterns that are produced by various processes.

Comprehending a machine learning pipeline architecture

How do you identify outliers?

Extreme data points can be transformed into z scores that indicate how far they deviate from the mean.

A value can be categorized as an outlier if its z score is sufficiently high or low. Generally speaking, values with a z score of larger than 3 or lower than -3 are regarded as outliers.

Which algorithm is best for outliers?

Because it indicates the range of the middle half of your dataset, using the interquartile range (IQR) may be a good approach. Outliers are any numbers that fall outside of the “fences” you can draw with the IQR around your data.

Extreme data points can be transformed into z scores that indicate how far they deviate from the mean

Is anomaly detection a classification problem?

By now, it should be clear that classification and supervised anomaly detection are two different machine learning issues. If you have labeled classes and deciding whether or not the dataset is unbalanced are two main ways to distinguish between them.

Can we use a classification algorithm to detect outliers?

A classification or regression dataset with outliers can have a poor fit and perform less well in predictive modeling. Given the enormous number of input variables in the majority of machine learning datasets, it is difficult to detect and eliminate outliers using straightforward statistical methods.

Is anomaly detection classification or regression?

When a subset of these anomalous patterns is known in some application-specific situations, the OD problem can be transformed into a supervised one and frequently treated as a classification problem; this is the condition that is typically referred to as supervised OD. But even in this scenario, OD is frequently the first stage of a data modeling procedure that ends with a supervised classifier or regressor.

Machine learning changed marketing strategies for good and all

Conclusion

The speed of anomaly detection can be increased by using machine learning to learn a system’s properties from observed data. In addition to learning from the data, machine learning algorithms are also able to forecast the future based on that data. These algorithms can then refine their initial predictions by “learning” from how the events actually turn out.

Techniques that let you efficiently find and categorize anomalies in huge and intricate big data sets are included in machine learning for anomaly detection. Sequential hypothesis tests, including cumulative sum charts and sequential probability ratio tests, are examples of anomaly-detection methods. These tests can be used to identify changes in the distributions of real-time data and to set alarm parameters.

Machines are haunted by the curse of dimensionality

Hasan Selman — Wed, 15 Jun 2022 15:00:29 +0000

The curse of dimensionality comes into play when we deal with a lot of data having many dimensions or features. The dimension of the data is the number of characteristics or columns in a dataset.

High-dimensional data has several challenges, the most notable of which is that it becomes extremely difficult to find meaningful correlations while processing and visualizing it. In addition, as the number of dimensions increases, training the model becomes much slower. More dimensions invite more chances for multicollinearity as well. Multicollinearity is a condition in which two or more variables are found to be highly correlated with one another.

What is the curse of dimensionality?

The curse of dimensionality is a term used to describe the issues when classifying, organizing, and analyzing high-dimensional data, particularly data sparsity and “closeness” of data.

Why is it a curse?

Data sparsity is an issue that arises when you go to higher dimensions. Because the amount of space represented grows so quickly that data can’t keep up, it becomes sparse, as seen below. The sparsity problem is a big issue for statistical significance. As the data space approaches two dimensions and then three dimensions, the amount of data filling it decreases. As a consequence of this, the data for analysis grows dramatically.

The curse of dimensionality: The dimension of the data is defined as the number of characteristics or columns in a dataset

Consider a data set with four points in one dimension (only one feature in the data set). It may be simply represented using a line, and the dimension space is equal to 4 since there are only four data points. Suppose we include another feature, which results in a 4-dimensional increase in space. If we add one more component to it, the space will expand to 16 dimensions. Dimensions space grows exponentially as the number of dimensions rises.

The curse of dimensionality: Data may appear similar in low-dimensional spaces, but as the dimension increases, these data points may seem more distant

The second issue is how to sort or classify the data. Data may appear similar in low-dimensional spaces, but as the dimension increases, these data points may seem more distant. In the image above, two dimensions appear close together but look distant when viewed in three dimensions. The curse of dimensionality has the same effect on data.

With the increase in the dimensions, the calculating distance between observations becomes increasingly difficult, and all algorithms that rely on correlation calculate it to be an uphill struggle.

More dimensions require more training

Neural networks are instantiated with a certain number of features (dimensions). Each data has its own set of characteristics, each one falling somewhere along a dimension. We may want one feature to handle color, for example, while another handles weight. Each feature adds information, and if we could comprehend every feature conceivable, we would be able to accurately convey which fruit we are thinking about. However, an infinite number of features necessitates infinite training instances, thereby rendering our network’s real-world usefulness doubtful.

The amount of training data required grows drastically with each new feature. Even if we only had 15 features, each being a ‘yes’ or ‘no’ question, the number of training samples needed would be 21532,000.

The curse of dimensionality: An infinite number of features necessitates infinite training instances, thereby rendering networks’ real-world usefulness doubtful

When does the curse of dimensionality take effect?

The following are just a few examples of domains where the direct consequence of the curse of dimensionality may be observed: Machine learning takes the worst hit from the curse.

Machine Learning

In Machine Learning, a marginal increase in dimensionality necessitates a substantial expansion in the amount of data to maintain comparable results. The by-product of a phenomenon that occurs with high-dimensional data is the curse of dimensionality.

Anomaly Detection

Anomaly detection is finding unusual items or events in the data. In high-dimensional data, anomalies frequently have many irrelevant attributes; various things appear more often in neighbor lists than others.

Combinatorics

When there are more possibilities for input combinations, the complexity grows quickly, and the curse of dimensionality strikes.

Mitigating the curse of dimensionality

To deal with the curse of dimensionality caused by high-dimensional data, a collection of methods known as “Dimensionality Reduction Techniques” is employed. Dimensionality reduction procedures are divided into “Feature selection” and “Feature extraction.”

Feature selection Techniques

The features are evaluated for usefulness and then chosen or eliminated in feature selection methods. The following are some of the most popular Feature selection procedures.

Low Variance filter: The variance in the distribution of all variables in a dataset is compared, and those with very low variation are removed in this method. Attributes with a little variance will be nearly constant; thus, they won’t help the model’s predictability.
High Correlation filter: The correlation between attributes is evaluated in this method. The most correlated pair of features are deleted, and the other is kept. The retained one records the degree of difference in the eliminated feature.
Multicollinearity: When attributes are highly correlated, a high degree of precision may not be obtained for pairs of characteristics, but if each attribute is regressed as a function of the others, we may observe that some of the attributes’ variances are completely covered by the others. The multicollinearity problem is corrected by removing attributes with high VIF values generally greater than 10. High VIF values indicate that there may be a lot of redundancy between related variables and can cause instability in a regression model.
Feature Ranking: Decision trees and similar models, such as CART, can rank attributes based on their importance or contribution to the model’s predictability. In high-dimensional data, certain of the lower-ranking variables may be removed to reduce the number of dimensions.
Forward selection: When using high-dimensional data in constructing multi-linear regression models, a method may be used where only one attribute is chosen to start. The final characteristics are added individually, and their value is verified with “Adjusted-R2” values. If the Adjusted-R2 value increases significantly, the variable is kept; otherwise, it is eliminated.

Feature Extraction Techniques

The high-dimensional features are combined into low-dimensional components (PCA or ICA) or factored into low-dimensional components (FA).

Principal Component Analysis (PCA): The Principal Component Analysis (PCA) is a dimensionality reduction technique in which high-dimensional correlated data is converted to a lower-dimensional set of uncorrelated components known as principal components. The reduced-dimension principal components account for most of the information in the original high-dimensional dataset. A ‘n’-dimensional data set is transformed into ‘n’ fundamental components, with a subset of the ‘n’ principle components being chosen based on the percentage of variance in the data that is to be captured through the principles.

Factor Analysis (FA): Factor analysis assumes that all of the observed characteristics in a dataset can be represented as a weighted linear combination of latent factors. The thinking behind this approach is that an ‘n-dimensional data might be modeled with ‘m’ components (m
Independent Component Analysis (ICA): The foundation of the ICA assumption is that all attributes are made up of independent components and that these variables are broken down into a mix of these components.PCA is more resilient than CCA and is most often utilized when PCA and FA fail.

“I often warn data analysts not to underestimate the power of small data” – Interview with Data Mining Consultant Rosaria Silipo

Peadar Coyle — Wed, 09 Dec 2015 08:30:46 +0000

Rosaria has been a researcher in applications of Data Mining and Machine Learning for over a decade. Application fields include biomedical systems and data analysis, financial time series (including risk analysis), and automatic speech processing. She is currently based in Zurich, Switzerland.

What project have you worked on do you wish you could go back to, and do better?

There is not such a thing like the perfect project! As close as you can be to perfection, at some point you need to stop either because the time is over, the money is over or because you just need to have a productive solution. I am sure I can go back to all my past projects and find something to improve in each of them!

This is actually one of the biggest issues in a data analytics projects: when do we stop? Of course, you need to identify some basic deliverables in the project initial phase, without which the project is not satisfactorily completed. But once you have passed these deliverable milestones, when do you stop? What is the right compromise between perfection and resource investment?

In addition, every few years some new technology becomes available which could help re-engineer your old projects, for speed or accuracy or both. So, even the most perfect project solution, after a few years, can surely be improved due to new technologies. This is, for example, the case of the new big data platforms. Most of my old projects would benefit now from a big data based speeding operation. This could help to speed up old models training and deployment, to create more complex data analytics models, and to optimize model parameters better.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Use your time to learn! Data Science is a relatively new discipline that combines old knowledge, such as statstics and machine learning, with newer wisdom, like big data platforms and parallel computation. Not many people know everything here, really! So, take your time to learn what you do not know yet from the experts in that area. Combining a few different pieces of data science knowledge probably makes you unique already in the data science landscape. The more pieces of different knowledge, the bigger of an advantage for you in the data science ecosystem!

One way to get easy hands-on experience on a different range of application fields is to explore the Kaggle challenges. Kaggle has a number of interesting challenges up every months and who knows you might also win some money!

What do you wish you knew earlier about being a data scientist?

This answer is related to the previous one, since my advice to young data scientists sprouts from my earlier experience and failures. My early background is in machine learning. So, when I moved my first steps in the data science world many years ago, I thought that knowledge of machine learning algorithms was all I needed. I wish!

I had to learn that data science is the sum of many different skills, including data collection and data cleaning and transformation. The latter, for example, is highly underestimated! In all data science projects I have seen (not only mine), the data processing part takes way more than 50% of the used resources! Including also data visualization and data presentation. A genial solution is worth nothing if the executives and stakeholders do not understand the results by means of a clear and compact representation. And so on. I guess I wish I took more time early on to learn from colleagues with a different set of skills than mine.

How do you respond when you hear the phrase ‘big data’?

Do you really need big data? Sometimes customers ask for a big data platform just because. Then when you investigate deeper you realize that they really do not have and do not want to have such a big amount of data to take care of every day. A nice traditional DWH solution is definitely enough for them.

Sometimes though, a big data solution is really needed or at least it will be needed at some point to keep all company’s’ internal and external data up to date. In these cases, we can work together for a big data solution for their project. Even if this is the case, I often warn data analysts not to underestimate the power of small data! A nice clean general statistical sample might produce higher accuracy in terms of prediction, classification and clustering than a messy large and noisy data lake (a data swamp really)! In some projects, a few data dimensionality selection algorithms were run posteriori, just to see whether all those input dimensions contained useful unique pieces of information or just obnoxious noise. You would be surprised! In some cases we easily passed from more than 200 input variables to less than 10 keeping the same accuracy performance.

There is another phrase though that triggers my inner warning signals: “We absolutely need real time execution”. When I hear this phrase, I usually wonder: “Do you need effective real time responses or would perceived real time responses be enough?”. Perceived real time for me is a few seconds,something that a user can wait without getting impatient. A few seconds however is NOT real time! Any data analytics tool, any deployment workflow can produce a response in a few seconds or even less. Real time is a much faster response usually to trigger some kind of consequent action. In most cases, “perceived real time” is good enough for the human reaction time.

What is the most exciting thing about your field?

Probably, the variety of applications. The whole knowledge of data collection, data warehousing, data analytics, data visualization, results inspection and presentation is transferable to a number of fields. You would be surprised at how many different applications can be designed using a variation of the same data science technique! Once you have the data science knowledge and a particular application request, all you need is imagination to make the two match and find the best solution.

How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I always propose a first pilot/investigation mini-project at the very beginning. This is for me to get a better idea of the application specs, of the data set, and yes also of the customer. This is a crucial phase, though short. During this part, in fact, I can take the measures of the project in terms of needed time and resources, and I and the customer we can study each other and adjust our expectations about input data and final results. This initial phase, usually involves a sample of the data, an understanding of the data update strategy, some visual investigation, and a first tentative analysis to produce the requested results. Once this part is successful and expectations have been adjusted on both sides, the real project can start.

You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

Ah … I am really not a very good example for dealing with stakeholders and executives and successfully manage cultural challenges! Usually, I rely on external collaborators to handle this part for me, also because of time constraints. I see myself as a technical professional, with little time for talking and convincing. Unfortunately, this is a big part of each data analytics project. However, when I have to deal with it myself, I let the facts speak for me: final or intermediate results of current and past projects. This is the easiest way to convince stakeholders that the project is worth the time and the money. For any occurrence, though, I always have at hand a set of slides with previous accomplishments to present to executives if and when needed.

Tell us about something cool you’ve been doing in Data Science lately.

My latest project was about anomaly detection. I found it a very interesting problem to solve, where skills and expertise have to meet creativity. In anomaly detection you have no historical records of anomalies, either because they rarely happen or because they are too expensive to let them happen.

What you have is a data set of records of normal functioning of the machine, transactions, system, or whatever it is you are observing. The challenge then is to predict anomalies before they happen and without previous historical examples. That is where the creativity comes in. Traditional machine learning algorithms need a twist in application to provide an adequate solution for this problem.

Visual Cues in Big Data for Analytics and Discovery

Kirk Borne — Wed, 12 Nov 2014 08:23:50 +0000

One of the most fun outcomes that you can achieve with your data is to discover new and interesting things. Sometimes, the most interesting thing is the detection of a novel, unexpected, surprising object, event, or behavior – i.e., the outlier, the thing that falls outside the bounds of your original expectations, the thing that signals something new about your data domain (a new class of behavior, an anomaly in the data processing pipeline, or an error in the data collection activity). The more quickly that you can find the interesting features and characteristics within your data collection, consequently the more likely you are to improve decision-making and responsiveness in your data-driven workflows.

Tapping into the human natural cognitive ability to see patterns quickly and to detect anomalies readily is powerful medicine for big data analytics headaches. That’s where data visualization shines most brightly in the big data firmament! One could even say that visualization is an efficiency amplification methodology for discovery from data. But visualization contributes to more than just discovery – it is an analytics ally.

The phrase “A picture is worth a thousand words” is a very common expression. In the modern computing era (where one word is equal to 4 bytes), we should say that a picture is worth 4 Kbytes. However, the rich complexity (high dimensionality and variety) of big data calls for a richer visual experience – perhaps encoding Megabytes (not Kbytes) of information in a single display. With such capabilities, we can exploit human visual cognitive abilities more effectively. In particular, visualization is especially useful and powerful for seeing patterns (trends) in your data and for seeing things that break the pattern (outliers). When used in this way (for description, discovery, prediction, and insights) within large datasets, visualization moves into the scientific realm of visual analytics. The significance of this potential analytics application explains the recent rapid growth in research and development of visualization tools for visual storytelling with data.

One of the best new tools in the visualization universe comes from VisualCue Technologies. They were recently named a winner of the 2014 Ventana Research Business Intelligence Innovation award.

VisualCue uses semantic cues in the form of glyphs (icons, symbols) within the visualization. This is doubly powerful in that it not only exploits visual cognition, but it employs semantics in the presentation and display of the information content within a big data stream. Semantic technologies in general are the future of big data discovery and analytics – going beyond the bits and the bytes, and delivering more than content, the semantics of the data reveal to us what the data means and what is its context: not only the “what”, but the “why”. One of the first such examples of glyph-enabled data visualization was a NASA project called ViSBARD (Visual System for Browsing, Analysis, and Retrieval of Data) – this ground-breaking system was specifically designed for use with space physics data from spacecraft in the interplanetary environment. VisBARD displayed six or more dimensions of scientific data simultaneously, enabling discovery of patterns and correlations across multiple parameters at once, but it did not provide any semantic elements. VisualCue is now making major advancements and significant contributions in the direction of visual data semantics.

The semantic visual cue contains an iconic representation of the meaning of a particular data element in the visualization. For example, if you are displaying international shipping manifests, the icons would include an iconic image of a ship, which can be color-coded according to some metric (e.g., time to delivery, load, or country of origin). The visual cue tile can also include smaller embedded icons representing key performance indicators (KPIs) related to the business intelligence questions that the analyst needs to track and monitor for insights and data-driven business decisions based on a dynamic, evolving, complex (high-dimensional) database.

VisualCue allows a single display of multiple visual cues (arranged in tiles on an end-user’s dashboard) to simultaneously present and track numerous KPIs, processes, assets, events, clients, suppliers, customers, transactions, etc., using color-coded alerts (green, yellow, red) to signal the spots in the data stream that need the most immediate attention, or where something novel (interesting) is occurring. The arrangement, grouping, and dashboard layout of the tiles is completely user-configurable. In fact, if a dashboard is not your thing, you can display the tiles on a map – oh yes, geospatial location-based business intelligence is hot stuff, and now becoming even hotter! You can also display the tiles in a diagram of your own choice: a floorplan, blueprints, a schematic, a workflow, or whatever graphic display empowers your business decision-making. The sky is the limit! In fact, I hope to try this out on the sky –i.e., putting astronomical event data on a VisualCue-enabled sky map.

As is the case with the most powerful data exploration and visualization tools, VisualCue’s tiles are just the icing on the big data cake. In other words, a user can click on a tile and drill down deeper into the data, for further discovery and analytics. In this way, the analyst can more effectively and efficiently determine the root cause of specific visual cues that were displayed in a particular tile.

The VisualCue Technologies’ solution is easily configurable and extensible to many use cases and business domains. It has a drag-and-drop capability, including a growing library of tiles that you can use “out of the box”. You can also channel your inner “data artist’ and use their tile builder tool in order to design and create your own personalized cues and tiles. Some of the application domains that are already supported in their tile library include IT system administration, call center, agent performance, health, education, logistics, vehicle fleet management, supply chain, delivery routing, business process management, and more.

Business intelligence is clearly a driver in the development of the new visual languages that enable efficient and effective big data discovery and analytics. VisualCue Technologies is an emerging leader in this field. Take a cue from me – check them out. You will be glad you did.

Follow @DataconomyMedia

Kirk is a data scientist, top big data influencer and professor of astrophysics and computational science at George Mason University.He spent nearly 20 years supporting NASA projects, including NASA’s Hubble Space Telescope as Data Archive Project Scientist, NASA’s Astronomy Data Center, and NASA’s Space Science Data Operations Office. He has extensive experience in large scientific databases and information systems, including expertise in scientific data mining. He is currently working on the design and development of the proposed Large Synoptic Survey Telescope (LSST), for which he is contributing in the areas of science data management, informatics and statistical science research, galaxies research, and education and public outreach. His writing and data reflections can be found at Rocket-Powered Data Science.

(Image Credit: VisualCue)