## Time Series Data and Machine LearningAugust 15, 2017

Posted by OromianEconomist in 10 best Youtube videos, 25 killer Websites that make you cleverer, Data Science, Econometrics, Economics, Uncategorized.
Tags: , , , , , ,

# Anomaly Detection of Time Series Data Using Machine Learning & Deep Learning

## Introduction to Time Series Data

By Jagreet, XenonStack,  June 23, 2017

Time Series is defined as a set of observations taken at a particular period of time. For example, having a set of login details at regular interval of time of each user can be categorized as a time series. On the other hand, when the data is collected at once or irregularly, it is not taken as a time series data.

Time series data can be classified into two types –

• Stock Series – It is a measure of attributes at a particular point in time and taken as a stock takes.

• Flow Series – It is a measure of activity at a specific interval of time. It contains effects related to the calendar.

Time series is a sequence that is taken successively at the equally pace of time. It appears naturally in many application areas such as economics, science, environment, medicine, etc. There are many practical real life problems where data might be correlated with each other and are observed sequentially at the equal period of time. This is because, if the repeatedly observe the data at a regular interval of time, it is obvious that data would be correlated with each other.

With the use of time series, it becomes possible to imagine what will happen in the future as future event depends upon the current situation. It is useful to divide the time series into historical and validation period. The model is built to make predictions on the basis of historical data and then this model is applied to the validation set of observations. With this process, the idea is developed how the model will perform in forecasting.

Time Series is also known as the stochastic process as it represents the vector of stochastic variables observed at regular interval of time.

## Components of Time Series Data

In order to analyze the time series data, there is a need to understand the underlying pattern of data ordered at a particular time. This pattern is composed of different components which collectively yield the set of observations of time series.

The Components of time series data are given below –

• Trend

• Cyclical

• Seasonal

• Irregular

Trend – It is a long pattern present in the time series. It produces irregular effects and can be positive, negative, linear or nonlinear. It represents the variations of low frequency and the high and medium frequency of data is filtered out from the time series.

If the time series does not contain any increasing or decreasing pattern, then time series is taken as stationary in the mean.

There are two types of the trend –

1. Deterministic – In this case, the effects of the shocks present in the time series are eliminated i.e. revert to the trend in long run.

2. Stochastic – It is the process in which the effects of shocks are never eliminated as they have permanently changed the level of the time series.

The stochastic process having a stationarity around the deterministic process is known as trend stationary process.

Cyclic – The pattern exhibit up and down movements around a specified trend is known as cyclic pattern. It is a kind of oscillations present in the time series. The duration of cyclic pattern depends upon the industries and business problems to be analysed. This is because the oscillations are dependable upon the business cycle.

They are larger variations that are repeated in a systematic way over time. The period of time is not fixed and usually composed of at least 2 months in duration. The cyclic pattern is represented by a well-shaped curve and shows contraction and expansion of data.

Seasonal – It is a pattern that reflects regular fluctuations. These short-term movements occur due to the seasonal factors and custom factors of people. In this case, the data faces regular and predictable changes that occurred at regular intervals of calendar. It always consist of fixed and known period.

The main sources of seasonality are given below –

• Climate

• Institutions

• Social habits and practices

• Calendar

How is the seasonal component estimated?

If the deterministic analysis is performed, then the seasonality will remain same for similar interval of time. Therefore, it can easily be modelled by dummy variables. On the other hand, this concept is not fulfilled by stochastic analysis. So, dummy variables are not appropriate because the seasonal component changes throughout the time series.

Different models to create a seasonal component in time series are given below –

• Additive Model – It is the model in which the seasonal component is added with the trend component.

• Multiplicative Model – In this model seasonal component is multiplied with the intercept if trend component is not present in the time series. But, if time series have trend component, sum of intercept and trend is multiplied with the seasonal component.

Irregular – It is an unpredictable component of time series. This component cannot be explained by any other component of time series because these variational fluctuations are known as random component. When the trend cycle and seasonal component is removed, it becomes residual time series. These are short term fluctuations that are not systematic in nature and have unclear patterns.

## Difference between Time Series Data and Cross-Section Data

Time Series Data is composed of collection of data of one specific variable at particular interval of time. On the other hand, Cross-Section Data is consist of collection of data on multiple variables from different sources at a particular interval of time.

Collection of company’s stock market data at regular interval of year is an example of time series data. But when the collection of company’s sales revenue, sales volume is collected for the past 3 months then it is taken as an example of cross-section data.

Time series data is mainly used for obtaining results over an extended period of time but, cross-section data focuses on the information received from surveys at a particular time.

## What is Time Series Analysis?

Performing analysis of time series data is known as Time Series Analysis. Analysis is performed in order to understand the structure and functions produced by the time series. By understanding the mechanism of time series data a mathematical model could easily be developed so that further predictions, monitoring and control can be performed.

Two approaches are used for analyzing time series data are –

• In the time domain

• In the frequency domain

Time series analysis is mainly used for –

• Decomposing the time series

• Identifying and modeling the time-based dependencies

• Forecasting

• Identifying and model the system variation

## Need of Time Series Analysis

In order to model successfully, the time series is important in machine learning and deep learning. Time series analysis is used to understand the internal structure and functions that are used for producing the observations. Time Series analysis is used for –

• Descriptive – In this case, patterns are identified in correlated data. In other words, the variations in trends and seasonality in the time series are identified.

• Explanation – In this understanding and modeling of data is performed.

• Forecasting – Here, the prediction from previous observations is performed for short term trends.

• Invention Analysis – In this case, effect performed by any event in time series data is analyzed.

• Quality Control – When the specific size deviates it provides an alert.

## Applications of Time Series Analysis

Time Series Database and its types

Time series database is a software which is used for handling the time series data. Highly complex data such higher transactional data is not feasible for the relational database management system. Many relational systems does not work properly for time series data. Therefore, time series databases are optimised for the time series data. Various time series databases are given below –

• CrateDB

• Graphite

• InfluxDB

• Informix TimeSeries

• Kx kdb+

• Riak-TS

• RRDtool

• OpenTSDB

## What is Anomaly?

Anomaly is defined as something that deviates from the normal behaviour or what is expected. For more clarity let’s take an example of bank transaction. Suppose you have a saving bank account and you mostly withdraw Rs 10,000 but, one day Rs 6,00,000 amount is withdrawn from your account. This is unusual activity for bank as mostly, Rs 10,000 is deducted from the account. This transaction is an anomaly for bank employees.

The anomaly is a kind of contradictory observation in the data. It gives the proof that certain model or assumption does not fit into the problem statement.

### Different Types of Anomalies

Different types of anomalies are given below –

• Point Anomalies – If the specific value within the dataset is anomalous with respect to the complete data then it is known as Point Anomalies. The above mentioned example of bank transaction is an example of point anomalies.

• Contextual Anomalies – If the occurrence of data is anomalous for specific circumstances, then it is known as Contextual Anomalies. For example, the anomaly occurs at a specific interval of period.

• Collective Anomalies – If the collection of occurrence of data is anomalous with respect to the rest of dataset then it is known as Collective Anomalies. For example, breaking the trend observed in ECG.

## Models of Time Series Data

ARIMA Model – ARIMA stands for Autoregressive Integrated Moving Average. Auto Regressive (AR) refers as lags of the differenced series, Moving Average (MA) is lags of errors and I represents the number of difference used to make the time series stationary.

Assumptions followed while implementing ARIMA Model are as under –

• Time series data should posses stationary property: this means that the data should be independent of time. Time series consist of cyclic behaviour and white noise is also taken as a stationary.

• ARIMA model is used for a single variable. The process is meant for regression with the past values.

In order to remove non-stationarity from the time series data the steps given below are followed –

• Find the difference between the consecutive observations.

• For stabilizing the variance log or square root of the time series data is computed.

• If the time series consists of the trend, then the residual from the fitted curve is modulated.

ARIMA model is used for predicting the future values by taking the linear combination of past values and past errors. The ARIMA models are used for modeling time series having random walk processes and characteristics such as trend, seasonal and nonseasonal time series.

Holt-Winters – It is a model which is used for forecasting the short term period. It is usually applied to achieve exponential smoothing using additive and multiplicative models along with increasing or decreasing trends and seasonality. Smoothing is measured by beta and gamma parameters in the holt’s method.

• When the beta parameter is set to FALSE, the function performs exponential smoothing.

• The gamma parameter is used for the seasonal component. If the gamma parameter is set to FALSE, a non-seasonal model is fitted.

## How to find Anomaly in Time Series Data

AnomalyDetection R package –

It is a robust open source package used to find anomalies in the presence of seasonality and trend. This package is build on Generalised E-Test and uses Seasonal Hybrid ESD (S-H-ESD) algorithm. S-H-ESD is used to find both local and global anomalies. This package is also used to detect anomalies present in a vector of numerical variables. Is also provides better visualization such that the user can specify the direction of anomalies.

Principal Component Analysis –

It is a statistical technique used to reduce higher dimensional data into lower dimensional data without any loss of information. Therefore, this technique can be used for developing the model of anomaly detection. This technique is useful at that time of situation when sufficient samples are difficult to obtain. So, PCA is used in which model is trained using available features to obtain a normal class and then distance metrics is used to determine the anomalies.

Chisq Square distribution –

It is a kind of statistical distribution that constitutes 0 as minimum value and no bound for the maximum value. Chisq square test is implemented for detecting outliers from univariate variables. It detects both lowest and highest values due to the presence of outliers on both side of the data.

## What are Breakouts in Time Series Data?

Breakout are significant changes observed in the time series data. It consist of two characteristics that are given below –

• Mean shift – It is defined as a sudden change in time series. For example the usage of CPU is increased from 35% to 70%. This is taken as a mean shift. It is added when the time series move from one steady state to another state.

• Ramp Up – It is defined as a sudden increase in the value of the metric from one steady state to another. It is a slow process as compared with the mean shift. It is a slow transition process from one stable state to another.

In Time series often more than one breakouts are observed.

## How to detect Breakouts in Time Series Data?

In order to detect breakouts in time series Twitter has introduced a package known as BreakoutDetection package. It is an open source package for detecting breakouts at a fast speed. This package uses E-Divisive with Medians (EDM) algorithm to detect the divergence within the mean. It can also be used to detect the change in distribution within the time series.

## Need of Machine Learning and Deep Learning in Time Series Data

Machine learning techniques are more effective as compared with the statistical techniques. This is because machine learning have two important features such as feature engineering and prediction. The feature engineering aspect is used to address the trend and seasonality issues of time series data. The issues of fitting the model to time series data can also be resolved by it.

Deep Learning is used to combine the feature extraction of time series with the non-linear autoregressive model for higher level prediction. It is used to extract the useful information from the features automatically without using any human effort or complex statistical techniques.

## Anomaly Detection using Machine Learning

There are two most effective techniques of machine learning such as supervised and unsupervised learning.

Firstly, supervised learning is performed for training data points so that they can be classified into anomalous and non-anomalous data points. But, for supervised learning, there should be labeled anomalous data points.

Another approach for detecting anomaly is unsupervised learning. One can apply unsupervised learning to train CART so that prediction of next data points in the series could be made. To implement this, confidence interval or prediction error is made. Therefore, to detect anomalous data points Generalised ESD-Test is implemented to check which data points are present within or outside the confidence interval

The most common supervised learning algorithms are supervised neural networks, support vector machine learning, k-nearest neighbors, Bayesian networks and Decision trees.

In the case of k-nearest neighbors, the approximate distance between the data points is calculated and then the assignment of unlabeled data points is made according to the class of k-nearest neighbor.

On the other hand, Bayesian networks can encode the probabilistic relationships between the variables. This algorithm is mostly used with the combination of statistical techniques.

The most common unsupervised algorithms are self-organizing maps (SOM), K-means, C-means, expectation-maximization meta-algorithm (EM), adaptive resonance theory (ART), and one-class support vector machine.

## Anomaly Detection using Deep Learning

Recurrent neural network is one of the deep learning algorithm for detecting anomalous data points within the time series. It consist of input layer, hidden layer and output layer. The nodes within hidden layer are responsible for handling internal state and memory. They both will be updated as the new input is fed into the network. The internal state of RNN is used to process the sequence of inputs. The important feature of memory is that it can automatically learns the time-dependent features.

The process followed by RNN is described below –

First the series of data is fed into the RNN model. After that, model will train the series of data to compute the normal behaviour. After computing, whenever the new input is fed into the trained network, it will be able to classify the input as normal and expected, or anomalous.

Training of normal data is performed because the quantity of abnormal data is less as compared with the normal data and provides an alert whenever any abnormal activity is observed in the future.

## Time Series Data Visualization

Data Visualization is an important and quickest way for picturizing the time series data and forecasting. The different types of graphs are given below:

• Line Plots.

• Histograms and Density Plots.

• Box and Whisker Plots.

• Heat Maps.

• Lag Plots or Scatter Plots.

• Autocorrelation Plots.

The above techniques are used for plotting univariate time series data but they can also be used for multivariate time series when more than one observation is dependent upon time.

They are used for the representation of time series data to identify trends, cycles, and seasonality from time series and observe how they can influence the choice of model.

## Summary

Time Series is defined as sequence of data points. The components of time series are responsible for the understanding of patterns of data. In time series, anomalous data points can also be there.

Therefore, there is a need to detect them. Various statistical techniques are mentioned in blog that are used but machine learning and deep learning are essential.

In machine learning, supervised learning and unsupervised learning is used for detecting anomalous data. On the other hand, in deep learning recurrent neural network is used.

Related:

Coursera: Data Science Courses free

## Data Science: 10+2 Data Science Methods that Every Data Scientist Should KnowMarch 15, 2017

Posted by OromianEconomist in 10 best Youtube videos.
Tags: , ,
1 comment so far

# 12 Statistical and Machine Learning Methods that Every Data Scientist Should Know

1. Statistical Hypothesis Testing (t-test, chi-squared test & ANOVA)
2. Multiple Regression (Linear Models)
3. General Linear Models (GLM: Logistic Regression, Poisson Regression)
4. Random Forest
5. Xgboost (eXtreme Gradient Boosted Trees)
6. Deep Learning
7. Bayesian Modeling with MCMC
8. word2vec
9. K-means Clustering
10. Graph Theory & Network Analysis
• (A1) Latent Dirichlet Allocation & Topic Modeling
• (A2) Factorization (SVD, NMF)

## Data Science: Avoiding a common mistake with time seriesJuly 14, 2015

Posted by OromianEconomist in 10 best Youtube videos, 25 killer Websites that make you cleverer, Data Science.
Tags: , , , , , ,
1 comment so far

Data Science Central

Avoiding a common mistake with time series

By Tom Fawcett*

A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning. If you work with data, throughout your career you’ll probably have to re-learn it several times. But you often see the principle demonstrated with a graph like this:

One line is something like a stock market index, and the other is an (almost certainly) unrelated time series like “Number of times Jennifer Lawrence is mentioned in the media.” The lines look amusingly similar. There is usually a statement like: “Correlation = 0.86”.  Recall that a correlation coefficient is between +1 (a perfect linear relationship) and -1 (perfectly inversely related), with zero meaning no linear relationship at all.  0.86 is a high value, demonstrating that the statistical relationship of the two time series is strong. The correlation passes a statistical test. This is a great example of mistaking correlation for causality, right? Well, no, not really: it’s actually a time series problem analyzed poorly, and a mistake that could have been avoided. You never should have seen this correlation in the first place. The more basic problem is that the author is comparing two trended time series. The rest of this post will explain what that means, why it’s bad, and how you can avoid it fairly simply. If any of your data involves samples taken over time, and you’re exploring relationships between the series, you’ll want to read on.

## Two random series

There are several ways of explaining what’s going wrong. Instead of going into the math right away, let’s look at a more intuitive visual explanation. To begin with, we’ll create two completely random time series. Each is simply a list of 100 random numbers between -1 and +1, treated as a time series. The first time is 0, then 1, etc., on up to 99. We’ll call one series Y1 (the Dow-Jones average over time) and the other Y2 (the number of Jennifer Lawrence mentions). Here they are graphed: There is no point staring at these carefully. They are random. The graphs and your intuition should tell you they are unrelated and uncorrelated. But as a test, the correlation (Pearson’s R) between Y1 and Y2 is -0.02, which is very close to zero. There is no significant relationship between them. As a second test, we do a linear regression of Y1 on Y2 to see how well Y2 can predict Y1. We get a Coefficient of Determination (R2 value) of .08 — also extremely low. Given these tests, anyone should conclude there is no relationship between them.

Now let’s tweak the time series by adding a slight rise to each. Specifically, to each series we simply add points from a slightly sloping line from (0,-3) to (99,+3). This is a rise of 6 across a span of 100. The sloping line looks like this:

Now we’ll add each point of the sloping line to the corresponding point of Y1 to get a slightly sloping series like this:

We’ll add the same sloping line to Y2:

Now let’s repeat the same tests on these new series. We get surprising results: the correlation coefficient is 0.96 — a very strong unmistakable correlation. If we regress Y on X we get a very strong R2 value of 0.92. The probability that this is due to chance is extremely low, about 1.3×10-54. These results would be enough to convince anyone that Y1 and Y2 are very strongly correlated! What’s going on? The two time series are no more related than before; we simply added a sloping line (what statisticians call trend). One trended time series regressed against another will often reveal a strong, but spurious, relationship. Put another way, we’ve introduced a mutual dependency. By introducing a trend, we’ve made Y1 dependent on X, and Y2 dependent on X as well. In a time series, X is time. Correlating Y1 and Y2 will uncover their mutual dependence — but the correlation is really just the fact that they’re both dependent on X. In many cases, as with Jennifer Lawrence’s popularity and the stock market index, what you’re really seeing is that they both increased over time in the period you’re looking at. This is sometimes called secular trend. The amount of trend determines the effect on correlation. In the example above, we needed to add only a little trend (a slope of 6/100) to change the correlation result from insignificant to highly significant. But relative to the changes in the time series itself (-1 to +1), the trend was large. A trended time series is not, of course, a bad thing. When dealing with a time series, you generally want to know whether it’s increasing or decreasing, exhibits significant periodicities or seasonalities, and so on. But in exploring relationships between two time series, you really want to know whether variations in one series are correlated with variations in another. Trend muddies these waters and should be removed.

## Dealing with trend

There are many tests for detecting trend. What can you do about trend once you find it? One approach is to model the trend in each time series and use that model to remove it. So if we expected Y1 had a linear trend, we could do linear regression on it and subtract the line (in other words, replace Y1 with its residuals). Then we’d do that for Y2, then regress them against each other. There are alternative, non-parametric methods that do not require modeling. One such method for removing trend is called first differences. With first differences, you subtract from each point the point that came before it: y'(t) = y(t) – y(t-1) Another approach is called link relatives. Link relatives are similar, but they divide each point by the point that came before it: y'(t) = y(t) / y(t-1)

## More examples

Once you’re aware of this effect, you’ll be surprised how often two trended time series are compared, either informally or statistically. Tyler Vigen created a web page devoted to spurious correlations, with over a dozen different graphs. Each graph shows two time series that have similar shapes but are unrelated (even comically irrelevant). The correlation coefficient is given at the bottom, and it’s usually high. How many of these relationships survive de-trending? Fortunately, Vigen provides the raw data so we can perform the tests. Some of the correlations drop considerably after de-trending. For example, here is a graph of US Crude Oil Imports from Venezuela vs Consumption of High Fructose Corn Syrup: The correlation of these series is 0.88. Now here are the time series after first-differences de-trending:

These time series look much less related, and indeed the correlation drops to 0.24. A recent blog post from Alex Jones, more tongue-in-cheek, attempts to link his company’s stock price with the number of days he worked at the company. Of course, the number of days worked is simply the time series: 1, 2, 3, 4, etc. It is a steadily rising line — pure trend! Since his company’s stock price also increased over time, of course he found correlation. In fact, every manipulation of the two variables he performed was simply another way of quantifying the trend in company price.

## Final words

I was first introduced to this problem long ago in a job where I was investigating equipment failures as a function of weather. The data I had were taken over six months, winter into summer. The equipment failures rose over this period (that’s why I was investigating). Of course, the temperature rose as well. With two trended time series, I found strong correlation. I thought I was onto something until I started reading more about time series analysis. Trends occur in many time series. Before exploring relationships between two series, you should attempt to measure and control for trend. But de-trending is not a panacea because not all spurious correlation are caused by trends. Even after de-trending, two time series can be spuriously correlated. There can remain patterns such as seasonality, periodicity, and autocorrelation. Also, you may not want to de-trend naively with a method such as first differences if you expect lagged effects. Any good book on time series analysis should discuss these issues. My go-to text for statistical time series analysis is Quantitative Forecasting Methods by Farnum and Stanton (PWS-KENT, 1989). Chapter 4 of their book discusses regression over time series, including this issue.   *Tom Fawcett is Principal Data Scientist at Silicon Valley Data Science. Co-author of the popular book Data Science for Business, Tom has over 20 years of experience applying machine learning and data mining in practical applications. He is a veteran of companies such as Verizon and HP Labs, and an editor of the Machine Learning Journal.

Related:-

https://oromianeconomist.wordpress.com/2015/06/22/what-is-calculus-used-for-tedx-talks/

## Statistics: The Sexiest Job of the DecadeJuly 7, 2015

Posted by OromianEconomist in 10 best Youtube videos, 25 killer Websites that make you cleverer, Economics, Uncategorized.
Tags: , ,

Anyone who’s got a formal education in economics knows who Hal Varian is. He’s most popularly known for his book Intermediate Economics. He’s also the Chief Economist at Google. He is known to have famously stated more or less, that statisticians and data analysts would be the sexiest jobs of the next decade.

That has come true, to a great extent, and we’ll be seeing more.

Great places to learn more about data science and statistical learning:
1] Statistical Learning (Stanford)
2] The Analytics Edge (MIT)

In a paper called ‘Big Data: New Tricks for Econometrics‘, Varian goes on to say that:

In fact, my standard advice to graduate students these days is “go to the computer science department and take a class in machine learning.” There have been very fruitful collaborations between computer scientists and statisticians in the last decade or so, and I…

View original post 22 more words