Advertisements
jump to navigation

Data Science: Avoiding a common mistake with time series July 14, 2015

Posted by OromianEconomist in 10 best Youtube videos, 25 killer Websites that make you cleverer, Data Science.
Tags: , , , , , ,
1 comment so far

???????????

Data Science Central

Avoiding a common mistake with time series

By Tom Fawcett*

A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning. If you work with data, throughout your career you’ll probably have to re-learn it several times. But you often see the principle demonstrated with a graph like this: Dow Jones vs. Jennifer Lawrence

One line is something like a stock market index, and the other is an (almost certainly) unrelated time series like “Number of times Jennifer Lawrence is mentioned in the media.” The lines look amusingly similar. There is usually a statement like: “Correlation = 0.86”.  Recall that a correlation coefficient is between +1 (a perfect linear relationship) and -1 (perfectly inversely related), with zero meaning no linear relationship at all.  0.86 is a high value, demonstrating that the statistical relationship of the two time series is strong. The correlation passes a statistical test. This is a great example of mistaking correlation for causality, right? Well, no, not really: it’s actually a time series problem analyzed poorly, and a mistake that could have been avoided. You never should have seen this correlation in the first place. The more basic problem is that the author is comparing two trended time series. The rest of this post will explain what that means, why it’s bad, and how you can avoid it fairly simply. If any of your data involves samples taken over time, and you’re exploring relationships between the series, you’ll want to read on.

Two random series

There are several ways of explaining what’s going wrong. Instead of going into the math right away, let’s look at a more intuitive visual explanation. To begin with, we’ll create two completely random time series. Each is simply a list of 100 random numbers between -1 and +1, treated as a time series. The first time is 0, then 1, etc., on up to 99. We’ll call one series Y1 (the Dow-Jones average over time) and the other Y2 (the number of Jennifer Lawrence mentions). Here they are graphed: Series Y1 Series Y2 There is no point staring at these carefully. They are random. The graphs and your intuition should tell you they are unrelated and uncorrelated. But as a test, the correlation (Pearson’s R) between Y1 and Y2 is -0.02, which is very close to zero. There is no significant relationship between them. As a second test, we do a linear regression of Y1 on Y2 to see how well Y2 can predict Y1. We get a Coefficient of Determination (R2 value) of .08 — also extremely low. Given these tests, anyone should conclude there is no relationship between them.

Adding trend

Now let’s tweak the time series by adding a slight rise to each. Specifically, to each series we simply add points from a slightly sloping line from (0,-3) to (99,+3). This is a rise of 6 across a span of 100. The sloping line looks like this: Trend line

Now we’ll add each point of the sloping line to the corresponding point of Y1 to get a slightly sloping series like this: Series Y1 Prime

We’ll add the same sloping line to Y2: Series Y2 Prime

Now let’s repeat the same tests on these new series. We get surprising results: the correlation coefficient is 0.96 — a very strong unmistakable correlation. If we regress Y on X we get a very strong R2 value of 0.92. The probability that this is due to chance is extremely low, about 1.3×10-54. These results would be enough to convince anyone that Y1 and Y2 are very strongly correlated! What’s going on? The two time series are no more related than before; we simply added a sloping line (what statisticians call trend). One trended time series regressed against another will often reveal a strong, but spurious, relationship. Put another way, we’ve introduced a mutual dependency. By introducing a trend, we’ve made Y1 dependent on X, and Y2 dependent on X as well. In a time series, X is time. Correlating Y1 and Y2 will uncover their mutual dependence — but the correlation is really just the fact that they’re both dependent on X. In many cases, as with Jennifer Lawrence’s popularity and the stock market index, what you’re really seeing is that they both increased over time in the period you’re looking at. This is sometimes called secular trend. The amount of trend determines the effect on correlation. In the example above, we needed to add only a little trend (a slope of 6/100) to change the correlation result from insignificant to highly significant. But relative to the changes in the time series itself (-1 to +1), the trend was large. A trended time series is not, of course, a bad thing. When dealing with a time series, you generally want to know whether it’s increasing or decreasing, exhibits significant periodicities or seasonalities, and so on. But in exploring relationships between two time series, you really want to know whether variations in one series are correlated with variations in another. Trend muddies these waters and should be removed.

Dealing with trend

There are many tests for detecting trend. What can you do about trend once you find it? One approach is to model the trend in each time series and use that model to remove it. So if we expected Y1 had a linear trend, we could do linear regression on it and subtract the line (in other words, replace Y1 with its residuals). Then we’d do that for Y2, then regress them against each other. There are alternative, non-parametric methods that do not require modeling. One such method for removing trend is called first differences. With first differences, you subtract from each point the point that came before it: y'(t) = y(t) – y(t-1) Another approach is called link relatives. Link relatives are similar, but they divide each point by the point that came before it: y'(t) = y(t) / y(t-1)

More examples

Once you’re aware of this effect, you’ll be surprised how often two trended time series are compared, either informally or statistically. Tyler Vigen created a web page devoted to spurious correlations, with over a dozen different graphs. Each graph shows two time series that have similar shapes but are unrelated (even comically irrelevant). The correlation coefficient is given at the bottom, and it’s usually high. How many of these relationships survive de-trending? Fortunately, Vigen provides the raw data so we can perform the tests. Some of the correlations drop considerably after de-trending. For example, here is a graph of US Crude Oil Imports from Venezuela vs Consumption of High Fructose Corn Syrup: US Crude Oil Imports vs. HFCS The correlation of these series is 0.88. Now here are the time series after first-differences de-trending: US Crude Oil Imports vs. HFCS de-trended

These time series look much less related, and indeed the correlation drops to 0.24. A recent blog post from Alex Jones, more tongue-in-cheek, attempts to link his company’s stock price with the number of days he worked at the company. Of course, the number of days worked is simply the time series: 1, 2, 3, 4, etc. It is a steadily rising line — pure trend! Since his company’s stock price also increased over time, of course he found correlation. In fact, every manipulation of the two variables he performed was simply another way of quantifying the trend in company price.

Final words

I was first introduced to this problem long ago in a job where I was investigating equipment failures as a function of weather. The data I had were taken over six months, winter into summer. The equipment failures rose over this period (that’s why I was investigating). Of course, the temperature rose as well. With two trended time series, I found strong correlation. I thought I was onto something until I started reading more about time series analysis. Trends occur in many time series. Before exploring relationships between two series, you should attempt to measure and control for trend. But de-trending is not a panacea because not all spurious correlation are caused by trends. Even after de-trending, two time series can be spuriously correlated. There can remain patterns such as seasonality, periodicity, and autocorrelation. Also, you may not want to de-trend naively with a method such as first differences if you expect lagged effects. Any good book on time series analysis should discuss these issues. My go-to text for statistical time series analysis is Quantitative Forecasting Methods by Farnum and Stanton (PWS-KENT, 1989). Chapter 4 of their book discusses regression over time series, including this issue.   *Tom Fawcett is Principal Data Scientist at Silicon Valley Data Science. Co-author of the popular book Data Science for Business, Tom has over 20 years of experience applying machine learning and data mining in practical applications. He is a veteran of companies such as Verizon and HP Labs, and an editor of the Machine Learning Journal.

Related:-

https://oromianeconomist.wordpress.com/2015/07/07/statistics-the-sexiest-job-of-the-decade/

https://oromianeconomist.wordpress.com/2015/06/22/what-is-calculus-used-for-tedx-talks/

Advertisements

Ancient Africa: Khemetic Mathematics: Herega Dur Durii June 28, 2015

Posted by OromianEconomist in Ancient Egyptian, Ancient Rock paintings in Oromia, Meroetic Oromo.
Tags: , , , , , , ,
1 comment so far

???????????

EGYPTIAN MATHEMATICS

Ancient African (khemetic), hieroglyphic, numerals

Ancient Egyptian hieroglyphic numerals

The early Egyptians settled along the fertile Nile valley as early as about 6000 BC, and they began to record the patterns of lunar phases and the seasons, both for agricultural and religious reasons. The Pharaoh’s surveyors used measurements based on body parts (a palm was the width of the hand, a cubit the measurement from elbow to fingertips) to measure land and buildings very early in Egyptian history, and a decimal numeric system was developed based on our ten fingers. The oldest mathematical text from ancient Egypt discovered so far, though, is the Moscow Papyrus, which dates from the Egyptian Middle Kingdom around 2000 – 1800 BC.

It is thought that the Egyptians introduced the earliest fully-developed base 10 numeration system at least as early as 2700 BC (and probably much early). Written numbers used a stroke for units, a heel-bone symbol for tens, a coil of rope for hundreds and a lotus plant for thousands, as well as other hieroglyphic symbols for higher powers of ten up to a million. However, there was no concept of place value, so larger numbers were rather unwieldy (although a million required just one character, a million minus one required fifty-four characters).

Ancient Egyptian method of multiplication

Ancient Egyptian method of multiplication

The Rhind Papyrus, dating from around 1650 BC, is a kind of instruction manual in arithmetic and geometry, and it gives us explicit demonstrations of how multiplication and division was carried out at that time. It also contains evidence of other mathematical knowledge, including unit fractions, composite and prime numbers, arithmetic, geometric and harmonic means, and how to solve first order linear equations as well as arithmetic and geometric series. The Berlin Papyrus, which dates from around 1300 BC, shows that ancient Egyptians could solve second-order algebraic (quadratic) equations.

Multiplication, for example, was achieved by a process of repeated doubling of the number to be multiplied on one side and of one on the other, essentially a kind of multiplication of binary factors similar to that used by modern computers (see the example at right). These corresponding blocks of counters could then be used as a kind of multiplication reference table: first, the combination of powers of two which add up to the number to be multiplied by was isolated, and then the corresponding blocks of counters on the other side yielded the answer. This effectively made use of the concept of binary numbers, over 3,000 years before Leibniz introduced it into the west, and many more years before the development of the computer was to fully explore its potential.

Practical problems of trade and the market led to the development of a notation for fractions. The papyri which have come down to us demonstrate the use of unit fractions based on the symbol of the Eye of Horus, where each part of the eye represented a different fraction, each half of the previous one (i.e. half, quarter, eighth, sixteenth, thirty-second, sixty-fourth), so that the total was one-sixty-fourth short of a whole, the first known example of a geometric series.

Ancient Egyptian method of division

Ancient Egyptian method of division

Unit fractions could also be used for simple division sums. For example, if they needed to divide 3 loaves among 5 people, they would first divide two of the loaves into thirds and the third loaf into fifths, then they would divide the left over third from the second loaf into five pieces. Thus, each person would receive one-third plus one-fifth plus one-fifteenth (which totals three-fifths, as we would expect).

The Egyptians approximated the area of a circle by using shapes whose area they did know. They observed that the area of a circle of diameter 9 units, for example, was very close to the area of a square with sides of 8 units, so that the area of circles of other diameters could be obtained by multiplying the diameter by 89 and then squaring it. This gives an effective approximation of π accurate to within less than one percent.

The pyramids themselves are another indication of the sophistication of Egyptian mathematics. Setting aside claims that the pyramids are first known structures to observe the golden ratio of 1 : 1.618 (which may have occurred for purely aesthetic, and not mathematical, reasons), there is certainly evidence that they knew the formula for the volume of a pyramid –13 times the height times the length times the width – as well as of a truncated or clipped pyramid. They were also aware, long before Pythagoras, of the rule that a triangle with sides 3, 4 and 5 units yields a perfect right angle, and Egyptian builders used ropes knotted at intervals of 3, 4 and 5 units in order to ensure exact right angles for their stonework (in fact, the 3-4-5 right triangle is often called “Egyptian”).

See more at : –  http://www.storyofmathematics.com/egyptian.html

Related:-

Scientific System, Mathematics and Ancient Kemetic Traditions

https://oromianeconomist.wordpress.com/2013/11/17/kemetic-numerology/