Central Tendency

13.12.2019

Episode #8 of the course Data fundamentals by Colby Schrauth and Serge LeBlanc

Welcome back!

Yesterday, we shared the third and final area for practically working with and understanding data.

Today, we’ll change gears and move from practical work to data measurement techniques. We believe there are two that everyone interested in working with data should master, and the first is central tendency.

Let’s dive in with a trick question!

Imagine that you have \$1,000,000 in an investment account Over the course of two years, you generate the following returns:

• start of Year 1 ➡︎ end of Year 1: +100%

• start of Year 2 ➡︎end of Year 2: -50%​

What is your average annual return?​ ​Let’s do the math: (100% – 50%) ÷ 2 = +25%.

Now, let’s look at the actual movement of money for each period:

• start of Year 1: \$1M ➡︎ +100% ➡︎ end of Year 1: \$2M

• start of Year 2: \$2M ➡︎ -50% ➡︎ end of Year 2: \$1M

​So, you started Year 1 with \$1,000,000 and ended Year 2 with \$1,000,000 (i.e., \$0 of growth), but your average annual return is +25%?

There are two takeaways from our trick question:

First, always make sure to look at the underlying data that’s producing a descriptive statistic. Sometimes, average-oriented metrics don’t align with what’s actually occurring in the real world. If access to the data isn’t an option (either the dataset is too large or simply inaccessible), it’s helpful to at least inquire about a statistic’s composition (more on this below) and the data source.

Second, we highly recommend to start thinking of “the average” as just a way of articulating the central tendency in a dataset and not as a literal mathematical computation, which most often is associated with the mean. In fact, there are two other ways to calculate “the average” that we recommend: median and mode.

Let’s line up another example to reiterate why the typical “average” (i.e., mean) is normally a fine way to define the center, but not always. Below we have three tables, each having the exact same data. The last row in each table calculates a measure of central tendency:

• Mean: Sum all values in a dataset, then divide by the number of values.

• Median: After a dataset is sorted, this is the middle number (i.e., 50th percentile).

• Mode: This is the number that repeats most often in a dataset.

 Name Weight Timber 500 Kygo 75 Rooby 50 Aslan 50 Rome 65 Burton 70 Mean 135
 Name Weight Timber 500 Kygo 75 Rooby 50 Aslan 50 Rome 65 Burton 70 Median 68
 Name Weight Timber 500 Kygo 75 Rooby 50 Aslan 50 Rome 65 Burton 70 Mode 50

Here’s the takeaway: The mean is sensitive to outliers. Notice in our first table that the mean (135) is being pulled up due to the outlier weight of Timber (500). No other value in our dataset is even close to 135. Outliers in data can distort/skew figures and not provide a true representation of what is actually happening.

To summarize our recommendation when working to define the central point in a dataset, always pull all three measures of central tendency (mean, median, and mode). Not only will this make your analysis look fancy and trustworthy, but it will provide you and others with legitimate insight into what “the average” actually is and whether there’s any distortion/skew in the underlying data.

Tomorrow, we’ll dive into our second data measurement principle to master: growth and normalization!

—Colby and Serge

Recommended book

Naked Statistics: Stripping the Dread from the Data by Charles Wheelan

Share with friends