“Statistics are like a bikini – what is revealed is interesting but what is hidden is crucial. “ – Aaron Levenstein, Associate Prof. emeritus of Business, Baruch College.
Statistics tell us about general trends and properties about our data but often they hide information that would give key insights.
The goal of every analyst then is to make statistics skimpy – but at the same time, make them as revealing as possible while not overwhelming the decision maker with unnecessary numbers.
In this post, I will focus on working the average and median statistics. While widely used to describe trends and data, they have a few limitations which can give you a wrong impression of what is going on.
Consider the revenue from three campaigns shown below:
The following patterns are readily apparent:
- The average revenue for Campaign 1 is close to $1000 a day.
- For campaign 2, the average revenue appears to be increasing and it has also had a few exceptional days.
- Campaign 3’s performance appears to fluctuate a lot.
Taking the average and median revenue in the time period for all three campaigns, we get the following graph:
The average revenue of all three campaigns is comparable. Campaign 2 appears to have a significantly higher average revenue.
However, when we look at the median revenue (i.e. the middle value of the revenue data), we see that Campaign 3 has the best performance, while Campaign 2 has the worst.
Since we have daily data, we know that Campaign 2’s apparently superior average is due to two primary reasons:
- Performance has gradually improved with time
- It has had a few exceptional days of performance.
Moreover, Campaign 3’s apparently superior median is due to its performance volatility. Clearly, both the mean and median statistics provide a partial picture about performance and can lead you to mistaken conclusions.
The simplest solution that comes to mind is to not provide statistics but provide time series data like the first graph.
Unfortunately, this clearly defeats the purpose of providing statistics. That is, to compress large volumes data into simple, digestible metrics that provide clear and concise information about the nature of your data.
One of the simplest ways to overcome the problem with the mean statistic is to use boxplots. I have drawn a boxplot of the 3 campaigns below.
In a box plot:
- The left end of the left line represents the minimum value
- The left line of the darker box represents the 25th percentile
- The middle line separating the 2 shaded boxes represents the median
- The right end of the lighter box represents the 75th percentile
- The right end of the right line represents the maximum value
This box plot shows:
(a) While the median value of Campaign 2 is the lowest, there is a high degree of variance as indicated by the large box size between the 25th and the 75th percentile. The high max value hints at the days of exceptional performance (outliers).
(b) Campaign 1 has the steadiest performance as the box width is the least of all three campaigns.
(c) Campaign 3 has moderate volatility but has a lot of days of poor performance as the dark shaded box is larger than the lighter shaded one.
Thus, all key performance trends have been captured in this chart. An analyst reading the chart will draw the right inferences and analyze the data to confirm her hypothesis. Also note that the real estate occupied by the boxplot is the same as a bar graph. So it also achieves the goal of conciseness.
While making boxplots in Excel is not straightforward, there are various resources online to teach you how to make them relatively easily. These plots will enhance the mean statistics of your data and ensure that you will seldom miss out on key trends and insights on your data.
Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.