Basics of Standard Deviation

Standard Deviation is probably one of the most important calculations you will need to understand as a data analyst. Using the calculation has great potential in reporting to show historical events, but is used better on dashboards to actively monitor systems.

In today's example I will be looking at a server, and the number of messages processed in an hour. My server has a database and we deliver a messaging service to the local university. The service we deliver is a basic messaging service to allow students to chat with other students or tutors.

Firstly lets look at Standard Deviation and some visual aids:

The Symbol used for Standard Deviation is the lower case Greek letter "Sigma" or lower case "s"


Sigma Symbol

Standard Deviation is a way to determinate how much a value is different from the average or mean. this means we need to have the mean value for a particular range of values. In our example we are looking at the total amount of messages sent per day, for a week.

Standard Deviation is calculated from the mean value, this means that we will need allow the values to increase or decrease based on all the values supplied. This is why we look at Standard Deviation as a positive and negative value range. The values can move away from the mean in both directions of our mean:



The Standard Deviation is calculated and then added to the Deviation + and the Deviation - directions of our chart. As its a single value the deviation in both directions is an equal distance apart from the mean value.



The Standard Deviation spread is the area (in Green) where the most of the figures normally appear 

The curve represents the number of items that will fall closer to the mean and will have more values in the standard deviation spread.

Now lets look at our example for more clarity. Our messages recorded in the database have different values each day, but to determine if the numbers are in a normal range for a day, we need to do a Standard Deviation Calculation. 

The formula looks like this:



To calculate this is much simpler and can be explained in a few easy steps. I will also show some figures to allow you to follow with the example of the messages sent.

1. Get the Average or Mean value 

Every day the number of messages are different, and to determine what the normal amount is we will need to get the mean value. To get the mean value, we need to divide the sum of the values with the count of the numbers.


(2568+2635+2687+3541+2421+2502+1235 = 17139) / 7 = 2448.43



2. Get the difference between each value and the mean value 

The difference or "Variance" is calculated by subtracting the mean from each value.

Monday's difference: 2568 - 2448.43 = 119.57 




 The variance must be a positive value to allow the calculation of the squared root: (Consider: (4 * 4 = 16) and in negative form (-4 * -4 = 16)). To get a positive number form a calculated field, you will need to multiply your negative values with -1 




3. Get the squared value of each difference

We will now need to get the squared value for each difference or variance

Monday: 119.57 * 119.57 = 14296.9 (Rounded to 14297)



The values have been rounded for this example, but in real values you might not want to do this. Un-rounded values will give you greater accuracy on the deviation values. 

4. Get the Mean of the squared values


Now we will need to get the average of the squared values

(14297+24809+56919+1193712+752+157156+1472409 = 2930051) / 7 = 418578.71 (418578.82 due to rounding)



5. Get the squared root of the mean value

We now have a mean number based on the all the differences hat was squared. We will now need to get the squared root of this mean number to get our deviation value.



The mean squared value is the deviation from the original mean value. So we now need to apply it to both the positive and negative directions of a chart as shown below.


Our values are calculated, but to display it in a chart we need to format a table to hold the values. 

The table that holds the chart values takes the mean and subtracts the deviation, or adds the deviation in order to get the deviation point. 

Our Mean was 2448.43 and to get the Mean - Deviation we just subtract the number 647 (Deviation) from the Mean. Similarly we do the same for the Mean + Deviation as we add the 647 to 2448.43.



For our messages chart we have the Mean for each day, and the Deviation + and Deviation - values. The chart shows that majority of the days the messages are within the deviation range except for Thursday and Sunday. On Thursday there is a massive increase of messages and on Sunday there is a dramatic short fall of messages.

Even though we now have an idea to get a Standard Deviation form a single Mean Value, it still does not answer for the two exceptions in our chart. The Thursday and Sunday values are not in my deviation spread, but is it normal?

To answer this we will need to look at the trend of the messages over several weeks. Then we can calculate the deviation on the daily values to show a much more accurate deviation based on daily values.

In my Next tutorial we will be looking at the Standard Deviation on trends and how we will be able to spot an issue if our current values breaks the trend.

No comments:

Post a Comment