Statistical variability parameters

Reading time

In statistics, a dispersion indicator evaluates the variability of values in a data set. It is always a positive number, and its value increases the further apart the data are. Commonly used measures of dispersion include variance, standard deviation, dispersion, range and interquartile range. 

Scope: 

In statistics, range is the most intuitive measure of the variability of data in a set. In statistics, it is often symbolized by the letter R. It represents the difference between the maximum and minimum values in that data set. In other words, the range indicates the total range of observed values. 

To calculate the extent of a data set, follow these steps: 

  1. Find the maximum value (𝑋𝑚𝑎𝑥Xmax) in the data set. 
  2. Find the minimum value (𝑋𝑚𝑖𝑛Xmin) in the data set. 
  3. Calculate the range 𝑅=𝑋𝑚𝑎𝑥-𝑋𝑚𝑖𝑛R=Xmax-Xmin 

However, the range can be significantly influenced by extreme values (or outliers), which can sometimes bias its interpretation. For this reason, other measures of dispersion, such as the interquartile range or standard deviation, are often used in conjunction with the range to obtain a more accurate picture of data variation. 

Interquartile range 

The interquartile range is a measure of dispersion in statistics. It represents the difference between the third quartile (𝑄3) and the first quartile (𝑄1) of an increasingly sorted data set. Quartiles divide the data into four equal parts, each representing 25% of observations. The moustache box below illustrates the positions of 𝑄3 and 𝑄1. 

To calculate the interquartile range : 

01. Find the first quartile (𝑄1Q1): This is the value that separates the 25% from the lowest data. 
02. Find the third quartile (𝑄3Q3): This is the value that separates the 25% from the highest data. 
03. Calculate the range E=𝑄3-𝑄1=Q3-Q1 

A large interquartile range value indicates that the median (the value in the middle of the sorted data set) is surrounded by widely dispersed values, while a low interquartile range value indicates that the values around the median are more closely grouped. The interquartile range is therefore less sensitive to extreme values than the range, providing a better indication of the dispersion of the core data. 

Variance :

In statistics, variance is a measure of dispersion that quantifies the difference between each individual value in a data set and the mean of that set. It indicates the extent to which the values in the set are dispersed around the mean. A high variance means that the values are widely dispersed, while a low variance indicates that the values are more closely grouped around the mean. 

The equation for calculating the variance of a data set is as follows: 

If 𝑥1, 𝑥2, 𝑥3,... ,𝑥𝑁x1, x2, x3,... ,xN are the individual values of a population, and µ is the population mean, then the population variance 𝑉𝑎𝑟(𝑋)=𝜎2 is calculated as follows: 

\sigma^{2}=\frac{\sum_{1}^{N}(xi-\mu)^{2}}{N}

With : 
N: population size 

µ: average population 

xi: the ith population value 

In the case of a continuous probability distribution, the variance (σ²) can also be calculated using the formula : 

𝜎2=∫(𝑥-𝜇)2∗𝑓(𝑥)∗𝑑𝑥𝜎2=∫x-𝜇2∗fx∗dx

With : 

x : represents the random variable,  

μ: is the mean of the distribution,  

f(x): is the probability density function of the distribution, and is integrated over the entire space of possible values of the random variable.  

But since in reality, in most cases, we have a series of values of size "n" (a sample), and the population mean is often unknown. We therefore calculate an approximation of the variance. This is often symbolized by the term S² . The calculation formula used is as follows 

𝑆2=∑𝑁1(𝑥𝑖-𝑥-)²𝑛-1S2=∑1N(xi-x-)²n-1

With :  

  • n: sample size 
  • 𝑥: sample mean 
  • xi: the ith sample value 

Why divide by n-1 when it's a sample? 

This correction is called the Bessel correction. The reason for this correction is to compensate for the potential bias in estimating the variance from the sample. By dividing by n-1, we have an unbiased estimate of the population variance. This correction is particularly important in small sample sizes, where variance estimation based on n tends to underestimate the true variability in the population. 

Why isn't variance widely used to interpret variability? 

Variance is an important measure of data dispersion, but tends to be less widely used than standard deviation for a number of practical and interpretative reasons: 

  • Firstly, the variance is in units squared of the original data, which makes it difficult to interpret directly. (If the individual data are in meters, the variance will be in meters squared). 
  • Moreover, variance is sensitive to extreme values, as it involves squares of deviations from the mean, which can bias its representation of overall dispersion, especially in the presence of outliers. 

Although the variance is statistically fundamental, the standard deviation is preferred for its ease of interpretation and its ability to provide a more accurate measure of data dispersion. 

Standard deviation :

The standard deviation of a distribution is a characteristic of its dispersion in the space of real numbers. The greater the standard deviation, the wider the dispersion. To calculate the standard deviation, simply take the square root of the variance: 

S=\sqrt{\frac{\sum_{1}^{N}(xi-x)^{2}}{n-1}}

  • n: sample size 
  • 𝑥: sample mean 
  • xi: the ith sample value 
An image containing diagram, plot, screenshot, automatically generated textDescription  Machine standard deviation M1 𝑆=1𝑚𝑚S=1mm Machine standard deviation M2 𝑆=4𝑚𝑚S=4mm   

Dispersion 

Standard deviation is a good mathematical characteristic of the normal distribution, but it doesn't really have an intuitive equivalent. We therefore prefer to use the term dispersion, which corresponds to : 

Dispersion = width of the value interval in which 99.73% values are observed. 

In the case of a normal distribution, dispersion is simply calculated : 

𝐷𝑖𝑠𝑝𝑒𝑟𝑠𝑖𝑜𝑛=6∗𝜎

The notion of dispersion is much more intuitive than standard deviation. Let's take the following example. This is an observation of 1000 data points with normal distribution, mean 0 and standard deviation 1. 

If we wanted to characterize the dispersion of these values intuitively, we'd be tempted to say that the dispersion of values is around 6, as the observed values are contained between -3 and +3. 

Our intuitive definition of dispersion actually corresponds to the range of a sample (Range = Max - Min). However, using the range as a characteristic of a distribution makes no statistical sense. Indeed, the normal distribution varies from -∞ to +∞ , so the range of this distribution would be ∞. 

Dispersion corresponds to the intuitive definition of variability, but also has a statistical meaning. It is the interval within which we will observe practically all the values, i.e. 99.73% of the values.