How to find variance using Python

This article was originally published on Built In by Eric Kleppen.

Variance is a powerful statistic used in data analysis and machine learning. It is one of the four main measures of variability along with range, interquartile range (IQR) and standard deviation. Understanding variance is important because it gives you insight into the spread of your data and can be used to compare differences in sample groups or identify important modeling features. Variance is also used in machine learning to understand changes in model performance due to using different samples of training data.

Calculating variance is easy using Python. Before diving into the Python code, I’ll first explain what variance is and how you can calculate it. By the end of this tutorial you’ll have a better understanding of why variance is an important statistic, along with several methods for calculating it using Python.

What is variance?

Variance is a statistic that measures dispersion. Low variance indicates that values are generally similar and do not vary widely from the mean while high variance indicates values are more widely dispersed from the mean. You can use variance on either a sample set or the entire population as the calculation takes in all data points in the given set. Although the calculation differs slightly when you’re looking at a sample versus population, you can calculate the variance as the average of the squared differences from the mean.

Since the variance is a squared value, it can be difficult to interpret compared to other measures of variability like standard deviation. Regardless, reviewing variance can be helpful; doing so can make it easier for you to decide which statistical tests to use with your data. Depending on the statistical tests, uneven variance between samples could skew or bias results.

One of the popular statistical tests that applies variance is called the analysis of variance (ANOVA) test. An ANOVA test is used to gauge whether any of the group means are significantly different from one another when analyzing a categorical independent variable and a quantitative dependent variable. For example, say you want to analyze whether social media use impacts hours of sleep. You could break social media use into different categories like low use, medium use and high use, then run an ANOVA test to gauge whether there are statistical differences between the group means. The test can show whether results are explained by group differences or individual differences.

How do you find the variance?

Calculating the variance for a data set can differ based on whether the set is the entire population or a sample of the population.

The formula for calculating the variance of an entire population looks like this:

σ² = ∑ (Xᵢ— μ)² / N

An explanation of the formula:

σ² = population variance
Σ = sum of…
Χᵢ = each value
μ = population mean
Ν = number of values in the population
Using an example range of numbers, let’s walk through the calculation step by step.

Example range of numbers: 8, 6, 12, 3, 13, 9

Find the population mean (μ):

Calculation for finding variance in Python

Calculate deviations from the mean by subtracting the mean from each value.

Calculation for finding variance in Python

Square each deviation to get a positive number.

Square each deviation with a positive number

Sum the squared values.

Sum the squared values

Divide the sum of squares by N or n-1.

Since we’re working with the entire population, we’ll divide by N. If we were working with a sample of the population, we would divide by n-1.

69.5/6 = 11.583

There we have it! The variance of our population is 11.583.

Why use n-1 when calculating the sample variance?

Applying n-1 to the formula is called Bessel’s correction, named after Friedrich Bessel. When using samples, we need to calculate the estimated variance for the population. If we used N instead of n-1 for the sample, the estimate would be biased, potentially underestimating the population variance. Using n-1 will make the variance estimate larger, overestimating variability in samples, thus reducing biases.

Let’s recalculate the variance pretending the values are from a sample:

recalculate the variance pretending the values are from a sample

As we can see, the variance is larger!

Calculating variance using Python

Now that we’ve done the calculation by hand, we can see that completing it for a large set of values would be very tedious. Luckily, Python can easily handle the calculation for very large data. We will explore two methods using Python:

Write our own variance calculation function
Use Pandas’ built-in function

Writing a variance function

As we begin to write a function to calculation variance, think back to the steps we took when calculating by hand. We want the function to take in two parameters:

population: an array of numbers
is_sample: a Boolean to alter the calculation depending on whether we’re working with a sample or population

Start by defining the function that takes in the two parameters.

Next, add logic to calculate the population mean.

After calculating the mean, find the differences from the mean for each value. You can do this in one line using a list comprehension.

find the differences from the mean for each value.

Next, square the differences and sum them.

Lastly, calculate the variance. Using an If/Else statement, we can utilize the is_sampleparameter. If is_sampleis true, calculate variance using (n-1). If it is false (the default), use N:

Calculate the variance

We can test the calculation using the range of numbers we crunched by hand:

How to find the variance in Python

Finding variance using Pandas

Although we can write a function to calculate variance in less than 10 lines of code, there is an even easier way to find variance. You can do it in one line of code using Pandas. Let’s load up some data and work through a real example of finding variance.

Loading example data

The Pandas example uses the BMW Price Challenge data set from Kaggle, which is free to download. Begin by importing the Pandas library, and then reading the CSV file into a Pandas data frame:

reading the CSV file into a Pandas data frame

We can count the number of rows in the data set and display the first five rows to make sure everything loaded correctly:

Displaying the first rows using bmw_df.head()

Finding the variance for the BMW data

Since the BMW data set is 4843 rows, calculating that by hand would…not be fun. Instead we can simply plug in the column from the data frame into our calculate_variance function and return the variance. Let’s find the variance for the numeric columns mileage, engine_power and price.

Variance for numeric columns in the BMW data frame

Using Pandas var() function

In case we forget the calculation for variance and cannot write our own function, Pandas has a built-in function to calculate variance named var(). By default, it assumes a sample population and uses n-1 in the calculation; however, you can adjust the calculation by passing in the ddof=0 argument.

As we can see the Var() function matches the values produced by our calculate_variance function, and it’s only one line of code. Reviewing the results, we can see mileage has a high variance meaning the values tend to vary from the mean by a lot. That makes sense because many factors play into the distance a person needs to drive. By comparison, engine_power has a low variance which indicates the values don’t vary widely from the mean.

The takeaway

Understanding variance can be an important part of data analysis and machine learning because you can use it to assess group differences. Variance also impacts which statistical tests can help us make data driven decisions. High variance means values are greatly dispersed from the mean, while low variance means numbers are not widely dispersed from the mean. If we have a small set of values, it’s possible to calculate the variance by hand in only five steps. For large data sets, we saw how simple it is to calculate variance using Python and Pandas. The Var() function in Pandas calculates the variance for the numerical columns in a data frame in only one line of code, which is pretty handy!