How to compute covariance and correlation coefficients
Description
Covariance is a measure of how much two variables “change together.” It is positive when the variables tend to increase or decrease together, and negative when they upward motion of one variable is correlated with downward motion of the other. Correlation normalizes covariance to the interval $[-1,1]$.
Using pandas and NumPy, in Python
View this solution alone.
We will construct some random data here, but when applying this, you would use your own data, of course.
1
2
3
4
5
import pandas as pd
import numpy as np
df = pd . DataFrame ( np . random . rand ( 10 , 5 ))
df . columns = [ 'col1' , 'col2' , 'col3' , 'col4' , 'col5' ]
df . head ()
col1
col2
col3
col4
col5
0
0.488293
0.151749
0.485939
0.278562
0.998647
1
0.405459
0.766983
0.915349
0.099784
0.518523
2
0.312085
0.498104
0.526030
0.745883
0.292882
3
0.313217
0.826840
0.254793
0.942009
0.456271
4
0.657147
0.024847
0.769884
0.140779
0.427270
If you have two pandas Series, you can compute the covariance of just those two variables. Note that every column in a DataFrame is a pandas series.
1
np . cov ( df [ 'col1' ], df [ 'col2' ] )
1
2
array([[ 0.04524431, -0.02545402],
[-0.02545402, 0.12901528]])
You can also compare all of a DataFrame’s columns among one another, each as a separate variable.
col1
col2
col3
col4
col5
col1
0.045244
-0.025454
0.005095
-0.015552
-0.006827
col2
-0.025454
0.129015
0.009857
0.062661
-0.013753
col3
0.005095
0.009857
0.084701
-0.048114
0.014510
col4
-0.015552
0.062661
-0.048114
0.087198
-0.023934
col5
-0.006827
-0.013753
0.014510
-0.023934
0.057866
The Pearson correlation coefficient can be computed with np.corrcoef
in place of np.cov
.
1
np . corrcoef ( df [ 'col1' ], df [ 'col2' ] )
1
2
array([[ 1. , -0.33316075],
[-0.33316075, 1. ]])
And pandas DataFrames have a built in method to do this for all numeric columns.
col1
col2
col3
col4
col5
col1
1.000000
-0.333161
0.082300
-0.247604
-0.133423
col2
-0.333161
1.000000
0.094296
0.590780
-0.159177
col3
0.082300
0.094296
1.000000
-0.559850
0.207259
col4
-0.247604
0.590780
-0.559850
1.000000
-0.336937
col5
-0.133423
-0.159177
0.207259
-0.336937
1.000000
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source .
Solution, in R
View this solution alone.
We will construct some random data here, but when applying this, you would use your own data, of course.
1
2
3
4
5
# Create a dataframe with random values between 0 and 1
set.seed ( 1 )
df <- as.data.frame ( matrix ( runif ( n = 50 , min = 0 , max = 1 ), nrow = 10 ))
names ( df ) <- c ( 'col1' , 'col2' , 'col3' , 'col4' , 'col5' )
head ( df )
1
2
3
4
5
6
7
col1 col2 col3 col4 col5
1 0.2655087 0.2059746 0.9347052 0.4820801 0.8209463
2 0.3721239 0.1765568 0.2121425 0.5995658 0.6470602
3 0.5728534 0.6870228 0.6516738 0.4935413 0.7829328
4 0.9082078 0.3841037 0.1255551 0.1862176 0.5530363
5 0.2016819 0.7698414 0.2672207 0.8273733 0.5297196
6 0.8983897 0.4976992 0.3861141 0.6684667 0.7893562
In R, we can use the cov()
function to calculate the covariance between two variables. The default method is Pearson.
1
cov ( df $ col1 , df $ col2 )
You can also compare all of a DataFrame’s columns among one another, each as a separate variable.
1
2
3
4
5
6
col1 col2 col3 col4 col5
col1 0.0996382947 0.0004115864 -0.0287090091 -0.0052485522 -0.029944309
col2 0.0004115864 0.0731549057 -0.0255386673 -0.0112688616 -0.026535785
col3 -0.0287090091 -0.0255386673 0.0942522913 0.0009465216 0.050640298
col4 -0.0052485522 -0.0112688616 0.0009465216 0.0593140088 -0.008714775
col5 -0.0299443088 -0.0265357850 0.0506402980 -0.0087147752 0.055665077
The Pearson correlation coefficient can be computed with cor()
in place of cov()
.
And you can compute correlation coefficients for all numeric columns in a DataFrame.
1
2
3
4
5
6
col1 col2 col3 col4 col5
col1 1.000000000 0.004820878 -0.29625051 -0.06827280 -0.4020775
col2 0.004820878 1.000000000 -0.30756049 -0.17107229 -0.4158329
col3 -0.296250506 -0.307560491 1.00000000 0.01265919 0.6991315
col4 -0.068272803 -0.171072293 0.01265919 1.00000000 -0.1516653
col5 -0.402077472 -0.415832858 0.69913152 -0.15166527 1.0000000
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source .
Topics that include this task
Opportunities
This website does not yet contain a solution for this task in any of the following
software packages.
If you can contribute a solution using any of these pieces of software,
see our Contributing page for how to help extend this website.