Link Search Menu Expand Document (external link)

How to compute covariance and correlation coefficients (in R)

See all solutions.

Task

Covariance is a measure of how much two variables “change together.” It is positive when the variables tend to increase or decrease together, and negative when they upward motion of one variable is correlated with downward motion of the other. Correlation normalizes covariance to the interval $[-1,1]$.

Solution

We will construct some random data here, but when applying this, you would use your own data, of course.

1
2
3
4
5
# Create a dataframe with random values between 0 and 1
set.seed(1)
df <- as.data.frame(matrix(runif(n=50,min=0,max=1),nrow = 10))
names(df) <- c('col1','col2','col3','col4','col5')
head(df)
1
2
3
4
5
6
7
  col1      col2      col3      col4      col5     
1 0.2655087 0.2059746 0.9347052 0.4820801 0.8209463
2 0.3721239 0.1765568 0.2121425 0.5995658 0.6470602
3 0.5728534 0.6870228 0.6516738 0.4935413 0.7829328
4 0.9082078 0.3841037 0.1255551 0.1862176 0.5530363
5 0.2016819 0.7698414 0.2672207 0.8273733 0.5297196
6 0.8983897 0.4976992 0.3861141 0.6684667 0.7893562

In R, we can use the cov() function to calculate the covariance between two variables. The default method is Pearson.

1
cov( df$col1, df$col2 )
1
[1] 0.0004115864

You can also compare all of a DataFrame’s columns among one another, each as a separate variable.

1
cov(df)
1
2
3
4
5
6
     col1          col2          col3          col4          col5        
col1  0.0996382947  0.0004115864 -0.0287090091 -0.0052485522 -0.029944309
col2  0.0004115864  0.0731549057 -0.0255386673 -0.0112688616 -0.026535785
col3 -0.0287090091 -0.0255386673  0.0942522913  0.0009465216  0.050640298
col4 -0.0052485522 -0.0112688616  0.0009465216  0.0593140088 -0.008714775
col5 -0.0299443088 -0.0265357850  0.0506402980 -0.0087147752  0.055665077

The Pearson correlation coefficient can be computed with cor() in place of cov().

1
cor(df$col1,df$col2)
1
[1] 0.004820878

And you can compute correlation coefficients for all numeric columns in a DataFrame.

1
cor(df)
1
2
3
4
5
6
     col1         col2         col3        col4        col5      
col1  1.000000000  0.004820878 -0.29625051 -0.06827280 -0.4020775
col2  0.004820878  1.000000000 -0.30756049 -0.17107229 -0.4158329
col3 -0.296250506 -0.307560491  1.00000000  0.01265919  0.6991315
col4 -0.068272803 -0.171072293  0.01265919  1.00000000 -0.1516653
col5 -0.402077472 -0.415832858  0.69913152 -0.15166527  1.0000000

Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Contributed by Ni Shi (shi_ni@bentley.edu)