6.2 Correlation
In R, the Pearson’s product-moment correlation coefficient between two continuous variables can be estimated using the cor()
function. Using the trees
data set again, we can determine the correlation coefficient of the association between tree Height
and Volume
.
data(trees)
str(trees)
## 'data.frame': 31 obs. of 3 variables:
## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
cor(trees$Height, trees$Volume)
## [1] 0.5982497
or we can produce a matrix of correlation coefficients for all variables in a data frame
cor(trees)
## Girth Height Volume
## Girth 1.0000000 0.5192801 0.9671194
## Height 0.5192801 1.0000000 0.5982497
## Volume 0.9671194 0.5982497 1.0000000
Note that the correlation coefficients are identical in each half of the matrix. Also, be aware that, although a matrix of coefficients can be useful, a little commonsense should be used when using cor()
on data frames with numerous variables. It is not good practice to trawl through these types of matrices in the hope of finding large coefficients without having an a priori reason for doing so and remember the correlation coefficient assumes that associations are linear.
If you have missing values in the variables you are trying to correlate, cor()
will return an error message (as will many functions in R). You will either have to remove these observations (be very careful if you do this) or tell R what to do when an observation is missing. A useful argument you can use with the cor()
function is use = "complete.obs"
.
cor(trees, use = "complete.obs")
## Girth Height Volume
## Girth 1.0000000 0.5192801 0.9671194
## Height 0.5192801 1.0000000 0.5982497
## Volume 0.9671194 0.5982497 1.0000000
The function cor()
will return the correlation coefficient of two variables, but gives no indication whether the coefficient is significantly different from zero. To do this you need to use the function cor.test()
.
cor.test(trees$Height, trees$Volume)
##
## Pearson's product-moment correlation
##
## data: trees$Height and trees$Volume
## t = 4.0205, df = 29, p-value = 0.0003784
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3095235 0.7859756
## sample estimates:
## cor
## 0.5982497
Two non-parametric equivalents to Pearson correlation are available within the cor.test()
function; Spearman’s rank and Kendall’s tau coefficient. To use either of these simply include the argument method = "spearman"
or method = "kendall"
depending on the test you wish to use. For example
cor.test(trees$Height, trees$Volume, method = "spearman")
## Warning in cor.test.default(trees$Height, trees$Volume, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: trees$Height and trees$Volume
## S = 2089.6, p-value = 0.0006484
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5787101