Analytical Paleobiology Tutorials

There are multiple ways to carry out Principal Component Analysis in R. However, different functions do not necessarily return identical outputs. We will first examine PCA using a very tiny dataset, so it is easier to understand all objects and their relations.

Compute PCA using base functions

We can compute the three key outputs of PCA (eigenvectors, eigenvalues, and PC scores) using base functions. We will use a small multivariate dataset with 4 observations and 3 variables. Note that when the number of observations exceeds the number of variables, the data are overdetermined. This means that we should be able, with a few exceptions, to find a unique solution when performing eigenvalue decomposition.

Y<- cbind(c(4.2, 3.9, 2.1, 1.8), c(21.2, 19.4, 9.1, 2.4),
          c(8.1, 3.2, 5.1, 2.4)) # a small dataset with numerical values stored as matrix
colnames(Y)<- c("var1","var2", "var3")
rownames(Y)<- c("sam1","sam2","sam3","sam4")
Y # looks like this

##      var1 var2 var3
## sam1  4.2 21.2  8.1
## sam2  3.9 19.4  3.2
## sam3  2.1  9.1  5.1
## sam4  1.8  2.4  2.4

This dataset is suitable for PCA because all variables are numerical and continuous. Also, there are no missing values. Prior to PCA anaysis, we need to center the data. We may even choose to fully standardize our data to ensure that all variables have comparable variances by converting raw data to z-scores. Note that in this specific data example variables vary in dispersion with ‘var2’ having a much greater variance than the other two variables.

apply(Y, 2, var)

##  var1  var2  var3 
##  1.50 78.59  6.42

Covariance-based versus Correlation-based PCA

Covariance PCA - If we simply center the data (i.e., the means of all variables = 0), we will still retain information about variance. In such a case, the analysis is based on covariance and the importance of individual variables will vary as a function of magnitude of their variances.

Correlation PCA - If we z-standardize the values (i.e., center variables and scale their variances the same way), we will conduct correlation PCA, with variance information removed. This is particularly appropriate when variables differ in measurement units (e.g., Kelvins, grams, and kilometers), and consequently, differences in variance are an arbitrary reflection of units of measurement.

(Xcentered <- scale(Y, scale=F))     # center data without scaling (covariance)

##      var1   var2 var3
## sam1  1.2   8.18  3.4
## sam2  0.9   6.38 -1.5
## sam3 -0.9  -3.92  0.4
## sam4 -1.2 -10.62 -2.3
## attr(,"scaled:center")
## var1 var2 var3 
##  3.0 13.0  4.7

(Xscaled <- scale(Y, scale=T))       # center and scale (correlation)

##        var1   var2   var3
## sam1  0.980  0.922  1.342
## sam2  0.735  0.719 -0.592
## sam3 -0.735 -0.443  0.158
## sam4 -0.980 -1.199 -0.908
## attr(,"scaled:center")
## var1 var2 var3 
##  3.0 13.0  4.7 
## attr(,"scaled:scale")
## var1 var2 var3 
## 1.22 8.87 2.53

# let's work with centered but not scaled data (X represents matrix of deviations)
X <- Xcentered                                 
(Xcov <- cov(X))                                 # covariance matrix

##       var1 var2  var3
## var1  1.50 10.6  1.71
## var2 10.61 78.6 13.70
## var3  1.71 13.7  6.42

# covariance matrix can be also computed as an inner product of X scaled by sample size
(Xcov2 <- t(X)%*%X/(nrow(X)-1))

##       var1 var2  var3
## var1  1.50 10.6  1.71
## var2 10.61 78.6 13.70
## var3  1.71 13.7  6.42

We will now use {eigen} function to compute covariance PCA and output eigenvalues, eigenvectors, and PC scores.

# compute PCA using function {eigen}
# (for PCA, input data should be either correlation of covariance matrix)
epca <- eigen(Xcov)         # perfom eigenvalue decomposition of covariance matrix
evectors <- epca$vectors            # output eigenvectors
evalues <- epca$values          # output eigenvalues
evalues                 # eigenvalues

## [1] 82.5271  3.9206  0.0615

Eigenvalues

Eigenvalues provide information about amount of variance accounted for by each new axis or vector (or “principal component” or “PC”). Standard deviations of eigenvalues represent lengths of those vectors. It is a common practice to report proportions of variance represented by each PC. Note that evalues decrease montonically, with each subsequent eigen value being smaller. In this case, the first eigenvalue is huge comparing to all subsequent eigenvalues, thus indicating that nearly all variation occurs along a single axis.

Note also that the sum eigenvalues equals the sum of the variances of the three original variables. That is, no information is lost when translating our data into a new coordinate system defined by principal components.

paste('total variance of eigenvalues =', round(sum(evalues), 10))

## [1] "total variance of eigenvalues = 86.5091666667"

paste('total variance of original variables =', round(sum(diag(Xcov)), 10))

## [1] "total variance of original variables = 86.5091666667"

Eigenvectors

evectors  # eigenvectors

##        [,1]    [,2]    [,3]
## [1,] -0.131 -0.0592  0.9896
## [2,] -0.975 -0.1720 -0.1398
## [3,] -0.178  0.9833  0.0352

Eigenvectors can be used to compute PC scores. Principal Components (1,2,…N) are linear combinations of variables (x₁, x₂, …, x_N) multiplied by coeffcients (a₁, a₂, …, a_N). Eigenvectors store values of those coefficients. In other words, eigenvectors are weights used to translate original variable into PC scores. Those coefficients vary from -1 to 1. When coefficient = 0 then a given variable is uncorrelated with (or orthogonal to) a given Principal Component. If coefficient is 1 (or -1) a perfect correlation of a given variable with a given PC is indicated. PC scores are new coordinates for our variables and they should fulfill a set of specific requirements (see below). In this example, var2 has a very high coefficient value for PC1 approaching -1 (-0.975) indicating that PC1 is mostly reflecting var2, which also happens to be the variable with a much greater variance than any other variable. That is, the resulting PCA ordination is primarily driven by var2 because of its huge variance. This may be desirable if differences in variance across variables are meaningful and important, but in many cases this outcome is not desirable because we wish all variables to inform the ordination.

If we decide to nullify the effect of variance, we can use z-standardized data and carry out Correlation PCA.

epca2 <- eigen(cov(Xscaled))            # perfom eigenvalue decomposition of correlation matrix
evectors2 <- epca2$vectors          # output eigenvectors
evalues2 <- epca2$values            # output eigenvalues
evectors2         # eigenvectors (correlation PCA)

##       [,1]   [,2]    [,3]
## [1,] 0.609 -0.399  0.6860
## [2,] 0.621 -0.299 -0.7247
## [3,] 0.494  0.867  0.0653

evalues2                    # eigenvalues (correlation PCA)

## [1] 2.4441 0.5358 0.0201

In this case, and in contrasts to the above results for Covariance PCA, all variables have much more comparable loading for PC1 (1st column of evectors2). Also, the eigenvalue of PC1 is accounting for much less variance in the data comparing to Covariance PCA. All variables are now playing a comparable role in the ordination. An alternative way for making all variables influential is to employ some type of mathematical transformation to reduce (but not eliminate) differences in variance across variables. For example, Covariance PCA can be run on log-transformed data, an approach used in the Traditional Multivariate Morphometrics approach.

Because Correlation PCA is based on z-standardized variables (i.e., for all variables, variance = 1), the sum of eigenvalues will NOT reflect the total variance of original data, but instead will reflect the number of variables. In this case, because data include three variables, sum of eigenvalues should equal 3, which is indeed the case.

sum(evalues2) # for correlation PCA eigenvalues sum up to the total number of variables in the data.

## [1] 3

Eigenvalue decomposition (exploring the rules of matrix algebra behind PCA)

For those more inclined to explore the rules of matrix algebra that underlie PCA, please note that the eigenvalue decomposition, which relies on matrix inversion (a multivariate equivalent of division), is designed to ensure that a set of certain specific conditions are met.

# S%*%U = U%*%Lambda is the aim of eigenvalue decomposition
# NOTE %*% denotes matrix multiplication
# "S" is "Xcov" (covariance matrix) and "U" is "evectors" (eigenvectors)
# Lambda "eval" is a square matrix the diagonal of which represents eigenvalues
(eval <- diag(evalues))     # create eigenvalue matrix

##      [,1] [,2]   [,3]
## [1,] 82.5 0.00 0.0000
## [2,]  0.0 3.92 0.0000
## [3,]  0.0 0.00 0.0615

(Su <- Xcov%*%evectors)     # Su is S%*%U (Xcov%*%evectors here)

##       [,1]   [,2]     [,3]
## var1 -10.8 -0.232  0.06081
## var2 -80.5 -0.674 -0.00859
## var3 -14.7  3.855  0.00216

(uL <- evectors%*%eval)     # uL is U%*%Lambda (evectors%*%eval here)

##       [,1]   [,2]     [,3]
## [1,] -10.8 -0.232  0.06081
## [2,] -80.5 -0.674 -0.00859
## [3,] -14.7  3.855  0.00216

# Note that uL = Su (i.e., S%*%U = U%*%Lambda) 
# Also, S=u%*%Lambda%*%u'
(S <- evectors%*%eval%*%t(evectors))

##       [,1] [,2]  [,3]
## [1,]  1.50 10.6  1.71
## [2,] 10.61 78.6 13.70
## [3,]  1.71 13.7  6.42

Xcov                    # note that S equals Xcov (i.e., covariance matrix)

##       var1 var2  var3
## var1  1.50 10.6  1.71
## var2 10.61 78.6 13.70
## var3  1.71 13.7  6.42

# compute PC scores of rows (samples) using eigenvectors
# (note that centered data 'X' rather than raw data 'Y' should be used)
(escores<- X%*%evectors)

##       [,1]   [,2]    [,3]
## sam1 -8.74  1.866  0.1639
## sam2 -6.07 -2.625 -0.0536
## sam3  3.87  1.122 -0.3277
## sam4 10.93 -0.363  0.2174

# There variance should be PC1 > PC2 > PC3 (indeed, they are)
round(cov(escores),3)

##      [,1] [,2]  [,3]
## [1,] 82.5 0.00 0.000
## [2,]  0.0 3.92 0.000
## [3,]  0.0 0.00 0.061

# PC1, PC2, PC3 scores should be uncorrelated
# In other words, the correlation matrix of PC scores should be an identity matrix (indeed, it is)
round(cor(escores),3)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

#sum of variances of PC scores should equal sum of variances of orignal variables
round(apply(X,2,var),3)

##  var1  var2  var3 
##  1.50 78.59  6.42

round(apply(escores,2,var),3)

## [1] 82.527  3.921  0.061

sum(apply(X,2,var))

## [1] 86.5

sum(apply(escores,2,var))

## [1] 86.5

As demonstrated above, thus, all requirements are met here: (1) total variance of eigenvalues equals total variance of data (2) PC scores are uncorrelated (3) Eigenvalues of PC vectors decrease monotonically (PC1 > PC2 > PC3)

Oridnation plots (data visualization is the primary goal of PCA)

Finally, we now can plot the data to explore them in the new rotated space.

# plot the resulting ordination
pcvar <- round(evalues/sum(evalues),2) # let's first compute the proportion of variance accounted by each PC
par(mar=c(5,5,1,1))
plot(escores[,1], escores[,2], xlab=paste("PC1 [", pcvar[1],"]", sep = ''), ylim=c(-3,3),
     ylab=paste("PC2 [", pcvar[2], "]",sep=''), pch = 16, col = "darkblue")
     text(escores[,1], escores[,2], row.names(escores), cex = 0.6, pos = 3, col = "blue")

The figure organizes our samples in the PCA ordination space and thus allows us to assess visually how similar or different these samples are from each other. How to interpret (or not interpret) those results will be discussed in a subsequent tutorial, in which real data are used to illustrate the utility and pitfalls of PCA ordinations.

Various functions for computing PCA are not exactly the same

There are multiple functions that output PCA in a single step. They are much simpler an option than computing it from scratch as done above. Some of the common functions are exemplified below.

# compute PCA using function {prcomp}, this function uses svd (singular value decomposition) rather than {eigen}
rpca <- prcomp(X, center=T, scale=F) # scale=T returns PCA based on correlation matrix
rvalues<- rpca$sdev^2           # output squared eigenvalues (need to be squared to convert to variances)
rvectors <- rpca$rotation       # output eigenvectors
rscores <- rpca$x               # output scores

# compute PCA using function {princomp}, this function uses {eigen}.
# Note: {princomp} computes population variance (SSQ/n), so the eigenvalues ("r2values")
# are slightly smaller than those returned by other variants illustrated here (SSQ/(n-1))
# PCA using{princomp} (cor=T returns PCA based on correlation matrix)
r2pca <- princomp(x=X, cor=F, scores=T) 
r2vectors <- r2pca$loadings[,1:3]

round(r2pca$loadings, 2)

## 
## Loadings:
##      Comp.1 Comp.2 Comp.3
## var1  0.13          0.99 
## var2  0.98   0.17  -0.14 
## var3  0.18  -0.98        
## 
##                Comp.1 Comp.2 Comp.3
## SS loadings     1.010  0.993  1.001
## Proportion Var  0.337  0.331  0.334
## Cumulative Var  0.337  0.668  1.001

r2values <- r2pca$sdev^2
r2scores <- r2pca$scores

### COMPARE OUTPUTS
evectors # eigenvectors using eigen function

##        [,1]    [,2]    [,3]
## [1,] -0.131 -0.0592  0.9896
## [2,] -0.975 -0.1720 -0.1398
## [3,] -0.178  0.9833  0.0352

rvectors # eigenvectors using prcomp" function

##        PC1     PC2     PC3
## var1 0.131 -0.0592 -0.9896
## var2 0.975 -0.1720  0.1398
## var3 0.178  0.9833 -0.0352

r2vectors # eigenvectors using princomp" function

##      Comp.1  Comp.2  Comp.3
## var1  0.131  0.0592  0.9896
## var2  0.975  0.1720 -0.1398
## var3  0.178 -0.9833  0.0352

evalues # eigenvalues using eigen function

## [1] 82.5271  3.9206  0.0615

rvalues # eigen values using "prcomp"

## [1] 82.5271  3.9206  0.0615

r2values # eigen values using "princomp" (smaller values - see comment above)

##  Comp.1  Comp.2  Comp.3 
## 61.8953  2.9405  0.0461

round(evalues/sum(evalues),3) # proportion of variance accounted by PCs (eigen)

## [1] 0.954 0.045 0.001

round(rvalues/sum(rvalues),3) # proportion of variance accounted by PCs (prcomp)

## [1] 0.954 0.045 0.001

round(r2values/sum(r2values),3) # proportion of variance accounted by PCs (princomp)

## Comp.1 Comp.2 Comp.3 
##  0.954  0.045  0.001

escores # pc-scores of observations (eigen)

##       [,1]   [,2]    [,3]
## sam1 -8.74  1.866  0.1639
## sam2 -6.07 -2.625 -0.0536
## sam3  3.87  1.122 -0.3277
## sam4 10.93 -0.363  0.2174

rscores # pc-scores of observations (prcomp)

##         PC1    PC2     PC3
## sam1   8.74  1.866 -0.1639
## sam2   6.07 -2.625  0.0536
## sam3  -3.87  1.122  0.3277
## sam4 -10.93 -0.363 -0.2174

### PLOTS
PCscores=scale(escores)

# assign axis labels
xlab=paste("PC1 [",100*round(evalues[1]/sum(evalues),3),"%]",sep="")
ylab=paste("PC2 [",100*round(evalues[2]/sum(evalues),3),"%]",sep="")
zlab=paste("PC3 [",100*round(evalues[3]/sum(evalues),3),"%]",sep="")

# plot a scatter plot using "sscores"
par(mfrow=c(1,3), mar=c(5,5,1,1), pty='square')
plot(PCscores[,1],PCscores[,2], xlab=xlab, ylab=ylab, pch=16, col="red")
plot(PCscores[,1],PCscores[,3], xlab=xlab, ylab=zlab, pch=16, col="red")
plot(PCscores[,2],PCscores[,3], xlab=ylab, ylab=zlab, pch=16, col="red")

Polarity of PC scores

It is important to realize that the sign of PC scores is arbitrary. For example, PC1 is a vector that aligns with the direction of maximum variability in the data. However, there is no objective way to decide upfront the polarity of this vector (i.e., which direction is negative and which is positive). For this reason, different algorithms may produce ordinations with reversed polarities (this is also a common problem for other ordination methods, and not just PCA). To illustrate this, we can compare the scores for PC1 produced by functions “eigen” and “princomp”:

cbind('eigen'=escores[,1], 'PC1-PRINCOMP'=r2scores[,1])

##      eigen PC1-PRINCOMP
## sam1 -8.74         8.74
## sam2 -6.07         6.07
## sam3  3.87        -3.87
## sam4 10.93       -10.93

The two algorithms produced scores that are mirror images of each other, but effectively capture the same information. If you need to compare multiple ordination outputs and polarity happens to vary, you can simply change the sign of scores to make all vectors polarized the same way.

What’s next…

This tutorial highlights some of the basic principles of PCA in the context of various PCA-related functions available in R, but does not fully explore strategies for interpreting PCA results (please see related tutorials).

Comments/Questions/Corrections: Michal Kowalewski (kowalewski@ufl.edu)

Peer-review: This document has NOT been peer-reviewed.

Our Sponsors: National Science Foundation (Sedimentary Geology and Paleobiology Program), National Science Foundation (Earth Rates Initiative), Paleontological Society, Society of Vertebrate Paleontology

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.