01 Feb 2021
Principal Component Analysis part 2
In a previous instalment of this blog, we covered the “ABC” of PCA (Principal Component Analysis) by examining how this method would be applied to a very simple example: five points on a plane:
As described in that article, principal components (PCs) are a new set of variables obtained from a combination of the original variables. The steps involved in PCA are the following:
 Calculate the covariance matrix, C, of the data set
 Extract the eigenvalues (λ) and eigenvectors () of the covariance matrix
 Normalise eigenvectors to calculate the loadings (coefficients) of the original variables onto each principal component
 Calculate the scores (coordinates) of the original data points in the new PC space
A multivariate data set containing n variables can be thought of as a scatter of points in an ndimensional space. Unfortunately, if the number of variables is greater than three, there is no easy way to visualise these points. The matrix C is an (n x n) matrix (i.e. a matrix of rank n) of the covariances measured between each pair of variables. Any matrix of rank n can be described by an arbitrary set of n orthogonal vectors, called basis vectors. One particular set of basis vectors of the C matrix, called ‘eigenvectors’ of C, describe the directions in which the data set points are spread or scattered.
Each eigenvector has an associated eigenvalue, a number which corresponds to the data variance measured along the direction of the eigenvector. The eigenvector with the largest eigenvalue, therefore, points in the direction of maximum spread. The eigenvector with the second largest eigenvalue indicates the direction of second highest scatter, etc. These orthogonal eigenvectors define the new axes PC1, PC2, …, PCn. Consequently, the first PC axis has been constructed so that it describes as much of the variance in the data set as possible, and successive PCs describe increasingly smaller proportions of variance.
The advantage of representing the data points in this new set of axes is that, in many cases, it is possible to simplify an abstract multidimensional space down to two or three easy to visualise dimensions. By creating a scatter plot of PC1, PC2 and PC3 – the directions of maximum variance – the simplified graph will still be able to convey a large amount of the information contained in the multidimensional space.
A Slightly More Complex Data Set
AAindex is an impressive repository of amino acid data compiled from hundreds of peerreviewed papers, and it is a good source of data to learn multivariate analysis. To describe more advanced aspects of PCA, I decided to use the following set of variables from AAindex:
Table 1:Selected numerical descriptors of proteinogenic amino acids.
Name  Abbr.  Symbol  MW  Residue MW (H_{2}O) 
pKa  pKb  pKx  pl  logD (pI)  Polarisability  Normalised Volume 
Alanine  Ala  A  89.10  71.08  2.34  9.69  6.00  2.841  8.478  1.00  
Arginine  Arg  R  174.20  156.19  1.82  8.99  12.48  10.76  3.156  18.035  6.13 
Asparagine  Asn  N  132.12  114.11  2.02  8.80  5.41  4.288  11.810  2.95  
Aspartic acid  Asp  D  133.11  115.09  1.88  9.60  3.65  2.77  3.504  11.418  2.78 
Cysteine  Cys  C  121.16  103.15  1.92  8.35  8.18  5.05  2.792  11.482  2.43 
Glutamic acid  Glu  E  147.13  129.12  2.10  9.67  4.25  3.22  3.241  13.488  3.78 
Glutamine  Gln  Q  146.15  128.13  2.17  9.13  5.65  4.001  13.866  3.95  
Glycine  Gly  G  75.07  57.05  2.35  9.78  5.97  3.409  6.636  0.00  
Histidine  His  H  155.16  137.14  1.82  9.17  6.00  7.59  3.293  14.926  4.66 
Isoleucine  Ile  I  131.18  113.16  2.36  9.68  6.02  1.508  14.268  4.00  
Leucine  Leu  L  131.18  113.16  2.36  9.60  5.98  1.586  14.325  4.00  
Lysine  Lys  K  146.19  128.18  2.16  9.18  10.53  9.74  3.215  16.009  4.77 
Methionine  Met  M  149.21  131.20  2.28  9.21  5.74  2.189  15.529  4.43  
Phenylalanine  Phe  F  165.19  147.18  2.16  9.18  5.48  1.184  17.197  5.89  
Proline  Pro  P  115.13  97.12  1.95  10.64  6.30  2.569  11.469  2.72  
Serine  Ser  S  105.09  87.08  2.19  9.21  5.68  3.888  9.448  1.60  
Threonine  Thr  T  119.12  101.11  2.09  9.10  5.66  3.471  11.008  2.60  
Tryptophan  Trp  W  204.23  186.22  2.43  9.44  5.89  1.085  21.197  8.08  
Tyrosine  Tyr  Y  181.19  163.18  2.20  9.11  10.07  5.66  1.489  18.204  6.47 
Valine  Val  V  117.15  99.13  2.32  9.62  5.96  1.953  12.238  3.00 
In this table,
 MW is the molecular weight of the amino acid
 Residue MW is the molecular weight after peptide condensation (after loss of H_{2}O)
 pKa, pKb and pKx refer to the acidic (Cterminus), basic (Nterminus) and side chain (where applicable)
 pI is the isoelectric point of the amino acid
 logD (pI) is the distribution coefficient measured at the isoelectric point
 Polarisability is related to the dipole moment of the amino acid
 Normalised Volume refers to the bulkiness of the side chain, with glycine (side chain: H) having an arbitrary value of zero
Dealing with Missing Data
If we try to calculate PCA on this data using a statistics package, we are likely to get an error. This is caused by the missing values in the pKx column, which prevent us from calculating the covariance matrix:
Where:
If any of the x _{i,k} values are missing, the programme will not know how to calculate the covariance matrix. To resolve the problem, we can simply eliminate the variable form the database. Alternatively, if we believe that the information is critical to the analysis, we can try to “pad” the missing values. A quick look at a pKa table shows that pKa values for alcohols are typically estimated to be above 15, above 20 for amides, and above 40 for hydrocarbons. We could therefore fill the blanks with a constant value (e.g. 20) to indicate that these species have neutral side chains.
After padding the missing data points, the resulting covariance matrix is:
Table 2: Covariance matrix of nine selected variables, corrected for missing values.
MW  Residue MW (H_{2}O) 
pKa  pKb  pKx  pl  logD (pI)  Polarisability  Normalised Volume 

MW  952.538  952.599  0.706  4.285  51.409  11.138  12.881  106.491  58.566 
Residue MW (H_{2}O) 
952.599  952.661  0.706  4.286  51.411  11.139  12.881  106.497  58.569 
pKa  0.706  0.706  0.037  0.024  0.742  0.048  0.094  0.010  0.003 
pKb  4.285  4.286  0.024  0.216  0.660  0.121  0.096  0.384  0.199 
pKx  51.409  51.411  0.742  0.660  38.739  1.051  1.291  3.267  1.858 
pI  11.138  11.139  0.048  0.121  1.051  3.130  0.015  2.087  1.027 
logD (pI)  12.881  12.881  0.094  0.096  1.291  0.015  0.967  1.979  1.053 
Polarisability  106.491  106.497  0.010  0.384  3.267  2.087  1.979  12.702  6.880 
Normalised Volume 
58.566  58.569  0.003  0.199  1.858  1.027  1.053  6.880  3.758 
Highly Correlated Variables
Remember from the first instalment that the elements in the diagonal are the variances of the variables:
It is very clear that the first two variables (MW and Residue MW) have much larger variances than the rest. We can predict that the first principal component will have very large loadings on these two variables, whereas the rest will mostly contribute to higher order PCs.
Notice also the very high covariance (952.599) between MW and Residue MW. What does this mean? Calculating the correlation between MW and Residue MW results in an : from an information perspective, the two variables are essentially identical and one of them could be safely removed without any loss in data quality.
We confirm these predictions after carrying out PCA and observing the eigenvalues and loadings (only the first four PCs are shown for brevity):
Table 3: Eigenvalues of PCA on nine selected variables.
Eigenvalue  λ  % of Total 
1  1923.84  97.92% 
2  36.38  1.85% 
3  3.16  0.16% 
4  1.11  0.06% 
Table 4: Loadings of first four principal components.
Variable  Loading PC1  Loading PC2  Loading PC3  Loading PC4 
MW  0.704  0.021  0.025  0.044 
Residue MW  0.704  0.021  0.025  0.045 
pKa  0.001  0.020  0.021  0.078 
pKb  0.003  0.012  0.025  0.172 
pKx  0.039  0.993  0.068  0.089 
pI  0.008  0.051  0.954  0.137 
logD (pI)  0.010  0.057  0.055  0.757 
Polarisability  0.079  0.074  0.264  0.550 
Norm Volume  0.043  0.039  0.101  0.241 
Recall from the first instalment that each eigenvalue indicates the variance over the corresponding principal component. Table 3 shows that PC1 accounts for nearly 98% of the variation, whereas the rest of PCs only make marginal contributions to the description of the data set.
Table 4, on the other hand, shows the loadings of the original variables on the first few PCs. It is clear that PC1 receives the greatest contributions from MW and Residual MW, whereas the effects of the other variables on this PC are at least an order of magnitude smaller. On the other hand, it is interesting to note that the loadings of MW and Residual MW are practically identical on all principal components: the two variables contribute in the same way to the new dimensions. Since the main objective of PCA is the reduction of data dimensionality, it is clear that we can afford to make one of these variables redundant, simplifying the data set without loss of information. Further analysis will be carried out on just eight of the variables:
Table 5: Eight selected variables for further PCA.
Name  Abbr.  Symbol  MW  pKa  pKb  pKx  pl  logD  Polar  NV 
Alanine  Ala  A  89.10  2.34  9.69  20.00  6.00  2.841  8.478  1.00 
Arginine  Arg  R  174.20  1.82  8.99  12.48  10.76  3.156  18.035  6.13 
Asparagine  Asn  N  132.12  2.02  8.80  20.00  5.41  4.288  11.810  2.95 
Aspartic acid  Asp  D  133.11  1.88  9.60  3.65  2.77  3.504  11.418  2.78 
Cysteine  Cys  C  121.16  1.92  8.35  8.18  5.05  2.792  11.482  2.43 
Glutamic acid  Glu  E  147.13  2.10  9.67  4.25  3.22  3.241  13.488  3.78 
Glutamine  Gln  Q  146.15  2.17  9.13  20.00  5.65  4.001  13.866  3.95 
Glycine  Gly  G  75.07  2.35  9.78  20.00  5.97  3.409  6.636  0.00 
Histidine  His  H  155.16  1.82  9.17  6.00  7.59  3.293  14.926  4.66 
Isoleucine  Ile  I  131.18  2.36  9.68  20.00  6.02  1.508  14.268  4.00 
Leucine  Leu  L  131.18  2.36  9.60  20.00  5.98  1.586  14.325  4.00 
Lysine  Lys  K  146.19  2.16  9.18  10.53  9.74  3.215  16.009  4.77 
Methionine  Met  M  149.21  2.28  9.21  20.00  5.74  2.189  15.529  4.43 
Phenylalanine  Phe  F  165.19  2.16  9.18  20.00  5.48  1.184  17.197  5.89 
Proline  Pro  P  115.13  1.95  10.64  20.00  6.30  2.569  11.469  2.72 
Serine  Ser  S  105.09  2.19  9.21  20.00  5.68  3.888  9.448  1.60 
Threonine  Thr  T  119.12  2.09  9.10  20.00  5.66  3.471  11.008  2.60 
Tryptophan  Trp  W  204.23  2.43  9.44  20.00  5.89  1.085  21.197  8.08 
Tyrosine  Tyr  Y  181.19  2.20  9.11  10.07  5.66  1.489  18.204  6.47 
Valine  Val  V  117.15  2.32  9.62  20.00  5.96  1.953  12.238  3.00 
Step by Step Principal Component Analysis
In my previous instalment, I started PCA by subtracting the averages of the original variables from each data point, to obtain a table of centred values. This transformation is convenient when doing manual calculations, but it is not universally performed by statistical software. For the remainder of this article I will use Minitab 16, which does not centre the variables. If I wished to do so, I should select the command ‘Standardize’ from the ‘Calc’ menu, and select the option ‘Subtract mean’. Alternatively, the data could be transformed in a spreadsheet and copied into the software.
Covariance Matrix
The covariance calculated on the eight variables in Table 5 is displayed below. After dispensing with Residue MW, the rest of covariances remain the same as Table 2. MW still has a very high variance and we expect this variable to weigh heavily on PC1.
Table 6: Covariance matrix of the eight selected variables.
MW  pKa  pKb  pKx  pl  logD  Polar  NV  
MW  952.538  0.706  4.285  51.409  11.138  12.881  106.491  58.566 
pKa  0.706  0.037  0.024  0.742  0.048  0.094  0.010  0.003 
pKb  4.285  0.024  0.216  0.661  0.121  0.096  0.384  0.199 
pKx  51.409  0.742  0.660  38.739  1.051  1.291  3.267  1.858 
pl  11.138  0.048  0.121  1.051  3.130  0.015  2.087  1.027 
logD  12.881  0.094  0.096  1.291  0.015  0.967  1.979  1.053 
Polar  106.491  0.010  0.384  3.268  2.087  1.979  12.702  6.880 
NV  58.566  0.003  0.199  1.858  1.027  1.053  6.880  3.758 
Eigenvectors of the Covariance Matrix and Scree Test
The eight eigenvectors of Table 6 are displayed below:
Eigenvalue  λ  % 
1  971.225  95.96% 
2  36.344  3.59% 
3  3.159  0.31% 
4  1.105  0.11% 
5  0.164  0.02% 
6  0.071  0.01% 
7  0.013  0.00% 
8  0.007  0.00% 
The variance ascribed to the first PC has dropped somewhat, but PC1 still explains more than 95% of data variability, with the rest being almost completely accounted for by PC2. It seems plausible that retaining these two principal components and discarding the rest would be a very efficient way to describe the original data set. It could also be argued that PC1 alone is sufficient, and the analysis could be reduced to a single dimension. The decision to retain one, two, or more PCs is somewhat subjective. It should be based on a deeper analysis of loadings and scores, but it is often simply decided on what percentage of variance one wishes to preserve. A graphical tool called a “scree plot” is often used to aid in this decision:
Figure 1: Scree plot of the 8 eigenvalues.
A scree plot displays the eigenvalues from highest to lowest value (therefore, in decreasing order of variance). Very often scree plots – like the one on Figure 1 – have an “elbow” beyond which variance levels off. The principal components “above” the elbow are generally retained, as they account for the greatest amount of variability. In the example above, only PC1 would be retained.
Loadings of the Principal Components
Each eigenvalue has an associated eigenvector, which forms the basis of a principal component. The resulting principal components are unique linear combinations of the eight original variables, and the coefficients of the variables in the linear combinations are called the “loadings” of the PCs. These are shown on Table 7.
Table 7: Loadings of eight selected variables on PC1 to PC8.
Variable  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
MW  0.990  0.043  0.050  0.089  0.019  0.072  0.019  0.036 
pKa  0.001  0.020  0.021  0.079  0.034  0.160  0.983  0.010 
pKb  0.004  0.012  0.025  0.173  0.977  0.065  0.010  0.107 
pKx  0.055  0.992  0.068  0.089  0.001  0.010  0.012  0.009 
pl  0.012  0.052  0.954  0.134  0.055  0.237  0.071  0.064 
logD  0.013  0.057  0.057  0.758  0.114  0.615  0.035  0.163 
Polar  0.111  0.077  0.261  0.547  0.157  0.728  0.162  0.185 
NV  0.061  0.040  0.099  0.239  0.064  0.022  0.029  0.960 
PC1 has a very strong loading from MW, and much weaker contributions from Polarity, normalised Volume, and pKx, the pK of the amino acid side chain. pKx is the major contributor to PC2, which has much lower coefficients for logD, pI and Normalised Value. pI is the major contributor to PC3, which also has a significant coefficient for Polarity.
These contributions are often visualised on “loading plots”, such as Figure 2 for PC1 and PC2, which simplify the interpretation of the principal components being plotted in terms of the original variables:
Figure 2: Loading plot for PC1 versus PC2.
In this graph, the horizontal axis indicates the contribution of each variable to PC1, and the vertical axis represents the contribution to PC2. Those variables close to zero (lower left corner of the plot) have small loadings on both principal components. MW has a strong loading on PC1 (close to one), but very weak on PC2 (close to zero). The opposite is true for pKx and PC2.
In a very real sense, PC1 is equal to MW, “complemented” with Polarisability and Nominal Volume. If we plot the amino acids along PC1 (the “scores” of the amino acids on PC1, which we will plot in the next section) we should see small, light, nonpolar amino acids on the lower values; whereas bulkier and more polarisable amino acids should have higher PC1 scores.
PC2 can be described as pKx “decorated” with pI and Polarisability. This axis should allow us to differentiate between neutral and polar or ionisable amino acids, as it encodes information related to the chemical properties of the side chains. PC3 should further enhance this classification and allow us to separate neutral, acidic, and basic amino acids.
Scores
After computing the loadings of the original variables on the principal components, the twenty amino acids can be “mapped” onto the new system of coordinates, by applying the linear combinations that define each PC to the values of each amino acid. The new coordinates are called “scores”.
For example, PC1 is defined as:
The score of Alanine on PC1 is therefore:
The complete table of scores is shown below:
Name  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
Alanine  88.12  24.66  2.13  5.88  10.49  0.92  2.74  4.96 
Arginine  174.23  21.99  6.01  7.34  10.55  0.89  2.53  5.01 
Asparagine  131.19  26.70  0.61  8.60  10.18  1.22  2.53  4.90 
Aspartic acid  132.99  10.40  1.03  6.42  10.78  1.20  2.53  4.93 
Cysteine  120.93  14.52  1.40  5.79  9.35  1.00  2.46  5.01 
Glutamic acid  147.14  11.84  0.72  6.18  10.84  1.49  2.68  4.94 
Glutamine  145.37  27.54  0.75  8.22  10.49  1.54  2.61  4.93 
Glycine  73.95  23.84  2.25  6.30  10.60  0.93  2.79  4.98 
Histidine  155.26  14.28  3.42  6.78  10.59  0.88  2.57  4.89 
Isoleucine  130.63  27.10  1.80  4.72  10.41  1.36  2.57  4.90 
Leucine  130.64  27.10  1.78  4.75  10.33  1.45  2.55  4.88 
Lysine  146.28  18.60  5.89  5.97  10.38  1.76  2.66  4.85 
Methionine  148.64  27.93  1.06  6.09  10.19  1.43  2.56  5.20 
Phenylalanine  164.75  28.84  0.54  5.46  10.17  0.95  2.44  4.83 
Proline  114.34  26.11  2.05  5.84  11.55  1.03  2.34  4.98 
Serine  104.08  25.36  1.42  7.47  10.32  1.16  2.65  4.90 
Threonine  118.22  26.14  1.19  7.33  10.25  1.03  2.54  4.80 
Tryptophan  203.99  30.94  0.24  6.12  10.67  1.01  2.76  4.95 
Tyrosine  181.29  19.77  0.93  5.57  10.31  0.80  2.73  4.90 
Valine  116.45  26.27  1.83  5.17  10.39  1.14  2.60  4.89 
As expected, the scores on PC1 for the amino acids are not very different from their molecular weights. The small differences are the combined effects of the other variables on the calculation of PC1. However, we cannot give PC1 units of “g/mol”, since it is a combination of eight distinct physical magnitudes.
The scree plot and eigenvalues table suggested that PC1 would suffice to describe 96% of the information contained in the original data set. The graphical representation of the twenty amino acids on PC1 is displayed on the following figure:
Figure 3: PC1 scores of twenty amino acids.
We predicted that small, neutral amino acids would be placed on the lower end of PC1, and increasingly bulkier and more polarisable species would appear at higher PC1 values. Whilst this is correct, PC1 alone is not able to separate the different chemical categories of amino acids, represented by the coloured icons:
The reason is clear: Molecular Weight values are so much bigger than the rest of variables, that they dominate the computation of the first principal components. Although MW accounts for a very large amount of variance, it does not encode the chemical information reflected in the other variables. This subtle information, which could arguably be of high value in many situations, is “displaced” to the lower rank principal components (PC2, PC3, etc.)
When deciding which principal components should be retained and which discarded, the scree plot rule is somewhat simplistic. It should always be supplemented with a brief exploration of loadings and scores to better understand how the information is being represented in the new PC space.
Standardised Variables and the Correlation Matrix
Most statistical packages offer the option to standardise the original variables as the first step prior to PCA. Standardisation, or normalisation, consists in calculating the mean and standard deviation of each variable, and use the to transform the data points:
Where and are the mean and standard deviation of variable i, and x_{i,j} is the value of variable i for object j. For example, the variable MW would be standardised as follows:
Standardisation is generally used when there are big differences in absolute values and (or) ranges between the variables. Our data set is a good example: MW values are one order of magnitude greater than those of NV; and they range between 75.07 g/mol and 204.23 g/mol, whereas the range of NV is much smaller, 0 – 8.08 units. Consequently, as we have seen in previous sections, MW will naturally have much more weight on the calculation of the first few PCs than NV – which may or may not be a desirable effect. By subtracting the mean and dividing by the standard deviation, these large differences are negated: all standardised variables will have an average of zero, and a variance of 1, which ensures they have equal weight on the calculation of PCs.
After standardising the data, the covariance matrix becomes a correlation matrix: diagonal elements are 1 and element above and below the diagonal indicate the pairwise correlation between the variables:
Table 9: Correlation matrix for eight selected variables (standardised).
MW  pKa  pKb  pKx  pI  logD  Polar  NV  
MW  1.000  0.119  0.298  0.268  0.204  0.425  0.968  0.979 
pKa  0.119  1.000  0.268  0.621  0.141  0.500  0.015  0.007 
pKb  0.298  0.268  1.000  0.228  0.147  0.209  0.232  0.221 
pKx  0.268  0.621  0.228  1.000  0.095  0.211  0.147  0.154 
pI  0.204  0.141  0.147  0.095  1.000  0.008  0.331  0.299 
logD  0.425  0.500  0.209  0.211  0.008  1.000  0.565  0.552 
Polar  0.968  0.015  0.232  0.147  0.331  0.565  1.000  0.996 
NV  0.979  0.007  0.221  0.154  0.299  0.552  0.996  1.000 
Scree Plot
The eigenvalue table and scree plot show a remarkable departure from the previous analysis:
Eigenvalue  λ  % Variance  % Additive 
1  3.469  43.36%  43.36% 
2  2.089  26.11%  69.47% 
3  1.057  13.22%  82.68% 
4  0.782  9.77%  92.46% 
5  0.351  4.39%  96.84% 
6  0.247  3.08%  99.93% 
7  0.004  0.05%  99.98% 
8  0.001  0.02%  100.00% 
Figure 4: Scree plot of eight principal components (standardised variables)
Calculated on standardised variables, the first principal component only contains 43% of the variance. 26% is ascribed to PC2, 13% to PC3, etc. it is clear that we will need to retain a larger number of PCs to represent the system. However, the scree plot does not have the sharp “elbow” we saw in the previous calculation. How is a decision made in these cases? Although a deeper analysis of loadings and scores is always a sound idea, it is common in practice to apply a “Pareto” criterion, and retain those principal components that accumulate 80% or more of the variance. In this example, PC1, PC2 and PC3 could be considered satisfactory.
Loadings
The loadings chart for PC1 and PC2, and the loadings table for the first three principal components are shown below:
Figure 5: Loading chart for PC1 and PC2
Table 10: Loadings for PC1, PC2, PC3 (standardised variables)
Variable  PC1  PC2  PC3 
MW  0.519  0.072  0.111 
pKa  0.016  0.614  0.030 
pKb  0.157  0.364  0.254 
pKx  0.107  0.517  0.455 
pI  0.184  0.081  0.816 
logD  0.304  0.458  0.222 
Polar  0.534  0.037  0.000 
NV  0.532  0.029  0.029 
PC1 is almost equally shared between MW, Polarisability, and Normalised Volume: the factors that were secondary in our original analysis have now become much more prominent. PC2 is composed of pKa, followed closely by pKx and logD. pKb has a smaller, but significant, contribution. PC3 is still strongly dominated by pI, but it also receives a strong contribution from pKx.
It should be noted that these variables have negative loadings: as the variables increase in value, PC2 and PC3 would decrease. It is the magnitude of the loading, regardless of its sign, that determines which variables are most strongly associated with each PC.
These combinations the variables seem almost natural. It is often said that PCA reveals the hidden structure of a data set, and can uncover latent variables that may have not been initially obvious. For example, it seems appropriate to assimilate PC1 to the “bulk”; PC2 to the “polarity”; and PC3 to the “acidbase character” of the amino acids. We might even prefer these more descriptive names, instead of PC1, PC2 and PC3.
Scores
Table 11: Scores of twenty amino acids on PC1, PC2, PC3 (standardised variables)
Name  PC1  PC2  PC3 
Alanine  2.559  1.201  0.385 
Arginine  2.518  2.021  2.018 
Asparagine  0.983  1.203  0.691 
Aspartic acid  1.082  1.904  2.357 
Cysteine  0.570  2.128  0.398 
Glutamic acid  0.186  0.990  2.235 
Glutamine  0.175  0.339  0.514 
Glycine  3.556  1.039  0.518 
Histidine  1.007  2.347  0.107 
Isoleucine  0.255  1.885  0.096 
Leucine  0.262  1.788  0.053 
Lysine  1.187  0.904  1.489 
Methionine  0.791  0.934  0.102 
Phenylalanine  2.014  1.009  0.327 
Proline  1.375  0.806  0.238 
Serine  2.163  0.146  0.646 
Threonine  1.246  0.358  0.521 
Tryptophan  3.833  2.087  0.435 
Tyrosine  2.708  0.089  0.923 
Valine  0.680  1.502  0.070 
This table of scores allows us to visualise the twenty amino acids in a threedimensional space, as seen in the figure below (created with Plotly Chart Studio) where each amino acid has been labelled according to the colour scheme seen earlier:
Figure 6: 3D scatter plot of PC1, PC2, PC3
The chart clearly shows that the combination of the three principal components is not only able to sort the objects in increasing molecular weight (along the PC1 axis), but it also reveals distinct groups and trends, that correspond to the seven categories of amino acids.
Displaying the scores on twodimensional scatter plots clearly shows the ability of PCA to show patterns in the data set and to simplify further analysis:
Figure 7: 2D scatter plot of "Polarity" (PC2) v "Bulk" (PC1)
Figure 8: 2D scatter plot of "AcidBase" (PC3) v "Bulk" (PC1)
Figure 9: 2D scatter plot of "AcidBase" (PC3) v "Polarity" (PC2)
These graphs illustrate a potential application of PCA: Figure 6 shows that amino acids with a positive PC2 are polar or ionisable in nature, whereas amino acids with a negative PC2 are aliphatic or aromatic. Figure 7 shows that acidic amino acids have a positive PC3, whereas basic amino acids have a negative PC3. Those with a PC3 close to zero are nonionisable.
Working with raw or standardised variables are not the only possibilities: one could apply judicious transformations to the data if they are deemed to be of benefit. For example, Snyder et al. developed in the early 2000s a method for HPLC column characterisation, widely known as the “Hydrophobic Subtraction Model” (J. Chrom. A, 961, 2002, 171193). The aim of the model was to quantify the different retention mechanisms involved in a chromatographic separation: hydrophobic interaction, ion exchange, shape selectivity, etc. They realised early on that hydrophobic interaction had an overwhelming effect on the results, and they decided to “subtract” it by dividing all the other factors by the hydrophobic contribution. Dividing pKa, pKb, pI… by the molecular weight of the amino acid may not be appropriate in our data set (I am not sure “pI per mass unit” makes any physical sense) but other strategies could well be interesting.
Outliers
Outliers are observations that lie outside of the general pattern of a distribution or mathematical model. In linear regression, for example, outliers can be detected by calculating “standardised residuals”:
Where the residual, r_{i}, is defined as the distance between the experimental point y_{i} and the regressed value, . After computing all the residuals, the average and standard deviation s_{r} can be calculated, and the residuals can be standardised thus:
Any points with standardised residuals beyond a critical value are potential residuals and should be investigated further.
The multivariate equivalent to standardised residuals is the Mahalanobis Distance (MD), named after its creator, Prasandra Mahalanobis:
Here, is the object under investigation (e.g. each one of our amino acids) in the principal component space. is therefore the vector of eight scores for a particular amino acid. For example, the vector “alanine” is:
represents the centroid of the objects in PC space: a vector whose coefficients are the average of each PC. For standardised data, the calculation is simplified because the average of each PCs is zero:
Finally, C ^{1} is the inverse of the Covariance matrix. For standardised data, this is the inverse of the Correlation matrix:
286.938  7.249  8.660  9.306  37.263  44.122  9.106  303.938 
7.249  2.826  0.050  1.194  1.646  0.220  7.588  0.307 
8.660  0.050  1.630  0.329  0.921  0.637  4.154  12.830 
9.306  1.194  0.329  2.354  0.437  1.257  5.404  14.889 
37.263  1.646  0.921  0.437  6.595  6.286  10.671  31.013 
44.122  0.220  0.637  1.257  6.286  9.537  7.009  43.025 
9.106  7.588  4.154  5.404  10.671  7.009  189.262  170.797 
303.938  0.307  12.830  14.889  31.013  43.025  170.797  496.522 
The Mahalanobis distances for all twenty amino acids are:
Name  MD  MD critical  p Value 
Alanine  2.299  15.507  0.970 
Arginine  3.277  15.507  0.916 
Asparagine  2.543  15.507  0.960 
Aspartic acid  2.787  15.507  0.947 
Cysteine  3.514  15.507  0.898 
Glutamic acid  2.903  15.507  0.940 
Glutamine  2.531  15.507  0.960 
Glycine  2.922  15.507  0.939 
Histidine  2.269  15.507  0.972 
Isoleucine  1.885  15.507  0.984 
Leucine  2.100  15.507  0.978 
Lysine  3.591  15.507  0.892 
Methionine  3.527  15.507  0.897 
Phenylalanine  2.673  15.507  0.953 
Proline  3.722  15.507  0.881 
Serine  1.719  15.507  0.988 
Threonine  2.202  15.507  0.974 
Tryptophan  3.180  15.507  0.923 
Tyrosine  2.550  15.507  0.959 
Valine  1.470  15.507  0.993 
The critical Mahalanobis distances and p values are given by the Chisquared distribution. In Excel, these are calculated as:
Where α is significance (set to 0.05 in this example) and v is degrees of freedom, which is equal to the number of PCs, 8. Our calculations indicate that none of the amino acids is an outlier. In other words, they all fit well within the method.
Conclusions
After a detailed introduction to the algebra behind Principal Components Analysis, this instalment is focused on some of the practical aspects of the technique, such as dealing with missing data, or highly correlated – and therefore potentially redundant – variables.
Most statistical packages enable the user to apply basic transformations to the data set prior to PCA. One of the most common options is to standardise the variables, which is commonly carried out when there are large disparities in absolute values and variances. We have demonstrated how this technique modifies the resulting principal components, and given general guidelines on how to use it.
Finally, the most important aspect of PCA is to interpret the results. We have discussed at length some graphical and statistical tools, including Mahalanobis distances to evaluate the presence of outlier objects.
As ever, I hope this article has been enlightening and useful. Questions and comments are welcome, and please do not hesitate to contact us for further information.