In a previous instalment of this blog, we covered the “ABC” of PCA (Principal Component Analysis) by examining how this method would be applied to a very simple example: five points on a plane:

Principal Component Analysis

Principal Component Analysis

As described in that article, principal components (PCs) are a new set of variables obtained from a combination of the original variables. The steps involved in PCA are the following:

  1. Calculate the covariance matrix, C, of the data set
  2. Extract the eigenvalues (λ) and eigenvectors () of the covariance matrix
  3. Normalise eigenvectors to calculate the loadings (coefficients) of the original variables onto each principal component
  4. Calculate the scores (coordinates) of the original data points in the new PC space

A multivariate data set containing n variables can be thought of as a scatter of points in an n-dimensional space. Unfortunately, if the number of variables is greater than three, there is no easy way to visualise these points. The matrix C is an (nn) matrix (i.e. a matrix of rank n) of the covariances measured between each pair of variables. Any matrix of rank n can be described by an arbitrary set of n orthogonal vectors, called basis vectors. One particular set of basis vectors of the C matrix, called ‘eigenvectors’ of C, describe the directions in which the data set points are spread or scattered.

Each eigenvector has an associated eigenvalue, a number which corresponds to the data variance measured along the direction of the eigenvector. The eigenvector with the largest eigenvalue, therefore, points in the direction of maximum spread. The eigenvector with the second largest eigenvalue indicates the direction of second highest scatter, etc. These orthogonal eigenvectors define the new axes PC1, PC2, …, PCn. Consequently, the first PC axis has been constructed so that it describes as much of the variance in the data set as possible, and successive PCs describe increasingly smaller proportions of variance.

The advantage of representing the data points in this new set of axes is that, in many cases, it is possible to simplify an abstract multidimensional space down to two or three easy to visualise dimensions. By creating a scatter plot of PC1, PC2 and PC3 – the directions of maximum variance – the simplified graph will still be able to convey a large amount of the information contained in the multidimensional space.

A Slightly More Complex Data Set

AAindex is an impressive repository of amino acid data compiled from hundreds of peer-reviewed papers, and it is a good source of data to learn multivariate analysis. To describe more advanced aspects of PCA, I decided to use the following set of variables from AAindex:

Table 1:Selected numerical descriptors of proteinogenic amino acids.

Name Abbr. Symbol MW Residue
MW (-H2O)
pKa pKb pKx pl logD (pI) Polarisability Normalised
Volume
Alanine Ala A 89.10 71.08 2.34 9.69   6.00 -2.841 8.478 1.00
Arginine Arg R 174.20 156.19 1.82 8.99 12.48 10.76 -3.156 18.035 6.13
Asparagine Asn N 132.12 114.11 2.02 8.80   5.41 -4.288 11.810 2.95
Aspartic acid Asp D 133.11 115.09 1.88 9.60 3.65 2.77 -3.504 11.418 2.78
Cysteine Cys C 121.16 103.15 1.92 8.35 8.18 5.05 -2.792 11.482 2.43
Glutamic acid Glu E 147.13 129.12 2.10 9.67 4.25 3.22 -3.241 13.488 3.78
Glutamine Gln Q 146.15 128.13 2.17 9.13   5.65 -4.001 13.866 3.95
Glycine Gly G 75.07 57.05 2.35 9.78   5.97 -3.409 6.636 0.00
Histidine His H 155.16 137.14 1.82 9.17 6.00 7.59 -3.293 14.926 4.66
Isoleucine Ile I 131.18 113.16 2.36 9.68   6.02 -1.508 14.268 4.00
Leucine Leu L 131.18 113.16 2.36 9.60   5.98 -1.586 14.325 4.00
Lysine Lys K 146.19 128.18 2.16 9.18 10.53 9.74 -3.215 16.009 4.77
Methionine Met M 149.21 131.20 2.28 9.21   5.74 -2.189 15.529 4.43
Phenylalanine Phe F 165.19 147.18 2.16 9.18   5.48 -1.184 17.197 5.89
Proline Pro P 115.13 97.12 1.95 10.64   6.30 -2.569 11.469 2.72
Serine Ser S 105.09 87.08 2.19 9.21   5.68 -3.888 9.448 1.60
Threonine Thr T 119.12 101.11 2.09 9.10   5.66 -3.471 11.008 2.60
Tryptophan Trp W 204.23 186.22 2.43 9.44   5.89 -1.085 21.197 8.08
Tyrosine Tyr Y 181.19 163.18 2.20 9.11 10.07 5.66 -1.489 18.204 6.47
Valine Val V 117.15 99.13 2.32 9.62   5.96 -1.953 12.238 3.00

In this table,

  • MW is the molecular weight of the amino acid
  • Residue MW is the molecular weight after peptide condensation (after loss of H2O)
  • pKa, pKb and pKx refer to the acidic (C-terminus), basic (N-terminus) and side chain (where applicable)
  • pI is the isoelectric point of the amino acid
  • logD (pI) is the distribution coefficient measured at the isoelectric point
  • Polarisability is related to the dipole moment of the amino acid
  • Normalised Volume refers to the bulkiness of the side chain, with glycine (side chain: -H) having an arbitrary value of zero

Dealing with Missing Data

If we try to calculate PCA on this data using a statistics package, we are likely to get an error. This is caused by the missing values in the pKx column, which prevent us from calculating the covariance matrix:

Where:

If any of the x i,k values are missing, the programme will not know how to calculate the covariance matrix. To resolve the problem, we can simply eliminate the variable form the database. Alternatively, if we believe that the information is critical to the analysis, we can try to “pad” the missing values. A quick look at a pKa table shows that pKa values for alcohols are typically estimated to be above 15, above 20 for amides, and above 40 for hydrocarbons. We could therefore fill the blanks with a constant value (e.g. 20) to indicate that these species have neutral side chains.

After padding the missing data points, the resulting covariance matrix is:

Table 2: Covariance matrix of nine selected variables, corrected for missing values.

  MW Residue
MW (-H2O)
pKa pKb pKx pl logD (pI) Polarisability Normalised
Volume
MW 952.538 952.599 -0.706 -4.285 -51.409 11.138 12.881 106.491 58.566
Residue
MW (-H2O)
952.599 952.661 -0.706 -4.286 -51.411 11.139 12.881 106.497 58.569
pKa -0.706 -0.706 0.037 0.024 0.742 -0.048 0.094 0.010 -0.003
pKb -4.285 -4.286 0.024 0.216 0.660 -0.121 0.096 -0.384 -0.199
pKx -51.409 -51.411 0.742 0.660 38.739 1.051 1.291 -3.267 -1.858
pI 11.138 11.139 -0.048 -0.121 1.051 3.130 -0.015 2.087 1.027
logD (pI) 12.881 12.881 0.094 0.096 1.291 -0.015 0.967 1.979 1.053
Polarisability 106.491 106.497 0.010 -0.384 -3.267 2.087 1.979 12.702 6.880
Normalised
Volume
58.566 58.569 -0.003 -0.199 -1.858 1.027 1.053 6.880 3.758

Highly Correlated Variables

Remember from the first instalment that the elements in the diagonal are the variances of the variables:

It is very clear that the first two variables (MW and Residue MW) have much larger variances than the rest. We can predict that the first principal component will have very large loadings on these two variables, whereas the rest will mostly contribute to higher order PCs.

Notice also the very high covariance (952.599) between MW and Residue MW. What does this mean? Calculating the correlation between MW and Residue MW results in an : from an information perspective, the two variables are essentially identical and one of them could be safely removed without any loss in data quality.

We confirm these predictions after carrying out PCA and observing the eigenvalues and loadings (only the first four PCs are shown for brevity):

Table 3: Eigenvalues of PCA on nine selected variables.

Eigenvalue λ % of Total
1 1923.84 97.92%
2 36.38 1.85%
3 3.16 0.16%
4 1.11 0.06%

Table 4: Loadings of first four principal components.

Variable Loading PC1 Loading PC2 Loading PC3 Loading PC4
MW 0.704 0.021 -0.025 0.044
Residue MW 0.704 0.021 -0.025 0.045
pKa -0.001 0.020 -0.021 -0.078
pKb -0.003 0.012 -0.025 -0.172
pKx -0.039 0.993 -0.068 0.089
pI 0.008 0.051 0.954 0.137
logD (pI) 0.010 0.057 -0.055 -0.757
Polarisability 0.079 0.074 0.264 -0.550
Norm Volume 0.043 0.039 0.101 -0.241

Recall from the first instalment that each eigenvalue indicates the variance over the corresponding principal component. Table 3 shows that PC1 accounts for nearly 98% of the variation, whereas the rest of PCs only make marginal contributions to the description of the data set.

Table 4, on the other hand, shows the loadings of the original variables on the first few PCs. It is clear that PC1 receives the greatest contributions from MW and Residual MW, whereas the effects of the other variables on this PC are at least an order of magnitude smaller. On the other hand, it is interesting to note that the loadings of MW and Residual MW are practically identical on all principal components: the two variables contribute in the same way to the new dimensions. Since the main objective of PCA is the reduction of data dimensionality, it is clear that we can afford to make one of these variables redundant, simplifying the data set without loss of information. Further analysis will be carried out on just eight of the variables:

Table 5: Eight selected variables for further PCA.

Name Abbr. Symbol MW pKa pKb pKx pl logD Polar NV
Alanine Ala A 89.10 2.34 9.69 20.00 6.00 -2.841 8.478 1.00
Arginine Arg R 174.20 1.82 8.99 12.48 10.76 -3.156 18.035 6.13
Asparagine Asn N 132.12 2.02 8.80 20.00 5.41 -4.288 11.810 2.95
Aspartic acid Asp D 133.11 1.88 9.60 3.65 2.77 -3.504 11.418 2.78
Cysteine Cys C 121.16 1.92 8.35 8.18 5.05 -2.792 11.482 2.43
Glutamic acid Glu E 147.13 2.10 9.67 4.25 3.22 -3.241 13.488 3.78
Glutamine Gln Q 146.15 2.17 9.13 20.00 5.65 -4.001 13.866 3.95
Glycine Gly G 75.07 2.35 9.78 20.00 5.97 -3.409 6.636 0.00
Histidine His H 155.16 1.82 9.17 6.00 7.59 -3.293 14.926 4.66
Isoleucine Ile I 131.18 2.36 9.68 20.00 6.02 -1.508 14.268 4.00
Leucine Leu L 131.18 2.36 9.60 20.00 5.98 -1.586 14.325 4.00
Lysine Lys K 146.19 2.16 9.18 10.53 9.74 -3.215 16.009 4.77
Methionine Met M 149.21 2.28 9.21 20.00 5.74 -2.189 15.529 4.43
Phenylalanine Phe F 165.19 2.16 9.18 20.00 5.48 -1.184 17.197 5.89
Proline Pro P 115.13 1.95 10.64 20.00 6.30 -2.569 11.469 2.72
Serine Ser S 105.09 2.19 9.21 20.00 5.68 -3.888 9.448 1.60
Threonine Thr T 119.12 2.09 9.10 20.00 5.66 -3.471 11.008 2.60
Tryptophan Trp W 204.23 2.43 9.44 20.00 5.89 -1.085 21.197 8.08
Tyrosine Tyr Y 181.19 2.20 9.11 10.07 5.66 -1.489 18.204 6.47
Valine Val V 117.15 2.32 9.62 20.00 5.96 -1.953 12.238 3.00

Step by Step Principal Component Analysis

In my previous instalment, I started PCA by subtracting the averages of the original variables from each data point, to obtain a table of centred values. This transformation is convenient when doing manual calculations, but it is not universally performed by statistical software. For the remainder of this article I will use Minitab 16, which does not centre the variables. If I wished to do so, I should select the command ‘Standardize’ from the ‘Calc’ menu, and select the option ‘Subtract mean’. Alternatively, the data could be transformed in a spreadsheet and copied into the software.

Covariance Matrix

The covariance calculated on the eight variables in Table 5 is displayed below. After dispensing with Residue MW, the rest of covariances remain the same as Table 2. MW still has a very high variance and we expect this variable to weigh heavily on PC1.

Table 6: Covariance matrix of the eight selected variables.

  MW pKa pKb pKx pl logD Polar NV
MW 952.538 -0.706 -4.285 -51.409 11.138 12.881 106.491 58.566
pKa -0.706 0.037 0.024 0.742 -0.048 0.094 0.010 -0.003
pKb -4.285 0.024 0.216 0.661 -0.121 0.096 -0.384 -0.199
pKx -51.409 0.742 0.660 38.739 1.051 1.291 -3.267 -1.858
pl 11.138 -0.048 -0.121 1.051 3.130 -0.015 2.087 1.027
logD 12.881 0.094 0.096 1.291 -0.015 0.967 1.979 1.053
Polar 106.491 0.010 -0.384 -3.268 2.087 1.979 12.702 6.880
NV 58.566 -0.003 -0.199 -1.858 1.027 1.053 6.880 3.758

Eigenvectors of the Covariance Matrix and Scree Test

The eight eigenvectors of Table 6 are displayed below:

Eigenvalue λ %
1 971.225 95.96%
2 36.344 3.59%
3 3.159 0.31%
4 1.105 0.11%
5 0.164 0.02%
6 0.071 0.01%
7 0.013 0.00%
8 0.007 0.00%

The variance ascribed to the first PC has dropped somewhat, but PC1 still explains more than 95% of data variability, with the rest being almost completely accounted for by PC2. It seems plausible that retaining these two principal components and discarding the rest would be a very efficient way to describe the original data set. It could also be argued that PC1 alone is sufficient, and the analysis could be reduced to a single dimension. The decision to retain one, two, or more PCs is somewhat subjective. It should be based on a deeper analysis of loadings and scores, but it is often simply decided on what percentage of variance one wishes to preserve. A graphical tool called a “scree plot” is often used to aid in this decision:

Figure 1: Scree plot of the 8 eigenvalues.

A scree plot displays the eigenvalues from highest to lowest value (therefore, in decreasing order of variance). Very often scree plots – like the one on Figure 1 – have an “elbow” beyond which variance levels off. The principal components “above” the elbow are generally retained, as they account for the greatest amount of variability. In the example above, only PC1 would be retained.

Loadings of the Principal Components

Each eigenvalue has an associated eigenvector, which forms the basis of a principal component. The resulting principal components are unique linear combinations of the eight original variables, and the coefficients of the variables in the linear combinations are called the “loadings” of the PCs. These are shown on Table 7.

Table 7: Loadings of eight selected variables on PC1 to PC8.

Variable PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
MW 0.990 0.043 -0.050 0.089 0.019 0.072 0.019 0.036
pKa -0.001 0.020 -0.021 -0.079 -0.034 -0.160 0.983 -0.010
pKb -0.004 0.012 -0.025 -0.173 0.977 -0.065 0.010 0.107
pKx -0.055 0.992 -0.068 0.089 0.001 0.010 -0.012 0.009
pl 0.012 0.052 0.954 0.134 0.055 0.237 0.071 0.064
logD 0.013 0.057 -0.057 -0.758 -0.114 0.615 0.035 0.163
Polar 0.111 0.077 0.261 -0.547 -0.157 -0.728 -0.162 0.185
NV 0.061 0.040 0.099 -0.239 0.064 -0.022 -0.029 -0.960

PC1 has a very strong loading from MW, and much weaker contributions from Polarity, normalised Volume, and pKx, the pK of the amino acid side chain. pKx is the major contributor to PC2, which has much lower coefficients for logD, pI and Normalised Value. pI is the major contributor to PC3, which also has a significant coefficient for Polarity.

These contributions are often visualised on “loading plots”, such as Figure 2 for PC1 and PC2, which simplify the interpretation of the principal components being plotted in terms of the original variables:

Figure 2: Loading plot for PC1 versus PC2.

In this graph, the horizontal axis indicates the contribution of each variable to PC1, and the vertical axis represents the contribution to PC2. Those variables close to zero (lower left corner of the plot) have small loadings on both principal components. MW has a strong loading on PC1 (close to one), but very weak on PC2 (close to zero). The opposite is true for pKx and PC2.

In a very real sense, PC1 is equal to MW, “complemented” with Polarisability and Nominal Volume. If we plot the amino acids along PC1 (the “scores” of the amino acids on PC1, which we will plot in the next section) we should see small, light, non-polar amino acids on the lower values; whereas bulkier and more polarisable amino acids should have higher PC1 scores.

PC2 can be described as pKx “decorated” with pI and Polarisability. This axis should allow us to differentiate between neutral and polar or ionisable amino acids, as it encodes information related to the chemical properties of the side chains. PC3 should further enhance this classification and allow us to separate neutral, acidic, and basic amino acids.

Scores

After computing the loadings of the original variables on the principal components, the twenty amino acids can be “mapped” onto the new system of coordinates, by applying the linear combinations that define each PC to the values of each amino acid. The new coordinates are called “scores”.

For example, PC1 is defined as:

The score of Alanine on PC1 is therefore:

The complete table of scores is shown below:

Name PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Alanine 88.12 24.66 2.13 5.88 10.49 -0.92 2.74 4.96
Arginine 174.23 21.99 6.01 7.34 10.55 -0.89 2.53 5.01
Asparagine 131.19 26.70 0.61 8.60 10.18 -1.22 2.53 4.90
Aspartic acid 132.99 10.40 -1.03 6.42 10.78 -1.20 2.53 4.93
Cysteine 120.93 14.52 1.40 5.79 9.35 -1.00 2.46 5.01
Glutamic acid 147.14 11.84 -0.72 6.18 10.84 -1.49 2.68 4.94
Glutamine 145.37 27.54 0.75 8.22 10.49 -1.54 2.61 4.93
Glycine 73.95 23.84 2.25 6.30 10.60 -0.93 2.79 4.98
Histidine 155.26 14.28 3.42 6.78 10.59 -0.88 2.57 4.89
Isoleucine 130.63 27.10 1.80 4.72 10.41 -1.36 2.57 4.90
Leucine 130.64 27.10 1.78 4.75 10.33 -1.45 2.55 4.88
Lysine 146.28 18.60 5.89 5.97 10.38 -1.76 2.66 4.85
Methionine 148.64 27.93 1.06 6.09 10.19 -1.43 2.56 5.20
Phenylalanine 164.75 28.84 0.54 5.46 10.17 -0.95 2.44 4.83
Proline 114.34 26.11 2.05 5.84 11.55 -1.03 2.34 4.98
Serine 104.08 25.36 1.42 7.47 10.32 -1.16 2.65 4.90
Threonine 118.22 26.14 1.19 7.33 10.25 -1.03 2.54 4.80
Tryptophan 203.99 30.94 0.24 6.12 10.67 -1.01 2.76 4.95
Tyrosine 181.29 19.77 0.93 5.57 10.31 -0.80 2.73 4.90
Valine 116.45 26.27 1.83 5.17 10.39 -1.14 2.60 4.89

As expected, the scores on PC1 for the amino acids are not very different from their molecular weights. The small differences are the combined effects of the other variables on the calculation of PC1. However, we cannot give PC1 units of “g/mol”, since it is a combination of eight distinct physical magnitudes.

The scree plot and eigenvalues table suggested that PC1 would suffice to describe 96% of the information contained in the original data set. The graphical representation of the twenty amino acids on PC1 is displayed on the following figure:

Figure 3: PC1 scores of twenty amino acids.

We predicted that small, neutral amino acids would be placed on the lower end of PC1, and increasingly bulkier and more polarisable species would appear at higher PC1 values. Whilst this is correct, PC1 alone is not able to separate the different chemical categories of amino acids, represented by the coloured icons:

The reason is clear: Molecular Weight values are so much bigger than the rest of variables, that they dominate the computation of the first principal components. Although MW accounts for a very large amount of variance, it does not encode the chemical information reflected in the other variables. This subtle information, which could arguably be of high value in many situations, is “displaced” to the lower rank principal components (PC2, PC3, etc.)

When deciding which principal components should be retained and which discarded, the scree plot rule is somewhat simplistic. It should always be supplemented with a brief exploration of loadings and scores to better understand how the information is being represented in the new PC space.

Standardised Variables and the Correlation Matrix

Most statistical packages offer the option to standardise the original variables as the first step prior to PCA. Standardisation, or normalisation, consists in calculating the mean and standard deviation of each variable, and use the to transform the data points:

Where and are the mean and standard deviation of variable i, and xi,j is the value of variable i for object j. For example, the variable MW would be standardised as follows:

Standardisation is generally used when there are big differences in absolute values and (or) ranges between the variables. Our data set is a good example: MW values are one order of magnitude greater than those of NV; and they range between 75.07 g/mol and 204.23 g/mol, whereas the range of NV is much smaller, 0 – 8.08 units. Consequently, as we have seen in previous sections, MW will naturally have much more weight on the calculation of the first few PCs than NV – which may or may not be a desirable effect. By subtracting the mean and dividing by the standard deviation, these large differences are negated: all standardised variables will have an average of zero, and a variance of 1, which ensures they have equal weight on the calculation of PCs.

After standardising the data, the covariance matrix becomes a correlation matrix: diagonal elements are 1 and element above and below the diagonal indicate the pairwise correlation between the variables:

Table 9: Correlation matrix for eight selected variables (standardised).

MW pKa pKb pKx pI logD Polar NV
MW 1.000 -0.119 -0.298 -0.268 0.204 0.425 0.968 0.979
pKa -0.119 1.000 0.268 0.621 -0.141 0.500 0.015 -0.007
pKb -0.298 0.268 1.000 0.228 -0.147 0.209 -0.232 -0.221
pKx -0.268 0.621 0.228 1.000 0.095 0.211 -0.147 -0.154
pI 0.204 -0.141 -0.147 0.095 1.000 -0.008 0.331 0.299
logD 0.425 0.500 0.209 0.211 -0.008 1.000 0.565 0.552
Polar 0.968 0.015 -0.232 -0.147 0.331 0.565 1.000 0.996
NV 0.979 -0.007 -0.221 -0.154 0.299 0.552 0.996 1.000

Scree Plot

The eigenvalue table and scree plot show a remarkable departure from the previous analysis:

Eigenvalue λ % Variance % Additive
1 3.469 43.36% 43.36%
2 2.089 26.11% 69.47%
3 1.057 13.22% 82.68%
4 0.782 9.77% 92.46%
5 0.351 4.39% 96.84%
6 0.247 3.08% 99.93%
7 0.004 0.05% 99.98%
8 0.001 0.02% 100.00%

Figure 4: Scree plot of eight principal components (standardised variables)

Calculated on standardised variables, the first principal component only contains 43% of the variance. 26% is ascribed to PC2, 13% to PC3, etc. it is clear that we will need to retain a larger number of PCs to represent the system. However, the scree plot does not have the sharp “elbow” we saw in the previous calculation. How is a decision made in these cases? Although a deeper analysis of loadings and scores is always a sound idea, it is common in practice to apply a “Pareto” criterion, and retain those principal components that accumulate 80% or more of the variance. In this example, PC1, PC2 and PC3 could be considered satisfactory.

Loadings

The loadings chart for PC1 and PC2, and the loadings table for the first three principal components are shown below:

Figure 5: Loading chart for PC1 and PC2

Table 10: Loadings for PC1, PC2, PC3 (standardised variables)

Variable PC1 PC2 PC3
MW 0.519 0.072 0.111
pKa -0.016 -0.614 -0.030
pKb -0.157 -0.364 0.254
pKx -0.107 -0.517 -0.455
pI 0.184 0.081 -0.816
logD 0.304 -0.458 0.222
Polar 0.534 -0.037 0.000
NV 0.532 -0.029 0.029

PC1 is almost equally shared between MW, Polarisability, and Normalised Volume: the factors that were secondary in our original analysis have now become much more prominent. PC2 is composed of pKa, followed closely by pKx and logD. pKb has a smaller, but significant, contribution. PC3 is still strongly dominated by pI, but it also receives a strong contribution from pKx.

It should be noted that these variables have negative loadings: as the variables increase in value, PC2 and PC3 would decrease. It is the magnitude of the loading, regardless of its sign, that determines which variables are most strongly associated with each PC.

These combinations the variables seem almost natural. It is often said that PCA reveals the hidden structure of a data set, and can uncover latent variables that may have not been initially obvious. For example, it seems appropriate to assimilate PC1 to the “bulk”; PC2 to the “polarity”; and PC3 to the “acid-base character” of the amino acids. We might even prefer these more descriptive names, instead of PC1, PC2 and PC3.

Scores

Table 11: Scores of twenty amino acids on PC1, PC2, PC3 (standardised variables)

Name PC1 PC2 PC3
Alanine -2.559 -1.201 -0.385
Arginine 2.518 2.021 -2.018
Asparagine -0.983 1.203 -0.691
Aspartic acid -1.082 1.904 2.357
Cysteine -0.570 2.128 0.398
Glutamic acid -0.186 0.990 2.235
Glutamine -0.175 0.339 -0.514
Glycine -3.556 -1.039 -0.518
Histidine 1.007 2.347 -0.107
Isoleucine 0.255 -1.885 0.096
Leucine 0.262 -1.788 0.053
Lysine 1.187 0.904 -1.489
Methionine 0.791 -0.934 -0.102
Phenylalanine 2.014 -1.009 0.327
Proline -1.375 -0.806 0.238
Serine -2.163 0.146 -0.646
Threonine -1.246 0.358 -0.521
Tryptophan 3.833 -2.087 0.435
Tyrosine 2.708 -0.089 0.923
Valine -0.680 -1.502 -0.070

This table of scores allows us to visualise the twenty amino acids in a three-dimensional space, as seen in the figure below (created with Plotly Chart Studio) where each amino acid has been labelled according to the colour scheme seen earlier:

Figure 6: 3D scatter plot of PC1, PC2, PC3

The chart clearly shows that the combination of the three principal components is not only able to sort the objects in increasing molecular weight (along the PC1 axis), but it also reveals distinct groups and trends, that correspond to the seven categories of amino acids.

Displaying the scores on two-dimensional scatter plots clearly shows the ability of PCA to show patterns in the data set and to simplify further analysis:

Figure 7: 2D scatter plot of "Polarity" (PC2) v "Bulk" (PC1)

Figure 8: 2D scatter plot of "Acid-Base" (PC3) v "Bulk" (PC1)

Figure 9: 2D scatter plot of "Acid-Base" (PC3) v "Polarity" (PC2)

These graphs illustrate a potential application of PCA: Figure 6 shows that amino acids with a positive PC2 are polar or ionisable in nature, whereas amino acids with a negative PC2 are aliphatic or aromatic. Figure 7 shows that acidic amino acids have a positive PC3, whereas basic amino acids have a negative PC3. Those with a PC3 close to zero are non-ionisable.

Working with raw or standardised variables are not the only possibilities: one could apply judicious transformations to the data if they are deemed to be of benefit. For example, Snyder et al. developed in the early 2000s a method for HPLC column characterisation, widely known as the “Hydrophobic Subtraction Model” (J. Chrom. A, 961, 2002, 171-193). The aim of the model was to quantify the different retention mechanisms involved in a chromatographic separation: hydrophobic interaction, ion exchange, shape selectivity, etc. They realised early on that hydrophobic interaction had an overwhelming effect on the results, and they decided to “subtract” it by dividing all the other factors by the hydrophobic contribution. Dividing pKa, pKb, pI… by the molecular weight of the amino acid may not be appropriate in our data set (I am not sure “pI per mass unit” makes any physical sense) but other strategies could well be interesting.

Outliers

Outliers are observations that lie outside of the general pattern of a distribution or mathematical model. In linear regression, for example, outliers can be detected by calculating “standardised residuals”:

Where the residual, ri, is defined as the distance between the experimental point yi and the regressed value, . After computing all the residuals, the average and standard deviation sr can be calculated, and the residuals can be standardised thus:

Any points with standardised residuals beyond a critical value are potential residuals and should be investigated further.

The multivariate equivalent to standardised residuals is the Mahalanobis Distance (MD), named after its creator, Prasandra Mahalanobis:

Here, is the object under investigation (e.g. each one of our amino acids) in the principal component space. is therefore the vector of eight scores for a particular amino acid. For example, the vector “alanine” is:

represents the centroid of the objects in PC space: a vector whose coefficients are the average of each PC. For standardised data, the calculation is simplified because the average of each PCs is zero:

Finally, C -1 is the inverse of the Covariance matrix. For standardised data, this is the inverse of the Correlation matrix:

286.938 7.249 8.660 9.306 37.263 44.122 -9.106 -303.938
7.249 2.826 0.050 -1.194 1.646 0.220 -7.588 -0.307
8.660 0.050 1.630 0.329 0.921 0.637 4.154 -12.830
9.306 -1.194 0.329 2.354 0.437 1.257 5.404 -14.889
37.263 1.646 0.921 0.437 6.595 6.286 -10.671 -31.013
44.122 0.220 0.637 1.257 6.286 9.537 -7.009 -43.025
-9.106 -7.588 4.154 5.404 -10.671 -7.009 189.262 -170.797
-303.938 -0.307 -12.830 -14.889 -31.013 -43.025 -170.797 496.522

The Mahalanobis distances for all twenty amino acids are:

Name MD MD critical p Value
Alanine 2.299 15.507 0.970
Arginine 3.277 15.507 0.916
Asparagine 2.543 15.507 0.960
Aspartic acid 2.787 15.507 0.947
Cysteine 3.514 15.507 0.898
Glutamic acid 2.903 15.507 0.940
Glutamine 2.531 15.507 0.960
Glycine 2.922 15.507 0.939
Histidine 2.269 15.507 0.972
Isoleucine 1.885 15.507 0.984
Leucine 2.100 15.507 0.978
Lysine 3.591 15.507 0.892
Methionine 3.527 15.507 0.897
Phenylalanine 2.673 15.507 0.953
Proline 3.722 15.507 0.881
Serine 1.719 15.507 0.988
Threonine 2.202 15.507 0.974
Tryptophan 3.180 15.507 0.923
Tyrosine 2.550 15.507 0.959
Valine 1.470 15.507 0.993

The critical Mahalanobis distances and p values are given by the Chi-squared distribution. In Excel, these are calculated as:

Where α is significance (set to 0.05 in this example) and v is degrees of freedom, which is equal to the number of PCs, 8. Our calculations indicate that none of the amino acids is an outlier. In other words, they all fit well within the method.

Conclusions

After a detailed introduction to the algebra behind Principal Components Analysis, this instalment is focused on some of the practical aspects of the technique, such as dealing with missing data, or highly correlated – and therefore potentially redundant – variables.

Most statistical packages enable the user to apply basic transformations to the data set prior to PCA. One of the most common options is to standardise the variables, which is commonly carried out when there are large disparities in absolute values and variances. We have demonstrated how this technique modifies the resulting principal components, and given general guidelines on how to use it.

Finally, the most important aspect of PCA is to interpret the results. We have discussed at length some graphical and statistical tools, including Mahalanobis distances to evaluate the presence of outlier objects.

As ever, I hope this article has been enlightening and useful. Questions and comments are welcome, and please do not hesitate to contact us for further information.