Principal Component Analysis part 2

In a previous instalment of this blog, we covered the “ABC” of PCA (Principal Component Analysis) by examining how this method would be applied to a very simple example: five points on a plane:

Principal Component Analysis

As described in that article, principal components (PCs) are a new set of variables obtained from a combination of the original variables. The steps involved in PCA are the following:

Calculate the covariance matrix, C, of the data set
Extract the eigenvalues (λ) and eigenvectors () of the covariance matrix
Normalise eigenvectors to calculate the loadings (coefficients) of the original variables onto each principal component
Calculate the scores (coordinates) of the original data points in the new PC space

A multivariate data set containing n variables can be thought of as a scatter of points in an n-dimensional space. Unfortunately, if the number of variables is greater than three, there is no easy way to visualise these points. The matrix C is an (n x n) matrix (i.e. a matrix of rank n) of the covariances measured between each pair of variables. Any matrix of rank n can be described by an arbitrary set of n orthogonal vectors, called basis vectors. One particular set of basis vectors of the C matrix, called ‘eigenvectors’ of C, describe the directions in which the data set points are spread or scattered.

Each eigenvector has an associated eigenvalue, a number which corresponds to the data variance measured along the direction of the eigenvector. The eigenvector with the largest eigenvalue, therefore, points in the direction of maximum spread. The eigenvector with the second largest eigenvalue indicates the direction of second highest scatter, etc. These orthogonal eigenvectors define the new axes PC1, PC2, …, PCn. Consequently, the first PC axis has been constructed so that it describes as much of the variance in the data set as possible, and successive PCs describe increasingly smaller proportions of variance.

The advantage of representing the data points in this new set of axes is that, in many cases, it is possible to simplify an abstract multidimensional space down to two or three easy to visualise dimensions. By creating a scatter plot of PC1, PC2 and PC3 – the directions of maximum variance – the simplified graph will still be able to convey a large amount of the information contained in the multidimensional space.

A Slightly More Complex Data Set

AAindex is an impressive repository of amino acid data compiled from hundreds of peer-reviewed papers, and it is a good source of data to learn multivariate analysis. To describe more advanced aspects of PCA, I decided to use the following set of variables from AAindex:

Table 1:Selected numerical descriptors of proteinogenic amino acids.

Name	Abbr.	Symbol	MW	Residue MW (-H₂O)	pKa	pKb	pKx	pl	logD (pI)	Polarisability	Normalised Volume
Alanine	Ala	A	89.10	71.08	2.34	9.69		6.00	-2.841	8.478	1.00
Arginine	Arg	R	174.20	156.19	1.82	8.99	12.48	10.76	-3.156	18.035	6.13
Asparagine	Asn	N	132.12	114.11	2.02	8.80		5.41	-4.288	11.810	2.95
Aspartic acid	Asp	D	133.11	115.09	1.88	9.60	3.65	2.77	-3.504	11.418	2.78
Cysteine	Cys	C	121.16	103.15	1.92	8.35	8.18	5.05	-2.792	11.482	2.43
Glutamic acid	Glu	E	147.13	129.12	2.10	9.67	4.25	3.22	-3.241	13.488	3.78
Glutamine	Gln	Q	146.15	128.13	2.17	9.13		5.65	-4.001	13.866	3.95
Glycine	Gly	G	75.07	57.05	2.35	9.78		5.97	-3.409	6.636	0.00
Histidine	His	H	155.16	137.14	1.82	9.17	6.00	7.59	-3.293	14.926	4.66
Isoleucine	Ile	I	131.18	113.16	2.36	9.68		6.02	-1.508	14.268	4.00
Leucine	Leu	L	131.18	113.16	2.36	9.60		5.98	-1.586	14.325	4.00
Lysine	Lys	K	146.19	128.18	2.16	9.18	10.53	9.74	-3.215	16.009	4.77
Methionine	Met	M	149.21	131.20	2.28	9.21		5.74	-2.189	15.529	4.43
Phenylalanine	Phe	F	165.19	147.18	2.16	9.18		5.48	-1.184	17.197	5.89
Proline	Pro	P	115.13	97.12	1.95	10.64		6.30	-2.569	11.469	2.72
Serine	Ser	S	105.09	87.08	2.19	9.21		5.68	-3.888	9.448	1.60
Threonine	Thr	T	119.12	101.11	2.09	9.10		5.66	-3.471	11.008	2.60
Tryptophan	Trp	W	204.23	186.22	2.43	9.44		5.89	-1.085	21.197	8.08
Tyrosine	Tyr	Y	181.19	163.18	2.20	9.11	10.07	5.66	-1.489	18.204	6.47
Valine	Val	V	117.15	99.13	2.32	9.62		5.96	-1.953	12.238	3.00

In this table,

MW is the molecular weight of the amino acid
Residue MW is the molecular weight after peptide condensation (after loss of H₂O)
pKa, pKb and pKx refer to the acidic (C-terminus), basic (N-terminus) and side chain (where applicable)
pI is the isoelectric point of the amino acid
logD (pI) is the distribution coefficient measured at the isoelectric point
Polarisability is related to the dipole moment of the amino acid
Normalised Volume refers to the bulkiness of the side chain, with glycine (side chain: -H) having an arbitrary value of zero

Dealing with Missing Data

If we try to calculate PCA on this data using a statistics package, we are likely to get an error. This is caused by the missing values in the pKx column, which prevent us from calculating the covariance matrix:

Where:

If any of the x _i,k values are missing, the programme will not know how to calculate the covariance matrix. To resolve the problem, we can simply eliminate the variable form the database. Alternatively, if we believe that the information is critical to the analysis, we can try to “pad” the missing values. A quick look at a pKa table shows that pKa values for alcohols are typically estimated to be above 15, above 20 for amides, and above 40 for hydrocarbons. We could therefore fill the blanks with a constant value (e.g. 20) to indicate that these species have neutral side chains.

After padding the missing data points, the resulting covariance matrix is:

Table 2: Covariance matrix of nine selected variables, corrected for missing values.

	MW	Residue MW (-H₂O)	pKa	pKb	pKx	pl	logD (pI)	Polarisability	Normalised Volume
MW	952.538	952.599	-0.706	-4.285	-51.409	11.138	12.881	106.491	58.566
Residue MW (-H₂O)	952.599	952.661	-0.706	-4.286	-51.411	11.139	12.881	106.497	58.569
pKa	-0.706	-0.706	0.037	0.024	0.742	-0.048	0.094	0.010	-0.003
pKb	-4.285	-4.286	0.024	0.216	0.660	-0.121	0.096	-0.384	-0.199
pKx	-51.409	-51.411	0.742	0.660	38.739	1.051	1.291	-3.267	-1.858
pI	11.138	11.139	-0.048	-0.121	1.051	3.130	-0.015	2.087	1.027
logD (pI)	12.881	12.881	0.094	0.096	1.291	-0.015	0.967	1.979	1.053
Polarisability	106.491	106.497	0.010	-0.384	-3.267	2.087	1.979	12.702	6.880
Normalised Volume	58.566	58.569	-0.003	-0.199	-1.858	1.027	1.053	6.880	3.758

Highly Correlated Variables

Remember from the first instalment that the elements in the diagonal are the variances of the variables:

It is very clear that the first two variables (MW and Residue MW) have much larger variances than the rest. We can predict that the first principal component will have very large loadings on these two variables, whereas the rest will mostly contribute to higher order PCs.

Notice also the very high covariance (952.599) between MW and Residue MW. What does this mean? Calculating the correlation between MW and Residue MW results in an : from an information perspective, the two variables are essentially identical and one of them could be safely removed without any loss in data quality.

We confirm these predictions after carrying out PCA and observing the eigenvalues and loadings (only the first four PCs are shown for brevity):

Table 3: Eigenvalues of PCA on nine selected variables.

Eigenvalue	λ	% of Total
1	1923.84	97.92%
2	36.38	1.85%
3	3.16	0.16%
4	1.11	0.06%

Table 4: Loadings of first four principal components.

Variable	Loading PC1	Loading PC2	Loading PC3	Loading PC4
MW	0.704	0.021	-0.025	0.044
Residue MW	0.704	0.021	-0.025	0.045
pKa	-0.001	0.020	-0.021	-0.078
pKb	-0.003	0.012	-0.025	-0.172
pKx	-0.039	0.993	-0.068	0.089
pI	0.008	0.051	0.954	0.137
logD (pI)	0.010	0.057	-0.055	-0.757
Polarisability	0.079	0.074	0.264	-0.550
Norm Volume	0.043	0.039	0.101	-0.241

Recall from the first instalment that each eigenvalue indicates the variance over the corresponding principal component. Table 3 shows that PC1 accounts for nearly 98% of the variation, whereas the rest of PCs only make marginal contributions to the description of the data set.

Table 4, on the other hand, shows the loadings of the original variables on the first few PCs. It is clear that PC1 receives the greatest contributions from MW and Residual MW, whereas the effects of the other variables on this PC are at least an order of magnitude smaller. On the other hand, it is interesting to note that the loadings of MW and Residual MW are practically identical on all principal components: the two variables contribute in the same way to the new dimensions. Since the main objective of PCA is the reduction of data dimensionality, it is clear that we can afford to make one of these variables redundant, simplifying the data set without loss of information. Further analysis will be carried out on just eight of the variables:

Table 5: Eight selected variables for further PCA.

Name	Abbr.	Symbol	MW	pKa	pKb	pKx	pl	logD	Polar	NV
Alanine	Ala	A	89.10	2.34	9.69	20.00	6.00	-2.841	8.478	1.00
Arginine	Arg	R	174.20	1.82	8.99	12.48	10.76	-3.156	18.035	6.13
Asparagine	Asn	N	132.12	2.02	8.80	20.00	5.41	-4.288	11.810	2.95
Aspartic acid	Asp	D	133.11	1.88	9.60	3.65	2.77	-3.504	11.418	2.78
Cysteine	Cys	C	121.16	1.92	8.35	8.18	5.05	-2.792	11.482	2.43
Glutamic acid	Glu	E	147.13	2.10	9.67	4.25	3.22	-3.241	13.488	3.78
Glutamine	Gln	Q	146.15	2.17	9.13	20.00	5.65	-4.001	13.866	3.95
Glycine	Gly	G	75.07	2.35	9.78	20.00	5.97	-3.409	6.636	0.00
Histidine	His	H	155.16	1.82	9.17	6.00	7.59	-3.293	14.926	4.66
Isoleucine	Ile	I	131.18	2.36	9.68	20.00	6.02	-1.508	14.268	4.00
Leucine	Leu	L	131.18	2.36	9.60	20.00	5.98	-1.586	14.325	4.00
Lysine	Lys	K	146.19	2.16	9.18	10.53	9.74	-3.215	16.009	4.77
Methionine	Met	M	149.21	2.28	9.21	20.00	5.74	-2.189	15.529	4.43
Phenylalanine	Phe	F	165.19	2.16	9.18	20.00	5.48	-1.184	17.197	5.89
Proline	Pro	P	115.13	1.95	10.64	20.00	6.30	-2.569	11.469	2.72
Serine	Ser	S	105.09	2.19	9.21	20.00	5.68	-3.888	9.448	1.60
Threonine	Thr	T	119.12	2.09	9.10	20.00	5.66	-3.471	11.008	2.60
Tryptophan	Trp	W	204.23	2.43	9.44	20.00	5.89	-1.085	21.197	8.08
Tyrosine	Tyr	Y	181.19	2.20	9.11	10.07	5.66	-1.489	18.204	6.47
Valine	Val	V	117.15	2.32	9.62	20.00	5.96	-1.953	12.238	3.00

Step by Step Principal Component Analysis

In my previous instalment, I started PCA by subtracting the averages of the original variables from each data point, to obtain a table of centred values. This transformation is convenient when doing manual calculations, but it is not universally performed by statistical software. For the remainder of this article I will use Minitab 16, which does not centre the variables. If I wished to do so, I should select the command ‘Standardize’ from the ‘Calc’ menu, and select the option ‘Subtract mean’. Alternatively, the data could be transformed in a spreadsheet and copied into the software.

Covariance Matrix

The covariance calculated on the eight variables in Table 5 is displayed below. After dispensing with Residue MW, the rest of covariances remain the same as Table 2. MW still has a very high variance and we expect this variable to weigh heavily on PC1.

Table 6: Covariance matrix of the eight selected variables.

	MW	pKa	pKb	pKx	pl	logD	Polar	NV
MW	952.538	-0.706	-4.285	-51.409	11.138	12.881	106.491	58.566
pKa	-0.706	0.037	0.024	0.742	-0.048	0.094	0.010	-0.003
pKb	-4.285	0.024	0.216	0.661	-0.121	0.096	-0.384	-0.199
pKx	-51.409	0.742	0.660	38.739	1.051	1.291	-3.267	-1.858
pl	11.138	-0.048	-0.121	1.051	3.130	-0.015	2.087	1.027
logD	12.881	0.094	0.096	1.291	-0.015	0.967	1.979	1.053
Polar	106.491	0.010	-0.384	-3.268	2.087	1.979	12.702	6.880
NV	58.566	-0.003	-0.199	-1.858	1.027	1.053	6.880	3.758

Eigenvectors of the Covariance Matrix and Scree Test

The eight eigenvectors of Table 6 are displayed below:

Eigenvalue	λ	%
1	971.225	95.96%
2	36.344	3.59%
3	3.159	0.31%
4	1.105	0.11%
5	0.164	0.02%
6	0.071	0.01%
7	0.013	0.00%
8	0.007	0.00%

The variance ascribed to the first PC has dropped somewhat, but PC1 still explains more than 95% of data variability, with the rest being almost completely accounted for by PC2. It seems plausible that retaining these two principal components and discarding the rest would be a very efficient way to describe the original data set. It could also be argued that PC1 alone is sufficient, and the analysis could be reduced to a single dimension. The decision to retain one, two, or more PCs is somewhat subjective. It should be based on a deeper analysis of loadings and scores, but it is often simply decided on what percentage of variance one wishes to preserve. A graphical tool called a “scree plot” is often used to aid in this decision:

Figure 1: Scree plot of the 8 eigenvalues.

A scree plot displays the eigenvalues from highest to lowest value (therefore, in decreasing order of variance). Very often scree plots – like the one on Figure 1 – have an “elbow” beyond which variance levels off. The principal components “above” the elbow are generally retained, as they account for the greatest amount of variability. In the example above, only PC1 would be retained.

Loadings of the Principal Components

Each eigenvalue has an associated eigenvector, which forms the basis of a principal component. The resulting principal components are unique linear combinations of the eight original variables, and the coefficients of the variables in the linear combinations are called the “loadings” of the PCs. These are shown on Table 7.

Table 7: Loadings of eight selected variables on PC1 to PC8.

Variable	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8
MW	0.990	0.043	-0.050	0.089	0.019	0.072	0.019	0.036
pKa	-0.001	0.020	-0.021	-0.079	-0.034	-0.160	0.983	-0.010
pKb	-0.004	0.012	-0.025	-0.173	0.977	-0.065	0.010	0.107
pKx	-0.055	0.992	-0.068	0.089	0.001	0.010	-0.012	0.009
pl	0.012	0.052	0.954	0.134	0.055	0.237	0.071	0.064
logD	0.013	0.057	-0.057	-0.758	-0.114	0.615	0.035	0.163
Polar	0.111	0.077	0.261	-0.547	-0.157	-0.728	-0.162	0.185
NV	0.061	0.040	0.099	-0.239	0.064	-0.022	-0.029	-0.960

PC1 has a very strong loading from MW, and much weaker contributions from Polarity, normalised Volume, and pKx, the pK of the amino acid side chain. pKx is the major contributor to PC2, which has much lower coefficients for logD, pI and Normalised Value. pI is the major contributor to PC3, which also has a significant coefficient for Polarity.

These contributions are often visualised on “loading plots”, such as Figure 2 for PC1 and PC2, which simplify the interpretation of the principal components being plotted in terms of the original variables:

Figure 2: Loading plot for PC1 versus PC2.

In this graph, the horizontal axis indicates the contribution of each variable to PC1, and the vertical axis represents the contribution to PC2. Those variables close to zero (lower left corner of the plot) have small loadings on both principal components. MW has a strong loading on PC1 (close to one), but very weak on PC2 (close to zero). The opposite is true for pKx and PC2.

In a very real sense, PC1 is equal to MW, “complemented” with Polarisability and Nominal Volume. If we plot the amino acids along PC1 (the “scores” of the amino acids on PC1, which we will plot in the next section) we should see small, light, non-polar amino acids on the lower values; whereas bulkier and more polarisable amino acids should have higher PC1 scores.

PC2 can be described as pKx “decorated” with pI and Polarisability. This axis should allow us to differentiate between neutral and polar or ionisable amino acids, as it encodes information related to the chemical properties of the side chains. PC3 should further enhance this classification and allow us to separate neutral, acidic, and basic amino acids.

Scores

After computing the loadings of the original variables on the principal components, the twenty amino acids can be “mapped” onto the new system of coordinates, by applying the linear combinations that define each PC to the values of each amino acid. The new coordinates are called “scores”.

For example, PC1 is defined as:

The score of Alanine on PC1 is therefore:

The complete table of scores is shown below:

Name	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8
Alanine	88.12	24.66	2.13	5.88	10.49	-0.92	2.74	4.96
Arginine	174.23	21.99	6.01	7.34	10.55	-0.89	2.53	5.01
Asparagine	131.19	26.70	0.61	8.60	10.18	-1.22	2.53	4.90
Aspartic acid	132.99	10.40	-1.03	6.42	10.78	-1.20	2.53	4.93
Cysteine	120.93	14.52	1.40	5.79	9.35	-1.00	2.46	5.01
Glutamic acid	147.14	11.84	-0.72	6.18	10.84	-1.49	2.68	4.94
Glutamine	145.37	27.54	0.75	8.22	10.49	-1.54	2.61	4.93
Glycine	73.95	23.84	2.25	6.30	10.60	-0.93	2.79	4.98
Histidine	155.26	14.28	3.42	6.78	10.59	-0.88	2.57	4.89
Isoleucine	130.63	27.10	1.80	4.72	10.41	-1.36	2.57	4.90
Leucine	130.64	27.10	1.78	4.75	10.33	-1.45	2.55	4.88
Lysine	146.28	18.60	5.89	5.97	10.38	-1.76	2.66	4.85
Methionine	148.64	27.93	1.06	6.09	10.19	-1.43	2.56	5.20
Phenylalanine	164.75	28.84	0.54	5.46	10.17	-0.95	2.44	4.83
Proline	114.34	26.11	2.05	5.84	11.55	-1.03	2.34	4.98
Serine	104.08	25.36	1.42	7.47	10.32	-1.16	2.65	4.90
Threonine	118.22	26.14	1.19	7.33	10.25	-1.03	2.54	4.80
Tryptophan	203.99	30.94	0.24	6.12	10.67	-1.01	2.76	4.95
Tyrosine	181.29	19.77	0.93	5.57	10.31	-0.80	2.73	4.90
Valine	116.45	26.27	1.83	5.17	10.39	-1.14	2.60	4.89

As expected, the scores on PC1 for the amino acids are not very different from their molecular weights. The small differences are the combined effects of the other variables on the calculation of PC1. However, we cannot give PC1 units of “g/mol”, since it is a combination of eight distinct physical magnitudes.

The scree plot and eigenvalues table suggested that PC1 would suffice to describe 96% of the information contained in the original data set. The graphical representation of the twenty amino acids on PC1 is displayed on the following figure:

Figure 3: PC1 scores of twenty amino acids.

We predicted that small, neutral amino acids would be placed on the lower end of PC1, and increasingly bulkier and more polarisable species would appear at higher PC1 values. Whilst this is correct, PC1 alone is not able to separate the different chemical categories of amino acids, represented by the coloured icons:

The reason is clear: Molecular Weight values are so much bigger than the rest of variables, that they dominate the computation of the first principal components. Although MW accounts for a very large amount of variance, it does not encode the chemical information reflected in the other variables. This subtle information, which could arguably be of high value in many situations, is “displaced” to the lower rank principal components (PC2, PC3, etc.)

When deciding which principal components should be retained and which discarded, the scree plot rule is somewhat simplistic. It should always be supplemented with a brief exploration of loadings and scores to better understand how the information is being represented in the new PC space.

Standardised Variables and the Correlation Matrix

Most statistical packages offer the option to standardise the original variables as the first step prior to PCA. Standardisation, or normalisation, consists in calculating the mean and standard deviation of each variable, and use the to transform the data points:

Where and are the mean and standard deviation of variable i, and x_i,j is the value of variable i for object j. For example, the variable MW would be standardised as follows:

Standardisation is generally used when there are big differences in absolute values and (or) ranges between the variables. Our data set is a good example: MW values are one order of magnitude greater than those of NV; and they range between 75.07 g/mol and 204.23 g/mol, whereas the range of NV is much smaller, 0 – 8.08 units. Consequently, as we have seen in previous sections, MW will naturally have much more weight on the calculation of the first few PCs than NV – which may or may not be a desirable effect. By subtracting the mean and dividing by the standard deviation, these large differences are negated: all standardised variables will have an average of zero, and a variance of 1, which ensures they have equal weight on the calculation of PCs.

After standardising the data, the covariance matrix becomes a correlation matrix: diagonal elements are 1 and element above and below the diagonal indicate the pairwise correlation between the variables:

Table 9: Correlation matrix for eight selected variables (standardised).

	MW	pKa	pKb	pKx	pI	logD	Polar	NV
MW	1.000	-0.119	-0.298	-0.268	0.204	0.425	0.968	0.979
pKa	-0.119	1.000	0.268	0.621	-0.141	0.500	0.015	-0.007
pKb	-0.298	0.268	1.000	0.228	-0.147	0.209	-0.232	-0.221
pKx	-0.268	0.621	0.228	1.000	0.095	0.211	-0.147	-0.154
pI	0.204	-0.141	-0.147	0.095	1.000	-0.008	0.331	0.299
logD	0.425	0.500	0.209	0.211	-0.008	1.000	0.565	0.552
Polar	0.968	0.015	-0.232	-0.147	0.331	0.565	1.000	0.996
NV	0.979	-0.007	-0.221	-0.154	0.299	0.552	0.996	1.000

Scree Plot

The eigenvalue table and scree plot show a remarkable departure from the previous analysis:

Eigenvalue	λ	% Variance	% Additive
1	3.469	43.36%	43.36%
2	2.089	26.11%	69.47%
3	1.057	13.22%	82.68%
4	0.782	9.77%	92.46%
5	0.351	4.39%	96.84%
6	0.247	3.08%	99.93%
7	0.004	0.05%	99.98%
8	0.001	0.02%	100.00%

Figure 4: Scree plot of eight principal components (standardised variables)

Calculated on standardised variables, the first principal component only contains 43% of the variance. 26% is ascribed to PC2, 13% to PC3, etc. it is clear that we will need to retain a larger number of PCs to represent the system. However, the scree plot does not have the sharp “elbow” we saw in the previous calculation. How is a decision made in these cases? Although a deeper analysis of loadings and scores is always a sound idea, it is common in practice to apply a “Pareto” criterion, and retain those principal components that accumulate 80% or more of the variance. In this example, PC1, PC2 and PC3 could be considered satisfactory.

Loadings

The loadings chart for PC1 and PC2, and the loadings table for the first three principal components are shown below:

Figure 5: Loading chart for PC1 and PC2

Table 10: Loadings for PC1, PC2, PC3 (standardised variables)

Variable	PC1	PC2	PC3
MW	0.519	0.072	0.111
pKa	-0.016	-0.614	-0.030
pKb	-0.157	-0.364	0.254
pKx	-0.107	-0.517	-0.455
pI	0.184	0.081	-0.816
logD	0.304	-0.458	0.222
Polar	0.534	-0.037	0.000
NV	0.532	-0.029	0.029

PC1 is almost equally shared between MW, Polarisability, and Normalised Volume: the factors that were secondary in our original analysis have now become much more prominent. PC2 is composed of pKa, followed closely by pKx and logD. pKb has a smaller, but significant, contribution. PC3 is still strongly dominated by pI, but it also receives a strong contribution from pKx.

It should be noted that these variables have negative loadings: as the variables increase in value, PC2 and PC3 would decrease. It is the magnitude of the loading, regardless of its sign, that determines which variables are most strongly associated with each PC.

These combinations the variables seem almost natural. It is often said that PCA reveals the hidden structure of a data set, and can uncover latent variables that may have not been initially obvious. For example, it seems appropriate to assimilate PC1 to the “bulk”; PC2 to the “polarity”; and PC3 to the “acid-base character” of the amino acids. We might even prefer these more descriptive names, instead of PC1, PC2 and PC3.

Scores

Table 11: Scores of twenty amino acids on PC1, PC2, PC3 (standardised variables)

Name	PC1	PC2	PC3
Alanine	-2.559	-1.201	-0.385
Arginine	2.518	2.021	-2.018
Asparagine	-0.983	1.203	-0.691
Aspartic acid	-1.082	1.904	2.357
Cysteine	-0.570	2.128	0.398
Glutamic acid	-0.186	0.990	2.235
Glutamine	-0.175	0.339	-0.514
Glycine	-3.556	-1.039	-0.518
Histidine	1.007	2.347	-0.107
Isoleucine	0.255	-1.885	0.096
Leucine	0.262	-1.788	0.053
Lysine	1.187	0.904	-1.489
Methionine	0.791	-0.934	-0.102
Phenylalanine	2.014	-1.009	0.327
Proline	-1.375	-0.806	0.238
Serine	-2.163	0.146	-0.646
Threonine	-1.246	0.358	-0.521
Tryptophan	3.833	-2.087	0.435
Tyrosine	2.708	-0.089	0.923
Valine	-0.680	-1.502	-0.070

This table of scores allows us to visualise the twenty amino acids in a three-dimensional space, as seen in the figure below (created with Plotly Chart Studio) where each amino acid has been labelled according to the colour scheme seen earlier:

Figure 6: 3D scatter plot of PC1, PC2, PC3

The chart clearly shows that the combination of the three principal components is not only able to sort the objects in increasing molecular weight (along the PC1 axis), but it also reveals distinct groups and trends, that correspond to the seven categories of amino acids.

Displaying the scores on two-dimensional scatter plots clearly shows the ability of PCA to show patterns in the data set and to simplify further analysis:

Figure 7: 2D scatter plot of "Polarity" (PC2) v "Bulk" (PC1)

Figure 8: 2D scatter plot of "Acid-Base" (PC3) v "Bulk" (PC1)

Figure 9: 2D scatter plot of "Acid-Base" (PC3) v "Polarity" (PC2)

These graphs illustrate a potential application of PCA: Figure 6 shows that amino acids with a positive PC2 are polar or ionisable in nature, whereas amino acids with a negative PC2 are aliphatic or aromatic. Figure 7 shows that acidic amino acids have a positive PC3, whereas basic amino acids have a negative PC3. Those with a PC3 close to zero are non-ionisable.

Working with raw or standardised variables are not the only possibilities: one could apply judicious transformations to the data if they are deemed to be of benefit. For example, Snyder et al. developed in the early 2000s a method for HPLC column characterisation, widely known as the “Hydrophobic Subtraction Model” (J. Chrom. A, 961, 2002, 171-193). The aim of the model was to quantify the different retention mechanisms involved in a chromatographic separation: hydrophobic interaction, ion exchange, shape selectivity, etc. They realised early on that hydrophobic interaction had an overwhelming effect on the results, and they decided to “subtract” it by dividing all the other factors by the hydrophobic contribution. Dividing pKa, pKb, pI… by the molecular weight of the amino acid may not be appropriate in our data set (I am not sure “pI per mass unit” makes any physical sense) but other strategies could well be interesting.

Outliers

Outliers are observations that lie outside of the general pattern of a distribution or mathematical model. In linear regression, for example, outliers can be detected by calculating “standardised residuals”:

Where the residual, r_i, is defined as the distance between the experimental point y_i and the regressed value, . After computing all the residuals, the average and standard deviation s_r can be calculated, and the residuals can be standardised thus:

Any points with standardised residuals beyond a critical value are potential residuals and should be investigated further.

The multivariate equivalent to standardised residuals is the Mahalanobis Distance (MD), named after its creator, Prasandra Mahalanobis:

Here, is the object under investigation (e.g. each one of our amino acids) in the principal component space. is therefore the vector of eight scores for a particular amino acid. For example, the vector “alanine” is:

represents the centroid of the objects in PC space: a vector whose coefficients are the average of each PC. For standardised data, the calculation is simplified because the average of each PCs is zero:

Finally, C ^-1 is the inverse of the Covariance matrix. For standardised data, this is the inverse of the Correlation matrix:

286.938	7.249	8.660	9.306	37.263	44.122	-9.106	-303.938
7.249	2.826	0.050	-1.194	1.646	0.220	-7.588	-0.307
8.660	0.050	1.630	0.329	0.921	0.637	4.154	-12.830
9.306	-1.194	0.329	2.354	0.437	1.257	5.404	-14.889
37.263	1.646	0.921	0.437	6.595	6.286	-10.671	-31.013
44.122	0.220	0.637	1.257	6.286	9.537	-7.009	-43.025
-9.106	-7.588	4.154	5.404	-10.671	-7.009	189.262	-170.797
-303.938	-0.307	-12.830	-14.889	-31.013	-43.025	-170.797	496.522

The Mahalanobis distances for all twenty amino acids are:

Name	MD	MD critical	p Value
Alanine	2.299	15.507	0.970
Arginine	3.277	15.507	0.916
Asparagine	2.543	15.507	0.960
Aspartic acid	2.787	15.507	0.947
Cysteine	3.514	15.507	0.898
Glutamic acid	2.903	15.507	0.940
Glutamine	2.531	15.507	0.960
Glycine	2.922	15.507	0.939
Histidine	2.269	15.507	0.972
Isoleucine	1.885	15.507	0.984
Leucine	2.100	15.507	0.978
Lysine	3.591	15.507	0.892
Methionine	3.527	15.507	0.897
Phenylalanine	2.673	15.507	0.953
Proline	3.722	15.507	0.881
Serine	1.719	15.507	0.988
Threonine	2.202	15.507	0.974
Tryptophan	3.180	15.507	0.923
Tyrosine	2.550	15.507	0.959
Valine	1.470	15.507	0.993

The critical Mahalanobis distances and p values are given by the Chi-squared distribution. In Excel, these are calculated as:

Where α is significance (set to 0.05 in this example) and v is degrees of freedom, which is equal to the number of PCs, 8. Our calculations indicate that none of the amino acids is an outlier. In other words, they all fit well within the method.

Conclusions

After a detailed introduction to the algebra behind Principal Components Analysis, this instalment is focused on some of the practical aspects of the technique, such as dealing with missing data, or highly correlated – and therefore potentially redundant – variables.

Most statistical packages enable the user to apply basic transformations to the data set prior to PCA. One of the most common options is to standardise the variables, which is commonly carried out when there are large disparities in absolute values and variances. We have demonstrated how this technique modifies the resulting principal components, and given general guidelines on how to use it.

Finally, the most important aspect of PCA is to interpret the results. We have discussed at length some graphical and statistical tools, including Mahalanobis distances to evaluate the presence of outlier objects.

As ever, I hope this article has been enlightening and useful. Questions and comments are welcome, and please do not hesitate to contact us for further information.