Tony Taylor BSc CChem FRSC Cert. Ed.
Tony has been doing, researching, teaching and training in analytical chemistry for the past 28 years. He comes from a pharmaceutical and polymer analysis background and continues to work with both liquid and gas phase techniques at Crawford Scientific (UK).
His main interests are the use of LCMS and GCMS for structural characterisation and the quantitation of trace components in complex matrices. He is professionally qualified trainer and is Technical Director of the CHROMacademy.

Calibration – done (badly?) every day
Many of our instrument techniques rely on a calibration in order to relate the detector response to the amount of analyte within our sample.
Where we have a wide expected analyte concentration range within our samples, it’s usual to achieve this using a range of standards of varying but known concentration to build a calibration curve of instrument response against known analyte amount. We then take the area count generated by the analyte from our sample (unknown) and interpolate the analyte amount or concentration using the calibration curve or, more precisely, using the regression equation which we derive from the calibration standard responses. Simple!
I’ve written about error estimation from the calibration curve and many others have written similar articles, however I continue to see mistakes being made in some of the crucial decisions associated with calibration curves, in particular with the treatment of the origin, weighting of the regression line, rejection of outliers and estimation of the ‘goodness of fit’ where we typically and often incorrectly rely upon the coefficient of determination (R^{2}) to accept or reject the calibration function.
I’m not going to consider good practice for designing a calibration experiment such as the spacing of sample concentrations, the way in which the standards are prepared, the number of standards or the concentration range over which the calibrants are made. Further, I’m basing my treatment below on an experiments which uses absolute instrument response, rather than the response ratio obtained when using an internal standard to correct for variability in sample preparation or instrument response. There are excellent references which will guide you on these topics, including those in references X – Y below. ^{[13]}
For the first part of this discussion, I’m assuming that we are using a method that has already been validated and we are working on a multipoint calibration with a single replicate at each calibration level – which is typical of many validated methods in routine use where sample analyte concentration is expected to vary widely (in bioanalytical measurements, reaction monitoring or environmental analysis for example). I’ll show later that during validation, it’s essential to properly characterise the calibration function and test the validity of the regression model that we adopt.
It is typical for analysts to construct a linear calibration curve of the form;


(Equation 1) 
where y is the detector response (area count for example), m is the slope of the regression line, x is the concentration and b is the yintercept of the regression line.
Typically, one might construct the curve, examine the R^{2} (coefficient of determination of the regression model) for it’s closeness to 1 and then carry on with our daily business.
I’m often asked ‘how meaningful is the R^{2} value and what value should be considered as a rejection threshold for R^{2}’ i.e. when is R^{2} too low and indicates that the instrument response or preparation of the calibration standards is ‘nonlinear’ and should therefore not be used for the interpolation of analyte amounts from samples. 

Well, you can start by asking yourself when was the last time you rejected a calibration function based on the R^{2} value or if the method specification or company general operating guidelines you’re using contain a limit or rejection criteria for R^{2}?
In reality, the use of R^{2} values are of limited value and tell us only about the percentage of variability in the instrument response or standard preparation that can be explained using the regression model that we have built (i.e. Equation 1). If the regression model can’t explain any of the response variability around it’s mean, then it’s value will be 0 (or 0%) and if it can explain all of the response variability about the mean then the value will be 1 (or 100%). Understand? Well – not many people do, and further, this value will not give you an indication of bias (systemic error) in your calibration. The number is merely an indication of how well the regression equation fits your data (which you have assumed to be linear) and really shouldn’t be taken in isolation as a measure of goodness of fit, or indeed linearity, of the data you have generated.
However, don’t worry, there are some simple measures that can really help you to increase your confidence (or confirm your lack of confidence) in your calibration function and provide a more helpful guide on whether a calibration cure is fit for purpose.
Firstly – the residuals plot, which is easily generated in MS Excel using the Regression Function of the DataAnalysis Toolpack. ^{[4]}
Residuals plots check the Stochastic (or random) nature of the errors in your data. For linear regression models to be valid, one should not be able to predict the error in any of your measurements, and since the residuals plot is a measure of the error associated with each calibration ‘point’ and we are looking for a random distribution in the error or a random scatter of the residuals around the 0 (no error) value. If one can discern a pattern or ability to predict that the error will be either positive or negative for a particular concentration value, then you have a problem.
Figure 1 shows the residuals from two reallife analytical determinations – the top figure (Experiment 1) is from the calibrations of 2,3,7,8Tetrachlorodibenzodioxin from test for dioxins in river water and the other (Experiment 2) from an LCMS method to determine gemfibrozil in human plasma. 

Experiment 1 
Conc. (ng/L) 
Response 
0.25

62 
0.5

143 
0.75

245 
1

451 
1.25

516 
1.5

666 
1.75

833 






Y = 518.571x + (102) 
R^{2} = 0.9872 




Experiment 2 
Conc. (ng/L) 
Response 
50

334.7933428 
25

162.453282 
10

65.30416852 
5

32.1265819 
2.5

16.02708877 
1

6.47004995 
0.5

3.792724692 






Y = 6.6782x + (0.9688) 
R^{2} = 0.9998 




Figure 1: Data and residuals plots from two calibration experiments to illustrate the usefulness of residuals plots in determining the validity of linear regression model. 

Whilst the regression data of experiment 1 looks a little ‘scruffy’ with low R^{2} value, large negative intercept (indicating bias or constant systematic error – of which more later) and large residuals, the residuals plot does show a good deal of random scatter – which is good. Whilst the regression data of experiment 2 looks much ‘better’ (high R^{2} value, small negative intercept and low absolute residuals), the residuals plot shows a somewhat typical ‘U’ shape that is common in analytical techniques where there is evidence of nonlinear behaviour . One may predict that any data below 2.5 μg/mL will show a positive error (residual) and, arguably, that any data over around 4050 μg/mL will also show a positive
Y
Residuals
error. Similarly we could predict that sample concentrations between 10 and 40 μg/mL will show a negative residual. We are not supposed to be able to predict the errors when applying ordinary least squares (OSL) regression and as such we would need to question the data and goes some way to highlight that whilst data may show a high R^{2} value this does not necessarily mean a valid linear regression model is in play!
So we need to examine our data for heteroscedasticity (wow – scary statistical terms) and that may lead us to know something more about any statistical weighting that we may want to apply to the data. Which is another frequently asked question!
If weighting is to be applied, you probably want to have this justified using an Ftest which examines the data for homo or heteroscedasticity – which is just a statistical term to describe if the data variances are equally random across the population (homo) or show some bias to the higher or lower concentrations (hetero).
This justification is often required, particularly when validating a method, by some regulatory authorities, such as the Food and Drug Administration (FDA) who state ‘Standard curve fitting is determined by applying the simplest model that adequately describes the concentration–response relationship using appropriate weighting and statistical tests for goodness of fit’. ^{[5]}
So fortunately, I had access to the gemfibrozil assay validation data and 6 replicate standard curves which were generated in order to assess various validation criteria and their back calculated values from the regression lines derived for each separate cure and these are shown in Table 1. We can use these data to get a better measure or picture of the homo or heteroscedasticity of the data. Whilst I appreciate that one might not go these lengths when simply building a calibration curve using a previously validated assay – nonrandom residuals in a simple unweighted linear calibration do point towards a lack of attention to the proper calibration model during method development and validation. 
Nominal Conc. (μg/mL) 
Curve 1 
Curve 2 
Curve 3 
Curve 4 
Curve 5 
Curve 6 
50 
50.4894 
50.4191 
50.0246 
50.4007 
50.1891 
50.1703 
25 
24.0717 
24.2661 
24.9604 
24.2236 
24.5211 
24.7353 
10 
9.7325 
9.7421 
10.0221 
9.8953 
10.3546 
9.7718 
5 
5.1392 
4.9177 
4.9313 
4.9968 
4.7813 
4.9674 
2.5 
2.6767 
2.4591 
2.4162 
2.5520 
2.4879 
2.6782 
1 
1.1574 
1.2698 
1.0397 
1.1804 
1.0283 
1.0246 
0.5 
0.7330 
0.9261 
0.6058 
0.7511 
0.6377 
0.6523 

Table 1: Back calculated gemfibrozil concentrations for 6 separately prepared standard curves. 

By combining all responses and concentrations into two contiguous columns with MS Excel and rerunning the regression statistics described above we can get a fuller picture of the residuals spread for this particular analysis. The residuals plot shown in Figure 2 displays this data. 

Figure 2: Residuals Plot for the back calculated gemfibrozil validation data in Table 1. 

As can be seen the error in the data seems to form a ‘fan’ shape (blue dotted lines) which suggests that the variability in the data at low concentrations is lower than that at high concentrations rather than being equally and randomly distributed with equal variance – for instance between the two red dotted lines on the figure. However, it is difficult to be absolutely sure due to the distribution of the residuals – which we pointed out much earlier in the discussion, seems to take a Ushape. To test for heteroscedasticity and we can employ an Ftest to test if this assumption is correct.
The null hypothesis (again just a statistical term to mean the assumption that we are testing) for this F test is that the variance of the data at the low end of the curve and the high end of the curve are equal. This needs to be the case for unweighted linear regression to be the most appropriate model for our calibration.
An Ftest can be easily carried out in MS Excel (FTest TwoSample for Variances) and works on the basis that if the F statistic value in the output is greater than the Onetail Critical F factor (take from tables or from the Excel output) then we need to reject the null hypothesis and state that the variances at the high and low end of the curve are unequal and that some form of weighting will produce more valid answers. The results of this test, performed on the instrument response rather than the interpolated values, are shown in Figure 3. As the Ftest compares the variances of two samples (or populations if one has large amounts of data), it is typical to compare the data from the highest and lowest concentrations on the calibration curve. One needs at least two determinations at each concentration, although to increase the validity of the test – one should aim to have at least six replicates. 


Curve 1 
Curve 2 
Curve 3 
Curve 4 
Curve 5 
Curve 6 
Response (50 μg/mL) 
388.0412 
379.7061261 
362.9748 
379.5124 
400.0833 
383.2355 
Response (0.5 μg/mL) 
4.13933 
4.676531518 
3.950976 
4.215466 
4.493892 
4.572878 


FTest TwoSample for Variances 




Variable 1 
Variable 2 
Mean 
382.2589 
4.34151214 
Variance 
147.6065 
0.07965423 
Observations 
6 
6 
df 
5 
5 
F 
1853.09 

P(F<=f) onetail 
3.67E08 

F Critical onetail 
5.050329 


Figure 3: F test to investigate heteroscedasticity in calibration curve response data and justify the use of regression line weighting. 

One very important point to note is that the data should be arranged such that the variance of variable 1 is larger than the variance of variable 2. If this is not the case, simply switch you data in their rows or columns prior to reperforming the Ftest.
As F > F Critical onetail (1853.09 > 5.050329), we reject the null hypothesis (original assumption that the variance of the data at the highest and lowest concentrations is equal) and conclude that the variances are unequal and therefore weighting of the calibration line may be appropriate.
This leads us nicely to the next frequently asked question of ‘what weighting do I apply to my calibration curve’.
We now start to need help from more advanced statistical techniques although an MS Excel spreadsheet with some simple embedded functions (caculations) will do the job very nicely. One very nice version that I regularly use can be found here
Download MS Excel spreadsheet of simple embedded functions here »
Whilst the effects of regression weighting are a little beyond the standard Excel Data Analysis toolpack, you will see that this modified sheet is very easy to use.
In the spreadsheet, enter your data (nominal concentration and average instrument response from Table 1) into the concentration and readings columns (B&C) and copy the relevant weighting from columns Z, AA into column A – these represent the weighting factors based your data for 1/x, 1/x^{2}. Enter a value of 1 into column A for each concentration to the get the unweighted line (1/x^{0}). Copy the same average instrument response data as you placed in column C into column J and column K will give you the interpolated (back calculated) value for the sample concentration. Then simply calculate the % recovery (Interpolated Conc. / Nominal Conc.) x 100 and for each value calculate the relative error (i.e. the difference between the calculated recovery and 100%).
I’ve shown these results in Table 2 for a calibration function based on the mean values of instrument response for the data in Table 1. 
Nominal Conc. (μg/mL) 
Average Response 
50 
334.7933 
25 
162.4533 
10 
65.30417 
5 
32.12658 
2.5 
16.02709 
1 
6.47005 
0.5 
3.792725 

μg/mL 
1/x^{0} 
1/x 
1/x^{2} 
50 
50.2775 
50.4997 
49.4896 
25 
24.4710 
24.4997 
23.9977 
10 
9.9238 
9.8434 
9.6278 
5 
4.9558 
4.8381 
4.7203 
2.5 
2.5450 
2.4092 
2.3390 
1 
1.1139 
0.9674 
0.9253 
0.5 
0.7130 
0.5635 
0.5293 


Interpolated Results from a single Calibration Curve 

μg/mL 
1/x^{0} 
1/x 
1/x^{2} 
50 
100.5550 
100.4419 
97.9998 
25 
97.8842 
100.1171 
97.9511 
10 
99.2381 
99.1894 
97.8102 
5 
99.1151 
97.6251 
97.5665 
2.5 
101.7998 
94.6650 
97.0834 
1 
111.3910 
86.8475 
95.6491 
0.5 
142.6007 
79.0304 
93.9313 


% recovery 

μg/mL 
1/x^{0} 
1/x 
1/x^{2} 
50 
0.555 
0.442 
2.000 
25 
2.116 
0.117 
2.049 
10 
0.762 
0.811 
2.190 
5 
0.885 
2.375 
2.434 
2.5 
1.800 
5.335 
2.917 
1 
11.391 
13.153 
4.351 
0.5 
42.601 
20.970 
6.069 
S_{%RE} 
60.1091 
43.2017 
22.0087 
R^{2} 
0.9998 
0.9987 
0.9887 


% Relative Error and
Sum of % Relative Errors 

Table 2: Evaluation of weighting factors on the quality of interpolated and back calculated data for an LCMS assay of gemfibrozil in human plasma. 

The value of S_{%RE} is the sum of the relative errors and gives a good indication of the most appropriate weighting to use on the data. In this case of weighting of 1/x^{2} will result in the smallest errors in the interpolated values of unknowns across the whole calibration range. If we refer back to the FDA guidelines mentioned above – it calls for the ‘simplest model’ to be used (1/x^{0} in preference to 1/x and then 1/x^{2} etc.) to ensure the data meets criteria on maximum allowable error – and of course we should examine the data carefully to see which of the weightings would fall into this category by referring to the allowable error in the particular guidelines chosen.
On visual inspection of the residuals data in Figure 2, it would appear that dropping the data associated with the 25ug/mL calibrant, may lead to a better overall result?
So the next frequently asked question is – ‘can I drop outliers from my data in order to improve the coefficient of determination and obtain more accurate results’.
I have to say this question is much more contentious and should only be considered after a thorough visual analysis of the calibration line and residuals. Where one point which lies way off the regression line and is causing ‘leverage’ or a skew of the data which obviously affects the slope of the calibration line (and often the accuracy of the interpolated data at the lowest concentrations) – then this process may be considered, but for no more than one point on the calibration line.
Grubb’s test can be used to determine whether or not a single outlying value within a set of measurements varies sufficiently from the mean value that it can be statistically classified as not belonging to the same population, and can therefore be omitted from subsequent calculations. As such, it is applied to either the highest or lowest residual value in the set; only one value may be omitted from the set on the basis of Grubb’s test.
Where X_{suspect} is the value with the largest or smallest ranked residual, x is the mean and S is the standard deviation of the residuals.
In practice, first obtain a regression of the calibration line, then rank the residuals for each calibration level (concentration) from highest to lowest prior to calculating the Grubb Factor for each. Again for this exercise I’ve used the mean response data which gave rise to the interpolated values in Table 1.
I’ve shown the data and results in Figure 4. 
Std Conc. (μg/mL) 
Response 
Residual 
50 
334.7933 
1.853065 
25 
162.4533 
3.53243 
10 
65.3042 
0.50881 
5 
32.1266 
0.29549 
2.5 
16.0271 
0.300477 
1 
6.4700 
0.760712 
0.5 
3.7927 
1.4225 


Y residual (Ranked) 
ABS
(Y residual  Mean) 
Grubs Value 
n=6 95%CI 

1.853065302 
1.853065 
1.042690335 
1.8871 

1.422477816 
1.422478 
0.800405614 
1.8871 

0.760711856 
0.760712 
0.428040447 
1.8871 

0.300477017 
0.300477 
0.169073633 
1.8871 

0.29548594 
0.295486 
0.166265232 
1.8871 

0.50881151 
0.508812 
0.286300132 
1.8871 

3.53243455 
3.532435 
1.987644664 
1.8871 
Mean 
1.57334E14 



SD 
1.777196201 




Figure 4: Grubs test for outliers on a gemfibrozil in human plasma. 

Note that the Grubbs critical value is the two sided statistic at the 95% level of confidence.
In this case we note that because the Grubbs Factor value for the standard with residual 3.5324 (25 μg/mL point) is greater than the critical factor (1.887) then this point may be rejected from the calibration line and the quality of the interpolated data reevaluated using the new regression equation.
Note again that this approach can only ever be applied to one point within the calibration curve and is used to prevent one outlier point from skewing the interpolated data. It goes without saying the underlying cause of the outlying point should be investigated in terms of integration of the chromatogram, instrument response, partial injection, sample preparation etc. in order that the causal issue can be avoided in future.
Now for our final frequently asked question – ‘do I include the origin in the calibration line?’
This is also a fairly controversial topic – however hopefully it’s one which is relatively easy to resolve.
Figure 5 shows a portion of the Regression Analysis carried out on the gemfibrozil analysis – the data is also shown in the Figure. 
μg/mL 
Response 
50 
334.7933 
25 
162.4533 
10 
65.30417 
5 
32.12658 
2.5 
16.02709 
1 
6.47005 
0.5 
3.792725 


Coefficients 
Standard Error 
Intercept 
0.968844343 
0.940209031 
X Variable 1 
6.678182437 
0.043584416 

Figure 5: Assessment of the Standard Error of the Intercept, used to inform the decision on how the origin should be treated when generating the calibration model. 

From Figure 5, the Intercept coefficient is the value of the intercept to be used in the regression equation and the Standard Error (SE) is the uncertainty associated with this value. If the magnitude (either positive or negative) of the intercept is greater than the standard error, this indicates that there is a bias in the data which cannot be explained by the regression model and the origin should not be included in the regression (i.e. the origin should not be forced).
In our case above the magnitude of the intercept and Standard Error are very close  0.9688 and 0.9402 which gives us a dilemma, fortunately in the majority of cases the decision to include the origin or not is much less ambiguous. Our rule tells us that the intercept should not be forced on this occasion, however let’s just take a look at some data which quantifies the errors in the interpolation of data from regression lines created when ignoring, forcing or including the origin. 
μg/mL 
Response 
Interpolated 
%Error 
50 
334.793 
50.2775 
0.552 
25 
162.453 
24.4710 
2.162 
10 
65.304 
9.9238 
0.768 
5 
32.127 
4.9558 
0.893 
2.5 
16.027 
2.5450 
1.768 
1 
6.470 
1.1139 
10.226 
0.5 
3.793 
0.7130 
29.874 



46.24 



Don't force origin 
y = 6.6781x + (0.9688) 





S%Error 



μg/mL 
Response 
Interpolated 
%Error 
50 
334.793 
50.343 
0.682 
25 
162.453 
24.428 
2.341 
10 
65.304 
9.820 
1.835 
5 
32.127 
4.831 
3.500 
2.5 
16.027 
2.410 
3.734 
1 
6.470 
0.973 
2.785 
0.5 
3.793 
0.570 
12.329 



27.21 



Force origin 
y = 6.6502x + 0 





S%Error 



μg/mL 
Response 
Interpolated 
%Error 
50 
334.793 
50.290 
0.576 
25 
162.453 
24.463 
2.195 
10 
65.304 
9.904 
0.967 
5 
32.127 
4.932 
1.374 
2.5 
16.027 
2.520 
0.776 
1 
6.470 
1.087 
8.032 
0.5 
3.793 
0.686 
27.125 



41.05 



Include origin (0,0) as a point in the regression data 
y = 6.67289x + (0.7856) 





S%Error 


Figure 6: Assessment of various origin treatments on the % error of interpolated data when using simple linear regression. 

As you can see from Figure 6, in this case forcing the origin (nonweighted linear regression was used) gives rise to increased error for lower concentrations but a lower error for the higher concentration standards, so one would need to take a view on the validity of the approach, as overall error (S%Error) the interpolated data across the whole range would be more accurate (S%error of 27.21 vs 46.24). Some software systems allow the origin to be ‘included’ as a point in the calibration table and will therefore use (0,0) as a datum when calculating the regression coefficients. As can be seen this makes only a very marginal difference in this case but should be considered when designing the calibration model. This should go alongside the injection of a ‘blank’ solution containing no analyte but which can be used to assess the contribution of coeluting species or system noise on the calibration.
One may want to consider using a combination of 1/x^{2} weighting with forcing the origin to assess the impact on the quality of the interpolated data.
In conclusion to this very long entry – I have to state that I’m not a statistician, but that all of the information and data above uses approaches that I have ‘collected’ over the course of many years in analytical science and some very simple spreadsheet work – no professional statistics packages need to be used.
As usual, when I publish this sort of material, there will be sharp eyed chromatographers or statisticians who spot errors in the approaches or data. I welcome your feedback in this case and will follow up and cite all information that is received so that hopefully we can all do a little bit better with our calibration!
Happy New Year. 

References
 Preparation of calibration curves: A guide to best practice
http://www.lgcgroup.com/ourscience/nationalmeasurementinstitute/publicationsandresources/goodpracticeguides/preparationofcalibrationcurvesaguidetobest/#.WH4AY9yvqwn
 Calibration Curves, Part I: To b or Not to b? Mar 01, 2009, John W. Dolan, LCGC North America, Volume 27, Issue 3, pg 224–230
http://www.chromatographyonline.com/calibrationcurvespartibornotb
Part of a series of excellent articles on various aspects of Calibration
 Chromatography and Linear Regression: An Inseparable Pair: Lynn Vanatta, American Laboratory
http://www.americanlaboratory.com/913TechnicalArticles/38733ChromatographyandLinearRegressionAnInseparablePair/
Again part of an excellent series of articles on regression for analytical calibration
 How to Estimate Error in Calibrated Instrument Methods  And Why We Have Stopped Doing It!: LCGC Blog
http://www.chromatographyonline.com/howestimateerrorcalibratedinstrumentmethodsandwhywehavestoppeddoingit
 "Guidance for Industry: Bioanalytical Method Validation" http://www.fda.gov/downloads/Drugs/Guidance/ucm070107.pdf (May 2001).


