14 Jun 2023
Artificial Intelligence for the Chromatography Community
Recently, we pitted our engaged experts against ChatGPT across a number of frequently asked technical questions. The conclusion drawn was we could see decent results from combining the strengths of artificial intelligence and human expertise in analytical chemistry, with human "sense-checking" the output. But, overall, the job of the analytical scientist is entirely safe.
Our Chief Scientific Officer, Tony Taylor took a further look instead foccusing his research on what the LLM is capable of delivering to the chromatography community and what needs to be done to get there.
How can ChatGPT aid chromatography?
There’s an awful lot of chat about ChatGPT – the Large Language Model (LLM) generative artificial intelligence engine from OpenAI. Based on 45TB (terrabytes) of text-based training data and being the first LLM to surpass the 100 billion parameters (learned variables from the training process), ChatGPT-3 (2020) is a powerful and high-profile tool that is changing the way that we interact with digital technology for learning and information retrieval. Or so we are told.
For the purposes of this exercise, I accessed ChatGPT-4 (March 2023) in the period between 25th May 2023 and 3rd June 2023. The training dataset size and the number of parameters generated are not publicly available for this version.
But what have LLMs ever done for chromatographers, and moreover, can a digital refugee like myself use the tool effectively to advance my work in separation science?
I’m really a newcomer to this technology, recently having my first foray under the guidance of my son who ‘simply couldn’t function without it’ as a Physics undergraduate. Apparently, it’s amazing for helping to write new computer code! Anyway, after a few lessons in the basics and coming to the realisation that interactions with ChatGPT need to be more conversations than ‘one and done’ questions, I started my investigations. Unlike other articles on the subject which show the questions, the responses from ChatGPT and a commentary on the suitability of the answers, I’ve focussed my research on what the LLM is capable of delivering to the chromatography community and what needs to be done to get there. Of course – if you want to see the answers to each question, you can simply enter the questions/conversations as I’ve documented them. A further qualification is that I deliberately didn’t make this a ‘narrow and deep’, quantitative type search. Rather, I wanted to explore various types of interaction that may result in useful information for chromatographers, but this should become obvious as we progress.
My first searches related to troubleshooting issues and the questioning was as follows;
Q What causes peak tailing of late eluting compounds in gas chromatography
Q But specifically what causes peak tailing for only late eluting compounds
Q What if the compound is a polyaromatic hydrocarbon
Q Is there any manufacturers advice about dealing with peak tailing when analysing acidic compounds?
Q Recommend analysis conditions for 2-ethyl hexanoic acid
The responses were very general, with nothing specific that I could action if I had this issue in the laboratory. As you can see from the questioning, I had to get more specific with each question in an attempt to elicit something practical. There was nothing here that I couldn’t have obtained from similar Google searches.
Q How often should I change the liner in GC system?
Q Any specific advice on this subject for food analysis?
Q What liner would you recommend for pesticide screening of vegetable extracts?
Q Can you give me a specific part number?
Whilst there was some good general information returned, again there was nothing actionable. The search for a specific liner part number yielded only the names of major GC manufacturers and a suggestion to contact them.
Q Can you guide me on how to check my GC inlet for leaks?
Q Will there be gas flowing even if the GC is turned off?
Q I was asking the question with reference to your answer of leak checking
Q But how can the leak detection fluid detect a gas leak if there is no gas flowing?
Here it was obvious that the LLM was agglomerating information from various elements of its dataset (which is current to September 2021) and ‘building the response’ on the fly. Essentially, it was asking me to use leak detection fluid around the inlet with the GC (and therefore the gas flows) turned off. The final suggestion would not have been usable in terms of a practical process for leak checking a GC inlet and in fact I would worry about the safety of some of the recommendations.
Q What extractables might I see when a cellulose syringe filter is used to filter 0.1M hydrochloric acid
Q What extractables might I see when a cellulose syringe filter is used to filter hexane
I have to say that whilst the answers to these two questions was still fairly general, there were helpful suggestions for the types of extractables that may be seen when each of the different solvents are used. I appreciate that this is a fairly niche question, however, it’s the type of information that is perhaps less accessible from a general web search.
I needed to establish trust
So, in terms of providing anything useful for troubleshooting problems in the lab, I’m not seeing any benefit to the use of AI. ChatGPT-4 is telling me that it doesn’t have access to specific tables of information or specific literature that may help give more specific responses. Whilst the responses may be good to obtain ‘directions of further investigation’ for the less experienced chromatographer, they tend to be wide reaching and sometimes contradictory, so the ‘trust’ level that needs to established between the user and technology is not being established in these early exchanges. As a colleague recently described it to me, you tend to put more weight behind someone’s response when you know what they have told you in the past has been correct or useful. I’m not feeling this with ChatGPT to this point. The other issue that is emerging, is that the LLM is very quick to ‘fold’ when challenged. Even with answers that are more or less accurate, when challenged it comes back with an apology and an alternative suggestion, which is often less accurate the previous one!
So let’s see what happens when I search for information which may be of help when planning a separation or analysing data.
Q What is the SMILES string for loratadine
Q Are you sure this is correct - ChemSpider tells me something different?
Q How can that be correct when the molecular formula is C22H23N2O2Cl
Q I still believe this is incorrect - can you recalculate please
What I got back was pretty much nonsense and didn’t get any better with some ‘coaching’. Let’s try something more straightforward.
Q What it the smiles string for octadecanoic acid
Bingo – correct first time. So further research is required on the limits of the model’s ability to produce useful SMILES information for more complex molecules.
Q What is this compound CN(C)C1=CC=C(C=C1)O
Q (for the benefit of readers this is the SMILES string for 4-Dimethylaminophenol)
Q Are you sure this is correct
Q Are you sure that the SMILES string corresponds to norepinephrine?
Norepinephrine is (R)-4-(2-amino-1-hydroxyethyl)benzene-1,2-diol (noradrenaline) and really nothing like the compound I was interested in. Again, I’m losing trust in the models ability.
Q What compound has the CAS number 375-22-4
Q Are you sure that is correct?
Q I thought that CAS number was for heptafluorobutyric acid
It turns out the model not that great at CAS number look ups either – eventually it agreed with me after suggesting Dipropylene glycol and Pyrroloquinoline quinone disodium salt in the previous two responses. Will it fair any better with physicochemical data?
Q What is the pKa of dodecanoic acid
This is close the literature value that I found (Barratt MD; Toxicol In Vitro 10: 85-94 (1996))
Q What is the LogP value
This is close to the ACD Labs predicted value that I found via ChemSpider (5.03)
Q What is the LogD value at pH 7.4
ACD Labs predicted value (ChemSpider) (2.09) / Chemaxon (2.06)
Q Are you sure about the last answer?
Response indicates that it is not possible to calculate the correct value
Q Can you not calculate the LogD value at pH 7.4 given that the pKa value is 4.88
Q So what would the LogD value be at pH 5.5
ACD Labs predicted value (ChemSpider) (3.89) / Chemaxon (3.82)
Here, I decided the include the full response from ChatGPT-4;
To estimate the LogD value at pH 5.5 for dodecanoic acid (lauric acid), we need to consider its acid dissociation constant (pKa) and the ionization behavior at that pH.
The pKa of dodecanoic acid is approximately 4.88. At pH 5.5, which is lower than the pKa, the majority of dodecanoic acid will be in its protonated form (RCOOH) rather than the deprotonated form (RCOO-).
Since LogD represents the distribution coefficient between octanol and water, we need to consider the relative solubilities of the protonated and deprotonated forms in those phases.
As dodecanoic acid is more soluble in the aqueous phase when protonated, the LogD value at pH 5.5 is expected to be lower than the LogP value. While I don't have access to specific data or software to provide an exact LogD value at pH 5.5, it is likely to be lower than the LogP value of 4.92, reflecting the increased solubility of the protonated form in the aqueous phase at lower pH.
As you can see, the logic is very flawed and at this point I decided to turn my attention to other search types. I appreciate that all the above information could have been found from a Google search, however I wanted to see if there was merit in starting a conversation to explore how deep we could go with conversations regarding optimisation of eluent pH, however, we never got out of the shallow end.
Let’s try something more specific on method development for HPLC.
Q Can you suggest some starting conditions to separate dodecanoic acid from it's closely eluting homologues in reversed phase HPLC?
Q This answer is very generic, can you suggest something with dodecanoic acid and teteradecanoic acid specifically.
The responses were very general and spectacularly misleading – including the following recommendation…..
Adjust the composition of the mobile phase to optimize the separation. For example, you can start with a gradient program ranging from 70% to 90% water and 30% to 10% acetonitrile over a suitable time period (e.g., 20-30 minutes).
Q Is there any literature you can cite with these compounds being separated by HPLC?
Here I was pointed in the direction of PubChem and Google Scholar. Nothing trustworthy or useful resulted from this conversation.
Ok, I’m giving this thing one last try. How about a buffer calculation;
Q Calculate the pH of a solution prepared from adding 0.0175 mol of citric acid and 0.0025 mol of dihydrogen citrate to 1.000 L of water
Q Can you repeat the calculation using the Henderson−Hasselbalch equation?
Following some reasonably sensible logic, the final part of the response was;
Substituting the given values:
pH = 3.128 + log((0.0025 + [H+])/0.0175)
To solve this equation, we need the concentration of H+ ions. However, without additional information about the solution or the pH of the initial solution, it is not possible to provide an exact pH value using the Henderson-Hasselbalch equation alone.
Q If I said the correct answer is pH 2.63, can you figure out where you went wrong?
My head hurts.
I have many more types of search to try, but in terms of my confidence in the responses of my correspondent, I would have more trust in the responses of P.T. Barnum or Walter Mitty.
I have to admit that I’m very much a newbie top so-called ‘AI’ and it’s possible that I just haven’t stumbled upon what LLM’s are good for, in terms of helping out the aged chromatographer. If you’ve had better experiences with ‘AI’ for analytical sciences, please write and let me know, I do feel that perhaps I’m not focussing in on the areas where the model can shine. I dearly wanted this exercise to be a rabbit hole down which I disappeared for a conversion with a learned digital colleague. Either I got hung up in the test questions which establish early trust, or I’m just a digital refugee looking for a digital native to hold my hand and introduce me to the wonders of the new age. One thing is for sure, I going to tell my son to check his code very carefully – who knows what might happen when executes his programs!!!!