I spent some time during this last winter break writing an app that automates some tasks that scientists do today: data analysis, manuscript review, read papers, and literature research (Here is a Demo, you can try out the app here). This idea had been on my mind for a while, but the demanding schedule of a postdoctoral researcher, coupled with the urgency of academic publishing ("you need to write that next paper fast!"), often leaves little room for pursuing hobbyist projects that may seem tangential to one's primary research focus.
Anyways, it got me thinking, if a Biochemistry PhD can build this in such a limited amount of time. There must be others engaging in even more complex and innovative projects. This essay aims to explore a few of these ventures and discuss related topics.
First, I should start by emphasizing the word semi in the “semi-autonomous AI”. What we are looking at here is an AI system with a degree of independence but without complete governance. And why is this important?
As I have argued elsewhere:
“Anyone who has used apps like chatGPT and other similar genAI apps must have experienced AI hallucination. Although improvement can be made with 'fine-tuning', iterative querying, the so-called retrieval-augmented generation, and what-not. Hallucination remains a problem that some experts have argued cannot be completely eliminated.
Wittgenstein's ruler thought experiment on circular validation connects neatly to this problem of AI hallucinations. The Wittgenstein ruler: ‘Unless you have confidence in the ruler’s reliability, if you use a ruler to measure a table you may also be using the table to measure the ruler.’
The AI generated output cannot by itself prove the model's competence, just as an imperfect ruler cannot be verified by using it to measure something.
In order to verify the competence of an LLM, we need a separate, independent (preferably liable) source of information, say a flesh and blood human expert. As such, the notion of fully autonomous AI performing jobs (especially for zero-margin-for-error-jobs) given today's state-of-the-art is fairly ludicrous.”
In other words, there is a need for a liable validator.
I am sure you didn’t sign up for a philosophy lesson when you clicked that link, so without further ado, I will go ahead with our main subject.
Data Analysis and Data Science
It goes without saying that data analysis and data science are an integral aspect of scientific research. And the technology available today are more than capable of automating some aspects of the scientific enterprise.
Short story: I ventured into computational biology late in grad school, partly because my biology experiments weren't progressing as quickly as I would have liked (I'll admit to that), and partly because I realized that machine learning and AI, as general-purpose technologies, would revolutionize biology and science. So, I started with reading books, taking online classes, and attending in-person classes. Simultaneously, I threw myself at various projects (I have detailed this in my PhD memoir)
Yes, AI/ML is revolutionizing many fields. However, what I couldn't have predicted then was that in the next few years, something called LLM would be able to automate some of the tasks I was learning in my machine learning class. Interestingly, the famous Transformer paper had already been published at that time.
Given the rise of tools like ChatGPT’s code interpreter (now Advanced Data Analysis), Tu et al. in their paper posed the question [1]: what should data science education do with large language models? They effectively argued that a transition somewhat akin to that from a software engineer to a product manager is inevitable for data scientists. For science researchers, this development means enhanced efficiency.
So, when I was working on my hobby project, one of the features I was eager to build was a Data Science Agent, an agent that automate data analysis. It was the easiest part of the project to build, and it automates some of the data analysis I conduct in the lab.
But how do we get the data to analyze in the first place? Let’s see to the design and execution of scientific experiments.
(Semi)-Autonomous Design and Execution of Scientific Experiments
In this recently published work by Boiko et al [2], the researchers built an AI scientist system called Coscientist which integrate LLM to autonomously design, plan, and execute scientific experiments. In the work, they leverage a GPT-4 chat completion instance called ‘Planner’ which processes user inputs and command outputs. The Planner operates through four main commands: 'GOOGLE' for web searches, 'PYTHON' for computations, 'DOCUMENTATION' for information retrieval, and 'EXPERIMENT' for executing automation processes via APIs.
In demonstrating its capabilities, Coscientist was used to plan and conduct Suzuki–Miyaura and Sonogashira coupling reactions. It autonomously gathered reaction data from the internet, selected correct reagents, calculated volumes, and generated Python protocols for a liquid handler.
The system initially made errors in protocol generation but was able to self-correct by consulting documentation. The successful completion of these experiments was validated through gas chromatography–mass spectrometry analysis, indicating the system’s efficacy in autonomous experimental design and execution. Brave new world. No?
Another interesting work in this domain is ChemCrow [3]. ChemCrow leverages a suite of chemistry-focused tools, using GPT-4 as it’s brain. These tools, ranging from molecule analysis and safety assessment to reaction planning, facilitate a wide array of chemistry tasks (you can try it here).
Still staying on this same theme, recently BioPlanner was published by O’Donoghue and co-workers [4]. And the paper reminds me of my very early days as a scientist. My undergraduate and a big part of my graduate school career was spent on the bench, making agar, growing micro-organism, culturing C. elegans, running toxicity experiments, etc. All these tasks involve adhering to strict protocols. Add 30ml of that buffer into this sample, spin it down, take out the supernatant, incubate it for 30 minutes, do this, do that, and so on.
And, of course, before that you have to plan your protocols and experiments.
It is here the BioPlanner comes in, a systematized approach for evaluating and enhancing the capacity of LLMs in the domain of biological protocol planning. It translates biology protocols into a form of pseudocode using a collection of pseudofunctions derived from GPT-4.
The framework operates through a dual-stage mechanism: initially, a 'teacher' model generates both pseudofunctions and accurate pseudocode. Subsequently, a 'student' model endeavors to reconstruct the procedure using a high-level description and the provided pseudofunctions. This innovative method breaks down the intricate process of composing scientific protocols into simpler, multiple-choice queries.
Moreover, the introduction of their BIOPROT dataset is a key feature of the system. This dataset offers a range of expert-validated biology lab protocols in both textual and pseudocode formats, aiding in different assessment activities such as predicting the next steps and generating complete protocols.
A primary contribution of this paper is the introduction of a novel approach for evaluating LLMs in the field of scientific protocol creation.
Semi-Automating Scientific Writing, Text Research, and Peer Review.
Scientists design experiments, conduct these experiments, and engage in data analysis. In addition to these tasks, scientists also perform extensive research, dedicate time to writing scientific papers, and participate in the peer review process. Recently, we have begun to observe the increasing involvement of AI tools in these areas. Let's examine each of these aspects one by one.
While writing my app, another challenge I set was to build an AI research assistant system capable of conducting research on specific topics and presenting draft reports complete with references. Utilizing the GPT-researcher methodology, I learned the LangChain Expression Language (LCEL) (making it much easier to build!) and built the feature within a few days. While not perfect, it functions quite well. Consequently, I began exploring what others are accomplishing in this domain.
Comes in PaperQA [5], a Retrieval-Augmented Generative (RAG) agent for scientific research.
The system consists of three core operations: (1) finding relevant papers from online databases, (2) gathering text from these papers, and (3) synthesizing information into a final answer.
It employs four different LLM instances: an agent LLM, a summary LLM, an ask LLM, and an answer LLM. Each of these plays a distinct role in processing the query, searching for relevant literature, summarizing findings, and formulating the final answer. The agent LLM orchestrates these tools and iteratively refines the query based on evidence collected.
The output is an answer with cited sources, formatted in a scholarly tone. The innovation lies in decomposing RAG into modular pieces, allowing iterative adjustments and reasoning over text to provide numerical relevance scores.
This system advances the field by bridging the gap between static LLMs and dynamic, retrieval-based question answering, making it more aligned with the human research process. In order words a more sophisticated and efficient method than GPT-researcher’s methodology.
Importantly they noted that as of the time of testing, “PaperQA outperforms GPT-4, Perplexity, and other LLMs, as well as commercial RAG systems on several benchmarks.”
And there is more, they built something called WikiCrow on top of PaperQA. WikiCrow specializes in creating Wikipedia-like summaries for complex scientific topics by analyzing scientific literature. Its primary application has been in generating draft articles for numerous human protein-coding genes, particularly those not covered or only briefly mentioned on Wikipedia.
The articles created by WikiCrow are completed in about 8 minutes/article, with consistent citation of sources, and a potential for increased accuracy over time. These tools will make scientific research more accessible by streamlining the process of compiling human scientific knowledge.
Finally, let’s see to the peer review process for scientific paper publication.
Another feature I built on my little app was this “AI reviewer” using GPT-4 as it’s brain. I was motivated by this large-scale empirical analysis from Liang and coworkers [6], where they asked the question, “can large language modes provide useful feedback on research papers?”
Here is the paper:
“…We then conducted a prospective user study with 308 researchers from 110 US institutions in the field of AI and computational biology to understand how researchers perceive feedback generated by our GPT-4 system on their own papers. Overall, more than half (57.4%) of the users found GPT-4 generated feedback helpful/very helpful and 82.4% found it more beneficial than feedback from at least some human reviewers.”
But there are limitations, which I noticed in my implementation too.
At times, it falls short in providing thorough critiques of methodologies, often defaulting to generic feedback such as 'add more samples' or 'conduct more experiments'. I believe these issues can be effectively addressed by adopting methods like those used in PaperQA for RAG.
The AI reviewer could be enhanced to 'consult' outstanding papers in the relevant field for more nuanced feedback. Additionally, fine-tuning a LLM with a high-quality dataset of paper reviews specific to a field of study could significantly improve its performance.
Conclusion and Safety.
In this essay, I have explored the concept of semi-autonomous AI in scientific research, deliberately avoiding an overly domain-specific approach and not delving into the scientific discoveries facilitated by AI.
To be sure they are plenty: from immunology, enzymology, neuro-science, olfactory perception, drug discovery, mass spectrometry, protein structure prediction, RNA biology, genetic discovery, and many more (perhaps this is a topic I will reserve for another essay.)
Anyways, suppose we were hypothetically asked whether an AI scientist is feasible, or more precisely, if we could build one within the next decade. First, we need to define what constitutes an AI scientist. I refer to a semi-autonomous AI system capable of undertaking research tasks—from hypothesis generation to conducting experiments, iterating through the process, and drafting findings, all while incorporating human oversight.
If you've read this far, I suspect your answer to both questions asked would be a yes.
However, before I conclude this essay, I need to talk about safety. It will be remiss of me to write an essay of this length, on a topic as this, and not say at least a word on the subject. In a world where anyone can prompt an AI system to synthesize various kinds of chemicals or power synthetic biology with AI, the potential for misuse is evident.
C.S. Lewis, in 'The Abolition of Man’, gives us a friendly reminder:
“…There neither is nor can be any simple increase of power on man’s side. Each new power won by man is a power over man as well. Each advance leaves him weaker as well as stronger.”
Clearly there is an urgent need for clear and effective governance in deploying these innovations—a topic that requires further discussion and one I am not fully prepared to tackle at present.
In the meantime, I recommend watching the short film 'The A.I. Dilemma' by the Center for Humane Technology. Although it's not exclusively focused on science and is about a year old (implying even more advanced technology today), it offers a poignant insight into potential pitfalls.
In conclusion, let's remain cautiously optimistic and vigilant, embracing the potential of AI in scientific research while actively engaging in discussions about its safe and ethical deployment.
References
[1]: Tu, X., Zou, J., Su, W. J., & Zhang, L. (2023). What Should Data Science Education Do with Large Language Models? Retrieved from https://arxiv.org/abs/2307.02792
[2]: Boiko, D. A., MacKnight, R., Kline, B., ... & Gomes, G. (2023). Autonomous chemical research with large language models. Nature, 624, 570-578. https://doi.org/10.1038/s41586-023-06792-0
[3]: Bran, A. M., et al. (2023). ChemCrow: Augmenting large-language models with chemistry tools. Retrieved from https://arxiv.org/abs/2304.05376
[4]: O’Donoghue et al (2023). BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology. Retrieved from https://arxiv.org/abs/2310.10632
[5]: Lála, J., O’Donoghue, O., Shtedritski, A., Cox, S., Rodriques, S. G., & White, A. D. (2023). PaperQA: Retrieval-Augmented Generative Agent for Scientific Research. Retrieved from https://arxiv.org/abs/2312.07559
[6]: Liang, W. et al. (2023). Can large language models provide useful feedback on research papers? A large-scale empirical analysis. Retrieved from https://arxiv.org/abs/2310.01783