Large Science Models in 2024: Hype & Hope —— A Subjective Outlook
Science needs large models beyond LLM
In 2023, the world of artificial intelligence saw a major leap with the rise of large language models (LLMs) like GPT, sparking excitement among scientists keen to tap into their potential. Yet, as discussions unfold, there's a realization that equating large models solely with LLMs is a bit limiting.
Language, being textual, is one-dimensional, but the realm of science is awash with data that defy such simplicity. Take chemical bonds, often depicted in 2D graphs to represent their structure, while molecules themselves exist in the 3D realm. Then there's the even more complex n-dimensional wavefunction, critical in determining material properties like bandgap or thermodynamics.
After all, the essence of a 'large' model lies in its ‘large’ number of parameters and the ‘large’ amount of unlabeled data used in pre-training, principles that are by no means exclusive to language. So what are the use-cases of large models in science that might bring about real value? Here is a subjective list of what I believe could emerge in 2024.
I. Expert LLM for Scientific Documents
Current Status
While tools like ChatPDF defined the user experience of interacting with scientific literature using chat, there's a world of difference between a tool that's "barely functional" and one that's "actually useful." The current crop of LLM PDF readers falls short :
either ignore all images in a document
or employ rudimentary image-to-text preprocessing techniques (which lost lots of crucial information).
However, for an LLM to truly serve the scientific community, it needs to go beyond this simplistic approach.
Challenge
The images in scientific papers and patent files are not just filler; they often contain essential information that complements the text. This is especially true for fields where visual data, such as Markush molecule graphs in chemistry, play a central role in conveying complex ideas. The challenge lies not just in recognizing these images but in aligning the information they contain with the textual content of the documents. Representing scientific images in a way that an LLM can understand or retrieve and integrate with textual data is a puzzle that researchers are just beginning to piece together.
Moreover, the urgency to enhance how LLMs handle scientific images stems from an ambitious vision: we aim to evolve LLMs into advanced search engines. Such engines would enable us to search for specific entities like a Markush structure and retrieve all mentions of it across scientific papers and patents. This capability would drastically improve the efficiency and precision of scientific research and intellectual property management.
2024 Outlook
Expert LLM that can process, preserve scientific image and align image info with text
LLM system that can aggregate vast scientific literature and ‘search’ precise information
II. Large Models with Lots of Labeled data
Current Status
The development of Large Models like AlphaFold2 has been a huge milestone for science. However, despite its impressive capabilities, AlphaFold2 still cannot replace cryo-EM. For example, its utility in directly facilitating processes like molecular docking—a method used to predict the preferred orientation of one molecule to a second when bound together to form a stable complex—remains limited. Various review studies have highlighted these limitations, and translating these predictions into actionable insights for drug discovery and other applications still faces hurdles.
Challenge
Despite the leaps in computational methods, the nuanced complexities of biological systems often require experimental confirmation. This reality underscores a broader challenge in the field: high-quality, experimentally derived data is not only expensive to obtain but also difficult to scale. While computational models like AlphaFold2 provide invaluable tools for hypothesis generation and preliminary analysis, the lab bench remains irreplaceable for the foreseeable future. The future, however, holds promise for a more integrated approach, where computational predictions and laboratory experiments converge more seamlessly. Robotics labs, equipped with automated and autonomous systems capable of designing and conducting experiments, validate computational predictions at a scale and speed previously unimaginable. This integration could extend to employing Bayesian optimization and other advanced statistical methods to refine computational models based on experimental outcomes, creating a dynamic feedback loop that continually enhances the accuracy and applicability of predictions.
2024 Outlook
More research institute and companies looking at the DeepMind / Berkeley Lab approach of ‘computation guided experiment design’ + ‘autonomous experiment design, execution and evaluation’
More open-sourced scientific database become available beyond PDB, OC20 etc.
Based on #2, the team that can ‘clean’ the most amount of open database will gain visible advantage. This is dirty work but highly necessary.
III. Large Models with Few Labeled data
Current Status
In the vast majority of scientific fields, such as drug discovery and materials science, the reality is starkly different from the ideal scenario of abundant, labeled data. For instance, in the realm of ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) in drug discovery, obtaining even a few hundred data points can be a rarity. Similarly, in materials science, exploring a new formula necessitates the actual creation and testing of a material in the lab, a process that is not only costly but often impractical. Given this scarcity of data, traditional direct learning models generally underperform due to their reliance on large amounts of high-quality labeled data.
This is where pre-training would shine —— first pre-training on unlabeled molecular structure data, followed by fine-tuning on sparse labeled data for specific downstream tasks. This approach has begun to show promising results in areas like QSAR, indicating a potential pathway to overcome the data scarcity challenge.
Challenge
The challenge of pre-training in the virtually unlimited chemical space is daunting. Identifying sources of high-quality data, deciding on the most effective way to represent molecules (be it as graphs via Graph Neural Networks (GNN) or as 3D conformers), and figuring out how to efficiently conduct multi-task training are all significant hurdles.
Multi-task training, in particular, is complex in this context due to the diversity of data sources and formats. For example, using Density Functional Theory (DFT) as training data introduces variability because different DFT methods can yield different data formats depending on the choice of base functions. This variability complicates the model's ability to generalize across tasks and datasets, making the development of versatile and effective models a challenging endeavor. Despite these challenges, early work in multi-task learning is paving the way for exciting developments.
As we look toward 2024, the potential emergence of large pre-trained science models, particularly in materials science, holds great promise. Such models could revolutionize fields like catalysis, battery development, and polymer science by enabling more efficient exploration of new materials and processes.
2024 Outlook
In material science, people will figure out how to run multi-task learning so that a unified large model can process data from various experimental tools (x-ray, stem, afm etc.) and different kinds of computation.
drug design and batteries are likely to benefit first from this new technology, given their fierece competition and thirst for innovation
Summary
In short, the exploration of large models in science reveals a future far beyond their initial application in language processing, promising a transformative impact across numerous fields.
Looking ahead to 2024, we might witness advancements that will showcase the full potential of large models in scientific discovery. The narrative of large models is expanding, and the coming year promises to deliver compelling evidence of their broader significance in the scientific community.
#LargeScienceModel #AI4Science