What data challenges do large language models pose for scientific R&D?

Key challenges include data quality and transparency, truth dilution, and complexity management and training sufficiency. Teams must ensure models are trained on accurate, sufficient data and that questions aren’t too complex for broadly trained algorithms, or results may be unreliable and require expert scrutiny.

Why does data origin and context matter when using AI apps in research?

Growing use of such apps feeds developers more training data, but there is often insufficient assessment of whether data are inaccurate or proprietary. In scientific research, data origin is of key importance as results quality and ethical collection must be ensured.

What are the risks of relying on NLP and AI-generated content for literature mining?

These tools can parse large volumes quickly but are not fool-proof and manual consideration is often necessary for final assessments. AI-generated content may sound correct even when it isn’t, amplifying quality issues and making the truth increasingly difficult to decipher.

How can LLMs produce plausible but wrong code or algorithms in bioinformatics tasks?

Requesting an 'alignment algorithm' without specification may return a peptide sequence aligner instead of DNA. Because DNA letters overlap with amino acid codes, the output may appear error-free yet be incorrect, leaving experts to fix it.

Why do generalized AI models struggle in specialized scientific domains?

Generalized models trained on huge datasets with no specificity will struggle in specialist areas; in drug discovery, the trustworthiness of binding predictions depends on how structurally similar inputs are to the training set.

What makes an AI model trustworthy for scientific R&D?

An uncertainty metric can improve transparency by indicating model limitations, and insight into training-data sourcing and quality is essential because models are only as good as their training data.

Should AI replace human experts in scientific R&D?

AI should augment people’s expertise, not replace it; users need interpretable metrics to gauge confidence and retain knowledge and control.

Four Key Risks of Using Large Language Models in Scientific R&D

The article discusses four major risks of using large language models (LLMs) in scientific research and development, emphasizing challenges related to data quality, transparency, truth dilution, training sufficiency, and the ethical and contextual origins of data, illustrated by examples from AI-based applications like Craiyon and Midjourney, highlighting the need for careful scrutiny to ensure reliable, ethical, and high-quality AI integration in life sciences.

ChatGPT and other generative large language models (LLMs) are becoming increasingly pervasive in our personal and professional lives. In the life sciences, the use of AI is nothing new, but it is certainly growing. According to McKinsey, “The AI-driven drug discovery industry has grown significantly over the past decade, fueled by new entrants in the market, significant capital investment, and technology maturation.”

Data Challenges in Large Language Models

As the use of LLMs grows, it has become clear that any potential benefits come along with numerous challenges. We must consider factors such as:

Data quality and transparency – Is the quality of data going into, and coming out of, generative models sufficient for its intended purpose?
Truth dilution – How can models and algorithms avoid perpetuating quality issues and diluting the truth?
Complexity management and training sufficiency – Have the models been properly trained using accurate and sufficient data? Are the questions being asked too complex or specific for general algorithms that have been built with broad training datasets? Will results be unreliable or in need of expert scrutiny?

Below, we explore some broad examples to illuminate key considerations that must also be kept in mind as we increase our adoption of AI in scientific R&D and integrate it into our primary workflows.

1. Data Origin and Context Concerns (As Illustrated through Novel AI-based Apps)

From the text-prompt-to-image app Craiyon to the photo-remixing tool Midjourney, AI-based apps have become increasingly popular. Growing use of such apps feeds developers more and more training data, however there is generally insufficient assessment on whether such data are inaccurate or proprietary, as evidenced by disputes over Midjourney’s use of output images that had artists’ signatures visible. Similarly, in scientific research, data origin is of key importance as results quality and ethical collection must be ensured.

A fun example to illustrate context concerns, specifically in using LLMs, is mixology, which in many ways is analogous to product formulations. A prominent YouTube mixologist used ChatGPT to create cocktail recipes from a preset list of ingredients. Not surprisingly, some results were unpalatable because crafting a cocktail recipe isn’t just a matter of following a defined format, but rather an art that relies heavily upon contextual application of both knowledge and sensory inputs. The mixologist’s assessment was that ChatGPT might best be used as an assistive tool, not a primary recipe generator. The role of LLMs in research must be similarly augmentative, helping to fuel scientists’ creativity, not replace it.

2. Data Accuracy Challenges (As Illustrated by AI-based News Articles)

AI-written articles have become more prominent than most of us realize and are a great example to illustrate data accuracy challenges. Earlier this year, Buzzfeed News reported that technology news outlet CNET had generated 70+ articles using AI, without prominently disclosing such initially. As a follow-on, Buzzfeed then used ChatGPT to generate their own article on the matter, noting that the process was error-prone and they had to rewrite their prompt several times to avoid basic factual errors. In the scientific realm, teams go to great lengths to ensure their data are trustworthy. Increased use of chat-based AI will present new challenges for doing such.

3. Error Perpetuation Potential (As Illustrated by Natural Language Processing and AI-based Content Generation)

Lexical analysis, or natural language processing (NLP), has been around for years. For example, there are a number of solutions for scanning papers and building semantic models. In drug discovery, researchers might use such tools to scan publications to quickly uncover potential binding targets for small molecules. While these tools can help parse through large volumes of content in rapid fashion, they’re certainly not fool-proof and manual consideration is often necessary to make final assessments. This is partly due to the inherent challenges of conveying complex information in written publications. What constitutes a “good” paper is a discussion far beyond the scope of this piece; but, certainly, most of us have read papers that left us wondering if we were missing some assumed knowledge, or if the paper was just poorly written. Training models using such papers is bound to be challenging.

Complicating matters even more is the growing popularity of using AI algorithms to generate new content using source materials of varying quality. The output content may often sound factually correct even when it isn’t, or it may become too complex and confusing to interpret. This can amplify quality issues and will likely skew toward poor-quality; in turn, readers may feel like they actually need algorithms to interpret information; but if those algorithms are themselves lacking, the quality problem just self-perpetuates, further diluting the content and making the truth increasingly difficult to decipher.

4. Complexity, Specificity, and Training Limitations (As Illustrated by AI-based Code Writing)

ChatGPT is also being hailed for its ability to write code; but like written language, creating code is an artform in its own right, and the more complex the code is, the greater the chance of error. Say, for example, you ask for the creation of an “alignment algorithm” without further specification. You may be given an algorithm that can align peptide sequences, but not DNA sequences. Because the letters representing DNA bases—A, C, G, and T—are also used to denote amino acids, you might get an output without error, but it might not actually be what you’re looking for. This leaves highly skilled people to clean up after the algorithm. Their skills, which have been acquired through years of computational life sciences work, might be better applied to actually write and refine the algorithms themselves.

As this example above illustrates, lack of specificity is a fundamental obstacle that must be kept in mind when employing any AI tools. Generalized models that have been trained on huge datasets with no specificity will undoubtedly struggle in specialist areas. For example, in drug discovery, if a predictive algorithm has been trained on small molecules for protein-drug binding, the trustworthiness of its binding predictions depends on how structurally similar the input molecules are to the molecules in the training set. In such cases, an uncertainty metric can help improve transparency, letting users know the limitations of the model. This notion of trustworthiness is of key importance. Models, after all, are only as good as the quality of their training data. Without transparency, how are we to know if models were trained using insufficient, inaccurate, or improperly sourced data? While definite, confident answers like those given by ChatGPT may be attractive, those answers mean little without a trustworthiness score or insight into training-data sourcing and quality.

Not All Models Are Created Equal - AI in Scientific R&D

Ask any scientist and they’ll likely agree that the use of machine learning and artificial intelligence in R&D is nothing new. For more than a decade, researchers have used computational techniques for many purposes, such as finding hits, predicting binding sites, modeling drug-protein interactions, and predicting reaction rates. Most scientists will also likely agree that all models, like all data, aren’t created equal. In many cases, AI- and ML-based tools have largely been used supplementally, not exclusively, but as they become more of a mainstay in our standard workflows, we must keep in mind the concerns illuminated by our examples above.

Developers of AI tools should aim to build semantic relationships into neatly organized training data and provide interpretable metrics that allow users to gauge confidence and reliability; users should not be expected to blindly take predictions at face value. It’s akin to providing a satellite navigation system that empowers drivers to see where they are and identify the best route to get where they need to be, rather than forcing upon them a self-driving vehicle that requires them to relinquish all knowledge and control. It’s about using AI to augment people’s expertise, not replace it (or them).

It all boils down to this: AI holds incredible potential to help speed up work, save costs, inspire innovations and expand the scope of possibility, but undoubtedly, the necessity of clean data, trustworthy models, and human insight is still imperative.

Use AI in Your Scientific R&D Workflows

The Dotmatics Platform facilitates easy capture of clean data and enables the integration of AI into more extensive R&D workflows.

Request a demonstration of Dotmatics to learn how we can help you get AI-ready.