The Blog

MindWalk is a biointelligence company uniting AI, multi-omics data, and advanced lab research into a customizable ecosystem for biologics discovery and development.

Mitigating LLM hallucinations

there is a compelling case underlying the tremendous interest in generative ai and llms as the next big technological inflection point in computational drug discovery and development. for starters, llms help expand the data universe of in silico drug discovery, especially in terms of opening up access to huge volumes of valuable information locked away in unstructured textual data sources including scientific literature, public databases, clinical trial notes, patient records, etc. llms provide the much-needed capability to analyze, identify patterns and connections, and extract novel insights about disease mechanisms and potential therapeutic targets. their ability to interpret complex scientific concepts and elucidate connections between diseases, genes, and biological processes can help accelerate disease hypothesis generation and the identification of potential drug targets and biomarkers. when integrated with biomedical knowledge graphs, llms help create a unique synergistic model that enables bidirectional data- and knowledge-based reasoning. the explicit structured knowledge of knowledge graphs enhances the knowledge of llms while the power of language models streamlines graph construction and user conversational interactions with complex knowledge bases. however, there are still several challenges that have to be addressed before llms can be reliably integrated into in silico drug discovery pipelines and workflows. one of these is hallucinations. why do llms hallucinate? at a time of some speculation about laziness and seasonal depression in llms, a hallucination leaderboard of 11 public llms revealed hallucination rates that ranged from 3% at the top end to 27% at the bottom of the barrel. another comparative study of two versions of a popular llm in generating ophthalmic scientific abstracts revealed very high hallucination rates (33% and 29%) of generating fake references. this tendency of llms to hallucinate, ergo present incorrect or unverifiable knowledge as accurate, even at 3% can have serious consequences in critical drug discovery applications. there are several reasons for llm hallucinations. at the core of this behavior is the fact that generative ai models have no actual intelligence, relying instead on a probability-based approach to predict data that is most likely to occur based on patterns and contexts ‘learned’ from their training data. apart from this inherent lack of contextual understanding, other potential causes include exposure to noise, errors, biases, and inconsistencies in training data, training and generation methods, or even prompting techniques. for some, hallucination is all llms do and others see it as inevitable for any prompt-based large language model. in the context of life sciences research, however, mitigating llm hallucinations remains one of the biggest obstacles to the large-scale and strategic integration of this potentially transformative technology. how to mitigate llm hallucinations? there are three broad and complementary approaches to mitigating hallucinations in large language models: prompt engineering, fine-tuning, and grounding + prompt augmentation. prompt engineering prompt engineering is the process of strategically designing user inputs, or prompts, in order to guide model behavior and obtain optimal responses. there are three major approaches to prompt engineering: zero-shot, few-shot, and chain-of-thought prompts. in zero-shot prompting, language models are provided with inputs that are not part of their training data but are still capable of generating reliable results. few-shot prompting involves providing examples to llms before presenting the actual query. chain-of-thought (cot) is based on the finding that a series of intermediate reasoning steps provided as examples during prompting can significantly improve the reasoning capabilities of large language models. the chain-of-thought concept has been expanded to include new techniques such as chain-of-verification (cove), a self-verification process that enables llms to check the accuracy and reliability of their output, and chain of density (cod), a process that focuses on summarization rather than reasoning to control the density of information in the generated text. prompt engineering, however, has its own set of limitations including prompt constraints that may cramp the ability to query complex domains and the lack of objective metrics to quantify prompt effectiveness. fine-tuning where the focus of prompt engineering is on the skill required to elicit better llm output, fine-tuning emphasizes task-specific training in order to enhance the performance of pre-trained models in specific topics or domain areas. a conventional approach to llm finetuning is full fine-tuning, which involves the additional training of pre-trained models on labeled, domain or task-specific data in order to generate more contextually relevant responses. this is a time, resource and expertise-intensive process. an alternative approach is parameter-efficient fine-tuning (peft), conducted on a small set of extra parameters without adjusting the entire model. the modular nature of peft means that the training can prioritize select portions or components of the original parameters so that the pre-trained model can be adapted for multiple tasks. lora (low-rank adaptation of large language models), a popular peft technique, can significantly reduce the resource intensity of fine-tuning while matching the performance of full fine-tuning. there are, however, challenges to fine-tuning including domain shift issues, the potential for bias amplification and catastrophic forgetting, and the complexities involved in choosing the right hyperparameters for fine-tuning in order to ensure optimal performance. grounding & augmentation llm hallucinations are often the result of language models attempting to generate knowledge based on information that they have not explicitly memorized or seen. the logical solution, therefore, would be to provide llms with access to a curated knowledge base of high-quality contextual information that enables them to generate more accurate responses. advanced grounding and prompt augmentation techniques can help address many of the accuracy and reliability challenges associated with llm performance. both techniques rely on external knowledge sources to dynamically generate context. grounding ensures that llms have access to up-to-date and use-case-specific information sources to provide the relevant context that may not be available solely from the training data. similarly, prompt augmentation enhances a prompt with contextually relevant information that enables llms to generate a more accurate and pertinent output. factual grounding is a technique typically used in the pre-training phase to ensure that llm output across a variety of tasks is consistent with a knowledge base of factual statements. post-training grounding relies on a range of external knowledge bases, including documents, code repositories, and public and proprietary databases, to improve the accuracy and relevance of llms on specific tasks. retrieval-augmented generation (rag), is a distinct framework for the post-training grounding of llms based on the most accurate, up-to-date information retrieved from external knowledge bases. the rag framework enables the optimization of biomedical llms output along three key dimensions. one, access to targeted external knowledge sources ensures llms' internal representation of information is dynamically refreshed with the most current and contextually relevant data. two, access to an llm’s information sources ensures that responses can be validated for relevance and accuracy. and three, there is the emerging potential to extend the rag framework beyond just text to multimodal knowledge retrieval, spanning images, audio, tables, etc., that can further boost the factuality, interpretability, and sophistication of llms. also read: how retrieval-augmented generation (rag) can transform drug discovery some of the key challenges of retrieval-augmented generation include the high initial cost of implementation as compared to standalone generative ai. however, in the long run, the rag-llm combination will be less expensive than frequently fine-tuning llms and provides the most comprehensive approach to mitigating llm hallucinations. but even with better grounding and retrieval, scientific applications demand another layer of rigor — validation and reproducibility. here’s how teams can build confidence in llm outputs before trusting them in high-stakes discovery workflows. how to validate llm outputs in drug discovery pipelines in scientific settings like drug discovery, ensuring the validity of large language model (llm) outputs is critical — especially when such outputs may inform downstream experimental decisions. here are key validation strategies used to assess llm-generated content in biomedical pipelines: validation checklist: compare outputs to curated benchmarks use structured, peer-reviewed datasets such as drugbank, chembl, or internal gold standards to benchmark llm predictions. cross-reference with experimental data validate ai-generated hypotheses against published experimental results, or integrate with in-house wet lab data for verification. establish feedback loops from in vitro validations create iterative pipelines where lab-tested results refine future model prompts, improving accuracy over time. advancing reproducibility in ai-augmented science for llm-assisted workflows to be trustworthy and audit-ready, they must be reproducible — particularly when used in regulated environments. reproducibility practices: dataset versioning track changes in source datasets, ensuring that each model run references a consistent data snapshot. prompt logging store full prompts (including context and input structure) to reproduce specific generations and analyze outputs over time. controlled inference environments standardize model versions, hyperparameters, and apis to eliminate variation in inference across different systems. integrated intelligence with lensai™ holistic life sciences research requires the sophisticated orchestration of several innovative technologies and frameworks. lensai integrated intelligence, our next-generation data-centric ai platform, fluently blends some of the most advanced proprietary technologies into one seamless solution that empowers end-to-end drug discovery and development. lensai integrates rag-enhanced biollms with an ontology-driven nlp framework, combining neuro-symbolic logic techniques to connect and correlate syntax (multi-modal sequential and structural data) and semantics (biological functions). a comprehensive and continuously expanding knowledge graph, mapping a remarkable 25 billion relationships across 660 million data objects, links sequence, structure, function, and literature information from the entire biosphere to provide a comprehensive overview of the relationships between genes, proteins, structures, and biological pathways. our next-generation, unified, knowledge-driven approach to the integration, exploration, and analysis of heterogeneous biomedical data empowers life sciences researchers with the high-tech capabilities needed to explore novel opportunities in drug discovery and development.

How retrieval-augmented generation (RAG) can transform drug discovery

in a recent article on knowledge graphs and large language models (llms) in drug discovery, we noted that despite the transformative potential of llms in drug discovery, there were several critical challenges that have to be addressed in order to ensure that these technologies conform to the rigorous standards demanded by life sciences research. synergizing knowledge graphs with llms into one bidirectional data- and knowledge-based reasoning framework addresses several concerns related to hallucinations and lack of interpretability. however, that still leaves the challenge of enabling llms access to external data sources that address their limitation with respect to factual accuracy and up-to-date knowledge recall. retrieval-augmented generation (rag), together with knowledge graphs and llms, is the third critical node on the trifecta of techniques required for the robust and reliable integration of the transformative potential of language models into drug discovery pipelines. why retrieval-augmented generation? one of the key limitations of general-purpose llms is their training data cutoff, which essentially means that their responses to queries are typically out of step with the rapidly evolving nature of information. this is a serious drawback, especially in fast-paced domains like life sciences research. retrieval-augmented generation enables biomedical research pipelines to optimize llm output by: grounding the language model on external sources of targeted and up-to-date knowledge to constantly refresh llms' internal representation of information without having to completely retrain the model. this ensures that responses are based on the most current data and are more contextually relevant. providing access to the model's information so that responses can be validated for accuracy and sources, ensuring that its claims can be checked for relevance and accuracy. in short, retrieval-augmented generation provides the framework necessary to augment the recency, accuracy, and interpretability of llm-generated information. how does retrieval-augmented generation work? retrieval augmented generation is a natural language processing (nlp) approach that combines elements of both information retrieval and text generation models to enhance the performance of knowledge-intensive tasks. the retrieval component aggregates information relevant to specific queries from a predefined set of documents or knowledge sources which then serves as the context for the generation model. once the information has been retrieved, it is combined with the input context to create an integrated context containing both the original query and the relevant retrieved information. this integrated context is then fed into a generation model to generate an accurate, coherent, and contextually appropriate response based on both pre-trained knowledge and retrieved query-specific information. the rag approach gives life sciences research teams more control over grounding data used by a biomedical llm by honing it on enterprise- and domain-specific knowledge sources. it also enables the integration of a range of external data sources, such as document repositories, databases, or apis, that are most relevant to enhancing model response to a query. the value of rag in biomedical research conceptually, the retrieve+generate model’s capabilities in terms of dealing with dynamic external information sources, minimizing hallucinations, and enhancing interpretability make it a natural and complementary fit to augment the performance of biollms. in order to quantify this augmentation in performance, a recent research effort evaluated the ability of a retrieval-augmented generative agent in biomedical question-answering vis-a-vis llms (gpt-3.5/4), state-of-the-art commercial tools (elicit, scite, and perplexity) and humans (biomedical researchers). the rag agent, paperqa, was first evaluated against a standard multiple-choice llm-evaluation dataset, pubmedqa, with the provided context removed to test the agents’ ability to retrieve information. in this case, the rag agent beats gpt-4 by 30 points (57.9% to 86.3%). next, the researchers constructed a more complex and more contemporary dataset (litqa), based on more recent full-text research papers outside the bounds of llm’s pre-training data, to compare the integrated abilities of paperqa, llms and human researchers to retrieve the right information and to generate an accurate answer based on that information. again, the rag agent outperformed both pre-trained llms and commercial tools with overall accuracy (69.5%) and precision (87.9%) scores that were on par with biomedical researchers. more importantly, the rag model produced zero hallucinated citations compared to llms (40-60%). despite being just a narrow evaluation of the performance of the retrieval+generation approach in biomedical qa, the above research does demonstrate the significantly enhanced value that rag+biollm can deliver compared to purely generative ai. the combined sophistication of retrieval and generation models can be harnessed to enhance the accuracy and efficiency of a range of processes across the drug discovery and development pipeline. retrieval-augmented generation in drug discovery in the context of drug discovery, rag can be applied to a range of tasks, from literature reviews to biomolecule design. currently, generative models have demonstrated potential for de novo molecular design but are still hampered by their inability to integrate multimodal information or provide interpretability. the rag framework can facilitate the retrieval of multimodal information, from a range of sources, such as chemical databases, biological data, clinical trials, images, etc., that can significantly augment generative molecular design. the same expanded retrieval + augmented generation template applies to a whole range of applications in drug discovery like, for example, compound design (retrieve compounds/ properties and generate improvements/ new properties), drug-target interaction prediction (retrieve known drug-target interactions and generate potential interactions between new compounds and specific targets. adverse effects prediction (retrieve known adverse and generate modifications to eliminate effects). etc. the template even applies to several sub-processes/-tasks within drug discovery to leverage a broader swathe of existing knowledge to generate novel, reliable, and actionable insights. in target validation, for example, retrieval-augmented generation can enable the comprehensive generative analysis of a target of interest based on an extensive review of all existing knowledge about the target, expression patterns and functional roles of the target, known binding sites, pertinent biological pathways and networks, potential biomarkers, etc. in short, the more efficient and scalable retrieval of timely information ensures that generative models are grounded in factual, sourceable knowledge, a combination with limitless potential to transform drug discovery. an integrated approach to retrieval-augmented generation retrieval-augmented generation addresses several of the critical limitations and augments the generative capabilities of biollms. however, there are additional design rules and multiple technological profiles that have to come together to successfully address the specific requirements and challenges of life sciences research. our lensai™ integrated intelligence platform seamlessly unifies the semantic proficiency of knowledge graphs, the versatile information retrieval capabilities of retrieval-augmented generation, and the reasoning capabilities of large language models to reinvent the understanding-retrieve-generate cycle in biomedical research. our unified approach empowers researchers to query a harmonized life science knowledge layer that integrates unstructured information & ontologies into a knowledge graph. a semantic-first approach enables a more accurate understanding of research queries, which in turn results in the retrieval of content that is most pertinent to the query. the platform also integrates retrieval-augmented generation with structured biomedical data from our hyft technology to enhance the accuracy of generated responses. and finally, lensai combines deep learning llms with neuro-symbolic logic techniques to deliver comprehensive and interpretable outcomes for inquiries. to experience this unified solution in action, please contact us here.

Natural Language Understanding (NLU) - Basics and Applications in Bioinformatics

natural language understanding (nlu) is an ai-powered technology that allows machines to understand the structure and meaning of human languages. nlu, like natural language generation (nlg), is a subset of natural language processing (nlp) that focuses on assigning structure, rules, and logic to human language so machines can understand the intended meaning of words, phrases, and sentences in text. nlg, on the other hand, deals with generating realistic written/spoken human-understandable information from structured and unstructured data. since the development of nlu is based on theoretical linguistics, the process can be explained in terms of the following linguistic levels of language comprehension. linguistic levels in nlu phonology is the study of sound patterns in different languages/dialects, and in nlu it refers to the analysis of how sounds are organized, and their purpose and behavior. lexical or morphological analysis is the study of morphemes, indivisible basic units of language with their own meaning, one at a time. indivisible words with their own meaning, or lexical morphemes (e.g.: work) can be combined with plural morphemes (e.g.: works) or grammatical morphemes (e.g.: worked/working) to create word forms. lexical analysis identifies relationships between morphemes and converts words into their root form. syntactic analysis, or syntax analysis, is the process of applying grammatical rules to word clusters and organizing them on the basis of their syntactic relationships in order to determine meaning. this also involves detecting grammatical errors in sentences. while syntactic analysis involves extracting meaning from the grammatical syntax of a sentence, semantic analysis looks at the context and purpose of the text. it helps capture the true meaning of a piece of text by identifying text elements as well as their grammatical role. discourse analysis expands the focus from sentence-length units to look at the relationships between sentences and their impact on overall meaning. discourse refers to coherent groups of sentences that contribute to the topic under discussion. pragmatic analysis deals with aspects of meaning not reflected in syntactic or semantic relationships. here the focus is on identifying intended meaning readers by analyzing literal and non-literal components against the context of background knowledge. common tasks/techniques in nlu there are several techniques that are used in the processing and understanding of human language. here’s a quick run-through of some of the key techniques used in nlu and nlp. tokenization is the process of breaking down a string of text into smaller units called tokens. for instance, a text document could be tokenized into sentences, phrases, words, subwords, and characters. this is a critical preprocessing task that converts unstructured text into numerical data for further analysis. stemming and lemmatization are two different approaches with the same objective: to reduce a particular word to its root word. in stemming, characters are removed from the end of a word to arrive at the “stem” of that word. algorithms determine the number of characters to be eliminated for different words even though they do not explicitly know the meaning of those words. lemmatization is a more sophisticated approach that uses complex morphological analysis to arrive at the root word, or lemma. parsing is the process of extracting the syntactic information of a sentence based on the rules of formal grammar. based on the type of grammar applied, the process can be classified broadly into constituency and dependency parsing. constituency parsing, based on context-free grammar, involves dividing a sentence into sub-phrases, or constituents, that belong to a specific grammar category, such as noun phrases or verb phrases. dependency parsing defines the syntax of a sentence not in terms of constituents but in terms of the dependencies between the words in a sentence. the relationship between words is depicted as a dependency tree where words are represented as nodes and the dependencies between them as edges. part-of-speech (pos) tagging, or grammatical tagging, is the process of assigning a grammatical classification, like noun, verb, adjective, etc., to words in a sentence. automatic tagging can be broadly classified as rule-based, transformation-based, and stochastic pos tagging. rule-based tagging uses a dictionary, as well as a small set of rules derived from the formal syntax of the language, to assign pos. transformation-based tagging, or brill tagging, leverages transformation-based learning for automatic tagging. stochastic refers to any model that uses frequency or probability, e.g. word frequency or tag sequence probability, for automatic pos tagging. name entity recognition (ner) is an nlp subtask that is used to detect, extract and categorize named entities, including names, organizations, locations, themes, topics, monetary, etc., from large volumes of unstructured data. there are several approaches to ner, including rule-based systems, statistical models, dictionary-based systems, ml-based systems, and hybrid models. these are just a few examples of some of the most common techniques used in nlu. there are several other techniques like, for instance, word sense disambiguation, semantic role labeling, and semantic parsing that focus on different levels of semantic abstraction, nlp/nlu in biomedical research nlp/nlu technologies represent a strategic fit for biomedical research with its vast volumes of unstructured data — 3,000-5,000 papers published each day, clinical text data from ehrs, diagnostic reports, medical notes, lab data, etc., and non-standardized digital real-world data. nlp-enabled text mining has emerged as an effective and scalable solution for extracting biomedical entity relations from vast volumes of scientific literature. techniques, like named entity recognition (ner), are widely used in relation extraction tasks in biomedical research with conventionally named entities, such as names, organizations, locations, etc., substituted with gene sequences, proteins, biological processes, and pathways, drug targets, etc. the unique vocabulary of biomedical research has necessitated the development of specialized, domain-specific bionlp frameworks. at the same time, the capabilities of nlu algorithms have been extended to the language of proteins and that of chemistry and biology itself. a 2021 article detailed the conceptual similarities between proteins and language that make them ideal for nlp analysis. more recently, an nlp model was trained to correlate amino acid sequences from the uniprot database with english language words, phrases, and sentences used to describe protein function to annotate over 40 million proteins. researchers have also developed an interpretable and generalizable drug-target interaction model inspired by sentence classification techniques to extract relational information from drug-target biochemical sentences. large neural language models and transformer-based language models are opening up transformative opportunities for biomedical nlp applications across a range of bioinformatics fields including sequence analysis, genome analysis, multi-omics, spatial transcriptomics, and drug discovery. most importantly, nlp technologies have helped unlock the latent value in huge volumes of unstructured data to enable more integrative, systems-level biomedical research. read more about nlp’s critical role in facilitating systems biology and ai-powered data-driven drug discovery. if you want more information on seamlessly integrating advanced bionlp frameworks into your research pipeline, please drop us a line here.

Creating an AI-ready data foundation for successful AI-enabled drug discovery

over the past year, we have looked at drug discovery and development from several different perspectives. for instance, we looked at the big data frenzy in biopharma, as zettabytes of sequencing, real-world and textual data (rwd) pile up and stress the data integration and analytic capabilities of conventional solutions. we also discussed how the time-consuming, cost-intensive, low productivity characteristics of the prevalent roi-focused model of development have an adverse impact not just on commercial viability in the pharma industry but on the entire healthcare ecosystem. then we saw how antibody drug discovery processes continued to be cited as the biggest challenge in therapeutic r&d even as the industry was pivoting to biologics and mabs. no matter the context or frame of reference, the focus inevitably turns to how ai technologies can transform the entire drug discovery and development process, from research to clinical trials. biopharma companies have traditionally been slow to adopt innovative technologies like ai and the cloud. today, however, digital innovation has become an industry-wide priority with drug development expected to be the most impacted by smart technologies. from application-centric to data-centric ai technologies have a range of applications across the drug discovery and development pipeline, from opening up new insights into biological systems and diseases to streamlining drug design to optimizing clinical trials. despite the wide-ranging potential of ai-driven transformation in biopharma, the process does entail some complex challenges. the most fundamental challenge will be to make the transformative shift from an application-centric to a data-centric culture, where data and metadata are operationalized at scale and across the entire drug design and development value chain. however, creating a data-centric culture in drug development comes with its unique set of data-related challenges. to start with there is the sheer scale of data that will require a scalable architecture in order to be efficient and cost-effective. most of this data is often distributed across disparate silos with unique storage practices, quality procedures, and naming and labeling conventions. then there is the issue of different data modalities, from mr or ct scans to unstructured clinical notes, that have to be extracted, transformed, and curated at scale for unified analysis. and finally, the level of regulatory scrutiny on sensitive biomedical data means that there is this constant tension between enabling collaboration and ensuring compliance. therefore, creating a strong data foundation that accounts for all these complexities in biopharma data management and analysis will be critical to ensuring the successful adoption of ai in drug development. three key requisites for an ai-ready data foundation successful ai adoption in drug development will depend on the creation of a data foundation that addresses these three key requirements. accessibility data accessibility is a key characteristic of ai leaders irrespective of sector. in order to ensure effective and productive data democratization, organizations need to enable access to data distributed across complex technology environments spanning multiple internal and external stakeholders and partners. a key caveat of accessibility is that the data provided should be contextual to the analytical needs of specific data users and consumers. a modern cloud-based and connected enterprise data and ai platform designed as a “one-stop-shop” for all drug design and development-related data products with ready-to-use analytical models will be critical to ensuring broader and deeper data accessibility for all users. data management and governance the quality of any data ecosystem is determined by the data management and governance frameworks that ensure that relevant information is accessible to the right people at the right time. at the same time, these frameworks must also be capable of protecting confidential information, ensuring regulatory compliance, and facilitating the ethical and responsible use of ai. therefore, the key focus of data management and governance will be to consistently ensure the highest quality of data across all systems and platforms as well as full transparency and traceability in the acquisition and application of data. ux and usability successful ai adoption will require a data foundation that streamlines accessibility and prioritizes ux and usability. apart from democratizing access, the emphasis should also be on ensuring that even non-technical users are able to use data effectively and efficiently. different users often consume the same datasets from completely different perspectives. the key, therefore, is to provide a range of tools and features that help every user customize the experience to their specific roles and interests. apart from creating the right data foundation, technology partnerships can also help accelerate the shift from an application-centric to a data-centric approach to ai adoption. in fact, a 2018 gartner report advised organizations to explore vendor offerings as a foundational approach to jump-start their efforts to make productive use of ai. more recently, pharma-technology partnerships have emerged as the fastest-moving model for externalizing innovation in ai-enabled drug discovery. according to a recent roots analysis report on the ai-based drug discovery market, partnership activity in the pharmaceutical industry has grown at a cagr of 50%, between 2015 and 2021, with a majority of the deals focused on research and development. so with that trend as background, here’s a quick look at how a data-centric, full-service biotherapeutic platform can accelerate biopharma’s shift to an ai-first drug discovery model. the lensai™ approach to data-centric drug development our approach to biotherapeutic research places data at the very core of a dynamic network of biological and artificial intelligence technologies. with our lensai platform, we have created a google-like solution for the entire biosphere, organizing it into a multidimensional network of 660 million data objects with multiple layers of information about sequence, syntax, and protein structure. this “one-stop-shop” model enables researchers to seamlessly access all raw sequence data. in addition, hyfts®, our universal framework for organizing all biological data, allows easy, one-click integration of all other research-relevant data from across public and proprietary data repositories. researchers can then leverage the power of the lensai integrated intelligence platform to integrate unstructured data from text-based knowledge sources such as scientific journals, ehrs, clinical notes, etc. here again, researchers have the ability to expand the core knowledge base, containing over 33 million abstracts from the pubmed biomedical literature database, by integrating data from multiple sources and knowledge domains, including proprietary databases. around this multi-source, multi-domain, data-centric core, we have designed next-generation ai technologies that can instantly and concurrently convert these vast volumes of text, sequence, and protein structure data into meaningful knowledge that can transform drug discovery and development.

Artificial Intelligence in early phase drug development

artificial intelligence (ai) technologies are currently the most disruptive trend in the pharmaceutical industry. over the past year, we have quite extensively covered the impact that these intelligent technologies can have on conventional drug discovery and development processes. we charted how ai and machine learning (ml) technologies came to be a core component of drug discovery and development, their potential to exponentially scale and autonomize drug discovery and development, their ability to expand the scope of drug research even in data-scarce specialties like rare diseases, and the power of knowledge graph-based drug discovery to transform a range of drug discovery and development tasks. ai/ml technologies can radically remake every stage of the drug discovery and development process, from research to clinical trials. today, we will dive deeper into the transformational possibilities of these technologies in two foundational stages — early drug discovery and preclinical development — of the drug development process. early drug discovery and preclinical development source: sciencedirect early drug discovery and preclinical development is a complex process that essentially determines the productivity and value of downstream development programs. therefore, even incremental improvements in accuracy and efficiency during these early stages could dramatically improve the entire drug development value chain. ai/ml in early drug discovery the early small molecule drug discovery process flows broadly, across target identification, hit identification, lead identification, lead optimization, and finally, on to preclinical development. currently, this time-consuming and resource-intensive process relies heavily on translational approaches and assumptions. incorporating assumptions, especially those that cannot be validated due to lack of data, raises the risk of late-stage failure by advancing nmes without accurate evidence of human response into drug development. even the drastically different process of large-molecule, or biologicals, development, starts with an accurate definition of the most promising target. ai/ml methods, therefore, can play a critical role in accelerating the development process. investigating drug-target interactions (dtis), therefore, is a critical step to enhancing the success rate of new drug discovery. predicting drug-target interactions despite the successful identification of the biochemical functions of a myriad of proteins and compounds with conventional biomedical techniques, the limitations of these approaches come into play when scaling across the volume and complexity of data. this is what makes ml methods ideal for drug–target interaction (dti) prediction at scale. l techniques ideal for drug-target interaction prediction. there are currently several state-of-the-art ml models available for dti prediction. however, many conventional ml approaches regard dti prediction either as a classification or a regression task, both of which can lead to bias and variance errors. novel multi-dti models that balance bias and variance through a multi-task learning framework have been able to deliver superior performance and accuracy over even state-of-the-art methods. these dti prediction models combine a deep learning framework with a co-attention mechanism to model interactions from drug and protein modalities and improve the accuracy of drug target annotation. deep learning models perform significantly better at high-throughput dti prediction than conventional approaches and continue to evolve, from identifying simple interactions to revealing unknown mechanisms of drug action. lead identification & optimization this stage focuses on identifying and optimizing drug-like small molecules that exhibit therapeutic activity. the challenge in this hit-to-lead generation phase is twofold. firstly, the search space to extract hit molecules from compound libraries extends to millions of molecules. for instance, a single database like the zinc database comprises 230 million purchasable compounds and the universe of make-on-demand synthesis compounds can be 10 billion. secondly, the hit rate of conventional high-throughput screening (hts) approaches to yield an eligible viable compound is just around 0.1%. over the years, there have been several initiatives to improve the productivity and efficiency of hit-to-lead generation, including the use of high-content screening (hcs) techniques to complement hts and improve efficiency and cadd virtual screening methodologies to reduce the number of compounds to be tested. source: bcg the availability of huge volumes of high-quality data combined with the ability of ai to parse and learn from these data has the potential to take the computational screening process to a new level. there are at least four ways — access to new biology, improved or novel chemistry, better success rates, and quicker and cheaper discovery processes — in which ai can add new value to small-molecule drug discovery. ai technologies can be applied to a variety of discovery contexts and biological targets and can play a critical role in redefining long-standing workflows and many of the challenges of conventional techniques. ai/ml in preclinical development preclinical development addresses several critical issues relevant to the success of new drug candidates. preclinical studies are a regulatory prerequisite to generating toxicology data that validate the safety of a drug for humans prior to clinical trials. these studies inform trial design and provide the pharmacokinetic, pharmacodynamic, tolerability, and safety information, such as in vitro off-target and tissue-cross reactivity (tcr), that defines optimal dosage. preclinical data also provide chemical, manufacturing, and control information that will be crucial for clinical production. finally, they help pharma companies to identify candidates with the broadest potential benefits and the greatest chance of success. it is estimated that just 10 out of 10,000 small molecule drug candidates in preclinical studies make it to clinical trials. one reason for this extremely high turnover is the imperfect nature of preclinical in vivo research models, as compared to in vitro studies which can typically confirm efficacy, moa, etc., which results in challenges to accurately predicting clinical outcomes. however, ai/ml technologies are increasingly being used to bridge the translational gap between preclinical discoveries and new therapeutics. for instance, a key approach to de-risking clinical development has been the use of translational biomarkers that demonstrate target modulation, target engagement, and confirm proof of mechanism. in this context, ai techniques have been deployed to learn from large volumes of heterogeneous and high-dimensional omics data and provide valuable insights that streamline translational biomarker discovery. similarly, ml algorithms that learn from problem-specific training data have been successfully used to accurately predict bioactivity, absorption, distribution, metabolism, excretion, and toxicity (admet) -related endpoints, and physicochemical properties. these technologies also play a critical role in the preclinical development of biologicals, including in the identification of candidate molecules with a higher probability of providing species-agnostic reactive outcomes in animal/human testing, ortholog analysis, and off-target binding analysis. these technologies have also been used to successfully predict drug interactions, including drug-target and drug-drug interactions, during preclinical testing. the age of data-driven drug discovery & development network-based approaches that enable a systems-level view of the mechanisms underlying disease pathophysiology are increasingly becoming the norm in drug discovery and development. this in turn has opened up a new era of data-driven drug development where the focus is on the integration of heterogeneous types and sources of data, including molecular, clinical trial, and drug label data. the preclinical space is being transformed by ai technologies like natural language processing (nlp) that are enabling the identification of novel targets and previously undiscovered drug-disease associations based on insights extracted from unstructured data sources like biomedical literature, electronic medical records (emrs), etc. sophisticated and powerful ml/ai algorithms now enable the unified analysis of huge volumes of diverse datasets to autonomously reveal complex non-linear relationships that streamline and accelerate drug discovery and development. ultimately, the efficiency and productivity of early drug discovery and preclinical development processes will determine the value of the entire pharma r&d value chain. and that’s where ai/ml technologies have been gaining the most traction in recent years.

AI, ML, DL, and NLP: An Overview

today artificial intelligence (ai), machine learning (ml), deep learning (dl) and natural language processing (nlp) are all technologies that have become a part of the fabric of enterprise it. however, solutions providers and end-users often use these terms interchangeably. even though there can be significant conceptual overlaps, there are also important distinctions between these key technologies. increasingly, the value of ai in drug discovery is determined not by model complexity alone, but by how well biological context is preserved across data, computation, and experimentation. platforms such as mindwalk reflect this shift—prioritizing biological fidelity, traceability, and integration with experimental workflows so that computational insight remains actionable as discovery programs scale. here’s a quick overview of the definition and scope of each of these terms. artificial intelligence (ai) the term ai has been around since the 1950s and broadly refers to the simulation of human intelligence by machines. it encompasses several areas beyond computer science including psychology, philosophy, linguistics and others. ai can be classified into four types, from simplest to most advanced, as reactive machines, limited memory, theory of mind and self-awareness. reactive machines: purely reactive machines are trained to perform a basic set of tasks based on certain inputs. this ai cannot function beyond a specific context and is not capable of learning or evolving over time. examples: ibm’s deep blue chess ai, and google’s alphago ai. limited memory systems: as the nomenclature suggests, these ai systems have limited memory to store and analyze data. this memory is what enables “learning” and gives them the capability to improve over time. in practical terms, these are the most advanced ai systems we currently have. examples: self-driving vehicles, virtual voice assistants, chatbots. theory of mind: at this level, we are already into theoretical concepts that have not yet been achieved yet. with their ability to understand human thoughts and emotions, these advanced ai systems can facilitate more complex two-way interactions with users. self-awareness: self-aware ais with human-level desires, emotions and consciousness is the aspirational end state for ai and, as yet, are pure science fiction. another broad approach to distinguishing between ai systems is in terms of narrow or weak ai, specialized intelligence trained to perform specific tasks better than humans, general artificial intelligence (agi) or strong ai, a theoretical system that could be applied to any task or problem, and artificial super intelligence (asi), ai that comprehensively surpasses human cognition. the concept of ai is continuously evolving based on the emergence of technologies that enable the most accurate simulation of human intelligence. some of those technologies include ml, dl, and artificial neural networks (ann) or simply neural networks (nn). ml, dl, rl, and drl here’s the tl;dr before we get into each of these concepts in a bit more detail: if ai’s objective is to endow machines with human intelligence, ml refers to methods for implementing ai by using algorithms for data-driven learning and decision-making. dl is a technology for realizing ml and expanding the scope of ai. reinforcement learning (rl), or evaluation learning, is an ml technique. and deep reinforcement learning (drl) combines dl and rl to realize optimization objectives and advance toward general ai. source: researchgate machine learning (ml) ml is a subset of ai that involves the implementation of algorithms and neural networks to give machines the ability to learn from experience and act automatically. ml algorithms can be broadly classified into three categories. supervised learning ml algorithms using a labelled input dataset and known responses to develop a regression/classification model that can then be used on new datasets to generate predictions or draw conclusions. the limitation of this approach is that it is not viable for datasets that are beyond a certain context. unsupervised learning algorithms are subjected to “unknown” data that has yet to be categorized or labelled. in this case, the ml system itself learns to classify and process unlabeled data to learn from its inherent structure. there is also an intermediate approach between supervised and unsupervised learning, called semi-supervised learning, where the system is trained based on a small amount of labelled data to determine correlations between data points. reinforcement learning (rl) is an ml paradigm where algorithms learn through ongoing interactions between an ai system and its environment. algorithms receive numerical scores as rewards for generating decisions and outcomes so that positive interactions and behaviours are reinforced over time. deep learning (dl) dl is a subset of ml where models built on deep neural networks work with unlabeled data to detect patterns with minimal human involvement. dl technologies are based on the theory of mind type of ai where the idea is to simulate the human brain by using neural networks to teach models to perceive, classify, and analyze information and continuously learn from these interactions. dl techniques can be classified into three major categories: deep networks for supervised or discriminative learning, deep networks for unsupervised or generative learning, and deep networks for hybrid learning that is an integration of both supervised and unsupervised models and relevant others. deep reinforcement learning (drl) combines rl with dl techniques to solve challenging sequential decision-making problems. because of its ability to learn different levels of abstractions from data, drl is capable of addressing more complicated tasks. natural language processing (nlp) what is natural language processing? nlp is the branch of ai that deals with the training of machines to understand, process, and generate language. by enabling machines to process human languages, nlp helps streamline information exchange between human beings and machines and opens up new avenues by which ai algorithms can receive data. nlp functionality is derived from cross-disciplinary theories from linguistics, ai and computer science. there are two main types of nlp algorithms, rules-based and ml-based. rules-based systems use carefully designed linguistic rules whereas ml-based systems use statistical methods. nlp also consists of two core subsets, natural language understanding (nlu) and natural language generation (nlg). nlu enables computers to comprehend human languages and communicate back to humans in their own languages. nlg is the use of ai programming to mine large quantities of numerical data, identify patterns and share that information as written or spoken narratives that are easier for humans to understand. comparing rules-based and deep learning nlp approaches natural language processing (nlp) systems generally fall into two broad categories: rules-based and deep learning-based. rules-based systems rely on expert-defined heuristics and pattern matching, offering transparency and interpretability. however, they tend to be brittle and limited in scalability across biomedical domains. in contrast, deep learning models—including transformers like biobert and scispacy—automatically learn contextual relationships from large biomedical corpora. these models serve as powerful biomedical text mining tools, offering greater flexibility and accuracy in processing complex, ambiguous language found in clinical narratives, scientific publications, and electronic health records (ehrs). many life sciences applications now favor hybrid pipelines that combine the precision of rule-based systems with the adaptability of deep learning—balancing interpretability and performance in production settings. conclusion this overview outlines the key technological acronyms shaping today’s discussions around ai-driven drug discovery. you can also explore how ai/ml technologies are are advancing intelligent bioinformatics and autonomous drug discovery and the importance and challenges of nlp in biomedical research. curious about nlp? dive deeper into our article for further exploration.