The Blog

MindWalk is a biointelligence company uniting AI, multi-omics data, and advanced lab research into a customizable ecosystem for biologics discovery and development.

Multimodal language models in protein engineering: Functional clonotyping & beyond

in the beginning of 2023, chatgpt achieved a significant milestone of 100 million users. the utilization of generative ai defined the year, with prominent large language models such as gpt-4 captivating the world due to their remarkable mastery of natural language. interestingly, openai’s last upgrade to chatgpt introduces powerful multimodal capabilities, enabling the model to handle various types of input going beyond text and processing images, audio and video. this showcases the future potential of generative ai for hyper-personalization and diverse application. what if these models progress to the point of mastering the language of life? imagine protein-level llms learning the “semantics” and “grammar” of proteins, not just as static structures but as dynamic multimodal entities, enabling us to unravel the intricacies of their functions and behaviors at a level of detail previously unimaginable. the need for multi-modality in protein engineering workflows also in protein engineering workflows, multi-modal models should be introduced, integrating multiple sources of data. going beyond exclusively sequence data might help to solve a vast array of known problems such as protein classification, mutational effect prediction and structure prediction. in the view of antibody discovery, an interesting problem is shaped by functional clonotyping, i.e. the grouping of antibody clonal groups that target the same antigen and epitope. typically, heavy chain cdr3 is used as unique identifier and thus, clustering is frequently performed by requiring a high percentage of hcdr3 sequence similarity and identical v-j assignments. however, it has been shown that many different hcdr3s can be identified within a target-specific antibody population [1]. moreover, the same hcdr3 can be generated by many different rearrangements, and specific target binding is an outcome of unique rearrangements and vl pairing: “the hcdr3 is necessary, albeit insufficient for specific antibody binding.”[1] in addition, it has been demonstrated that antibodies within a same cluster, targeting a same epitope, encompass highly divergent hcdr sequences [2]. this underscores the necessity of incorporating additional “layers” of information in pursuit of the clustering objective. for instance, space2 excels in clustering antibodies that bind to shared epitopes, highlighting that these clusters, characterized by functional coherence and structural similarity, embrace diversity in terms of sequence, genetic lineage and species origin [3]. nevertheless, the potential for significant advancements may reside in the transformative capacities of llms, not only due to their substantial scaling advantages but also the extensive array of possibilities they present. llms grasping the language of life while natural language – large language models (llms) excel in grasping contexts, protein based llms (plms) are advancing their understanding of meanings, contexts and the intricate relationships between the fundamental building blocks, amino acids. much like the word “apple” assuming different meanings based on context, different amino acid (patterns) might have different nuances within protein sequences. the process begins with the tokenization of protein sequence data, transforming them into linear strings of amino acids. some amino acids might “impact” other, more distant amino acids in such a way that it reveals a different function [semantics]. again, compare this to two word phrases: “apple, pear and banana” versus “i bought an apple phone” – the semantics change by context. to unravel the workings of the models behind llms – the so called transformer models - attention layers yield valuable information. which context-information is important to classify apple as being a “fruit” or “tech company”? ask now a similar question for classifying proteins: which context residues/ residue patterns are influencing another residue/ pattern to take part in a different function? does the model learn residue-residue interactions (reflected in attention weights) that overlap with structural interactions? by overlapping protein-domain knowledge on the model’s learnt embedding representations, we can learn underlying protein intricacies. moreover, we believe that utilizing these lower-layer embeddings as predictive features instead of/ on top of the final-layer embeddings might help to make the model more understandable and transparent. this clearly fits in the idea of strategically combining multi-modal data. the potential for improving predictive performance, e.g. improving functional clonotyping of antibodies, lies in the strategic concatenation of embeddings from different layers across various protein language models. indeed, plms are trained for different purposes. for e.g. ablang [4] is trained on predicting missing amino acids in antibody sequences, while antiberty [5] is trained on predicting paratope-binding residues. the model’s embeddings could encompass distinct, perhaps non-overlapping and unique angles of protein-relevant information – being it (a combination of) structural, functional, physiochemical, immunogenicity-related … information. delving deeper into the realm of functional clonotyping, where epitope-binning gains importance, relying solely on antigen-agnostic models may prove insufficient. our curiosity lies in understanding how residues on the paratope interact with those on the epitope – a two-fold perspective that has been addressed through cross-modal attention. this method, akin to graph-attention-network model applied to a bipartite antibody-antigen graph emerges as a compelling approach for modelling multimodality in antibody-antigen interaction and more broadly in protein-protein interactions. [6] in general, we should build comprehensive representations that go beyond individual layers to open up new avenues for understanding protein language. protein words to capture semantics language models for natural language learn how words are used in context, i.e. words with similar context have similar meanings. this allows the model to understand meanings based on distributional patterns alone. in natural language, symbols like spaces and punctuation help identify meaningful words, making explicit linguistic knowledge less necessary. however, applying this idea to proteins is uncertain because there's no clear definition of meaningful protein units, like ‘protein words.' we need a more analytical, expertise-driven approach to identify meaningful parts in protein sequences. this is where mindwalk’s hyft technology comes into play. amino acid patterns offer a more refined approach to embeddings compared to full sequence embeddings, analogous to the way semantic embeddings capture “logical” word groups or phrases to improve understanding in textual language. while full sequence embeddings encapsulate the entire protein sequence in a holistic manner, amino acid patterns focus on specific meaningful blocks within the sequence. mindwalk’s proprietary hyfts, which serve as protein building blocks with well-defined boundaries, enhance robustness to sequence variability by emphasizing critical regions and downplaying non-critical or less relevant areas in the full protein sequence. moreover, the hyfts serve as a central and unifying connector element laying the foundation for a holistic data management system. this integration extends beyond the incorporation of protein sequential , structural and functional data encompassing both flat metadata and vector embedding data, as well as textual enrichment data extracted from literature. these connector elements can traverse omics databases or external datasets such as iedb serving as starting points for nlp searches. in this way, a bridge is established between genetic information and relevant literature. lensai as a holistic integrator taking all this information together, an integrated data management system becomes necessary to build generalized foundation models for biology, rather than siloing each step independently. this integration extends beyond the incorporation of protein sequential , structural and functional data encompassing both flat metadata and vector embedding data, as well as textual enrichment data extracted from literature. the antibody discovery process undergoes a transformative shift, becoming a more informed journey where the flow of information is rooted in genetic building blocks. at each step, a comprehensive understanding is cultivated, by synthesizing insights from the amalgamation of genetic, textual, and structural dimensions, including diverse embeddings from different layers of llms capturing varying information sources. this is where lensai comes into play. by leveraging a vast knowledge graph interconnecting syntax (multi-modal sequential and structural data) and semantics (biological function), combined with the insights captured at the residue, region or hyft level – harnessed by the power of llm embeddings – this paves the way to improve drug-discovery relevant tasks such as functional clustering, developability prediction or prediction of immunogenicity risk. lensai’s advanced capabilities empower researchers to explore innovative protein structures and functionalities, unlocking new opportunities in antibody design and engineering. sources [1] https://www.frontiersin.org/articles/10.3389/fimmu.2018.00395/full [2] https://www.nature.com/articles/s41598-023-45538-w#sec10 [3] https://www.frontiersin.org/articles/10.3389/fmolb.2023.1237621/full [4] https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac046/6609807 [5] https://arxiv.org/abs/2112.07782 [6] https://arxiv.org/abs/1806.04398

Biomedical knowledge graphs and the power of ontology

knowledge graphs play a crucial role in the organization, integration, and interpretation of vast volumes of heterogeneous life sciences data. they are key to the effective integration of disparate data sources. they help map the semantic or functional relationships between a million data points. they enable information from diverse datasets to be mapped to a common ontology to create a unified, comprehensive, and interconnected view of complex biological data that enables a more contextual approach to exploration and interpretation. though ontologies and knowledge graphs are concepts related to the contextual organization and representation of knowledge, their approach and purpose can vary. so here’s a closer look at these concepts, their similarities, individual strengths, and synergies. what is an ontology? an ontology is a “formal, explicit specification of a shared conceptualization” that helps define, capture, and standardize information within a particular knowledge domain. the key three critical requirements of an ontology can be further codified as follows: ‘shared conceptualization’ emphasizes the importance of a consensual definition (shared) of domain concepts and their interrelationships (conceptualization) among users of a specific knowledge domain. the term ‘explicit’ requires the unambiguous characterization and representation of domain concepts to create a common understanding. and finally, ‘formal’ refers to the capability of the specified conceptualization to be machine-interpretable and support algorithmic reasoning. what is a knowledge graph? a knowledge graph, aka a semantic network, is a graphical representation of the foundational entities in a domain connected by semantic, contextual relationships. a knowledge model uses formal semantics to interlink descriptions of different concepts, entities, relationships, etc. and enables efficient data processing by both people and machines. knowledge graphs, therefore, are a type of graph database with an embedded semantic model that unifies all domain data into one knowledge base. semantics, therefore, is an essential capability for any knowledge base to qualify as a knowledge graph. though an ontology is often used to define the formal semantics of a knowledge domain, the terms ‘semantic knowledge graph’ and ‘ontology’ refer to different aspects of organizing and representing knowledge. what’s the difference between ontology and a semantic knowledge graph? in broad terms, the key difference between a semantic knowledge graph and an ontology is that semantics focuses predominantly on the interpretation and understanding of data relationships within a knowledge graph, whereas an ontology is a formal definition of the vocabulary and structure unique to the knowledge domain. both ontologies and semantics play a distinct and critical role in defining the utility and performance of a knowledge graph. an ontology provides the structured framework, formal definitions, and common vocabulary required to organize domain-specific knowledge in a way that creates a shared understanding. semantics focuses on the meaning, context, interrelationships, and interpretation of different pieces of information in a given domain. ontologies provide a formal representation, using languages like rdf (resource description framework), and owl (web ontology language) to standardize the annotation, organization, and expression of domain-specific knowledge. a semantic data layer is a more flexible approach to extracting implicit meaning and interrelationships between entities, often relying on a combination of semantic technologies and natural language processing (nlp) / large language models (llms) frameworks to contextually integrate and organize structured and unstructured data. semantic layers are often built on top of an ontology to create a more enriched and context-aware representation of knowledge graph entities. what are the key functions of ontology in knowledge graphs? ontologies are essential to structuring and enhancing the capabilities of knowledge graphs, thereby enabling several key functions related to the organization and interpretability of domain knowledge. the standardized and formal representation provided by ontologies serves as a universal foundation for integrating, mapping and aligning data from heterogeneous sources into one unified view of knowledge. ontologies provide the structure, rules, and definitions that enable logical reasoning and inference and the deduction of new knowledge based on existing information. by establishing a shared and standardized vocabulary, ontologies enhance semantic interoperability between different knowledge graphs, databases, and systems and create a comprehensive and meaningful understanding of a given domain. they also contribute to the semantic layer of knowledge graphs, enabling a richer and deeper understanding of data relationships that drive advanced analytics and decision-making. ontologies help formalize data validation rules, thereby ensuring consistency and enhancing data quality. ontologies enhance the search and discovery capabilities of knowledge graphs with a structured and semantically rich knowledge representation that enables more flexible and intelligent querying as well as more contextually relevant and accurate results. the importance of ontologies in biomedical knowledge graphs knowledge graphs have emerged as a critical tool in addressing the challenges posed by rapidly expanding and increasingly dispersed volumes of heterogeneous, multimodal, and complex biomedical information. biomedical ontologies are foundational to creating ontology-based biomedical knowledge graphs that are capable of structuring all existing biological knowledge as a panorama of semantic biomedical data. for example, scalable precision medicine open knowledge engine (spoke), a biomedical knowledge graph connecting millions of concepts across 41 biomedical databases, uses 11 different ontologies as a framework to semantically organize and connect data. this massive knowledge engine integrates a wide variety of information, such as proteins, pathways, molecular functions, biological processes, etc., and has been used for a range of biomedical applications, including drug repurposing, disease prediction, and interpretation of transcriptomic data. ontology-based knowledge graphs will also be key to the development of precision medicine given their capability to standardize and harmonize data resources across different organizational scales, including multi-omics data, molecular functions, intra- and inter-cellular pathways, phenotypes, therapeutics, environmental effects, etc., into one holistic network. the use of ontologies for semantic enrichment of biomedical knowledge graphs will also help accelerate the fairification of biomedical data and enable researchers to use ontology-based queries to answer more complex questions with greater accuracy and precision. however, there are still several challenges to the more widespread use of ontologies in biomedical research. biomedical ontologies will play an increasingly strategic role in the representation and standardization of biomedical knowledge. however, given their rapid growth proliferation, the emphasis going forward will have to on the development of biomedical ontologies that adhere to mathematically precise shared standards and good practice design principles to ensure that they are more interoperable, exchangeable, and examinable.

Mitigating LLM hallucinations

there is a compelling case underlying the tremendous interest in generative ai and llms as the next big technological inflection point in computational drug discovery and development. for starters, llms help expand the data universe of in silico drug discovery, especially in terms of opening up access to huge volumes of valuable information locked away in unstructured textual data sources including scientific literature, public databases, clinical trial notes, patient records, etc. llms provide the much-needed capability to analyze, identify patterns and connections, and extract novel insights about disease mechanisms and potential therapeutic targets. their ability to interpret complex scientific concepts and elucidate connections between diseases, genes, and biological processes can help accelerate disease hypothesis generation and the identification of potential drug targets and biomarkers. when integrated with biomedical knowledge graphs, llms help create a unique synergistic model that enables bidirectional data- and knowledge-based reasoning. the explicit structured knowledge of knowledge graphs enhances the knowledge of llms while the power of language models streamlines graph construction and user conversational interactions with complex knowledge bases. however, there are still several challenges that have to be addressed before llms can be reliably integrated into in silico drug discovery pipelines and workflows. one of these is hallucinations. why do llms hallucinate? at a time of some speculation about laziness and seasonal depression in llms, a hallucination leaderboard of 11 public llms revealed hallucination rates that ranged from 3% at the top end to 27% at the bottom of the barrel. another comparative study of two versions of a popular llm in generating ophthalmic scientific abstracts revealed very high hallucination rates (33% and 29%) of generating fake references. this tendency of llms to hallucinate, ergo present incorrect or unverifiable knowledge as accurate, even at 3% can have serious consequences in critical drug discovery applications. there are several reasons for llm hallucinations. at the core of this behavior is the fact that generative ai models have no actual intelligence, relying instead on a probability-based approach to predict data that is most likely to occur based on patterns and contexts ‘learned’ from their training data. apart from this inherent lack of contextual understanding, other potential causes include exposure to noise, errors, biases, and inconsistencies in training data, training and generation methods, or even prompting techniques. for some, hallucination is all llms do and others see it as inevitable for any prompt-based large language model. in the context of life sciences research, however, mitigating llm hallucinations remains one of the biggest obstacles to the large-scale and strategic integration of this potentially transformative technology. how to mitigate llm hallucinations? there are three broad and complementary approaches to mitigating hallucinations in large language models: prompt engineering, fine-tuning, and grounding + prompt augmentation. prompt engineering prompt engineering is the process of strategically designing user inputs, or prompts, in order to guide model behavior and obtain optimal responses. there are three major approaches to prompt engineering: zero-shot, few-shot, and chain-of-thought prompts. in zero-shot prompting, language models are provided with inputs that are not part of their training data but are still capable of generating reliable results. few-shot prompting involves providing examples to llms before presenting the actual query. chain-of-thought (cot) is based on the finding that a series of intermediate reasoning steps provided as examples during prompting can significantly improve the reasoning capabilities of large language models. the chain-of-thought concept has been expanded to include new techniques such as chain-of-verification (cove), a self-verification process that enables llms to check the accuracy and reliability of their output, and chain of density (cod), a process that focuses on summarization rather than reasoning to control the density of information in the generated text. prompt engineering, however, has its own set of limitations including prompt constraints that may cramp the ability to query complex domains and the lack of objective metrics to quantify prompt effectiveness. fine-tuning where the focus of prompt engineering is on the skill required to elicit better llm output, fine-tuning emphasizes task-specific training in order to enhance the performance of pre-trained models in specific topics or domain areas. a conventional approach to llm finetuning is full fine-tuning, which involves the additional training of pre-trained models on labeled, domain or task-specific data in order to generate more contextually relevant responses. this is a time, resource and expertise-intensive process. an alternative approach is parameter-efficient fine-tuning (peft), conducted on a small set of extra parameters without adjusting the entire model. the modular nature of peft means that the training can prioritize select portions or components of the original parameters so that the pre-trained model can be adapted for multiple tasks. lora (low-rank adaptation of large language models), a popular peft technique, can significantly reduce the resource intensity of fine-tuning while matching the performance of full fine-tuning. there are, however, challenges to fine-tuning including domain shift issues, the potential for bias amplification and catastrophic forgetting, and the complexities involved in choosing the right hyperparameters for fine-tuning in order to ensure optimal performance. grounding & augmentation llm hallucinations are often the result of language models attempting to generate knowledge based on information that they have not explicitly memorized or seen. the logical solution, therefore, would be to provide llms with access to a curated knowledge base of high-quality contextual information that enables them to generate more accurate responses. advanced grounding and prompt augmentation techniques can help address many of the accuracy and reliability challenges associated with llm performance. both techniques rely on external knowledge sources to dynamically generate context. grounding ensures that llms have access to up-to-date and use-case-specific information sources to provide the relevant context that may not be available solely from the training data. similarly, prompt augmentation enhances a prompt with contextually relevant information that enables llms to generate a more accurate and pertinent output. factual grounding is a technique typically used in the pre-training phase to ensure that llm output across a variety of tasks is consistent with a knowledge base of factual statements. post-training grounding relies on a range of external knowledge bases, including documents, code repositories, and public and proprietary databases, to improve the accuracy and relevance of llms on specific tasks. retrieval-augmented generation (rag), is a distinct framework for the post-training grounding of llms based on the most accurate, up-to-date information retrieved from external knowledge bases. the rag framework enables the optimization of biomedical llms output along three key dimensions. one, access to targeted external knowledge sources ensures llms' internal representation of information is dynamically refreshed with the most current and contextually relevant data. two, access to an llm’s information sources ensures that responses can be validated for relevance and accuracy. and three, there is the emerging potential to extend the rag framework beyond just text to multimodal knowledge retrieval, spanning images, audio, tables, etc., that can further boost the factuality, interpretability, and sophistication of llms. also read: how retrieval-augmented generation (rag) can transform drug discovery some of the key challenges of retrieval-augmented generation include the high initial cost of implementation as compared to standalone generative ai. however, in the long run, the rag-llm combination will be less expensive than frequently fine-tuning llms and provides the most comprehensive approach to mitigating llm hallucinations. but even with better grounding and retrieval, scientific applications demand another layer of rigor — validation and reproducibility. here’s how teams can build confidence in llm outputs before trusting them in high-stakes discovery workflows. how to validate llm outputs in drug discovery pipelines in scientific settings like drug discovery, ensuring the validity of large language model (llm) outputs is critical — especially when such outputs may inform downstream experimental decisions. here are key validation strategies used to assess llm-generated content in biomedical pipelines: validation checklist: compare outputs to curated benchmarks use structured, peer-reviewed datasets such as drugbank, chembl, or internal gold standards to benchmark llm predictions. cross-reference with experimental data validate ai-generated hypotheses against published experimental results, or integrate with in-house wet lab data for verification. establish feedback loops from in vitro validations create iterative pipelines where lab-tested results refine future model prompts, improving accuracy over time. advancing reproducibility in ai-augmented science for llm-assisted workflows to be trustworthy and audit-ready, they must be reproducible — particularly when used in regulated environments. reproducibility practices: dataset versioning track changes in source datasets, ensuring that each model run references a consistent data snapshot. prompt logging store full prompts (including context and input structure) to reproduce specific generations and analyze outputs over time. controlled inference environments standardize model versions, hyperparameters, and apis to eliminate variation in inference across different systems. integrated intelligence with lensai™ holistic life sciences research requires the sophisticated orchestration of several innovative technologies and frameworks. lensai integrated intelligence, our next-generation data-centric ai platform, fluently blends some of the most advanced proprietary technologies into one seamless solution that empowers end-to-end drug discovery and development. lensai integrates rag-enhanced biollms with an ontology-driven nlp framework, combining neuro-symbolic logic techniques to connect and correlate syntax (multi-modal sequential and structural data) and semantics (biological functions). a comprehensive and continuously expanding knowledge graph, mapping a remarkable 25 billion relationships across 660 million data objects, links sequence, structure, function, and literature information from the entire biosphere to provide a comprehensive overview of the relationships between genes, proteins, structures, and biological pathways. our next-generation, unified, knowledge-driven approach to the integration, exploration, and analysis of heterogeneous biomedical data empowers life sciences researchers with the high-tech capabilities needed to explore novel opportunities in drug discovery and development.

How retrieval-augmented generation (RAG) can transform drug discovery

in a recent article on knowledge graphs and large language models (llms) in drug discovery, we noted that despite the transformative potential of llms in drug discovery, there were several critical challenges that have to be addressed in order to ensure that these technologies conform to the rigorous standards demanded by life sciences research. synergizing knowledge graphs with llms into one bidirectional data- and knowledge-based reasoning framework addresses several concerns related to hallucinations and lack of interpretability. however, that still leaves the challenge of enabling llms access to external data sources that address their limitation with respect to factual accuracy and up-to-date knowledge recall. retrieval-augmented generation (rag), together with knowledge graphs and llms, is the third critical node on the trifecta of techniques required for the robust and reliable integration of the transformative potential of language models into drug discovery pipelines. why retrieval-augmented generation? one of the key limitations of general-purpose llms is their training data cutoff, which essentially means that their responses to queries are typically out of step with the rapidly evolving nature of information. this is a serious drawback, especially in fast-paced domains like life sciences research. retrieval-augmented generation enables biomedical research pipelines to optimize llm output by: grounding the language model on external sources of targeted and up-to-date knowledge to constantly refresh llms' internal representation of information without having to completely retrain the model. this ensures that responses are based on the most current data and are more contextually relevant. providing access to the model's information so that responses can be validated for accuracy and sources, ensuring that its claims can be checked for relevance and accuracy. in short, retrieval-augmented generation provides the framework necessary to augment the recency, accuracy, and interpretability of llm-generated information. how does retrieval-augmented generation work? retrieval augmented generation is a natural language processing (nlp) approach that combines elements of both information retrieval and text generation models to enhance the performance of knowledge-intensive tasks. the retrieval component aggregates information relevant to specific queries from a predefined set of documents or knowledge sources which then serves as the context for the generation model. once the information has been retrieved, it is combined with the input context to create an integrated context containing both the original query and the relevant retrieved information. this integrated context is then fed into a generation model to generate an accurate, coherent, and contextually appropriate response based on both pre-trained knowledge and retrieved query-specific information. the rag approach gives life sciences research teams more control over grounding data used by a biomedical llm by honing it on enterprise- and domain-specific knowledge sources. it also enables the integration of a range of external data sources, such as document repositories, databases, or apis, that are most relevant to enhancing model response to a query. the value of rag in biomedical research conceptually, the retrieve+generate model’s capabilities in terms of dealing with dynamic external information sources, minimizing hallucinations, and enhancing interpretability make it a natural and complementary fit to augment the performance of biollms. in order to quantify this augmentation in performance, a recent research effort evaluated the ability of a retrieval-augmented generative agent in biomedical question-answering vis-a-vis llms (gpt-3.5/4), state-of-the-art commercial tools (elicit, scite, and perplexity) and humans (biomedical researchers). the rag agent, paperqa, was first evaluated against a standard multiple-choice llm-evaluation dataset, pubmedqa, with the provided context removed to test the agents’ ability to retrieve information. in this case, the rag agent beats gpt-4 by 30 points (57.9% to 86.3%). next, the researchers constructed a more complex and more contemporary dataset (litqa), based on more recent full-text research papers outside the bounds of llm’s pre-training data, to compare the integrated abilities of paperqa, llms and human researchers to retrieve the right information and to generate an accurate answer based on that information. again, the rag agent outperformed both pre-trained llms and commercial tools with overall accuracy (69.5%) and precision (87.9%) scores that were on par with biomedical researchers. more importantly, the rag model produced zero hallucinated citations compared to llms (40-60%). despite being just a narrow evaluation of the performance of the retrieval+generation approach in biomedical qa, the above research does demonstrate the significantly enhanced value that rag+biollm can deliver compared to purely generative ai. the combined sophistication of retrieval and generation models can be harnessed to enhance the accuracy and efficiency of a range of processes across the drug discovery and development pipeline. retrieval-augmented generation in drug discovery in the context of drug discovery, rag can be applied to a range of tasks, from literature reviews to biomolecule design. currently, generative models have demonstrated potential for de novo molecular design but are still hampered by their inability to integrate multimodal information or provide interpretability. the rag framework can facilitate the retrieval of multimodal information, from a range of sources, such as chemical databases, biological data, clinical trials, images, etc., that can significantly augment generative molecular design. the same expanded retrieval + augmented generation template applies to a whole range of applications in drug discovery like, for example, compound design (retrieve compounds/ properties and generate improvements/ new properties), drug-target interaction prediction (retrieve known drug-target interactions and generate potential interactions between new compounds and specific targets. adverse effects prediction (retrieve known adverse and generate modifications to eliminate effects). etc. the template even applies to several sub-processes/-tasks within drug discovery to leverage a broader swathe of existing knowledge to generate novel, reliable, and actionable insights. in target validation, for example, retrieval-augmented generation can enable the comprehensive generative analysis of a target of interest based on an extensive review of all existing knowledge about the target, expression patterns and functional roles of the target, known binding sites, pertinent biological pathways and networks, potential biomarkers, etc. in short, the more efficient and scalable retrieval of timely information ensures that generative models are grounded in factual, sourceable knowledge, a combination with limitless potential to transform drug discovery. an integrated approach to retrieval-augmented generation retrieval-augmented generation addresses several of the critical limitations and augments the generative capabilities of biollms. however, there are additional design rules and multiple technological profiles that have to come together to successfully address the specific requirements and challenges of life sciences research. our lensai™ integrated intelligence platform seamlessly unifies the semantic proficiency of knowledge graphs, the versatile information retrieval capabilities of retrieval-augmented generation, and the reasoning capabilities of large language models to reinvent the understanding-retrieve-generate cycle in biomedical research. our unified approach empowers researchers to query a harmonized life science knowledge layer that integrates unstructured information & ontologies into a knowledge graph. a semantic-first approach enables a more accurate understanding of research queries, which in turn results in the retrieval of content that is most pertinent to the query. the platform also integrates retrieval-augmented generation with structured biomedical data from our hyft technology to enhance the accuracy of generated responses. and finally, lensai combines deep learning llms with neuro-symbolic logic techniques to deliver comprehensive and interpretable outcomes for inquiries. to experience this unified solution in action, please contact us here.

Integrating knowledge graphs and large language models for next-generation drug discovery

across several previous blogs, we have explored the importance of knowledge graphs, large language models (llms), and semantic analysis in biomedical research. today, we focus on integrating these distinct concepts into a unified model that can help advance drug discovery and development. but before we get to that, here’s a quick synopsis of the knowledge graph, llm & semantic analysis narrative so far. llms, knowledge graphs & semantics in biomedical research it has been established that biomedical llms — domain-specific models pre-trained exclusively on domain-specific vocabulary — outperform conventional tools in many biological data-based tasks. it is therefore considered inevitable that these models will quickly expand across the broader biomedical domain. however, there are still several challenges, such as hallucinations and interpretability for instance, that have to be addressed before biomedical llms can be taken mainstream. a key biomedical domain-specific challenge is llms’ lack of semantic intelligence. llms have, debatably, been described as ‘stochastic parrots’ that comprehend none of the language, relying instead on ‘learning’ meaning based on the large-scale extraction of statistical correlations. this has led to the question of whether modern llms really possess any inductive, deductive, or abductive reasoning abilities. statistically extrapolated meaning may well be adequate for general language llm applications. however, the unique complexities and nuances of the biochemical, biomedical, and biological vocabulary, require a more semantic approach to convert words/sentences into meaning, and ultimately knowledge. biomedical knowledge graphs address this key capability gap in llms by going beyond statistical correlations to bring the power of context to biomedical language models. knowledge graphs help capture the inherent graph structure of biomedical data, such as drug-disease interactions and protein-protein interactions, and model complex relationships between disparate data elements into one unified structure that is both human-readable and computationally accessible. knowledge graphs accomplish this by emphasizing the definitions of, and the semantic relationships between, different entities. they use domain-specific ontologies that formally define various concepts and relations to enrich and interlink data based on context. a combination, therefore, of semantic knowledge graphs and biomedical llms will be most effective for life sciences applications. semantic knowledge graphs and llms in drug discovery there are three general frameworks for unifying the power of llms and knowledge graphs. the first, knowledge graph-enhanced llms, focuses on using the explicit, structured knowledge of knowledge graphs to enhance the knowledge of llms at different stages including pre-training, inference, and interpretability. this approach offers three distinct advantages: it improves the knowledge expression of llms, provides llms with continuous access to the most up-to-date knowledge, and affords more transparency into the reasoning process of black-box language models. structured data from knowledge graphs, related to genes, proteins, diseases, pathways, chemical compounds, etc., combined with the unstructured data, from scientific literature, clinical trial reports, and patents. etc, can help augment drug discovery by providing a more holistic domain view. the second, llm-augmented knowledge graphs, leverages the power of language models to streamline graph construction, enhance knowledge graph tasks such as graph-to-text generation and question answering, and augment the reasoning capabilities and performance of knowledge graph applications. llm-augmented knowledge graphs combine the natural language capabilities of llms with the rich semantic relationships represented in knowledge graphs to empower pharmaceutical researchers with faster and more precise answers to complex questions and to extract insights based on patterns and correlations. llms can also enhance the utility of knowledge graphs in drug discovery by constantly extracting and enriching pharmaceutical knowledge graphs. the third approach is towards creating a synergistic biomedical llm plus biomedical knowledge graph (bkg) model that enables bidirectional data- and knowledge-based reasoning. currently, the process of combining generative and reasoning capabilities into one symbiotic model is focused on specific tasks. however, this is poised to expand to diverse downstream applications in the near future. even as research continues to focus on the symbiotic possibilities of a unified knowledge graph-llm framework, these concepts are already having a transformative impact on several drug discovery and development processes. take target identification, for instance, a critical step in drug discovery with consequential implications for downstream development processes. ai-powered language models have been shown to outperform state-of-the-art approaches in key tasks such as biomedical named entity recognition (bioner) and biomedical relation extraction. transformer-based llms are being used in chemoinformatics to advance drug–target relationship prediction and to effectively generate novel, valid, and unique molecules. llms are also evolving beyond basic text-to-text frameworks to multi-modal large language models (mllms) that bring the combined power of image plus text adaptive learning to target identification and validation. meanwhile, the semantic capabilities of knowledge graphs enhance the efficiencies of target identification by enabling the harmonization and enrichment of heterogeneous data into one connected framework for more holistic exploration and analysis. ai-enabled llms are increasingly being used across the drug discovery and development pipeline to predict drug-target interactions (dtis) and drug-drug interactions, molecular properties, such as pharmacodynamics, pharmacokinetics, and toxicity, and even likely drug withdrawals from the market due to safety concerns. in the drug discovery domain, biomedical knowledge graphs are being across a range of tasks including polypharmacy prediction, dti prediction, adverse drug reaction (adr) prediction, gene-disease prioritization, and drug repurposing. the next significant point of inflection will be the integration of these powerful technologies into one synergized model to drive a stepped increase in performance and efficiency. optimizing llms for biomedical research there are three key challenges — knowledge cut-off, hallucinations, and interpretability — that must be addressed before llms can be reliably integrated into biomedical research. there are currently two complementary approaches to mitigate these challenges and optimize biomedical llm performance. the first approach is to leverage the structured, factual, domain-specific knowledge contained in biomedical knowledge graphs to enhance the factual accuracy, consistency, and transparency of llms. using graph-based query languages, the pre-structured data embedded in knowledge graph frameworks can be directly queried and integrated into llms. another key capability for biomedical llms is to retrieve information from external sources, on a per-query basis, in order to generate the most up-to-date and contextually relevant responses. there are two broad reasons why this is a critical capability in biomedical research: first, it ensures that llms' internal knowledge is supplemented by access to the most current and reliable information from domain-specific, high-quality, and updateable knowledge sources. and two, access to the data sources means that responses can be checked for accuracy and provenance. the retrieval augmented generation (rag) approach combines the power of llms with external knowledge retrieval mechanisms to enhance the reasoning, accuracy, and knowledge recall of biomedical llms. combining the knowledge graph- and rag-based approaches will lead to significant improvements in llm performance in terms of factual accuracy, context-awareness, and continuous knowledge enrichment. what is retrieval-augmented generation (rag) in drug discovery? retrieval-augmented generation (rag) is an approach that combines large language models with access to internal and external, trusted data sources. in the context of drug discovery, it helps generate scientifically grounded responses by drawing on biomedical datasets or proprietary silos. when integrated with a knowledge graph, rag can support context-aware candidate suggestions, summarize literature, or even generate hypotheses based on experimental inputs. this is especially useful in fragmented biomedical data landscapes, where rag helps surface meaningful cross-modal relationships—across omics layers, pathways, phenotypes, and more. what’s the difference between llms and plms in drug discovery? large language models (llms) are general-purpose models trained on vast textual corpora, capable of understanding and generating human-like language. protein language models (plms), on the other hand, are trained on biological sequences, like amino acids, to capture structural and functional insights. while llms can assist in literature mining or clinical trial design, plms power structure prediction, function annotation, and rational protein engineering. combining both enables cross-modal reasoning for smarter discovery. lensai: the next-generation rag-kg-llm platform these components—llms, plms, knowledge graphs, and rag—are increasingly being combined into unified frameworks for smarter drug discovery. imagine a system where a protein structure predicted by a plm is linked to pathway insights from a biomedical knowledge graph. an llm then interprets these connections to suggest possible disease associations or therapeutic hypotheses—supported by citations retrieved via rag. this kind of multi-layered integration mirrors how expert scientists reason, helping teams surface and prioritize meaningful leads much faster than traditional workflows. at biostrand, we have successfully actualized a next-generation unified knowledge graph-large language model framework for holistic life sciences research. at the core of our lensai platform is a comprehensive and continuously expanding knowledge graph that maps 25 billion relationships across 660 million data objects, linking sequence, structure, function, and literature information from the entire biosphere. our first-in-class technology provides a holistic understanding of the relationships between genes, proteins, and biological pathways thereby opening up powerful new opportunities for drug discovery and development. the platform leverages the latest advances in ontology-driven nlp and ai-driven llms to connect and correlate syntax (multi-modal sequential and structural data ) and semantics (functions). our unified approach to biomedical knowledge graphs, retrieval-augmented generation models, and large language models combines the reasoning capabilities of llms, the semantic proficiency of knowledge graphs, and the versatile information retrieval capabilities of rag to streamline the integration, exploration, and analysis of all biomedical data.

Knowledge graphs and black box LLMs

what are the limitations of large language models (llms) in biological research? chatgpt responds to this query with quite a comprehensive list that includes a lack of domain-specific knowledge, contextual understanding, access to up-to-date information, and interpretability and explainability. nevertheless, it has to be acknowledged that llms can have a transformative impact on biological and biomedical research. after all, these models have already been applied successfully in biological sequential data-based tasks like protein structure predictions and could possibly be extended to the broader language of biochemistry. specialized llms like chemical language models (clms) have the potential to outperform conventional drug discovery processes in traditional small-molecule drugs as well as antibodies. more broadly, there is a huge opportunity to use large-scale pre-trained language models to extract value from vast volumes of unannotated biomedical data. pre-training, of course, will be key to the development of biological domain-specific llms. research shows that domains, such as biomedicine, with large volumes of unlabeled text benefit most from domain-specific pretraining, as opposed to starting from general-domain language models. biomedical language models, pre-trained solely on domain-specific vocabulary, cover a much wider range of applications and, more importantly, substantially outperform currently available biomedical nlp tools. however, there is a larger issue of interpretability and explainability when it comes to transformer-based llms. the llm black box the development of natural language processing (nlp) models has traditionally been rooted in white-box techniques that were inherently interpretable. since then, however, the evolution has been towards more sophistical and advanced techniques black-box techniques that have undoubtedly facilitated state-of-the-art performance but have also obfuscated interpretability. to understand the sheer scale of the interpretability challenge in llms, we turn to openai’s language models can explain neurons in language models paper from earlier this year, which opens with the sentence “language models have become more capable and more widely deployed, but we do not understand how they work.” millions of neurons need to be analyzed in order to fully understand llms, and the paper proposes an approach to automating interpretability so that it can be scaled to all neurons in a language model. the catch, however, is that “neurons may not be explainable.” so, even as work continues on interpretable llms, the life sciences industry needs a more immediate solution to harness the power of llms while mitigating the need for a more immediate solution to integrate the potential of llms while mitigating issues such as interpretability and explainability. and knowledge graphs could be that solution. augmenting bionlp interpretability with knowledge graphs one criticism of llms is that the predictions that they generated based on ‘statistically likely continuations of word sequences’ fail to capture relational functionings that are central to scientific knowledge creation. these relation functionings, as it were, are critical to effective life sciences research. biomedical data is derived from different levels of biological organization, with disparate technologies and modalities, and scattered across multiple non-standardized data repositories. researchers need to connect all these dots, across diverse data types, formats, and sources, and understand the relationships/dynamics between them in order to derive meaningful insights. knowledge graphs (kgs) have become a critical component of life sciences’ technology infrastructure because they help map the semantic or functional relationships between a million different data points. they use nlp to create a semantic network that visualises all objects in the systems in terms of the relationships between them. semantic data integration, based on ontology matching, helps organize and link disparate structured/unstructured information into a unified human-readable, computationally accessible, and traceable knowledge graph that can be further queried for novel relationships and deeper insights. unifying llms and kgs combining these distinct ontology-driven and natural language-driven systems creates a synergistic technique that enhances the advantages of each while addressing the limitations of both. kgs can provide llms with the traceable factual knowledge required to address interpretability concerns. one roadmap for the unification of llms and kgs proposes three different frameworks: kg-enhanced llms, where the structured traceable knowledge from kgs enhances the knowledge awareness and interpretability of llms. incorporating kgs in the pre-training stage helps with the transfer of knowledge whereas in the inference stage, it enhances llm performance in accessing domain-specific knowledge. llm-augmented kgs: llms can be used in two different contexts - they can be used to process the original corpus and extract relations and entities that inform kg construction. and, to process the textual corpus in the kgs to enrich representation. synergized llms + kgs: both systems are unified into one general framework containing four layers. one, a data layer that processes the textual and structural data that can be expanded to incorporate multi-modal data, such as video, audio, and images. two, the synergized model layer, where both systems' features are synergized to enhance capabilities and performance. three, a technique layer to integrate related llms and kgs into the framework. and four, an application layer, for addressing different real-world applications. the kg-llm advantage a unified kg-llm approach to bionlp provides an immediate solution to the black box concerns that impede large-scale deployment in the life sciences. combining domain-specific kgs, ontologies, and dictionaries can significantly enhance llm performance in terms of semantic understanding and interpretability. at the same time, llms can also help enrich kgs with real-world data, from ehrs, scientific publications, etc., thereby expanding the scope and scale of semantic networks and enhancing biomedical research. at mindwalk, we have already created a comprehensive knowledge graph that integrates over 660 million objects, linked by more than 25 billion relationships, from the biosphere and from other data sources, such as scientific literature. plus, our lensai platform, powered by hyft technology, leverages the latest advancements in llms to bridge the gap between syntax (multi-modal sequential and structural data ) and semantics (functions). by integrating retrieval-augmented generation (rag) models, we have been able to harness the reasoning capabilities of llms while simultaneously addressing several associated limitations such as knowledge-cutoff, hallucinations, and lack of interpretability. compared to closed-loop language modelling, this enhanced approach yields multiple benefits including clear provenance and attribution, and up-to-date contextual reference as our knowledge base updates and expands. if you would like to integrate the power of a unified kg-llm framework into your research, please drop us a line here.

CaseXCase Series

The Blog

Multimodal language models in protein engineering: Functional clonotyping & beyond

Biomedical knowledge graphs and the power of ontology

Mitigating LLM hallucinations

How retrieval-augmented generation (RAG) can transform drug discovery

Integrating knowledge graphs and large language models for next-generation drug discovery

Knowledge graphs and black box LLMs

Topic: Large language models

Multimodal language models in protein engineering: Functional clonotyping & beyond

Biomedical knowledge graphs and the power of ontology

Mitigating LLM hallucinations

How retrieval-augmented generation (RAG) can transform drug discovery

Integrating knowledge graphs and large language models for next-generation drug discovery

Knowledge graphs and black box LLMs

Keep up to date