The Blog

MindWalk is a biointelligence company uniting AI, multi-omics data, and advanced lab research into a customizable ecosystem for biologics discovery and development.

Biomedical knowledge graphs and the power of ontology

knowledge graphs play a crucial role in the organization, integration, and interpretation of vast volumes of heterogeneous life sciences data. they are key to the effective integration of disparate data sources. they help map the semantic or functional relationships between a million data points. they enable information from diverse datasets to be mapped to a common ontology to create a unified, comprehensive, and interconnected view of complex biological data that enables a more contextual approach to exploration and interpretation. though ontologies and knowledge graphs are concepts related to the contextual organization and representation of knowledge, their approach and purpose can vary. so here’s a closer look at these concepts, their similarities, individual strengths, and synergies. what is an ontology? an ontology is a “formal, explicit specification of a shared conceptualization” that helps define, capture, and standardize information within a particular knowledge domain. the key three critical requirements of an ontology can be further codified as follows: ‘shared conceptualization’ emphasizes the importance of a consensual definition (shared) of domain concepts and their interrelationships (conceptualization) among users of a specific knowledge domain. the term ‘explicit’ requires the unambiguous characterization and representation of domain concepts to create a common understanding. and finally, ‘formal’ refers to the capability of the specified conceptualization to be machine-interpretable and support algorithmic reasoning. what is a knowledge graph? a knowledge graph, aka a semantic network, is a graphical representation of the foundational entities in a domain connected by semantic, contextual relationships. a knowledge model uses formal semantics to interlink descriptions of different concepts, entities, relationships, etc. and enables efficient data processing by both people and machines. knowledge graphs, therefore, are a type of graph database with an embedded semantic model that unifies all domain data into one knowledge base. semantics, therefore, is an essential capability for any knowledge base to qualify as a knowledge graph. though an ontology is often used to define the formal semantics of a knowledge domain, the terms ‘semantic knowledge graph’ and ‘ontology’ refer to different aspects of organizing and representing knowledge. what’s the difference between ontology and a semantic knowledge graph? in broad terms, the key difference between a semantic knowledge graph and an ontology is that semantics focuses predominantly on the interpretation and understanding of data relationships within a knowledge graph, whereas an ontology is a formal definition of the vocabulary and structure unique to the knowledge domain. both ontologies and semantics play a distinct and critical role in defining the utility and performance of a knowledge graph. an ontology provides the structured framework, formal definitions, and common vocabulary required to organize domain-specific knowledge in a way that creates a shared understanding. semantics focuses on the meaning, context, interrelationships, and interpretation of different pieces of information in a given domain. ontologies provide a formal representation, using languages like rdf (resource description framework), and owl (web ontology language) to standardize the annotation, organization, and expression of domain-specific knowledge. a semantic data layer is a more flexible approach to extracting implicit meaning and interrelationships between entities, often relying on a combination of semantic technologies and natural language processing (nlp) / large language models (llms) frameworks to contextually integrate and organize structured and unstructured data. semantic layers are often built on top of an ontology to create a more enriched and context-aware representation of knowledge graph entities. what are the key functions of ontology in knowledge graphs? ontologies are essential to structuring and enhancing the capabilities of knowledge graphs, thereby enabling several key functions related to the organization and interpretability of domain knowledge. the standardized and formal representation provided by ontologies serves as a universal foundation for integrating, mapping and aligning data from heterogeneous sources into one unified view of knowledge. ontologies provide the structure, rules, and definitions that enable logical reasoning and inference and the deduction of new knowledge based on existing information. by establishing a shared and standardized vocabulary, ontologies enhance semantic interoperability between different knowledge graphs, databases, and systems and create a comprehensive and meaningful understanding of a given domain. they also contribute to the semantic layer of knowledge graphs, enabling a richer and deeper understanding of data relationships that drive advanced analytics and decision-making. ontologies help formalize data validation rules, thereby ensuring consistency and enhancing data quality. ontologies enhance the search and discovery capabilities of knowledge graphs with a structured and semantically rich knowledge representation that enables more flexible and intelligent querying as well as more contextually relevant and accurate results. the importance of ontologies in biomedical knowledge graphs knowledge graphs have emerged as a critical tool in addressing the challenges posed by rapidly expanding and increasingly dispersed volumes of heterogeneous, multimodal, and complex biomedical information. biomedical ontologies are foundational to creating ontology-based biomedical knowledge graphs that are capable of structuring all existing biological knowledge as a panorama of semantic biomedical data. for example, scalable precision medicine open knowledge engine (spoke), a biomedical knowledge graph connecting millions of concepts across 41 biomedical databases, uses 11 different ontologies as a framework to semantically organize and connect data. this massive knowledge engine integrates a wide variety of information, such as proteins, pathways, molecular functions, biological processes, etc., and has been used for a range of biomedical applications, including drug repurposing, disease prediction, and interpretation of transcriptomic data. ontology-based knowledge graphs will also be key to the development of precision medicine given their capability to standardize and harmonize data resources across different organizational scales, including multi-omics data, molecular functions, intra- and inter-cellular pathways, phenotypes, therapeutics, environmental effects, etc., into one holistic network. the use of ontologies for semantic enrichment of biomedical knowledge graphs will also help accelerate the fairification of biomedical data and enable researchers to use ontology-based queries to answer more complex questions with greater accuracy and precision. however, there are still several challenges to the more widespread use of ontologies in biomedical research. biomedical ontologies will play an increasingly strategic role in the representation and standardization of biomedical knowledge. however, given their rapid growth proliferation, the emphasis going forward will have to on the development of biomedical ontologies that adhere to mathematically precise shared standards and good practice design principles to ensure that they are more interoperable, exchangeable, and examinable.

Mitigating LLM hallucinations

there is a compelling case underlying the tremendous interest in generative ai and llms as the next big technological inflection point in computational drug discovery and development. for starters, llms help expand the data universe of in silico drug discovery, especially in terms of opening up access to huge volumes of valuable information locked away in unstructured textual data sources including scientific literature, public databases, clinical trial notes, patient records, etc. llms provide the much-needed capability to analyze, identify patterns and connections, and extract novel insights about disease mechanisms and potential therapeutic targets. their ability to interpret complex scientific concepts and elucidate connections between diseases, genes, and biological processes can help accelerate disease hypothesis generation and the identification of potential drug targets and biomarkers. when integrated with biomedical knowledge graphs, llms help create a unique synergistic model that enables bidirectional data- and knowledge-based reasoning. the explicit structured knowledge of knowledge graphs enhances the knowledge of llms while the power of language models streamlines graph construction and user conversational interactions with complex knowledge bases. however, there are still several challenges that have to be addressed before llms can be reliably integrated into in silico drug discovery pipelines and workflows. one of these is hallucinations. why do llms hallucinate? at a time of some speculation about laziness and seasonal depression in llms, a hallucination leaderboard of 11 public llms revealed hallucination rates that ranged from 3% at the top end to 27% at the bottom of the barrel. another comparative study of two versions of a popular llm in generating ophthalmic scientific abstracts revealed very high hallucination rates (33% and 29%) of generating fake references. this tendency of llms to hallucinate, ergo present incorrect or unverifiable knowledge as accurate, even at 3% can have serious consequences in critical drug discovery applications. there are several reasons for llm hallucinations. at the core of this behavior is the fact that generative ai models have no actual intelligence, relying instead on a probability-based approach to predict data that is most likely to occur based on patterns and contexts ‘learned’ from their training data. apart from this inherent lack of contextual understanding, other potential causes include exposure to noise, errors, biases, and inconsistencies in training data, training and generation methods, or even prompting techniques. for some, hallucination is all llms do and others see it as inevitable for any prompt-based large language model. in the context of life sciences research, however, mitigating llm hallucinations remains one of the biggest obstacles to the large-scale and strategic integration of this potentially transformative technology. how to mitigate llm hallucinations? there are three broad and complementary approaches to mitigating hallucinations in large language models: prompt engineering, fine-tuning, and grounding + prompt augmentation. prompt engineering prompt engineering is the process of strategically designing user inputs, or prompts, in order to guide model behavior and obtain optimal responses. there are three major approaches to prompt engineering: zero-shot, few-shot, and chain-of-thought prompts. in zero-shot prompting, language models are provided with inputs that are not part of their training data but are still capable of generating reliable results. few-shot prompting involves providing examples to llms before presenting the actual query. chain-of-thought (cot) is based on the finding that a series of intermediate reasoning steps provided as examples during prompting can significantly improve the reasoning capabilities of large language models. the chain-of-thought concept has been expanded to include new techniques such as chain-of-verification (cove), a self-verification process that enables llms to check the accuracy and reliability of their output, and chain of density (cod), a process that focuses on summarization rather than reasoning to control the density of information in the generated text. prompt engineering, however, has its own set of limitations including prompt constraints that may cramp the ability to query complex domains and the lack of objective metrics to quantify prompt effectiveness. fine-tuning where the focus of prompt engineering is on the skill required to elicit better llm output, fine-tuning emphasizes task-specific training in order to enhance the performance of pre-trained models in specific topics or domain areas. a conventional approach to llm finetuning is full fine-tuning, which involves the additional training of pre-trained models on labeled, domain or task-specific data in order to generate more contextually relevant responses. this is a time, resource and expertise-intensive process. an alternative approach is parameter-efficient fine-tuning (peft), conducted on a small set of extra parameters without adjusting the entire model. the modular nature of peft means that the training can prioritize select portions or components of the original parameters so that the pre-trained model can be adapted for multiple tasks. lora (low-rank adaptation of large language models), a popular peft technique, can significantly reduce the resource intensity of fine-tuning while matching the performance of full fine-tuning. there are, however, challenges to fine-tuning including domain shift issues, the potential for bias amplification and catastrophic forgetting, and the complexities involved in choosing the right hyperparameters for fine-tuning in order to ensure optimal performance. grounding & augmentation llm hallucinations are often the result of language models attempting to generate knowledge based on information that they have not explicitly memorized or seen. the logical solution, therefore, would be to provide llms with access to a curated knowledge base of high-quality contextual information that enables them to generate more accurate responses. advanced grounding and prompt augmentation techniques can help address many of the accuracy and reliability challenges associated with llm performance. both techniques rely on external knowledge sources to dynamically generate context. grounding ensures that llms have access to up-to-date and use-case-specific information sources to provide the relevant context that may not be available solely from the training data. similarly, prompt augmentation enhances a prompt with contextually relevant information that enables llms to generate a more accurate and pertinent output. factual grounding is a technique typically used in the pre-training phase to ensure that llm output across a variety of tasks is consistent with a knowledge base of factual statements. post-training grounding relies on a range of external knowledge bases, including documents, code repositories, and public and proprietary databases, to improve the accuracy and relevance of llms on specific tasks. retrieval-augmented generation (rag), is a distinct framework for the post-training grounding of llms based on the most accurate, up-to-date information retrieved from external knowledge bases. the rag framework enables the optimization of biomedical llms output along three key dimensions. one, access to targeted external knowledge sources ensures llms' internal representation of information is dynamically refreshed with the most current and contextually relevant data. two, access to an llm’s information sources ensures that responses can be validated for relevance and accuracy. and three, there is the emerging potential to extend the rag framework beyond just text to multimodal knowledge retrieval, spanning images, audio, tables, etc., that can further boost the factuality, interpretability, and sophistication of llms. also read: how retrieval-augmented generation (rag) can transform drug discovery some of the key challenges of retrieval-augmented generation include the high initial cost of implementation as compared to standalone generative ai. however, in the long run, the rag-llm combination will be less expensive than frequently fine-tuning llms and provides the most comprehensive approach to mitigating llm hallucinations. but even with better grounding and retrieval, scientific applications demand another layer of rigor — validation and reproducibility. here’s how teams can build confidence in llm outputs before trusting them in high-stakes discovery workflows. how to validate llm outputs in drug discovery pipelines in scientific settings like drug discovery, ensuring the validity of large language model (llm) outputs is critical — especially when such outputs may inform downstream experimental decisions. here are key validation strategies used to assess llm-generated content in biomedical pipelines: validation checklist: compare outputs to curated benchmarks use structured, peer-reviewed datasets such as drugbank, chembl, or internal gold standards to benchmark llm predictions. cross-reference with experimental data validate ai-generated hypotheses against published experimental results, or integrate with in-house wet lab data for verification. establish feedback loops from in vitro validations create iterative pipelines where lab-tested results refine future model prompts, improving accuracy over time. advancing reproducibility in ai-augmented science for llm-assisted workflows to be trustworthy and audit-ready, they must be reproducible — particularly when used in regulated environments. reproducibility practices: dataset versioning track changes in source datasets, ensuring that each model run references a consistent data snapshot. prompt logging store full prompts (including context and input structure) to reproduce specific generations and analyze outputs over time. controlled inference environments standardize model versions, hyperparameters, and apis to eliminate variation in inference across different systems. integrated intelligence with lensai™ holistic life sciences research requires the sophisticated orchestration of several innovative technologies and frameworks. lensai integrated intelligence, our next-generation data-centric ai platform, fluently blends some of the most advanced proprietary technologies into one seamless solution that empowers end-to-end drug discovery and development. lensai integrates rag-enhanced biollms with an ontology-driven nlp framework, combining neuro-symbolic logic techniques to connect and correlate syntax (multi-modal sequential and structural data) and semantics (biological functions). a comprehensive and continuously expanding knowledge graph, mapping a remarkable 25 billion relationships across 660 million data objects, links sequence, structure, function, and literature information from the entire biosphere to provide a comprehensive overview of the relationships between genes, proteins, structures, and biological pathways. our next-generation, unified, knowledge-driven approach to the integration, exploration, and analysis of heterogeneous biomedical data empowers life sciences researchers with the high-tech capabilities needed to explore novel opportunities in drug discovery and development.

How retrieval-augmented generation (RAG) can transform drug discovery

in a recent article on knowledge graphs and large language models (llms) in drug discovery, we noted that despite the transformative potential of llms in drug discovery, there were several critical challenges that have to be addressed in order to ensure that these technologies conform to the rigorous standards demanded by life sciences research. synergizing knowledge graphs with llms into one bidirectional data- and knowledge-based reasoning framework addresses several concerns related to hallucinations and lack of interpretability. however, that still leaves the challenge of enabling llms access to external data sources that address their limitation with respect to factual accuracy and up-to-date knowledge recall. retrieval-augmented generation (rag), together with knowledge graphs and llms, is the third critical node on the trifecta of techniques required for the robust and reliable integration of the transformative potential of language models into drug discovery pipelines. why retrieval-augmented generation? one of the key limitations of general-purpose llms is their training data cutoff, which essentially means that their responses to queries are typically out of step with the rapidly evolving nature of information. this is a serious drawback, especially in fast-paced domains like life sciences research. retrieval-augmented generation enables biomedical research pipelines to optimize llm output by: grounding the language model on external sources of targeted and up-to-date knowledge to constantly refresh llms' internal representation of information without having to completely retrain the model. this ensures that responses are based on the most current data and are more contextually relevant. providing access to the model's information so that responses can be validated for accuracy and sources, ensuring that its claims can be checked for relevance and accuracy. in short, retrieval-augmented generation provides the framework necessary to augment the recency, accuracy, and interpretability of llm-generated information. how does retrieval-augmented generation work? retrieval augmented generation is a natural language processing (nlp) approach that combines elements of both information retrieval and text generation models to enhance the performance of knowledge-intensive tasks. the retrieval component aggregates information relevant to specific queries from a predefined set of documents or knowledge sources which then serves as the context for the generation model. once the information has been retrieved, it is combined with the input context to create an integrated context containing both the original query and the relevant retrieved information. this integrated context is then fed into a generation model to generate an accurate, coherent, and contextually appropriate response based on both pre-trained knowledge and retrieved query-specific information. the rag approach gives life sciences research teams more control over grounding data used by a biomedical llm by honing it on enterprise- and domain-specific knowledge sources. it also enables the integration of a range of external data sources, such as document repositories, databases, or apis, that are most relevant to enhancing model response to a query. the value of rag in biomedical research conceptually, the retrieve+generate model’s capabilities in terms of dealing with dynamic external information sources, minimizing hallucinations, and enhancing interpretability make it a natural and complementary fit to augment the performance of biollms. in order to quantify this augmentation in performance, a recent research effort evaluated the ability of a retrieval-augmented generative agent in biomedical question-answering vis-a-vis llms (gpt-3.5/4), state-of-the-art commercial tools (elicit, scite, and perplexity) and humans (biomedical researchers). the rag agent, paperqa, was first evaluated against a standard multiple-choice llm-evaluation dataset, pubmedqa, with the provided context removed to test the agents’ ability to retrieve information. in this case, the rag agent beats gpt-4 by 30 points (57.9% to 86.3%). next, the researchers constructed a more complex and more contemporary dataset (litqa), based on more recent full-text research papers outside the bounds of llm’s pre-training data, to compare the integrated abilities of paperqa, llms and human researchers to retrieve the right information and to generate an accurate answer based on that information. again, the rag agent outperformed both pre-trained llms and commercial tools with overall accuracy (69.5%) and precision (87.9%) scores that were on par with biomedical researchers. more importantly, the rag model produced zero hallucinated citations compared to llms (40-60%). despite being just a narrow evaluation of the performance of the retrieval+generation approach in biomedical qa, the above research does demonstrate the significantly enhanced value that rag+biollm can deliver compared to purely generative ai. the combined sophistication of retrieval and generation models can be harnessed to enhance the accuracy and efficiency of a range of processes across the drug discovery and development pipeline. retrieval-augmented generation in drug discovery in the context of drug discovery, rag can be applied to a range of tasks, from literature reviews to biomolecule design. currently, generative models have demonstrated potential for de novo molecular design but are still hampered by their inability to integrate multimodal information or provide interpretability. the rag framework can facilitate the retrieval of multimodal information, from a range of sources, such as chemical databases, biological data, clinical trials, images, etc., that can significantly augment generative molecular design. the same expanded retrieval + augmented generation template applies to a whole range of applications in drug discovery like, for example, compound design (retrieve compounds/ properties and generate improvements/ new properties), drug-target interaction prediction (retrieve known drug-target interactions and generate potential interactions between new compounds and specific targets. adverse effects prediction (retrieve known adverse and generate modifications to eliminate effects). etc. the template even applies to several sub-processes/-tasks within drug discovery to leverage a broader swathe of existing knowledge to generate novel, reliable, and actionable insights. in target validation, for example, retrieval-augmented generation can enable the comprehensive generative analysis of a target of interest based on an extensive review of all existing knowledge about the target, expression patterns and functional roles of the target, known binding sites, pertinent biological pathways and networks, potential biomarkers, etc. in short, the more efficient and scalable retrieval of timely information ensures that generative models are grounded in factual, sourceable knowledge, a combination with limitless potential to transform drug discovery. an integrated approach to retrieval-augmented generation retrieval-augmented generation addresses several of the critical limitations and augments the generative capabilities of biollms. however, there are additional design rules and multiple technological profiles that have to come together to successfully address the specific requirements and challenges of life sciences research. our lensai™ integrated intelligence platform seamlessly unifies the semantic proficiency of knowledge graphs, the versatile information retrieval capabilities of retrieval-augmented generation, and the reasoning capabilities of large language models to reinvent the understanding-retrieve-generate cycle in biomedical research. our unified approach empowers researchers to query a harmonized life science knowledge layer that integrates unstructured information & ontologies into a knowledge graph. a semantic-first approach enables a more accurate understanding of research queries, which in turn results in the retrieval of content that is most pertinent to the query. the platform also integrates retrieval-augmented generation with structured biomedical data from our hyft technology to enhance the accuracy of generated responses. and finally, lensai combines deep learning llms with neuro-symbolic logic techniques to deliver comprehensive and interpretable outcomes for inquiries. to experience this unified solution in action, please contact us here.

Integrating knowledge graphs and large language models for next-generation drug discovery

across several previous blogs, we have explored the importance of knowledge graphs, large language models (llms), and semantic analysis in biomedical research. today, we focus on integrating these distinct concepts into a unified model that can help advance drug discovery and development. but before we get to that, here’s a quick synopsis of the knowledge graph, llm & semantic analysis narrative so far. llms, knowledge graphs & semantics in biomedical research it has been established that biomedical llms — domain-specific models pre-trained exclusively on domain-specific vocabulary — outperform conventional tools in many biological data-based tasks. it is therefore considered inevitable that these models will quickly expand across the broader biomedical domain. however, there are still several challenges, such as hallucinations and interpretability for instance, that have to be addressed before biomedical llms can be taken mainstream. a key biomedical domain-specific challenge is llms’ lack of semantic intelligence. llms have, debatably, been described as ‘stochastic parrots’ that comprehend none of the language, relying instead on ‘learning’ meaning based on the large-scale extraction of statistical correlations. this has led to the question of whether modern llms really possess any inductive, deductive, or abductive reasoning abilities. statistically extrapolated meaning may well be adequate for general language llm applications. however, the unique complexities and nuances of the biochemical, biomedical, and biological vocabulary, require a more semantic approach to convert words/sentences into meaning, and ultimately knowledge. biomedical knowledge graphs address this key capability gap in llms by going beyond statistical correlations to bring the power of context to biomedical language models. knowledge graphs help capture the inherent graph structure of biomedical data, such as drug-disease interactions and protein-protein interactions, and model complex relationships between disparate data elements into one unified structure that is both human-readable and computationally accessible. knowledge graphs accomplish this by emphasizing the definitions of, and the semantic relationships between, different entities. they use domain-specific ontologies that formally define various concepts and relations to enrich and interlink data based on context. a combination, therefore, of semantic knowledge graphs and biomedical llms will be most effective for life sciences applications. semantic knowledge graphs and llms in drug discovery there are three general frameworks for unifying the power of llms and knowledge graphs. the first, knowledge graph-enhanced llms, focuses on using the explicit, structured knowledge of knowledge graphs to enhance the knowledge of llms at different stages including pre-training, inference, and interpretability. this approach offers three distinct advantages: it improves the knowledge expression of llms, provides llms with continuous access to the most up-to-date knowledge, and affords more transparency into the reasoning process of black-box language models. structured data from knowledge graphs, related to genes, proteins, diseases, pathways, chemical compounds, etc., combined with the unstructured data, from scientific literature, clinical trial reports, and patents. etc, can help augment drug discovery by providing a more holistic domain view. the second, llm-augmented knowledge graphs, leverages the power of language models to streamline graph construction, enhance knowledge graph tasks such as graph-to-text generation and question answering, and augment the reasoning capabilities and performance of knowledge graph applications. llm-augmented knowledge graphs combine the natural language capabilities of llms with the rich semantic relationships represented in knowledge graphs to empower pharmaceutical researchers with faster and more precise answers to complex questions and to extract insights based on patterns and correlations. llms can also enhance the utility of knowledge graphs in drug discovery by constantly extracting and enriching pharmaceutical knowledge graphs. the third approach is towards creating a synergistic biomedical llm plus biomedical knowledge graph (bkg) model that enables bidirectional data- and knowledge-based reasoning. currently, the process of combining generative and reasoning capabilities into one symbiotic model is focused on specific tasks. however, this is poised to expand to diverse downstream applications in the near future. even as research continues to focus on the symbiotic possibilities of a unified knowledge graph-llm framework, these concepts are already having a transformative impact on several drug discovery and development processes. take target identification, for instance, a critical step in drug discovery with consequential implications for downstream development processes. ai-powered language models have been shown to outperform state-of-the-art approaches in key tasks such as biomedical named entity recognition (bioner) and biomedical relation extraction. transformer-based llms are being used in chemoinformatics to advance drug–target relationship prediction and to effectively generate novel, valid, and unique molecules. llms are also evolving beyond basic text-to-text frameworks to multi-modal large language models (mllms) that bring the combined power of image plus text adaptive learning to target identification and validation. meanwhile, the semantic capabilities of knowledge graphs enhance the efficiencies of target identification by enabling the harmonization and enrichment of heterogeneous data into one connected framework for more holistic exploration and analysis. ai-enabled llms are increasingly being used across the drug discovery and development pipeline to predict drug-target interactions (dtis) and drug-drug interactions, molecular properties, such as pharmacodynamics, pharmacokinetics, and toxicity, and even likely drug withdrawals from the market due to safety concerns. in the drug discovery domain, biomedical knowledge graphs are being across a range of tasks including polypharmacy prediction, dti prediction, adverse drug reaction (adr) prediction, gene-disease prioritization, and drug repurposing. the next significant point of inflection will be the integration of these powerful technologies into one synergized model to drive a stepped increase in performance and efficiency. optimizing llms for biomedical research there are three key challenges — knowledge cut-off, hallucinations, and interpretability — that must be addressed before llms can be reliably integrated into biomedical research. there are currently two complementary approaches to mitigate these challenges and optimize biomedical llm performance. the first approach is to leverage the structured, factual, domain-specific knowledge contained in biomedical knowledge graphs to enhance the factual accuracy, consistency, and transparency of llms. using graph-based query languages, the pre-structured data embedded in knowledge graph frameworks can be directly queried and integrated into llms. another key capability for biomedical llms is to retrieve information from external sources, on a per-query basis, in order to generate the most up-to-date and contextually relevant responses. there are two broad reasons why this is a critical capability in biomedical research: first, it ensures that llms' internal knowledge is supplemented by access to the most current and reliable information from domain-specific, high-quality, and updateable knowledge sources. and two, access to the data sources means that responses can be checked for accuracy and provenance. the retrieval augmented generation (rag) approach combines the power of llms with external knowledge retrieval mechanisms to enhance the reasoning, accuracy, and knowledge recall of biomedical llms. combining the knowledge graph- and rag-based approaches will lead to significant improvements in llm performance in terms of factual accuracy, context-awareness, and continuous knowledge enrichment. what is retrieval-augmented generation (rag) in drug discovery? retrieval-augmented generation (rag) is an approach that combines large language models with access to internal and external, trusted data sources. in the context of drug discovery, it helps generate scientifically grounded responses by drawing on biomedical datasets or proprietary silos. when integrated with a knowledge graph, rag can support context-aware candidate suggestions, summarize literature, or even generate hypotheses based on experimental inputs. this is especially useful in fragmented biomedical data landscapes, where rag helps surface meaningful cross-modal relationships—across omics layers, pathways, phenotypes, and more. what’s the difference between llms and plms in drug discovery? large language models (llms) are general-purpose models trained on vast textual corpora, capable of understanding and generating human-like language. protein language models (plms), on the other hand, are trained on biological sequences, like amino acids, to capture structural and functional insights. while llms can assist in literature mining or clinical trial design, plms power structure prediction, function annotation, and rational protein engineering. combining both enables cross-modal reasoning for smarter discovery. lensai: the next-generation rag-kg-llm platform these components—llms, plms, knowledge graphs, and rag—are increasingly being combined into unified frameworks for smarter drug discovery. imagine a system where a protein structure predicted by a plm is linked to pathway insights from a biomedical knowledge graph. an llm then interprets these connections to suggest possible disease associations or therapeutic hypotheses—supported by citations retrieved via rag. this kind of multi-layered integration mirrors how expert scientists reason, helping teams surface and prioritize meaningful leads much faster than traditional workflows. at biostrand, we have successfully actualized a next-generation unified knowledge graph-large language model framework for holistic life sciences research. at the core of our lensai platform is a comprehensive and continuously expanding knowledge graph that maps 25 billion relationships across 660 million data objects, linking sequence, structure, function, and literature information from the entire biosphere. our first-in-class technology provides a holistic understanding of the relationships between genes, proteins, and biological pathways thereby opening up powerful new opportunities for drug discovery and development. the platform leverages the latest advances in ontology-driven nlp and ai-driven llms to connect and correlate syntax (multi-modal sequential and structural data ) and semantics (functions). our unified approach to biomedical knowledge graphs, retrieval-augmented generation models, and large language models combines the reasoning capabilities of llms, the semantic proficiency of knowledge graphs, and the versatile information retrieval capabilities of rag to streamline the integration, exploration, and analysis of all biomedical data.

Connecting the dots: Why knowledge graphs matter

knowledge graphs (kgs) have become a must-know innovation that will drive transformational benefits in data-centric ai applications across industries. kgs, big data and ai are complementary concepts that together address the challenges of integrating, unifying, analyzing and querying vast volumes of diverse and complex data. there are several inherent advantages to the kg approach to organizing and representing information. unlike traditional flat data structures, for instance, a kg framework is designed to model multilevel hierarchical, associative, and causal relationships that more accurately represent real-world data. the application of a semantic layer to data also makes it easier for both humans and machines to understand the context and significance of information. here then are some of the key features and benefits of knowledge graphs. efficient data integration: integrate disparate data sources and break down information silos ai-specific data management, including automated data and metadata integration, is a critical component in successful data-centric ai. however, factors such as data complexity, quality, and accessibility pose integration challenges that are barriers to ai adoption. data-centric ai requires a modern approach to data integration that integrates all organizational data entities into one unified semantic representation based on context (ontologies, metadata, domain knowledge, etc.) and time (temporal relationships). knowledge graphs (kgs) have become the ideal platform for the contextual integration and representation of complex data ecosystems. they enable the integration of information from multiple data sources and map them to a common ontology in order to create a comprehensive, consistent, and connected representation of all organizational data entities. the scalability of this approach, across large volumes of heterogeneous, structured, semi-structured, and multimodal unstructured data from diverse data sources and silos, makes them ideal for automated data acquisition, transformation, and integration. knowledge extraction methods can be used to classify entities and relations, identify matching entities (entity linking, entity resolution), combine entities into a single representation (entity fusion), and match and merge ontology concepts to create a kg graph data model. there are several advantages to kg data models. they have the flexibility to scale across complex heterogeneous data structures. when integrated with natural language technologies (nlt), kgs can help train language models on domain-specific knowledge and natural language technologies can streamline the construction of knowledge models. they allow for more intuitive querying of complex data even by users without specialized data science knowledge. they can evolve to assimilate new data, sources, definitions, and use cases without manageability and accessibility loss. they provide consistent and unified access to all organization knowledge that is typically distributed across different data silos and systems. rich contextualization: capture relationships and provide a holistic view of data context is a critical component of learning, for both humans and machines. contextual information will be key to the development of next-generation ai systems that adopt a human approach to transform data into knowledge that enables more human-like decision-making. kgs leverage the powers of context and relations to embed data with intelligence. by organizing data based on factual interconnections and interrelations, they add real-world meaning to data that makes it easier for ai systems to extract knowledge from vast volumes of data. a key organizing principle of kgs is the provision of an additional metadata layer that organizes data based on context to support logical reasoning and knowledge discovery. the organizing principle could take many forms including controlled vocabularies, such as taxonomies, ontologies, etc., entity resolution and analysis, and tagging, categorization, and classification. with kgs, smart behavior is encoded directly into the data so that the graph itself can dynamically understand connections and associations between entities, eliminating the need to manually program every new piece of information. knowledge graphs provide context for decision support and can be further classified based on use cases as actioning kgs (data management) and decisioning kgs (analytics), and as context-rich kgs (internal knowledge management), external-sensing kgs (external data mapping), and natural language processing kgs. enhanced search and discovery: enable precise and context-aware search results the first step towards understanding how kgs transform the data search and discovery function is to understand the distinction between data search and data discovery. data search broadly refers to a scenario in which users are looking for specific information that they know or assume to exist. this is a framework that allows users to seek and extract relevant information from volumes of non-relevant data. data discovery is focused more on proactively enabling users to surface and explore new information and ideas that are potentially related to the actual search string. discovery essentially is search powered by context. kgs contextually integrate all entities and relationships across different data silos and systems into a unified semantic layer. this enables them to deliver more accurate and comprehensive search results and to provide context-relevant connections and relationships that promote knowledge discovery. users can then follow the contextual links that are most pertinent to their interest to delve deeper into the data thereby boosting data utilization and value. and perhaps equally importantly, the intuitive and flexible querying capabilities of kgs allow even non-technical users to explore data and discover new insights. it is estimated that graph-based models can help organizations enhance their ability to find, access, and reuse information by as much as 30% and up to 75% faster. knowledge graphs in life sciences knowledge graphs are transformative frameworks that enable a structured, connected, and semantically-enhanced approach to organize and interpret data holistically. they provide the foundations for companies to create a uniform data fabric across different environments and technologies and operationalizing ai at scale. for the life sciences industry, knowledge graphs represent a powerful tool for integrating, harmonizing, and governing heterogeneous and siloed data while ensuring data quality, lineage, and compliance. they enable the creation of a centralized, shared and holistic repository of knowledge that can be continually updated and enriched with new entities, relationships, and attributes. according to gartner, graph technologies will drive 80% of data and analytics innovations by 2025. if you are interested in integrating the innovative potential of kgs and ai/ml to your research pipeline, please drop us a line.

Knowledge graphs and black box LLMs

what are the limitations of large language models (llms) in biological research? chatgpt responds to this query with quite a comprehensive list that includes a lack of domain-specific knowledge, contextual understanding, access to up-to-date information, and interpretability and explainability. nevertheless, it has to be acknowledged that llms can have a transformative impact on biological and biomedical research. after all, these models have already been applied successfully in biological sequential data-based tasks like protein structure predictions and could possibly be extended to the broader language of biochemistry. specialized llms like chemical language models (clms) have the potential to outperform conventional drug discovery processes in traditional small-molecule drugs as well as antibodies. more broadly, there is a huge opportunity to use large-scale pre-trained language models to extract value from vast volumes of unannotated biomedical data. pre-training, of course, will be key to the development of biological domain-specific llms. research shows that domains, such as biomedicine, with large volumes of unlabeled text benefit most from domain-specific pretraining, as opposed to starting from general-domain language models. biomedical language models, pre-trained solely on domain-specific vocabulary, cover a much wider range of applications and, more importantly, substantially outperform currently available biomedical nlp tools. however, there is a larger issue of interpretability and explainability when it comes to transformer-based llms. the llm black box the development of natural language processing (nlp) models has traditionally been rooted in white-box techniques that were inherently interpretable. since then, however, the evolution has been towards more sophistical and advanced techniques black-box techniques that have undoubtedly facilitated state-of-the-art performance but have also obfuscated interpretability. to understand the sheer scale of the interpretability challenge in llms, we turn to openai’s language models can explain neurons in language models paper from earlier this year, which opens with the sentence “language models have become more capable and more widely deployed, but we do not understand how they work.” millions of neurons need to be analyzed in order to fully understand llms, and the paper proposes an approach to automating interpretability so that it can be scaled to all neurons in a language model. the catch, however, is that “neurons may not be explainable.” so, even as work continues on interpretable llms, the life sciences industry needs a more immediate solution to harness the power of llms while mitigating the need for a more immediate solution to integrate the potential of llms while mitigating issues such as interpretability and explainability. and knowledge graphs could be that solution. augmenting bionlp interpretability with knowledge graphs one criticism of llms is that the predictions that they generated based on ‘statistically likely continuations of word sequences’ fail to capture relational functionings that are central to scientific knowledge creation. these relation functionings, as it were, are critical to effective life sciences research. biomedical data is derived from different levels of biological organization, with disparate technologies and modalities, and scattered across multiple non-standardized data repositories. researchers need to connect all these dots, across diverse data types, formats, and sources, and understand the relationships/dynamics between them in order to derive meaningful insights. knowledge graphs (kgs) have become a critical component of life sciences’ technology infrastructure because they help map the semantic or functional relationships between a million different data points. they use nlp to create a semantic network that visualises all objects in the systems in terms of the relationships between them. semantic data integration, based on ontology matching, helps organize and link disparate structured/unstructured information into a unified human-readable, computationally accessible, and traceable knowledge graph that can be further queried for novel relationships and deeper insights. unifying llms and kgs combining these distinct ontology-driven and natural language-driven systems creates a synergistic technique that enhances the advantages of each while addressing the limitations of both. kgs can provide llms with the traceable factual knowledge required to address interpretability concerns. one roadmap for the unification of llms and kgs proposes three different frameworks: kg-enhanced llms, where the structured traceable knowledge from kgs enhances the knowledge awareness and interpretability of llms. incorporating kgs in the pre-training stage helps with the transfer of knowledge whereas in the inference stage, it enhances llm performance in accessing domain-specific knowledge. llm-augmented kgs: llms can be used in two different contexts - they can be used to process the original corpus and extract relations and entities that inform kg construction. and, to process the textual corpus in the kgs to enrich representation. synergized llms + kgs: both systems are unified into one general framework containing four layers. one, a data layer that processes the textual and structural data that can be expanded to incorporate multi-modal data, such as video, audio, and images. two, the synergized model layer, where both systems' features are synergized to enhance capabilities and performance. three, a technique layer to integrate related llms and kgs into the framework. and four, an application layer, for addressing different real-world applications. the kg-llm advantage a unified kg-llm approach to bionlp provides an immediate solution to the black box concerns that impede large-scale deployment in the life sciences. combining domain-specific kgs, ontologies, and dictionaries can significantly enhance llm performance in terms of semantic understanding and interpretability. at the same time, llms can also help enrich kgs with real-world data, from ehrs, scientific publications, etc., thereby expanding the scope and scale of semantic networks and enhancing biomedical research. at mindwalk, we have already created a comprehensive knowledge graph that integrates over 660 million objects, linked by more than 25 billion relationships, from the biosphere and from other data sources, such as scientific literature. plus, our lensai platform, powered by hyft technology, leverages the latest advancements in llms to bridge the gap between syntax (multi-modal sequential and structural data ) and semantics (functions). by integrating retrieval-augmented generation (rag) models, we have been able to harness the reasoning capabilities of llms while simultaneously addressing several associated limitations such as knowledge-cutoff, hallucinations, and lack of interpretability. compared to closed-loop language modelling, this enhanced approach yields multiple benefits including clear provenance and attribution, and up-to-date contextual reference as our knowledge base updates and expands. if you would like to integrate the power of a unified kg-llm framework into your research, please drop us a line here.

Knowledge graphs & the power of context

data overload is becoming a real challenge for businesses of all stripes even as a majority continue gathering data faster than they can analyse and harness its business value. and it’s not just about volume. much of modern big data, as much as 93%, comes in the form of unstructured data and most if not all of which ends up as dark data i.e. collected but not analysed. unlocking knowledge at scale from troves of unstructured organisational data is rapidly becoming one of the most pressing needs for businesses today. concurrent themes in this regard include the importance of connected data, the value of applying knowledge in context and the benefits of using ai to contextualize data and create knowledge. and the need for connected, contextualised data and continuing developments in ai has resulted in increasing interest in knowledge graphs as a means to generate context-based insights. in fact, gartner believes that graph technologies are the foundation of modern data and analytics, noting that most client inquiries on the topic of ai typically involve a discussion on graph technology. a brief history of knowledge graphs in 1735s königsberg, swiss mathematician leonhard euler used a concept of nodes/objects and links/relationships to prove that there was no route across the city’s four districts that would involve crossing each of its interconnecting seven bridges exactly once, thereby laying the foundations for graph theory. cut to more modern times and 1956 witnessed the development of a semantic network, a well-known ancestor of knowledge graphs, for machine translation of natural languages. fast forward to the early aughts, and sir timothy john berners-lee proposed a semantic web that would use structured and standardized metadata about webpages and their interlinks to make the knowledge stored in these relationships machine-readable. unfortunately, the concept did not exactly scale but search and social companies were quick to latch on to the value of extremely large graphs and the potential in extracting knowledge from them. google is often credited with rebranding the semantic web and popularising knowledge graphs with the introduction of the google knowledge graph in 2012. most of the first big knowledge graphs, from companies such as google, ibm, amazon, samsung, ebay, bloomberg, ny times, compiled non-proprietary information into a single graph that served a wide range of interests. enterprise knowledge graphs emerged as the second wave and used ontologies to elucidate various conceptual models (schemas, taxonomies, vocabularies, etc.) used across different enterprise systems. back in 2019, gartner predicted that an annualised 100% growth in the application of graph processing and graph databases would help accelerate data preparation and enable more complex and adaptive data science. today, graphs are considered to be one of the fastest-growing database niches, having surpassed the growth rate of standard classical relational databases, and graph db + ai may well be the future of data management. defining knowledge graphs a knowledge graph is quite simply any graph of data that accumulates and conveys knowledge of the real world. data graphs can conform to different graph-based data models, such as a directed edge-labelled graph, a heterogeneous graph, a property graph, etc. for instance, a directed labelled knowledge graph consists of nodes representing entities of interest, edges that connect nodes and reference potential relationships between various entities, and labels that capture the nature of the relationship. so, knowledge graphs use a graph-based data model to integrate, manage and extract knowledge from diverse sources of data at scale. knowledge graph databases enable ai systems to deal with huge volumes of complex data by storing information as a network of data points correlated by the nature of their relationships. key characteristics of knowledge graphs by connecting multiple data points around relevant and contextually related attributes, graph technologies enable the creation of rich knowledge databases that enhance augmented analytics. some of the most defining characteristics of this approach include: knowledge graphs work across structured and unstructured datasets and represent the most credible means of aggregating all enterprise data regardless of structure variation, type, or format. compared to knowledge bases with ﬂat structures and static content, knowledge graphs integrate adjacent information on how different data points are correlated to enable a human brain-like approach to derive new knowledge. knowledge graphs are dynamic and can be programmed to automatically identify attribute-based associations across new incoming data. the ability to create connected clusters of data based on levels of inﬂuence, frequency of interaction and probability opens up the possibility of developing and training highly complex models. knowledge graphs simplify the process of integrating and analysing complicated data by establishing a semantic layer of business definitions. the use of intelligent metadata enables users to even ﬁnd insights that otherwise might have been beyond the scope of analytics. applications of knowledge graphs today, knowledge graphs are everywhere. every consumer-facing digital brand, such as google, amazon, facebook, spotify, etc., has invested significantly in building knowledge and the concept of graphs has evolved to underpin everything from critical infrastructure to supply chains and policing. here’s a quick look at how this technology can transform certain key sectors and functions. healthcare in the healthcare sector, it is especially critical that classification models are reliable and accurate. but this continues to be a challenge given the volume, quality and complexity of data within the sector. despite the application of advanced classification methodologies, including deep learning, the outcomes do not demonstrate adequate superiority over previous techniques. much of this boils down to the fact that conventional techniques disregard correlations between data instances. however, it has been demonstrated that knowledge graph algorithms, with their inherent focus on correlations, could significantly advance capabilities for the discovery of knowledge and insights from connected data. finance knowledge graphs, and their ability to uncover new dimensions of data-driven knowledge, are expected to be adopted by as much as 80% of financial services firms in the near future. in fact, a 2020 report from business and technology management consultancy capco provided a veritable laundry list of knowledge graph applications across the financial services value chain. for instance, graphs can be used across compliance, kyc and fraud detection to build a ‘deep client insight’ capability that can transform compliance from a cost to a revenue-driving function. the adoption of graph data models could also drive product innovations given the inflexibility of current tabular data structures to reflect real-world needs. pharma machine learning approaches that use knowledge graphs have the potential to transform a range of drug discovery and development tasks, including drug repurposing, drug toxicity prediction and target gene-disease prioritisation. in the context of knowledge graph-based drug discovery, in a drug discovery graph, genes, diseases, drugs etc. are represented as entities with the edges indicating relationships/interactions. as a result, an edge between a disease and drug entity could indicate a successful clinical trial. similarly, an edge between two drug entities could reference either a potentially harmful interaction or compatibility. the pharma sector is also emerging as the ideal target for text-enhanced knowledge graph representation models that utilise textual information to augment knowledge representations. knowledge graphs and ai/ml ai/ml technologies are playing an increasingly critical role in driving data-driven decision making in the digital enterprise. knowledge graphs will play a significant role in sustaining and growing this trend by providing the context required for more intelligent decision-making. there are two distinct reasons for knowledge graphs being at the epicentre of ai and machine learning. on the one hand, they are a manifestation of ai given their ability to derive a connected and contextualised understanding of diverse data points. on the other, they also represent a new approach to integrating all data, structured and unstructured, required to build the ml models that drive decision-making. the combination, therefore, of knowledge graphs and ai technologies will be critical not only for integrating all enterprise data but also add the power of context to augment ai/ml approaches.