MindWalk Blog: The power of integrated AI, Data and Lab precision in Biologics Discovery

How retrieval-augmented generation (RAG) can transform drug discovery

in a recent article on knowledge graphs and large language models (llms) in drug discovery, we noted that despite the transformative potential of llms in drug discovery, there were several critical challenges that have to be addressed in order to ensure that these technologies conform to the rigorous standards demanded by life sciences research. synergizing knowledge graphs with llms into one bidirectional data- and knowledge-based reasoning framework addresses several concerns related to hallucinations and lack of interpretability. however, that still leaves the challenge of enabling llms access to external data sources that address their limitation with respect to factual accuracy and up-to-date knowledge recall. retrieval-augmented generation (rag), together with knowledge graphs and llms, is the third critical node on the trifecta of techniques required for the robust and reliable integration of the transformative potential of language models into drug discovery pipelines. why retrieval-augmented generation? one of the key limitations of general-purpose llms is their training data cutoff, which essentially means that their responses to queries are typically out of step with the rapidly evolving nature of information. this is a serious drawback, especially in fast-paced domains like life sciences research. retrieval-augmented generation enables biomedical research pipelines to optimize llm output by: grounding the language model on external sources of targeted and up-to-date knowledge to constantly refresh llms' internal representation of information without having to completely retrain the model. this ensures that responses are based on the most current data and are more contextually relevant. providing access to the model's information so that responses can be validated for accuracy and sources, ensuring that its claims can be checked for relevance and accuracy. in short, retrieval-augmented generation provides the framework necessary to augment the recency, accuracy, and interpretability of llm-generated information. how does retrieval-augmented generation work? retrieval augmented generation is a natural language processing (nlp) approach that combines elements of both information retrieval and text generation models to enhance the performance of knowledge-intensive tasks. the retrieval component aggregates information relevant to specific queries from a predefined set of documents or knowledge sources which then serves as the context for the generation model. once the information has been retrieved, it is combined with the input context to create an integrated context containing both the original query and the relevant retrieved information. this integrated context is then fed into a generation model to generate an accurate, coherent, and contextually appropriate response based on both pre-trained knowledge and retrieved query-specific information. the rag approach gives life sciences research teams more control over grounding data used by a biomedical llm by honing it on enterprise- and domain-specific knowledge sources. it also enables the integration of a range of external data sources, such as document repositories, databases, or apis, that are most relevant to enhancing model response to a query. the value of rag in biomedical research conceptually, the retrieve+generate model’s capabilities in terms of dealing with dynamic external information sources, minimizing hallucinations, and enhancing interpretability make it a natural and complementary fit to augment the performance of biollms. in order to quantify this augmentation in performance, a recent research effort evaluated the ability of a retrieval-augmented generative agent in biomedical question-answering vis-a-vis llms (gpt-3.5/4), state-of-the-art commercial tools (elicit, scite, and perplexity) and humans (biomedical researchers). the rag agent, paperqa, was first evaluated against a standard multiple-choice llm-evaluation dataset, pubmedqa, with the provided context removed to test the agents’ ability to retrieve information. in this case, the rag agent beats gpt-4 by 30 points (57.9% to 86.3%). next, the researchers constructed a more complex and more contemporary dataset (litqa), based on more recent full-text research papers outside the bounds of llm’s pre-training data, to compare the integrated abilities of paperqa, llms and human researchers to retrieve the right information and to generate an accurate answer based on that information. again, the rag agent outperformed both pre-trained llms and commercial tools with overall accuracy (69.5%) and precision (87.9%) scores that were on par with biomedical researchers. more importantly, the rag model produced zero hallucinated citations compared to llms (40-60%). despite being just a narrow evaluation of the performance of the retrieval+generation approach in biomedical qa, the above research does demonstrate the significantly enhanced value that rag+biollm can deliver compared to purely generative ai. the combined sophistication of retrieval and generation models can be harnessed to enhance the accuracy and efficiency of a range of processes across the drug discovery and development pipeline. retrieval-augmented generation in drug discovery in the context of drug discovery, rag can be applied to a range of tasks, from literature reviews to biomolecule design. currently, generative models have demonstrated potential for de novo molecular design but are still hampered by their inability to integrate multimodal information or provide interpretability. the rag framework can facilitate the retrieval of multimodal information, from a range of sources, such as chemical databases, biological data, clinical trials, images, etc., that can significantly augment generative molecular design. the same expanded retrieval + augmented generation template applies to a whole range of applications in drug discovery like, for example, compound design (retrieve compounds/ properties and generate improvements/ new properties), drug-target interaction prediction (retrieve known drug-target interactions and generate potential interactions between new compounds and specific targets. adverse effects prediction (retrieve known adverse and generate modifications to eliminate effects). etc. the template even applies to several sub-processes/-tasks within drug discovery to leverage a broader swathe of existing knowledge to generate novel, reliable, and actionable insights. in target validation, for example, retrieval-augmented generation can enable the comprehensive generative analysis of a target of interest based on an extensive review of all existing knowledge about the target, expression patterns and functional roles of the target, known binding sites, pertinent biological pathways and networks, potential biomarkers, etc. in short, the more efficient and scalable retrieval of timely information ensures that generative models are grounded in factual, sourceable knowledge, a combination with limitless potential to transform drug discovery. an integrated approach to retrieval-augmented generation retrieval-augmented generation addresses several of the critical limitations and augments the generative capabilities of biollms. however, there are additional design rules and multiple technological profiles that have to come together to successfully address the specific requirements and challenges of life sciences research. our lensai™ integrated intelligence platform seamlessly unifies the semantic proficiency of knowledge graphs, the versatile information retrieval capabilities of retrieval-augmented generation, and the reasoning capabilities of large language models to reinvent the understanding-retrieve-generate cycle in biomedical research. our unified approach empowers researchers to query a harmonized life science knowledge layer that integrates unstructured information & ontologies into a knowledge graph. a semantic-first approach enables a more accurate understanding of research queries, which in turn results in the retrieval of content that is most pertinent to the query. the platform also integrates retrieval-augmented generation with structured biomedical data from our hyft technology to enhance the accuracy of generated responses. and finally, lensai combines deep learning llms with neuro-symbolic logic techniques to deliver comprehensive and interpretable outcomes for inquiries. to experience this unified solution in action, please contact us here.

The evolution of bioinformatics

conventional vaccine development, still based predominantly on systems developed in the last century, is a complex process that takes between 10-15 years on average. until the covid-19 pandemic, when two mrna vaccines went from development to deployment in less than a year, the record for the fastest development of a new vaccine, in just four years, had gone unchallenged for over half a century. this revolutionary boost to the vaccine development cycle stemmed from two uniquely 21st century developments: first, the access to cost-effective next-generation sequencing technologies with significantly enhanced speed, coverage and accuracy that enabled the rapid sequencing of the sars-cov-2 virus. and second, the availability of innovative state-of-the-art bioinformatics technologies to convert raw data into actionable insights, without which ngs would have just resulted in huge stockpiles of dormant or dark data. in the case of covid-19, cutting edge bioinformatics approaches played a critical role in enabling researchers to quickly hone in on the spike protein gene as the vaccine candidate. ngs technologies and advanced bioinformatics solutions have been pivotal to mitigate the global impact of covid-19, providing the tools required for detection, tracking, containment and treatment, the identification of biomarkers, the discovery of potential drug targets, drug repurposing, and exploring other therapeutic opportunities. however, the combination of gene engineering and information technologies is already creating the foundation for the fourth generation of sequencing technologies for faster and more cost-effective whole-genome sequencing and disease diagnosis. as a result, continuous innovation has become an evolutionary imperative for modern bioinformatics as it has to keep up with the developmental pace of ngs technologies and accelerate the transformation of an exponentially increasing trove of data into knowledge. however, the raw volume and velocity of data sequences is just one facet of big data genomics. today, bioinformatics solutions have to cope with a variety of complex data, in heterogeneous formats, from diverse data sources, from different sequencing methods connected to different -omes, and relating to different characteristics of genomes. more importantly, the critical focus of next-generation bioinformatics technologies has to be on catalysing new pathways and dimensions in biological research that can drive transformative change in precision medicine and public health. in the following section, we look at the current evolutionary trajectory of bioinformatics in the context of three key omics analysis milestones. three key milestones in the evolution of bioinformatics the steady evolution of bioinformatics over the past two decades into a cross-disciplinary and advanced computational practice has enabled several noteworthy milestones in omics analysis. the following, however, are significant as they best showcase the growth and expansion of omics research across multiple biological layers and dimensions, all made possible by a new breed of bioinformatics solutions. searching and aligning sequences are in its essence a problem of matching letters on a grid and assigning regions of high similarity versus regions of high variation. but nature has done a great deal to make this a challenging task. integrated multi-omics for years, omics data has provided the requisite basis for the molecular characterisation of various diseases. however, genomic studies of diseases, like cancer for example, invariably include data from heterogeneous data sources and understanding cross-data associations and interactions can reveal deep molecular insights into complex biological processes that may simply not be possible with single-source analysis. combining data across metabolomics, genomics, transcriptomics, and proteomics can reveal hidden associations and interactions between omics variables, elucidate the complex relationships between molecular layers and enable a holistic, pathway-oriented view of biology. an integrated and unified approach to multiple omics analysis has a range of novel applications in the prediction, detection, and prevention of various diseases, in drug discovery, and in designing personalised treatments. and, thanks to the development of next-generation bioinformatics platforms, it is now possible to integrate not just omics data but all types of relevant medical, clinical, and biological data, both structured and unstructured, under a unified analytical framework for a truly integrated approach to multi-omics analysis. single-cell multi-omics where multi-omics approaches focus on the interactions between omics layers to clarify complex biological processes, single-cell multi-omics enable the simultaneous and comprehensive analysis of the unique genotypic and phenotypic characteristics of single cells as well as the regulatory mechanisms that are evident only at single-cell resolution. earlier approaches to single-cell analysis involved the synthesis of data from individual cells and then computationally linking different modalities across cells. but with next-generation multi-omics technologies, it is now possible to directly look at each cell in multiple ways and perform multiple analyses at the single-cell level. today, advanced single-cell multi-omics technologies can measure a wide range of modalities, including genomics, transcriptomics, epigenomics, and proteomics, to provide ground-breaking insights into cellular phenotypes and biological processes. best-in-class solutions provide the framework required to seamlessly integrate huge volumes of granular data across multiple experiments, measurements, cell types, and organisms, and facilitate the integrative and comprehensive analysis of single-cell data. spatial transcriptomics single-cell rna sequencing enabled a more fine-grained assessment of each cell’s transcriptome. however, single-cell sequencing techniques are limited to tissue-dissociated cells that have lost all spatial information. delineating the positional context of cell types within a tissue is important for several reasons, including the need to understand the chain of information between cells in a tissue, to correlate cell groups and cellular functions, and to identify cell distribution differences between normal and diseased cells. spatial single-cell transcriptomics, or spatialomics, considered to be the next wave after single-cell analysis, combines imaging and single-cell sequencing to map the position of particular transcripts on a tissue, thereby revealing where particular genes are expressed and indicating the functional context of individual cells. even though many bioinformatics capabilities for the analysis of single-cell rna-seq data are shared with spatially resolved data, analysis pipelines diverge at the level of the quantification matrix, requiring specialised tools to extract knowledge from spatial data. however, there are advanced analytics platforms that use a unique single data framework to ingest all types of data, including spatial coordinates, for integrated analysis. quo vadis, bioinformatics? bioinformatics will continue to evolve alongside, if not ahead of, emerging needs and opportunities in biological research. but if there is one key takeaway from the examples cited here, it is that a reductionist approach – one that is limited to a single omics modality or discipline or even dimension – yields limited and often suboptimal results. if bioinformatics is to continue driving cutting edge biological research to tackle some of the most complex questions of our times, then the focus needs to be on developing a more holistic, systems bioinformatics approach to analysis. bioinformatics systems biology analysis is not an entirely novel concept, though its application is not particularly commonplace. but systems bioinformatics applies a well-defined systems approach framework to the entire spectrum of omics data with the emphasis on defining the level of resolution and the boundary of the system of interest in order to study the system as a whole, rather than as a sum of its components. the focus is on combining the bottom-up approach of systems biology with the data-driven top-down approach of classical bioinformatics to integrate different levels of information. the advent of multi-omics has, quite paradoxically, only served to accentuate the inherently siloed nature of omics approaches. even though the pace of bioinformatics innovations has picked up over the past couple of decades, the broader practice itself is still mired in a fragmented multiplicity of domain, project, or data specific solutions and pipelines. there is still a dearth of integrated end-to-end solutions with the capabilities to integrate multi-modal datasets, scale effortlessly from the study of specific molecular mechanisms to system-wide analysis of biological systems, and empower collaboration across disciplines research communities. integration at scale and across disciplines, datasets, sources, and computational methodologies is now the grand challenge for bioinformatics and represents the first step towards a future of systems bioinformatics.

The problem of fast and reliable sequence similarity searching

case study: finding robust domains in the variable region of immunoglobulins. searching for similarity in biological databases is easy to grasp but hard to master. dna, rna and protein sequence databases are often large, complex and multi-dimensional. conceptually simple approaches such as dynamic programming perform poorly when the alignment of multiple sequences is desired, and heuristic algorithms cut corners to gain speed. a new method, based on advances in computer science, may combine the best of both worlds and provide great performance without sacrificing accuracy. searching for similarity in biological sequences is challenging finding patterns in biological data is one of the most important parts of many data analysis workflows in life sciences, like omics analysis. to distinguish similarity from variance is to find meaning. whether scientists are building evolutionary trees, identifying conserved domains in proteins of interest, or studying structure-function relationships, from dna to rna to amino acids, they all rely on a handful of methods for finding similarity and dissimilarity in biological sequences. searching and aligning sequences are in its essence a problem of matching letters on a grid and assigning regions of high similarity versus regions of high variation. but nature has done a great deal to make this a challenging task. first, there is the sheer scope of the data: the human genome contains three billion base pairs, and rarely are sequence similarity searches limited to a simple one-on-one query. aligning genomic sequences of large patient databases means that queries become n-on-n. the simple task of matching letters on a grid of this size is computationally intensive and clever optimization is necessary, but also dangerous: cutting corners can lead to the obfuscation of meaningful data. apart from its size, there is another reason why biological sequence data is notoriously difficult to work with when performing alignment searches. biological data is not static. whenever dna is replicated, mistakes are made. whenever a gene is transcribed or a transcript is translated, the possibility for error arises a well. this propensity for error is at the very heart of biology, as it is believed to be the molecular driving force behind the ability of living organisms to adapt to their environment. this elegant system of iterative adaptation however, makes biological data even more complex. random mutations and other irregularities in biological data (snvs, cnvs, inversions, etc.) make it difficult to differentiate between “natural noise” and meaningful differences. all of these properties make biological datasets challenging on a conceptual and mathematical level. even the simplest case of finding a dna pattern in a biological database is, in a mathematical sense, not a well-posed problem. this means that possibly, no single static solution exists. sequence alignment: dynamic programming is slow but reliable many solutions to the sequence similarity-searching problem have been found. in essence, they all try to do one thing: given a set of query sequences (of any nature), find the way in which the largest number of similar or identical units (typically amino acids or bases) align with each other. dynamic programming is the earliest developed method for aligning sequences and remains a gold standard in terms of quality. computationally, however, dynamic programming is sub-optimal, only being the recommended method of choice when alignments involve two, three, or four sequences. these methods are, in other words, not scalable in the least. commonly used dynamic programming algorithms for sequence alignment are the needleman-wunsch algorithm and the smith-waterman algorithm, developed in the 1970s and '80s. a standard dynamic programming approach will first construct alignment spaces of all pairs of input sequences, creating a collection of one-on-one alignments that are merged together into an n-level alignment grid, where n is the number of query sequences. although laborious, dynamic programming has the advantage of always leading to an optimal solution. in contrast to heuristics, discussed below, dynamic programming methods do not “cut corners”, causing them to be the method of choice when a low number of sequences need to be aligned. another advantage of dynamic programming is that it can be easily applied from open-source python tool collections such as biopython (https://biopython.org/), which contains the bio.pairwise2 module for simple pairwise sequence alignment, and others for more complex alignments. sequence alignment: heuristics are fast but cut corners heuristics are defined as practical approaches to solving a data problem that do not guarantee an optimal outcome. in other words: the alignment produced by a heuristic algorithm may not be the one representing the most sequence similarity. while this sounds like a serious caveat – after all, who wants a sub-optimal solution? – their practical nature makes heuristic algorithms much less computationally intensive when compared to dynamic programming methods. in fact, when solving complex multiple alignment problems, heuristics offer the only workable solution, because a classical dynamic programming approach to the same problem would take days or weeks of computation time. the first popular heuristic method for sequence alignment was fasta, developed in 1985, and it was soon followed by blast in the 1990s. the prime achievement of these heuristics is the use of word methods or k-tuples. “words”, in these methods, are short sequences taken from the sequence query and matched to a database of other sequences. by performing an initial alignment with the use of these words, sequences can be offset, create a relative position alignment that greatly speeds up the rest of the sequence alignment method. note that the fasta method should not to be confused with the fasta file format, which is the default input format for fasta alignment software, but has also become the industry standard in bioinformatics for dna, rna, and protein sequences throughout the years. many heuristics for sequence alignment are progressive methods, meaning that they build up an alignment grid by first aligning the two most similar sequences, and iteratively add less and less similar sequences to the grid until all sequences are incorporated. one pitfall to this method is that the initial choice of the “most related sequences” carries a lot of weight. if the initial estimate of which sequences are most related is incorrect, the accuracy of the final alignment suffers. common progressive heuristics methods are clustal and t-coffee. a new gold standard: the need for optimization and indexing neither of the two categories discussed above, dynamic programming and heuristics-based approaches, is perfect. one lacks the computational efficiency of a truly scalable tool, while the other may miss vital information. the need for a tool that combines the strengths of dynamic programming and heuristic methods, while avoiding their pitfalls, is high because databases are becoming increasingly complex and data analysis is becoming a bottleneck in many pipelines. one way to tackle this problem is by using techniques inspired by modern computer science, such as optimization and indexing. optimization algorithms such as hidden markov models are especially good at aligning remotely related sequences, but still regularly fall short of more traditional methods such as dynamic programming and heuristic approaches. indexing, on the other hand, adopts a google-like approach using algorithms from natural language programming to discover short informative patterns in biological sequence data, which can then be abstracted and indexed for fast retrieval on all molecular layers. using this method, no pre-selected search window has to be specified and thus, bias is avoided. below, a short case study is laid out, describing the search for robust domains in the variable region of immunoglobulins using hyftstm patterns, which allow for ultra-fast ultra-precise sequence alignment. case study: finding robust domains in the variable region of immunoglobulins immunoglobulins or antibodies are versatile, clinically relevant proteins with a wide range of applications in disease diagnosis and therapy. complex diseases, including many types of cancer, are increasingly treated with monoclonal antibody therapies. key in developing these therapies is the characterization of sequence similarity in immunoglobulin variable regions. while this challenge can be approached using classical dynamic programming or heuristics, the performance of the first is poor and the latter may lead to missing out on binding sites because of the limited search window. using indexing methods with hyfts patterns searches the complete sequence with optimal speed. immunoglobulin protein sequences from pdb are decomposed into hyfts patterns, which form fast and searchable abstractions of the sequences. next, all sequences are aligned based on their hyfts patterns, the outcome of which is shown below (figure 1). the algorithm returns 900 non-equivalent sequences, which are aligned in the constant region of the immunoglobulin (red), and show more variation in the variable region (blue). however, the variable region is not completely random, and a quick search already reveals many conserved domains in what is classically thought of as very variable domains. this search, which took less than one second and did not need any preprocessing or parameter tweaking, shows that index-based methods of sequencing alignment hold a great promise for bioinformatics, and may become the industry standard in the coming years. for a video demonstration of this case study, see prof. dirk valkenborgh’s (uhasselt) talk at gsk meets universities (https://info.biostrand.be/en/gskmeetsuniversities). figure 1: alignment of immunoglobulins based on hyfts patterns. conclusion while many solutions already exist for the sequence alignment problem, the most commonly used dynamic programming and heuristic approaches still suffer from pitfalls inherent in their design. new methods emerging from computer science, relying on optimization and indexing, will likely provide a leap forward in the performance and accuracy of sequence alignment methods. image source: adobestock © siarhei 335010335

CaseXCase Series

BioStrand

How retrieval-augmented generation (RAG) can transform drug discovery

The evolution of bioinformatics

The problem of fast and reliable sequence similarity searching

How retrieval-augmented generation (RAG) can transform drug discovery

The evolution of bioinformatics

The problem of fast and reliable sequence similarity searching

Keep up to date