BioStrand
×
in a recent article on knowledge graphs and large language models (llms) in drug discovery, we noted that despite the transformative potential of llms in drug discovery, there were several critical challenges that have to be addressed in order to ensure that these technologies conform to the rigorous standards demanded by life sciences research. synergizing knowledge graphs with llms into one bidirectional data- and knowledge-based reasoning framework addresses several concerns related to hallucinations and lack of interpretability. however, that still leaves the challenge of enabling llms access to external data sources that address their limitation with respect to factual accuracy and up-to-date knowledge recall. retrieval-augmented generation (rag), together with knowledge graphs and llms, is the third critical node on the trifecta of techniques required for the robust and reliable integration of the transformative potential of language models into drug discovery pipelines. why retrieval-augmented generation? one of the key limitations of general-purpose llms is their training data cutoff, which essentially means that their responses to queries are typically out of step with the rapidly evolving nature of information. this is a serious drawback, especially in fast-paced domains like life sciences research. retrieval-augmented generation enables biomedical research pipelines to optimize llm output by: grounding the language model on external sources of targeted and up-to-date knowledge to constantly refresh llms' internal representation of information without having to completely retrain the model. this ensures that responses are based on the most current data and are more contextually relevant. providing access to the model's information so that responses can be validated for accuracy and sources, ensuring that its claims can be checked for relevance and accuracy. in short, retrieval-augmented generation provides the framework necessary to augment the recency, accuracy, and interpretability of llm-generated information. how does retrieval-augmented generation work? retrieval augmented generation is a natural language processing (nlp) approach that combines elements of both information retrieval and text generation models to enhance the performance of knowledge-intensive tasks. the retrieval component aggregates information relevant to specific queries from a predefined set of documents or knowledge sources which then serves as the context for the generation model. once the information has been retrieved, it is combined with the input context to create an integrated context containing both the original query and the relevant retrieved information. this integrated context is then fed into a generation model to generate an accurate, coherent, and contextually appropriate response based on both pre-trained knowledge and retrieved query-specific information. the rag approach gives life sciences research teams more control over grounding data used by a biomedical llm by honing it on enterprise- and domain-specific knowledge sources. it also enables the integration of a range of external data sources, such as document repositories, databases, or apis, that are most relevant to enhancing model response to a query. the value of rag in biomedical research conceptually, the retrieve+generate model’s capabilities in terms of dealing with dynamic external information sources, minimizing hallucinations, and enhancing interpretability make it a natural and complementary fit to augment the performance of biollms. in order to quantify this augmentation in performance, a recent research effort evaluated the ability of a retrieval-augmented generative agent in biomedical question-answering vis-a-vis llms (gpt-3.5/4), state-of-the-art commercial tools (elicit, scite, and perplexity) and humans (biomedical researchers). the rag agent, paperqa, was first evaluated against a standard multiple-choice llm-evaluation dataset, pubmedqa, with the provided context removed to test the agents’ ability to retrieve information. in this case, the rag agent beats gpt-4 by 30 points (57.9% to 86.3%). next, the researchers constructed a more complex and more contemporary dataset (litqa), based on more recent full-text research papers outside the bounds of llm’s pre-training data, to compare the integrated abilities of paperqa, llms and human researchers to retrieve the right information and to generate an accurate answer based on that information. again, the rag agent outperformed both pre-trained llms and commercial tools with overall accuracy (69.5%) and precision (87.9%) scores that were on par with biomedical researchers. more importantly, the rag model produced zero hallucinated citations compared to llms (40-60%). despite being just a narrow evaluation of the performance of the retrieval+generation approach in biomedical qa, the above research does demonstrate the significantly enhanced value that rag+biollm can deliver compared to purely generative ai. the combined sophistication of retrieval and generation models can be harnessed to enhance the accuracy and efficiency of a range of processes across the drug discovery and development pipeline. retrieval-augmented generation in drug discovery in the context of drug discovery, rag can be applied to a range of tasks, from literature reviews to biomolecule design. currently, generative models have demonstrated potential for de novo molecular design but are still hampered by their inability to integrate multimodal information or provide interpretability. the rag framework can facilitate the retrieval of multimodal information, from a range of sources, such as chemical databases, biological data, clinical trials, images, etc., that can significantly augment generative molecular design. the same expanded retrieval + augmented generation template applies to a whole range of applications in drug discovery like, for example, compound design (retrieve compounds/ properties and generate improvements/ new properties), drug-target interaction prediction (retrieve known drug-target interactions and generate potential interactions between new compounds and specific targets. adverse effects prediction (retrieve known adverse and generate modifications to eliminate effects). etc. the template even applies to several sub-processes/-tasks within drug discovery to leverage a broader swathe of existing knowledge to generate novel, reliable, and actionable insights. in target validation, for example, retrieval-augmented generation can enable the comprehensive generative analysis of a target of interest based on an extensive review of all existing knowledge about the target, expression patterns and functional roles of the target, known binding sites, pertinent biological pathways and networks, potential biomarkers, etc. in short, the more efficient and scalable retrieval of timely information ensures that generative models are grounded in factual, sourceable knowledge, a combination with limitless potential to transform drug discovery. an integrated approach to retrieval-augmented generation retrieval-augmented generation addresses several of the critical limitations and augments the generative capabilities of biollms. however, there are additional design rules and multiple technological profiles that have to come together to successfully address the specific requirements and challenges of life sciences research. our lensai™ integrated intelligence platform seamlessly unifies the semantic proficiency of knowledge graphs, the versatile information retrieval capabilities of retrieval-augmented generation, and the reasoning capabilities of large language models to reinvent the understanding-retrieve-generate cycle in biomedical research. our unified approach empowers researchers to query a harmonized life science knowledge layer that integrates unstructured information & ontologies into a knowledge graph. a semantic-first approach enables a more accurate understanding of research queries, which in turn results in the retrieval of content that is most pertinent to the query. the platform also integrates retrieval-augmented generation with structured biomedical data from our hyft technology to enhance the accuracy of generated responses. and finally, lensai combines deep learning llms with neuro-symbolic logic techniques to deliver comprehensive and interpretable outcomes for inquiries. to experience this unified solution in action, please contact us here.
reproducibility, getting the same results using the original data and analysis strategy, and replicability, is fundamental to valid, credible, and actionable scientific research. without reproducibility, replicability, the ability to confirm research results within different data contexts, becomes moot. a 2016 survey of researchers revealed a consensus that there was a crisis of reproducibility, with most researchers reporting that they failed to reproduce not only the experiments of other scientists (70%) but even their own (>50%). in biomedical research, reproducibility testing is still extremely limited, with some attempts to do so failing to comprehensively or conclusively validate reproducibility and replicability. over the years, there have been several efforts to assess and improve reproducibility in biomedical research. however, there is a new front opening in the reproducibility crisis, this time in ml-based science. according to this study, the increasing adoption of complex ml models is creating widespread data leakage resulting in “severe reproducibility failures,” “wildly overoptimistic conclusions,” and the inability to validate the superior performance of ml models over conventional statistical models. pharmaceutical companies have generally been cautious about accepting published results for a number of reasons, including the lack of scientifically reproducible data. an inability to reproduce and replicate preclinical studies can adversely impact drug development and has also been linked to drug and clinical trial failures. as drug development enters its latest innovation cycle, powered by computational in silico approaches and advanced ai-cadd integrations, reproducibility represents a significant obstacle to converting biomedical research into real-world results. reproducibility in in silico drug discovery the increasing computation of modern scientific research has already resulted in a significant shift with some journals incentivizing authors and providing badges for reproducible research papers. many scientific publications also mandate the publication of all relevant research resources, including code and data. in 2020, elife launched executable research articles (eras) that allowed authors to add live code blocks and computed outputs to create computationally reproducible publications. however, creating a robust reproducibility framework to sustain in silico drug discovery would require more transformative developments across three key dimensions: infrastructure/incentives for reproducibility in computational biology, reproducible ecosystems in research, and reproducible data management. reproducible computational biology this approach to industry-wide transformation envisions a fundamental cultural shift with reproducibility as the fulcrum for all decision-making in biomedical research. the focus is on four key domains. first, creating courses and workshops to expose biomedical students to specific computational skills and real-world biological data analysis problems and impart the skills required to produce reproducible research. second, promoting truly open data sharing, along with all relevant metadata, to encourage larger-scale data reuse. three, leveraging platforms, workflows, and tools that support the open data/code model of reproducible research. and four, promoting, incentivizing, and enforcing reproducibility by adopting fair principles and mandating source code availability. computational reproducibility ecosystem a reproducible ecosystem should enable data and code to be seamlessly archived, shared, and used across multiple projects. computational biologists today have access to a broad range of open-source and commercial resources to ensure their ecosystem generates reproducible research. for instance, data can now be shared across several recognized, domain and discipline-specific public data depositories such as pubchem, cdd vault, etc. public and private code repositories, such as github and gitlab, allow researchers to submit and share code with researchers around the world. and then there are computational reproducibility platforms like code ocean that enable researchers to share, discover, and run code. reproducible data management as per a recent data management and sharing (dms) policy issued by the nih, all applications for funding will have to be accompanied by a dms plan detailing the strategy and budget to manage and share research data. sharing scientific data, the nih points out, accelerates biomedical research discovery through validating research, increasing data access, and promoting data reuse. effective data management is critical to reproducibility and creating a formal data management plan prior to the commencement of a research project helps clarify two key facets of the research: one, key information about experiments, workflows, types, and volumes of data generated, and two, research output format, metadata, storage, and access and sharing policies. the next critical step towards reproducibility is having the right systems to document the process, including data/metadata, methods and code, and version control. for instance, reproducibility in in silico analyses relies extensively on metadata to define scientific concepts as well as the computing environment. in addition, metadata also plays a major role in making data fair. it is therefore important to document experimental and data analysis metadata in an established standard and store it alongside research data. similarly, the ability to track and document datasets as they adapt, reorganize, extend, and evolve across the research lifecycle will be crucial to reproducibility. it is therefore important to version control data so that results can be traced back to the precise subset and version of data. of course, the end game for all of that has to be the sharing of data and code, which is increasingly becoming a prerequisite as well as a voluntarily accepted practice in computational biology. one survey of 188 researchers in computational biology found that those who authored papers were largely satisfied with their ability to carry out key code-sharing tasks such as ensuring good documentation and that the code was running in the correct environment. the average researcher, however, would not commit any more time, effort, or expenditure to share code. plus, there still are certain perceived barriers that need to be addressed before the public archival of biomedical research data and code becomes prevalent. the future of reproducibility in drug discovery a 2014 report from the american association for the advancement of science (aaas) estimated that the u.s. alone spent approximately $28 billion yearly on irreproducible preclinical research. in the future, a set of blockchain-based frameworks may well enable the automated verification of the entire research process. meanwhile, in silico drug discovery has emerged as one of the maturing innovation areas in the pharmaceutical industry. the alliance between pharmaceutical companies and research-intensive universities has been a key component in de-risking drug discovery and enhancing its clinical and commercial success. reproducibility-related improvements and innovations will help move this alliance to a data-driven, ai/ml-based, in silico model of drug discovery.
conventional vaccine development, still based predominantly on systems developed in the last century, is a complex process that takes between 10-15 years on average. until the covid-19 pandemic, when two mrna vaccines went from development to deployment in less than a year, the record for the fastest development of a new vaccine, in just four years, had gone unchallenged for over half a century. this revolutionary boost to the vaccine development cycle stemmed from two uniquely 21st century developments: first, the access to cost-effective next-generation sequencing technologies with significantly enhanced speed, coverage and accuracy that enabled the rapid sequencing of the sars-cov-2 virus. and second, the availability of innovative state-of-the-art bioinformatics technologies to convert raw data into actionable insights, without which ngs would have just resulted in huge stockpiles of dormant or dark data. in the case of covid-19, cutting edge bioinformatics approaches played a critical role in enabling researchers to quickly hone in on the spike protein gene as the vaccine candidate. ngs technologies and advanced bioinformatics solutions have been pivotal to mitigate the global impact of covid-19, providing the tools required for detection, tracking, containment and treatment, the identification of biomarkers, the discovery of potential drug targets, drug repurposing, and exploring other therapeutic opportunities. however, the combination of gene engineering and information technologies is already creating the foundation for the fourth generation of sequencing technologies for faster and more cost-effective whole-genome sequencing and disease diagnosis. as a result, continuous innovation has become an evolutionary imperative for modern bioinformatics as it has to keep up with the developmental pace of ngs technologies and accelerate the transformation of an exponentially increasing trove of data into knowledge. however, the raw volume and velocity of data sequences is just one facet of big data genomics. today, bioinformatics solutions have to cope with a variety of complex data, in heterogeneous formats, from diverse data sources, from different sequencing methods connected to different -omes, and relating to different characteristics of genomes. more importantly, the critical focus of next-generation bioinformatics technologies has to be on catalysing new pathways and dimensions in biological research that can drive transformative change in precision medicine and public health. in the following section, we look at the current evolutionary trajectory of bioinformatics in the context of three key omics analysis milestones. three key milestones in the evolution of bioinformatics the steady evolution of bioinformatics over the past two decades into a cross-disciplinary and advanced computational practice has enabled several noteworthy milestones in omics analysis. the following, however, are significant as they best showcase the growth and expansion of omics research across multiple biological layers and dimensions, all made possible by a new breed of bioinformatics solutions. searching and aligning sequences are in its essence a problem of matching letters on a grid and assigning regions of high similarity versus regions of high variation. but nature has done a great deal to make this a challenging task. integrated multi-omics for years, omics data has provided the requisite basis for the molecular characterisation of various diseases. however, genomic studies of diseases, like cancer for example, invariably include data from heterogeneous data sources and understanding cross-data associations and interactions can reveal deep molecular insights into complex biological processes that may simply not be possible with single-source analysis. combining data across metabolomics, genomics, transcriptomics, and proteomics can reveal hidden associations and interactions between omics variables, elucidate the complex relationships between molecular layers and enable a holistic, pathway-oriented view of biology. an integrated and unified approach to multiple omics analysis has a range of novel applications in the prediction, detection, and prevention of various diseases, in drug discovery, and in designing personalised treatments. and, thanks to the development of next-generation bioinformatics platforms, it is now possible to integrate not just omics data but all types of relevant medical, clinical, and biological data, both structured and unstructured, under a unified analytical framework for a truly integrated approach to multi-omics analysis. single-cell multi-omics where multi-omics approaches focus on the interactions between omics layers to clarify complex biological processes, single-cell multi-omics enable the simultaneous and comprehensive analysis of the unique genotypic and phenotypic characteristics of single cells as well as the regulatory mechanisms that are evident only at single-cell resolution. earlier approaches to single-cell analysis involved the synthesis of data from individual cells and then computationally linking different modalities across cells. but with next-generation multi-omics technologies, it is now possible to directly look at each cell in multiple ways and perform multiple analyses at the single-cell level. today, advanced single-cell multi-omics technologies can measure a wide range of modalities, including genomics, transcriptomics, epigenomics, and proteomics, to provide ground-breaking insights into cellular phenotypes and biological processes. best-in-class solutions provide the framework required to seamlessly integrate huge volumes of granular data across multiple experiments, measurements, cell types, and organisms, and facilitate the integrative and comprehensive analysis of single-cell data. spatial transcriptomics single-cell rna sequencing enabled a more fine-grained assessment of each cell’s transcriptome. however, single-cell sequencing techniques are limited to tissue-dissociated cells that have lost all spatial information. delineating the positional context of cell types within a tissue is important for several reasons, including the need to understand the chain of information between cells in a tissue, to correlate cell groups and cellular functions, and to identify cell distribution differences between normal and diseased cells. spatial single-cell transcriptomics, or spatialomics, considered to be the next wave after single-cell analysis, combines imaging and single-cell sequencing to map the position of particular transcripts on a tissue, thereby revealing where particular genes are expressed and indicating the functional context of individual cells. even though many bioinformatics capabilities for the analysis of single-cell rna-seq data are shared with spatially resolved data, analysis pipelines diverge at the level of the quantification matrix, requiring specialised tools to extract knowledge from spatial data. however, there are advanced analytics platforms that use a unique single data framework to ingest all types of data, including spatial coordinates, for integrated analysis. quo vadis, bioinformatics? bioinformatics will continue to evolve alongside, if not ahead of, emerging needs and opportunities in biological research. but if there is one key takeaway from the examples cited here, it is that a reductionist approach – one that is limited to a single omics modality or discipline or even dimension – yields limited and often suboptimal results. if bioinformatics is to continue driving cutting edge biological research to tackle some of the most complex questions of our times, then the focus needs to be on developing a more holistic, systems bioinformatics approach to analysis. bioinformatics systems biology analysis is not an entirely novel concept, though its application is not particularly commonplace. but systems bioinformatics applies a well-defined systems approach framework to the entire spectrum of omics data with the emphasis on defining the level of resolution and the boundary of the system of interest in order to study the system as a whole, rather than as a sum of its components. the focus is on combining the bottom-up approach of systems biology with the data-driven top-down approach of classical bioinformatics to integrate different levels of information. the advent of multi-omics has, quite paradoxically, only served to accentuate the inherently siloed nature of omics approaches. even though the pace of bioinformatics innovations has picked up over the past couple of decades, the broader practice itself is still mired in a fragmented multiplicity of domain, project, or data specific solutions and pipelines. there is still a dearth of integrated end-to-end solutions with the capabilities to integrate multi-modal datasets, scale effortlessly from the study of specific molecular mechanisms to system-wide analysis of biological systems, and empower collaboration across disciplines research communities. integration at scale and across disciplines, datasets, sources, and computational methodologies is now the grand challenge for bioinformatics and represents the first step towards a future of systems bioinformatics.
the exponential generation of data by modern high-throughput, low-cost next generation sequencing (ngs) technologies is set to revolutionise genomics and molecular biology and enable a deeper and richer understanding of biological systems. and it is not just about more volumes of highly accurate, multi-layered data. it’s also about more types of omics datasets, such as glycomics, lipidomics, microbiomics, and phenomics. the increasing availability of large-scale, multidimensional and heterogeneous datasets has the potential to open up new insights into biological systems and processes, improve and increase diagnostic yield, and pave the way to shift from reductionist biology to a more holistic systems biology approach to decoding the complexities of biological entities. it has already been established that multi-dimensional analysis – as opposed to single layer analyses – yields better results from a statistical and a biological point of view, and can have a transformative impact on a range of research areas, such as genotype-phenotype interactions, disease biology, systems microbiology, and microbiome analysis. however, applying systems thinking principles to biological data requires the development of radically new integrative techniques and processes that can enable the multi-scale characterisation of biological systems. combining and integrating diverse types of omics data from different layers of biological regulation is the first computational challenge – and the next big opportunity – on the way to enabling a unified end-to-end workflow that is truly multi-omics. the challenge is quite colossal – indeed, a 2019 article in the journal of molecular endocrinology refers to the successful implementation of more than two datasets as very rare. data integration challenges in multi-omics analysing omics datasets at just one level of biological complexity is challenging enough. multi-omics analysis amplifies those challenges and introduces some unfamiliar new complications around data integration/fusion, clustering, visualisation, and functional characterisation. for instance, accommodating for the inherent complexity of biological systems, the sheer number of biological variables and the relatively low number of biological samples can on its own turn out to be a particularly difficult assignment. over and above this, there is a litany of other issues including process variations in data cleaning and normalisation, data dimensionality reduction, biological contextualisation, biomolecule identification, statistical validation, etc., etc., etc. data heterogeneity, arguably the raison d'être for integrated omics, is often the primary hurdle in multi-omics data management. omics data is typically distributed across multiple silos defined by domain, type, and access type (public/proprietary), to name just a few variables. more often than not, there are significant variations between datasets in terms of the technologies/platforms that were used to generate these datasets, nomenclature, data modalities, assay types, etc. data harmonisation, therefore, becomes a standard pre-integration process. but the process for data scaling, data normalisation, and data transformation to harmonise data can vary across different dataset types and sources. for example, there is a difference between normalisation and scaling techniques between rna-seq datasets and small rna-seq datasets. multi-omics data integration has its own set of challenges, including lack of reliability in parameter estimation, preserving accuracy in statistical inference, and/or the prevalence of large standard errors. there are, however, several tools currently available for multi-omics data integration, though they come with their own limitations. for example, there are web-based tools that require no computational experience – but the lack of visibility into their underlying processes makes it a challenge to deploy them for large-scale scientific research. on the other end of the spectrum, there are more sophisticated tools that afford more customisation and control – but also require considerable expertise in computational techniques. in this context, the development of a universal standard or unified framework for pre-analysis, let alone an integrated end-to-end pipeline for multi-omics analysis, seems rather daunting. however, if multi-omics analysis is to yield diagnostic value at scale, it is imperative that it quickly evolves from being a dispersed syndicate of tools, techniques and processes to a new integrated multi-omics paradigm that is versatile, computationally feasible and user-friendly. a platform approach to multi-omics analysis the data integration challenge in multi-omics essentially boils down to this. there either has to be a technological innovation designed specifically to handle the fine-grained and multidimensional heterogeneity of biological data. or, there has to be a biological discovery that unifies all omics data and makes them instantly computable even for conventional technologies. at mindwalk, we took the latter route and came up with hyfts™, a biological discovery that can instantly make all omics data computable. normalising/integrating data with hyfts™ we started with a new technique for indexing cellular blueprints and building blocks and used it to identify and catalogue unique signature sequences, or biological fingerprints, in dna, rna, and aa that we call hyft™ patterns. each hyft™ comprises multiple layers of information, relating to function, structure, position, etc., that together create a multilevel information network. we then designed a mindwalk parser to identify, collate and index hyfts™ from over 450 million sequences available across 11 popular public databases. this helped us create a proprietary pangenomic knowledge database using over 660 million hyft™ patterns containing information about variation, mutation, structure, and more. based on our biological discovery, we were able to normalise and integrate all publicly available omics data, including patent data, at scale, and render them multi-omics analysis-ready. the same hyft™ ip can also be applied to normalise and integrate proprietary omics data. making 660 million data points accessible that’s a lot of data points. so, we made it searchable. with google-like advanced indexing and exact matching technologies, only exact matches to search inputs are returned. through a simple search interface – use plain text or a fasta file – researchers can now accurately retrieve all relevant information about sequence alignments, similarities, and differences from a centralised knowledge base with information on millions of organisms in just 3 seconds. synthesising knowledge with our ai-powered saas platform around these core capabilities, we built the mindwalk saas platform with state-of-the-art ai tools to expand data management capabilities, mitigate data complexity, and to empower researchers to intuitively synthesise knowledge out of petabytes of biological data. with our platform, researchers can easily add different types of structured and unstructured data, leverage its advanced graph-based data mining features to extract insights from huge volumes of data, and use built-in genomic analysis tools for annotation and variation analysis. multi-omics as a platform as omics data sets become more multi-layered and multidimensional, only a truly sequence integrated multi-omics analysis solution can enable the discovery of novel and practically beneficial biological insights. with mindwalk platform, delivered as a saas, we believe we have created an integrated platform that enables a user-friendly, automated, intelligent, and data-ingestion-to-insight approach to multi-omics analysis. it eliminates all the data management challenges associated with conventional multi-omics analysis solutions and offers a cloud-based platform-centric approach to multi-omics analysis that is paramount to usability and productivity.
case study: finding robust domains in the variable region of immunoglobulins. searching for similarity in biological databases is easy to grasp but hard to master. dna, rna and protein sequence databases are often large, complex and multi-dimensional. conceptually simple approaches such as dynamic programming perform poorly when the alignment of multiple sequences is desired, and heuristic algorithms cut corners to gain speed. a new method, based on advances in computer science, may combine the best of both worlds and provide great performance without sacrificing accuracy. searching for similarity in biological sequences is challenging finding patterns in biological data is one of the most important parts of many data analysis workflows in life sciences, like omics analysis. to distinguish similarity from variance is to find meaning. whether scientists are building evolutionary trees, identifying conserved domains in proteins of interest, or studying structure-function relationships, from dna to rna to amino acids, they all rely on a handful of methods for finding similarity and dissimilarity in biological sequences. searching and aligning sequences are in its essence a problem of matching letters on a grid and assigning regions of high similarity versus regions of high variation. but nature has done a great deal to make this a challenging task. first, there is the sheer scope of the data: the human genome contains three billion base pairs, and rarely are sequence similarity searches limited to a simple one-on-one query. aligning genomic sequences of large patient databases means that queries become n-on-n. the simple task of matching letters on a grid of this size is computationally intensive and clever optimization is necessary, but also dangerous: cutting corners can lead to the obfuscation of meaningful data. apart from its size, there is another reason why biological sequence data is notoriously difficult to work with when performing alignment searches. biological data is not static. whenever dna is replicated, mistakes are made. whenever a gene is transcribed or a transcript is translated, the possibility for error arises a well. this propensity for error is at the very heart of biology, as it is believed to be the molecular driving force behind the ability of living organisms to adapt to their environment. this elegant system of iterative adaptation however, makes biological data even more complex. random mutations and other irregularities in biological data (snvs, cnvs, inversions, etc.) make it difficult to differentiate between “natural noise” and meaningful differences. all of these properties make biological datasets challenging on a conceptual and mathematical level. even the simplest case of finding a dna pattern in a biological database is, in a mathematical sense, not a well-posed problem. this means that possibly, no single static solution exists. sequence alignment: dynamic programming is slow but reliable many solutions to the sequence similarity-searching problem have been found. in essence, they all try to do one thing: given a set of query sequences (of any nature), find the way in which the largest number of similar or identical units (typically amino acids or bases) align with each other. dynamic programming is the earliest developed method for aligning sequences and remains a gold standard in terms of quality. computationally, however, dynamic programming is sub-optimal, only being the recommended method of choice when alignments involve two, three, or four sequences. these methods are, in other words, not scalable in the least. commonly used dynamic programming algorithms for sequence alignment are the needleman-wunsch algorithm and the smith-waterman algorithm, developed in the 1970s and '80s. a standard dynamic programming approach will first construct alignment spaces of all pairs of input sequences, creating a collection of one-on-one alignments that are merged together into an n-level alignment grid, where n is the number of query sequences. although laborious, dynamic programming has the advantage of always leading to an optimal solution. in contrast to heuristics, discussed below, dynamic programming methods do not “cut corners”, causing them to be the method of choice when a low number of sequences need to be aligned. another advantage of dynamic programming is that it can be easily applied from open-source python tool collections such as biopython (https://biopython.org/), which contains the bio.pairwise2 module for simple pairwise sequence alignment, and others for more complex alignments. sequence alignment: heuristics are fast but cut corners heuristics are defined as practical approaches to solving a data problem that do not guarantee an optimal outcome. in other words: the alignment produced by a heuristic algorithm may not be the one representing the most sequence similarity. while this sounds like a serious caveat – after all, who wants a sub-optimal solution? – their practical nature makes heuristic algorithms much less computationally intensive when compared to dynamic programming methods. in fact, when solving complex multiple alignment problems, heuristics offer the only workable solution, because a classical dynamic programming approach to the same problem would take days or weeks of computation time. the first popular heuristic method for sequence alignment was fasta, developed in 1985, and it was soon followed by blast in the 1990s. the prime achievement of these heuristics is the use of word methods or k-tuples. “words”, in these methods, are short sequences taken from the sequence query and matched to a database of other sequences. by performing an initial alignment with the use of these words, sequences can be offset, create a relative position alignment that greatly speeds up the rest of the sequence alignment method. note that the fasta method should not to be confused with the fasta file format, which is the default input format for fasta alignment software, but has also become the industry standard in bioinformatics for dna, rna, and protein sequences throughout the years. many heuristics for sequence alignment are progressive methods, meaning that they build up an alignment grid by first aligning the two most similar sequences, and iteratively add less and less similar sequences to the grid until all sequences are incorporated. one pitfall to this method is that the initial choice of the “most related sequences” carries a lot of weight. if the initial estimate of which sequences are most related is incorrect, the accuracy of the final alignment suffers. common progressive heuristics methods are clustal and t-coffee. a new gold standard: the need for optimization and indexing neither of the two categories discussed above, dynamic programming and heuristics-based approaches, is perfect. one lacks the computational efficiency of a truly scalable tool, while the other may miss vital information. the need for a tool that combines the strengths of dynamic programming and heuristic methods, while avoiding their pitfalls, is high because databases are becoming increasingly complex and data analysis is becoming a bottleneck in many pipelines. one way to tackle this problem is by using techniques inspired by modern computer science, such as optimization and indexing. optimization algorithms such as hidden markov models are especially good at aligning remotely related sequences, but still regularly fall short of more traditional methods such as dynamic programming and heuristic approaches. indexing, on the other hand, adopts a google-like approach using algorithms from natural language programming to discover short informative patterns in biological sequence data, which can then be abstracted and indexed for fast retrieval on all molecular layers. using this method, no pre-selected search window has to be specified and thus, bias is avoided. below, a short case study is laid out, describing the search for robust domains in the variable region of immunoglobulins using hyftstm patterns, which allow for ultra-fast ultra-precise sequence alignment. case study: finding robust domains in the variable region of immunoglobulins immunoglobulins or antibodies are versatile, clinically relevant proteins with a wide range of applications in disease diagnosis and therapy. complex diseases, including many types of cancer, are increasingly treated with monoclonal antibody therapies. key in developing these therapies is the characterization of sequence similarity in immunoglobulin variable regions. while this challenge can be approached using classical dynamic programming or heuristics, the performance of the first is poor and the latter may lead to missing out on binding sites because of the limited search window. using indexing methods with hyfts patterns searches the complete sequence with optimal speed. immunoglobulin protein sequences from pdb are decomposed into hyfts patterns, which form fast and searchable abstractions of the sequences. next, all sequences are aligned based on their hyfts patterns, the outcome of which is shown below (figure 1). the algorithm returns 900 non-equivalent sequences, which are aligned in the constant region of the immunoglobulin (red), and show more variation in the variable region (blue). however, the variable region is not completely random, and a quick search already reveals many conserved domains in what is classically thought of as very variable domains. this search, which took less than one second and did not need any preprocessing or parameter tweaking, shows that index-based methods of sequencing alignment hold a great promise for bioinformatics, and may become the industry standard in the coming years. for a video demonstration of this case study, see prof. dirk valkenborgh’s (uhasselt) talk at gsk meets universities (https://info.biostrand.be/en/gskmeetsuniversities). figure 1: alignment of immunoglobulins based on hyfts patterns. conclusion while many solutions already exist for the sequence alignment problem, the most commonly used dynamic programming and heuristic approaches still suffer from pitfalls inherent in their design. new methods emerging from computer science, relying on optimization and indexing, will likely provide a leap forward in the performance and accuracy of sequence alignment methods. image source: adobestock © siarhei 335010335
Sorry. There were no results for your query.