The Blog

MindWalk is a biointelligence company uniting AI, multi-omics data, and advanced lab research into a customizable ecosystem for biologics discovery and development.

Data-driven biomedical research

data-driven biomedical research is the foundation of effective precision medicine (pm). of course, over the long term, the success of the pm model will be determined as much by the productivity of data-driven research as it will be by our ability to translate research‐based theories and discoveries into clinical precision medicine. the expanding universe of biomedical data this paradigm shift from a symptom-specific to a patient-specific approach to biomedical research and medicine is powered by exponential volumes of multimodal data from a wide range of domains including molecular data from primary/secondary biomedical research, clinical data from laboratory studies and ehrs, and non-clinical data pertaining to environmental, lifestyle and socio-economic factors. since the advent of the first omics discipline of genomics, the scope of omics technologies has constantly been expanding across multiple biological layers. advances in ngs and mass spectrometry technologies have significantly expanded our knowledge of the biomolecular milieu and ushered in a new era of multi-omics analysis. however, the explosion of data from primary and secondary research only created new challenges in multi-omics data integration. it has been estimated that genome sequence data alone will be the biggest big data domain of all by 2025, the same year that healthcare data is expected to start doubling every 73 days. all this meant that biomedical research needed new computational methods that were better at dealing with such huge volumes of data than traditional statistical approaches. ai/ml in biomedical research as omics data sets become more multi-layered and multidimensional, advanced computational methods powered by intelligent ai/ml technologies became the key to integrating and transforming complex multi-omics data sets into actionable knowledge. these technologies soon expanded beyond the omics domain into new fields, such as radiogenomics, a complementary field in precision medicine. ml/dl models have helped automate the assessment of radiological images for diagnosis, staging, and tumor segmentation while improving accuracy and reducing the time required. deep learning networks have also demonstrated high performance across low-level and high-level tasks in the rather complex field of histopathology image analysis. similar algorithmic approaches are now being adopted across a range of image analysis applications covering ct scans, mammographies, mri, etc. ai-based technologies like nlp have helped unlock access to unstructured textual data, i.e. documents, journal articles, blogs, emails, electronic health records, social media posts, etc., and bring the knowledge embedded in these data sets within the purview of integrated biomedical research. notwithstanding the immense potential of ai/ml technologies in biomedical research and, therefore, precision medicine, there are still several technical and ethical challenges that need to be addressed. the pm model is built on the principle of the integrated analysis of vast volumes of data, including proprietary, personal, and often sensitive information. the efficiency and accuracy of data-intensive ai/ml models depend on troves of representative data. however, this immediately raises a host of compliance issues around security, privacy, ownership, and consent. then there are the ethical questions of reliability, explainability, opacity, bias, trustworthiness, traceability, fairness, and moral responsibility. all these technical and ethical factors will have a collective influence on the evolution of data-driven research into a comprehensive and practical precision medicine model. however, as mentioned at the outset, successful implementation of the pm model will also depend on the critical ability to translate research‐based theories and discoveries into clinical practice. translational research & precision medicine according to a frequently cited statistic, there is a 17-year gap between the discovery of scientific evidence and its implementation in clinical practice. this gap between laboratory research and clinical practice is what the evolving scientific discipline of translational research seeks to address. translational research enables the transformation of theoretical and experimental knowledge into innovations at the point of care while also ensuring that there is a reverse flow of clinical information and insights to biomedical research. a discovery-to-practice precision medicine pipeline will require the integration of humongous volumes of biomedical data from high-throughput multi-omics technologies, multi-modal imaging and clinical data, secondary research, ehrs, scientific publications, real-world data and non-clinical digital devices and health apps, amongst others. the real-time integration and analysis of all this data will require the deployment of advanced ai tools and technologies in the clinical setting. however, the deployment of ai in clinical care, according to a 2022 report from the council of europe’s steering committee for human rights in the fields of biomedicine and health (cdbio), remains nascent. some of the report’s key observations include a huge lag between the scale of research activity vis-a-vis demonstrating clinical efficacy, inadequate generalization of performance from trials to clinical practice, and inability to translate research, development, and testing into broader clinical deployment. a multiparametric practice like personalized medicine requires advanced ai/ml capabilities that span the spectrum of translational medicine from research to testing, clinical practice, and patient management. despite increasing ambitions and investments in ml for health (ml4h) in routine clinical care, the deployment of these models is currently limited to isolated workflows, like radiology, for example. this gap between research ai and clinical ai is now being addressed by the new field of translational ml or translational ai. the futuristic vision for this approach is to combine discrete ai agents from across the spectrum of translational medicine into one collaborative translational ai that will enable real-time lab-to-clinic analysis. translational omics and precision medicine a more immediate approach to expanding the use of ai/ml in clinical practice could be to use omics as an ai strategy. ai-driven translational omics, the clinical utilization of molecular data derived from multiple biological domains, can create the framework for precision medicine initiatives. combining state-of-the-art multi-omics technologies with ai-based integration and analysis strategies has had a significant impact on cancer precision medicine, in terms of early screening, diagnosis, response assessment, and prognosis prediction. ai technologies have the proven potential to streamline the integration of multi-modal omics with imaging, phenotypic, ehr, and patient-specific data to generate more precise insights into disease biology and enhance routine clinical decision-making. therefore, an ai-enabled multi-omics strategy may be the first evolutionary step required toward realizing the vision of an end-to-end research-lab-to-clinical-care model of precision medicine.

Making sense of multi-omics data

we love multi-omics analysis. it is data-driven. it is continuously evolving and expanding across new modalities, techniques, and technologies. integrated multi-omics analysis is essential for a holistic understanding of complex biological systems and a foundational step on the road to a systems biology approach to innovation. and it is the key to innovation in biomedical and life sciences research, underpinning antibody discovery, biomarker discovery, and precision medicine, to name just a few. in fact, if you love multi-omics as much as we do, we have an extensive library of multi-perspective omics-related content just for you. however, today we will take a closer look at some of the biggest data-related challenges — data integration, data quality, and data fairness — currently facing integrative multi-omics analysis. data integration over the years, multiomics analysis has evolved beyond basic multi-staged integration, i.e combining just two data features at a time. nowadays, true multi-level data integration, which transforms all data of research interest from across diverse datasets into a single matrix for concurrent analysis, is the norm. and yet, multi-omics data integration techniques still span multiple categories based on diverse methodologies with different objectives. for instance, there are two distinct approaches to multi-level data integration: horizontal and vertical integration. the horizontal model is used to integrate omics data of the same type derived from different studies whereas the vertical model integrates different types of omics data from different experiments on the same cohort of samples. single-cell data integration further expands this classification to include diagonal integration, which further expands the scope of integration beyond the previous two methods, and mosaic integration, which includes features shared across datasets as well as features exclusive to a single experiment. the increasing use of ai/ml technologies has helped address many previous challenges inherent in multiomics data integration but has only added to the complexity of classification. for instance, vertical data integration strategies for ml analysis are further subdivided into 5 groups based on a variety of factors. even the classification of supervised and unsupervised techniques covers several distinct approaches and categories. as a result, researchers today can choose from various applications and analytical frameworks for handling diverse omics data types, and yet not many standardized workflows for integrative data analyses. the biggest challenge, therefore, in multi-omics data integration is the lack of a universal framework that can unify all omics data. data quality the success of integrative multi-omics depends as much on an efficient and scalable data integration strategy as it does on the quality of omics data. and when it comes to multi-omics research, it is rarely prudent to assume that data values are precise representations of true biological value. there are several factors, between the actual sampling to the measurement, that affect the quality of a sample. this applies equally to data generated from manual small-scale experiments and from sophisticated high-throughput technologies. for instance, there can be intra-experimental quality heterogeneity where there is variation in data quality even when the same omics procedure is used to conduct a large number of single experiments simultaneously. similarly, there can also be inter-experimental heterogeneity in which the quality of data from one experimental procedure is affected by factors shared by other procedures. in addition, data quality also depends on the computational methods used to process raw experimental data into quantitative data tables. an effective multi-omics analysis solution must have first-line data quality assessment capabilities to guarantee high-quality datasets and ensure accurate biological inferences. however, there are currently few classification or prediction algorithms that can compensate for the quality of input data. however, in recent years there have been efforts to harmonize quality control vocabulary across different omics and high-throughput methods in order to develop a unified framework for quality control in multi-omics experiments. data fairness the ability to reuse life sciences data is critical for validating existing hypotheses, exploring novel hypotheses, and gaining new knowledge that can significantly advance interdisciplinary research. quality, for instance, is a key factor affecting the reusability of multi-omics and clinical data due to the lack of common quality control frameworks that can harmonize data across different studies, pipelines, and laboratories. the publishing of the fair principles in 2016 represented one of the first concerted efforts to focus on improving the quality, standardization, and reusability of scientific data. the fair data principles, designed by a representative set of stakeholders, defined measurable guidelines for “those wishing to enhance the reusability of their data holdings” both for individuals and for machines to automatically find and use the data. the four foundational principles — findability, accessibility, interoperability, and reusability — were applicable to data as well as to the algorithms, tools, and workflows that contributed to data generation. since then there have been several collaborative initiatives, such as the eatris-plus project and the global alliance for genomics and health (ga4gh) for example, that have championed data fairness and advanced standards and frameworks to enhance data quality, harmonization, reproducibility, and reusability. despite these efforts, the use of specific and non-standard formats continues to be quite common in the life sciences. integrative multi-omics - the mindwalk model our approach to truly integrated and scalable multi-omics analysis is defined by three key principles. one, we have created a universal and automated framework, based on a proprietary transversal language called hyfts®, that has pre-indexed and organized all publicly available biological data into a multilayered multidimensional knowledge graph of 660 million data objects that are currently linked by over 25 billion relations. we then further augmented this vast and continuously expanding knowledge network, using our unique lensai integrated intelligence platform, to provide instant access to over 33 million abstracts from the pubmed biomedical literature database. most importantly, our solution enables researchers to easily integrate proprietary datasets, both sequence- and text-based data. with our unique data-centric model, researchers can integrate all research-relevant data into one distinct analysis-ready data matrix mosaic. two, we combined a simple user interface with a universal workflow that allows even non-data scientists to quickly explore, interrogate, and correlate all existing and incoming life sciences data. and three, we built a scalable platform with proven big data technologies and an intelligent, unified analytical framework that enables integrative multi-omics research. in conclusion, if you share our passion for integrated multi-omics analysis, then please do get in touch with us. we’d love to compare notes on how best to realize the full potential of truly data-driven multi-omics analysis.

Why we love multi-omics analysis

the key challenge to understanding complex biological systems is that they cannot be simply decoded as a sum of their parts. biomedical research, therefore, is transitioning from this reductionist approach to a more holistic and integrated systems biology model to understand the bigger picture. the first step in the transition to this holistic model is to catalog a complete parts list of biological systems and decode how they connect, interact, and individually and collectively correlate to the function and behavior of that specific system. omics is the science of analyzing the structure and functions of all the parts of a specific biological function, across different levels, including the gene, the protein, and metabolites. today, we’ll take an objective look at why we believe multi-omics is central to modern biomedical and life sciences research. the importance of multi-omics in four points it delivers a holistic, dynamic, high-resolution view omics experiments have evolved considerably since the days of single-omics data. nowadays, it is fairly commonplace for researchers to combine multiple assays to generate multi-omics datasets. multi-omics is central to obtaining a detailed picture of molecular-level dynamics. the integration of multidimensional molecular datasets provides deeper insight into biological mechanisms and networks. more importantly, multi-omics can provide a dynamic view of different cell and tissue types over time which can be vital to understand the progressive effect of different environmental and genetic factors. combining data from different modalities enables a more holistic view of biological systems and a more comprehensive understanding of the underlying dynamics. the development of massively parallel genomic technologies is constantly broadening the scope and scale of biological modalities that can be integrated into research. at the same time, a new wave of multi-omics approaches is enabling researchers to simultaneously explore different layers of omics information to gain unparalleled insights into the internal dynamics of specific cells and tissues. emerging technologies such as single-cell sequencing and spatial analysis are opening up new layers of biological information to deliver a comprehensive, high-resolution view at the molecular level. it is constantly expanding & evolving genomics was the first omics discipline. since then the omics sciences have been constantly expanding beyond genomics, transcriptomics, proteomics, and metabolomics which were derived from the central dogma. however, the increasing sophistication of modern high-throughput technologies means that today we have a continuously expanding variety of omics datasets focusing on multiple diverse yet complementary biological layers. in fact, the ‘omics’ suffix seems to have developed its own unique cachet that it has even crossed over into emerging scientific fields, such as polymeromics, humeomics, etc., that deal with huge volumes of data but are not related to the life sciences. omics technologies can be broadly classified into two categories. the first, technology-based omics, is itself further subdivided into sequencing-based omics, focusing on the genome, transcriptome, their epitomes, and interactomes, and mass spectrometry-based omics that interrogate proteome, metabolome, and interactomes not involving dna/rna. the second category, comprising knowledge-based omics such as immunomics and microbiomics, develops organically from the integration of multiple omics data from different computational approaches and molecular layers for specific research applications. the consistent development of techniques to cover new omics modalities has also contributed to the trend of combining multiple techniques to simultaneously collect information from different layers. next-generation multi-omics approaches, spearheaded by new single-cell and spatial sequencing technologies, enable researchers to concurrently explore multiple omics profiles of a sample and gain novel insights into cell systems. and mechanisms operating within specific cells and tissues, providing a greater understanding of cell biology. it is data-driven the omics revolution ushered in the era of big data in biological research. the exponential generation of high-throughput data following the hgp triggered the shift from traditional hypothesis-driven approaches to data-driven methodologies that opened up new perspectives and accelerated biological research and innovation. it was not just about data volumes though. with the continuous evolution of high-throughput omics technologies came the ability to measure a wider array of biological data. the rapid development of novel omics technologies in the post-genomic era produced a wealth of multilayered biological information across transcriptomics, proteomics, epigenomics, metabolomics, spatial omics, single-cell omics, etc. the increasing availability of large-scale, multidimensional, and heterogeneous datasets created unprecedented opportunities for biological research to gain deeper and holistic insights into the inner workings of biological systems and processes. the shift from single-layer to multi-dimensional analysis also yielded better results that would have a transformative impact on a range of research areas including biomarker identification, microbiome analysis, and systems microbiology. researchers have already taken on the much more complex challenge of referencing the human multi-ome and describing normal epigenetic conditions and levels of mrna, proteins, and metabolites in each of the 200 cell types in an adult human. when completed, this effort will deliver even more powerful datasets than those that emerged following the sequencing of the genome. it is key to innovation in recent years, multi-omics analysis has become a key component across several areas of biomedical and life sciences research. take precision medicine, for example, a practice that promotes the integration of collective and individualized clinical data with patient-specific multi-omics data to accurately diagnose health states and determine personalized therapeutic options at an individual level. modern ai/ml-powered bioinformatics platforms enable researchers to seamlessly integrate all relevant omics and clinical data, including unstructured textual data in order to develop predictive models that are able to identify risks much before they become clinically apparent and thereby facilitate preemptive interventions. in the case of complex diseases, multi-omics data provide molecular profiles of disease-relevant cell types that when integrated with gwas insights help translate genetic findings into clinical applications. in drug discovery, multi-omics data is used to create multidimensional models that help identify and validate new drug targets, predict toxicity and develop biomarkers for downstream diagnostics in the field. modern biomarker development relies on the effective integration of a range of omics datasets in order to obtain a more holistic understanding of diseases and to augment the accuracy and speed of identifying novel drug targets. the future of multi-omics integrated multi-omics analysis has revolutionized biology and opened up new horizons for basic biology and disease research. however, the complexity of managing and integrating multi-dimensional data that drives such analyses continues to be a challenge. modern bioinformatics platforms are designed for multi-dimensional data. for instance, our integrated data-ingestion-to-insight platform eliminates all multi-omics data management challenges while prioritizing user experience, automation, and productivity. with unified access to all relevant data, researchers can focus on leveraging the ai-powered features of our solution to maximize the potential of multi-omics analysis.

Challenges in multi-omics data integration

today, the integrative computational analysis of multi-omics data has become a central tenet of the big data-driven approach to biological research. and yet, there is still a lack of gold standards when it comes to evaluating and classifying integration methodologies that can be broadly applied across multi-omics analysis. more importantly, the lack of a cohesive or universal approach to big data integration is also creating new challenges in the development of novel computational approaches for multi-omics analysis. one aspect of sequence search and comparison, however, has not changed much at all – a biological sequence in a predefined and acceptable data format is still the primary input in most research. this approach is probably and arguably valid in many if not most real-world research scenarios. take machine learning (ml) models, for instance, which are increasingly playing a central role in the analysis of genomic big data. biological data presents several unique challenges, such as missing values and precision variations across omics modalities, that simply expand the gamut of integration strategies required to address each specific challenge. for example, omics datasets often contain missing values, which can hamper downstream integrative bioinformatics analyses. this requires an additional imputation process to infer the missing values in these incomplete datasets before statistical analyses can be applied. then there is the high-dimension low sample size (hdlss) problem, where the variables significantly outnumber samples, leading ml algorithms to overfit these datasets, thereby decreasing their generalisability on new data. in addition, there are multiple challenges inherent to all biological data irrespective of analytical methodology or framework. to start with there is the sheer heterogeneity of omics data comprising a variety of datasets originating from a range of data modalities and comprising completely different data distributions and types that have to be handled appropriately. integrating heterogeneous multi-omics data presents a cascade of challenges involving the unique data scaling, normalisation, and transformation requirements of each individual dataset. any effective integration strategy will also have to account for the regulatory relationships between datasets from different omics layers in order to accurately and holistically reflect the nature of this multidimensional data. furthermore, there is the issue of integrating omics and non-omics (ono) data, like clinical, epidemiological or imaging data, for example, in order to enhance analytical productivity and to access richer insights into biological events and processes. currently, the large-scale integration of non-omics data with high-throughput omics data is extremely limited due to a range of factors, including heterogeneity and the presence of subphenotypes, for instance. the crux of the matter is that without effective and efficient data integration, multi-omics analysis will only tend to become more complex and resource-intensive without any proportional or even significant augmentation in productivity, performance, or insight generation. an overview of multi-omics data integration early approaches to multi-omics analysis involved the independent analysis of different data modalities and combining results for a quasi-integrated view of molecular interactions. but the field has evolved significantly since then into a broad range of novel, predominantly algorithmic meta-analysis frameworks and methodologies for the integrated analysis of multi-dimensional multi-omics data. however, the topic of data integration and the challenges involved is often overshadowed by the ground-breaking developments in integrated, multi-omics analysis. it is therefore essential to understand the fundamental conceptual principles, rather than any specific method or framework, that define multi-omics data integration. horizontal vs vertical data integration multi-omics datasets are broadly organized as horizontal or vertical, corresponding to the complexity and heterogeneity of multi-omics data. horizontal datasets are typically generated from one or two technologies, for a specific research question and from a diverse population, and represent a high degree of real-world biological and technical heterogeneity. horizontal or homogeneous data integration, therefore, involves combining data from across different studies, cohorts or labs that measure the same omics entities. vertical data refers to data generated using multiple technologies, probing different aspects of the research question, and traversing the possible range of omics variables including the genome, metabolome, transcriptome, epigenome, proteome, microbiome, etc. vertical, or heterogeneous, data integration involves multi-cohort datasets from different omics levels, measured using different technologies and platforms. the fact that vertical integration techniques cannot be applied for horizontal integrative analysis and vice-versa opens up an opportunity for conceptual innovation in multi-omics for data integration techniques that can enable an integrative analysis of both horizontal and vertical multi-omics datasets. of course, each of these broad data heads can further be broken down into a range of approaches based on utility and efficiency. 5 integration strategies for vertical data a 2021 mini-review of general approaches to vertical data integration for ml analysis defined five distinct integration strategies – early, mixed, intermediate, late and hierarchical – based not just on the underlying mathematics but on a variety of factors including how they were applied. here’s a quick rundown of each approach. early integration is a simple and easy-to-implement approach that concatenates all omics datasets into a single large matrix. this increases the number of variables, without altering the number of observations, which results in a complex, noisy and high dimensional matrix that discounts dataset size difference and data distribution. mixed integration addresses the limitations of the early model by separately transforming each omics dataset into a new representation and then combining them for analysis. this approach reduces noise, dimensionality, and dataset heterogeneities. intermediate integration simultaneously integrates multi-omics datasets to output multiple representations, one common and some omics-specific. however, this approach often requires robust pre-processing due to potential problems arising from data heterogeneity. late integration circumvents the challenges of assembling different types of omics datasets by analysing each omics separately and combining the final predictions. this multiple single-omics approach does not capture inter-omics interactions. hierarchical integration focuses on the inclusion of prior regulatory relationships between different omics layers so that analysis can reveal the interactions across layers. though this strategy truly embodies the intent of trans-omics analysis, this is still a nascent field with many hierarchical methods focusing on specific omics types, thereby making them less generalisable. the availability of an unenviable choice of conceptual approaches – each with its own scope and limitations in terms of throughput, performance, and accuracy – to multi-omics data integration represents one of the biggest bottlenecks to downstream analysis and biological innovation. researchers often spend more time mired in the tedium of data munging and wrangling than they do extracting knowledge and novel insights. most conventional approaches to data integration, moreover, seem to involve some form of compromise involving either the integrity of high-throughput multi-omics data or achieving true trans-omics analysis. there has to be a new approach to multi-omics data integration that can 1), enable the one-click integration of all omics and non-omics data, and 2), preserve the biological consistency, in terms of correlations and associations across different regulatory datasets, for integrative multi-omics analysis in the process. the mindwalk hyft model for data integration at mindwalk, we took a lateral approach to the challenge of biological data integration. rather than start with a technological framework that could be customised for the complexity and heterogeneity of multi-omics data, we set out to decode the atomic units of all biological information that we call hyfts™. hyfts are essentially the building blocks of biological information, which means that they enable the tokenisation of all biological data, irrespective of species, structure, or function, to a common omics data language. we then built the framework to identify, collate, and index hyfts from sequence data. this enabled us to create a proprietary pangenomic knowledge database of over 660 million hyfts, each containing comprehensive information about variation, mutation, structure, etc., from over 450 million sequences available across 12 popular public databases. with the mindwalk platform, researchers and bioinformaticians have instant access to all the data from some of the most widely used omics data sources. plus, our unique hyfts framework allows researchers the convenience of one-click normalization and integration of all their proprietary omics data and metadata. based on our biological discovery, we were able to normalise and integrate all publicly available omics data, including patent data, at scale, and render them multi-omics analysis-ready. the same hyft ip can also be applied to normalise and integrate proprietary omics data. the transversal language of hyfts enables the instant normalisation and integration of multi-omics research-relevant data and metadata into one single source of truth. with the mindwalk approach to multi-omic data integration, it is no longer about whether research data is horizontal or vertical, homogeneous or heterogeneous, text or sequence, omics or non-omics. if it is data that is relevant to your research, mindwalk enables you to integrate it with just one click.

The evolution of bioinformatics

conventional vaccine development, still based predominantly on systems developed in the last century, is a complex process that takes between 10-15 years on average. until the covid-19 pandemic, when two mrna vaccines went from development to deployment in less than a year, the record for the fastest development of a new vaccine, in just four years, had gone unchallenged for over half a century. this revolutionary boost to the vaccine development cycle stemmed from two uniquely 21st century developments: first, the access to cost-effective next-generation sequencing technologies with significantly enhanced speed, coverage and accuracy that enabled the rapid sequencing of the sars-cov-2 virus. and second, the availability of innovative state-of-the-art bioinformatics technologies to convert raw data into actionable insights, without which ngs would have just resulted in huge stockpiles of dormant or dark data. in the case of covid-19, cutting edge bioinformatics approaches played a critical role in enabling researchers to quickly hone in on the spike protein gene as the vaccine candidate. ngs technologies and advanced bioinformatics solutions have been pivotal to mitigate the global impact of covid-19, providing the tools required for detection, tracking, containment and treatment, the identification of biomarkers, the discovery of potential drug targets, drug repurposing, and exploring other therapeutic opportunities. however, the combination of gene engineering and information technologies is already creating the foundation for the fourth generation of sequencing technologies for faster and more cost-effective whole-genome sequencing and disease diagnosis. as a result, continuous innovation has become an evolutionary imperative for modern bioinformatics as it has to keep up with the developmental pace of ngs technologies and accelerate the transformation of an exponentially increasing trove of data into knowledge. however, the raw volume and velocity of data sequences is just one facet of big data genomics. today, bioinformatics solutions have to cope with a variety of complex data, in heterogeneous formats, from diverse data sources, from different sequencing methods connected to different -omes, and relating to different characteristics of genomes. more importantly, the critical focus of next-generation bioinformatics technologies has to be on catalysing new pathways and dimensions in biological research that can drive transformative change in precision medicine and public health. in the following section, we look at the current evolutionary trajectory of bioinformatics in the context of three key omics analysis milestones. three key milestones in the evolution of bioinformatics the steady evolution of bioinformatics over the past two decades into a cross-disciplinary and advanced computational practice has enabled several noteworthy milestones in omics analysis. the following, however, are significant as they best showcase the growth and expansion of omics research across multiple biological layers and dimensions, all made possible by a new breed of bioinformatics solutions. searching and aligning sequences are in its essence a problem of matching letters on a grid and assigning regions of high similarity versus regions of high variation. but nature has done a great deal to make this a challenging task. integrated multi-omics for years, omics data has provided the requisite basis for the molecular characterisation of various diseases. however, genomic studies of diseases, like cancer for example, invariably include data from heterogeneous data sources and understanding cross-data associations and interactions can reveal deep molecular insights into complex biological processes that may simply not be possible with single-source analysis. combining data across metabolomics, genomics, transcriptomics, and proteomics can reveal hidden associations and interactions between omics variables, elucidate the complex relationships between molecular layers and enable a holistic, pathway-oriented view of biology. an integrated and unified approach to multiple omics analysis has a range of novel applications in the prediction, detection, and prevention of various diseases, in drug discovery, and in designing personalised treatments. and, thanks to the development of next-generation bioinformatics platforms, it is now possible to integrate not just omics data but all types of relevant medical, clinical, and biological data, both structured and unstructured, under a unified analytical framework for a truly integrated approach to multi-omics analysis. single-cell multi-omics where multi-omics approaches focus on the interactions between omics layers to clarify complex biological processes, single-cell multi-omics enable the simultaneous and comprehensive analysis of the unique genotypic and phenotypic characteristics of single cells as well as the regulatory mechanisms that are evident only at single-cell resolution. earlier approaches to single-cell analysis involved the synthesis of data from individual cells and then computationally linking different modalities across cells. but with next-generation multi-omics technologies, it is now possible to directly look at each cell in multiple ways and perform multiple analyses at the single-cell level. today, advanced single-cell multi-omics technologies can measure a wide range of modalities, including genomics, transcriptomics, epigenomics, and proteomics, to provide ground-breaking insights into cellular phenotypes and biological processes. best-in-class solutions provide the framework required to seamlessly integrate huge volumes of granular data across multiple experiments, measurements, cell types, and organisms, and facilitate the integrative and comprehensive analysis of single-cell data. spatial transcriptomics single-cell rna sequencing enabled a more fine-grained assessment of each cell’s transcriptome. however, single-cell sequencing techniques are limited to tissue-dissociated cells that have lost all spatial information. delineating the positional context of cell types within a tissue is important for several reasons, including the need to understand the chain of information between cells in a tissue, to correlate cell groups and cellular functions, and to identify cell distribution differences between normal and diseased cells. spatial single-cell transcriptomics, or spatialomics, considered to be the next wave after single-cell analysis, combines imaging and single-cell sequencing to map the position of particular transcripts on a tissue, thereby revealing where particular genes are expressed and indicating the functional context of individual cells. even though many bioinformatics capabilities for the analysis of single-cell rna-seq data are shared with spatially resolved data, analysis pipelines diverge at the level of the quantification matrix, requiring specialised tools to extract knowledge from spatial data. however, there are advanced analytics platforms that use a unique single data framework to ingest all types of data, including spatial coordinates, for integrated analysis. quo vadis, bioinformatics? bioinformatics will continue to evolve alongside, if not ahead of, emerging needs and opportunities in biological research. but if there is one key takeaway from the examples cited here, it is that a reductionist approach – one that is limited to a single omics modality or discipline or even dimension – yields limited and often suboptimal results. if bioinformatics is to continue driving cutting edge biological research to tackle some of the most complex questions of our times, then the focus needs to be on developing a more holistic, systems bioinformatics approach to analysis. bioinformatics systems biology analysis is not an entirely novel concept, though its application is not particularly commonplace. but systems bioinformatics applies a well-defined systems approach framework to the entire spectrum of omics data with the emphasis on defining the level of resolution and the boundary of the system of interest in order to study the system as a whole, rather than as a sum of its components. the focus is on combining the bottom-up approach of systems biology with the data-driven top-down approach of classical bioinformatics to integrate different levels of information. the advent of multi-omics has, quite paradoxically, only served to accentuate the inherently siloed nature of omics approaches. even though the pace of bioinformatics innovations has picked up over the past couple of decades, the broader practice itself is still mired in a fragmented multiplicity of domain, project, or data specific solutions and pipelines. there is still a dearth of integrated end-to-end solutions with the capabilities to integrate multi-modal datasets, scale effortlessly from the study of specific molecular mechanisms to system-wide analysis of biological systems, and empower collaboration across disciplines research communities. integration at scale and across disciplines, datasets, sources, and computational methodologies is now the grand challenge for bioinformatics and represents the first step towards a future of systems bioinformatics.

The metadata challenge in biological research

one of the key highlights of the global response to covid-19 has been the importance of effective, ethical, and equitable data sharing and how it can exponentially accelerate outbreak research. however, data sharing cannot just be an incident response strategy. with the volume of biological data increasing exponentially year on year, the public sharing of experimental multi-omics data needs to become part of the culture of biological and life sciences research. the ability to assemble data from across domains and disciplines will pave the way for more integrated multi-omics and cross-disciplinary research and expand the potential for more sophisticated insights into biological systems. there are already concerted efforts within the industry to evolve towards an open data paradigm in biological research. however, the public availability of data will not automatically translate into enhanced value in terms of insights and knowledge. it has to be accompanied by a radical rethink of how data is generated, stored, shared and accessed. and, equally importantly, data management frameworks will have to account for the value of metadata. the challenges of metadata metadata – essentially data that describes data – provide context and provenance to raw data and is crucial to both data discovery and validation. for instance, metadata can describe a sample – in terms of the biomaterial it was derived from, how the sample was handled, the details of processes used for sample purification, profiling, and quantification, and provide detailed information into the experimental set-up and procedure. integrating data from multiple analyses and experiments enables high-level research that can address more complex questions in the life sciences. however, several case studies indicate that even current efforts to prioritise open data have not been able to catalyse open analysis at a proportionate scale. there are several reasons for this lag. a key challenge in this context is the fact that many research groups have adopted community-specific conventions that are not easily scalable for multidisciplinary research. many researchers still use no formal reporting conventions or completely exclude the metadata critical to the interpretation and reuse of data. often, the metadata included with open datasets is incomplete and/or poorly annotated. when it comes to data integration, conventional methods fall into one of two categories – multi-staged or meta-dimensional analysis. though meta-dimensional analysis is capable of incorporating all data into a comprehensive metadata matrix, combining data from different datasets still remains a significant challenge. data integration is further complicated by the lack of user-friendly tools for researchers with limited bioinformatics, biostatistics, and programming expertise. open data without metadata adds no discernible new value to the research process. though the biomedical community is making a concerted effort to share omics data, there is still a lack of consistency among researchers in ensuring that this data is backed by complete, annotated and usable metadata. so, if open analysis has to catch up with open data, there needs to be a universal standard for the reporting and sharing of valuable data. standardising biological metadata though metadata has been acknowledged as a key component of research infrastructure design, there is still no universal standard for reporting and sharing metadata. instead, there have been numerous initiatives for the development of hundreds of metadata standards with diverse characteristics. however, there has been a conceptual consensus on the three types of metadata standards – descriptive, administrative, and structural. there is also general agreement that metadata is key to supporting fair principles in order to overcome obstacles to data discovery and reuse for both humans and machines. as a result, there has been some change in the profile of public data with many repositories showing some degrees of “fairness” and several new projects emerging with fair as a central objective. in addition, many scientific journals are now urging researchers to make data shareable and public even as they endorse public data repositories that implement fair principles. notwithstanding that, much of the data in public repositories is far from being perfectly fair. one limited study of engineered nanomaterial databases found that even though a majority met fair criteria, one of the potential areas of improvement was the use of standard schema for metadata. another study to evaluate the completeness of metadata, referenced to nine clinical phenotypes, in public omics data reported a large variability in both the number and consistency of reported clinical phenotypes. even coordinated efforts, like miame, designed to encourage metadata sharing have had a limited impact given the fact that they define the content but not the format for this information. the creation of a unified framework for metadata continues to be a significant challenge with the public data landscape still characterized by diverse databases and standards that still require users to devise and manage compatibility. it is therefore going to require a monumental and orchestrated effort to ensure data and metadata quality adherence across the universe of public data repositories. the primary reason for this is that genomic data organization is essentially a fraught endeavour. for instance, files come in multiple formats with widely different semantics to fit neatly into a predefined universal framework. more importantly, there is no commonly accepted standard for a general yet basic data unit that can represent the heterogeneous and multi-dimensional data assets that are central to biological research. hyfts – the atomic units of biological data at mindwalk, we apply advanced nlp techniques to protein and dna sequences to transcribe the universal language of all omics data. in doing so, we were able to decode the atomic units of information, called hyfts™, that are the building blocks of biological information. with hyfts, all biological data, irrespective of species, structure, or function, can be tokenised to a common omics data language. in addition, these atomic data units are also extremely efficient carriers of biological information. each hyft pattern represents a unique signature sequence in dna, rna, and aa and integrates data and metadata across all omics layers. the transversal language of hyfts enables the unification, standardisation, and normalisation of all data, across species and domains, to create a single source of truth. the default integration of omics layers and associated metadata combined with the lensai platform’s hyper-scalable technology and unified analytical framework enables truly integrated multi-omics research. standardised metadata is the key to the usability and reproducibility of public data. with mindwalk, all public data is usable, and all biological research is reproducible.

Multi-omics data in biomarker discovery

current diagnostic alternatives for neurodegenerative diseases like alzheimer’s, parkinson’s, down’s syndrome, dementia, and motor neuron disease, are either invasive lumbar punctures, expensive brain imaging scans, pen-and-paper cognitive tests, or a simple blood test in a primary care setting to check for nfl (neurofilament light chain) concentration. similarly, despite increasing evidence that exercise could delay or even prevent alzheimer’s, there are currently no cost-effective or scalable procedures to validate or measure that correlation. however, research has now revealed that post-exercise increases in levels of plasma ctsb, a protease positively associated with learning and memory, could help evaluate how training influences cognitive change. nfl and plasma ctsb are two prime examples of biomarkers, biological or characteristics found in body fluids and tissues that can be objectively measured and evaluated to differentiate between normal biological processes and pathogenic processes, or pharmacologic responses to therapeutic interventions. the growing promise of biomarkers in the seven decades since the term was first introduced, biomarkers have evolved from simple indicators of health and disease to transformative instruments in clinical care and precision medicine. today, biomarkers have a wide variety of applications – diagnostic, prognostic, predictive, disease screening and detection, treatment response, risk stratification, etc., – across a broad range of therapeutic areas (cancer, cardiovascular, hepatic, renal, respiratory, neuroscience, gastrointestinal, etc.). in keeping with the times, we now also have digital biomarkers – objective, quantifiable physiological and behavioural data collected and measured by digital devices. biomarkers are at the heart of ground-breaking medical research to, for instance, reveal the underlying mechanism in acute myelogenous leukemia, improve prognosis of gastric cancer, establish a new prognostic gene profile for ovarian cancer, and provide novel etiological insights into obesity that facilitate patient stratification and precision prevention. biomarkers are also playing an increasingly critical role in the drug discovery, development and approval process. they enable a better understanding of the mechanism of action of a drug, help reduce the risk of failure and discovery costs, and allow for more precise patient stratification. between 2015 and 2019, more than half of the drugs approved by ema and fda were supported by biomarker data during the development stage. it is, therefore, hardly surprising that there is currently a lot of focus on biomarker discovery. however, this inherently complex process is only getting more complex, data-driven, and time-consuming – and that introduces some significant new challenges along the way. the increasing complexity of biomarker discovery initially, a biomarker was a simple one-dimensional molecule whose presence, or absence, indicated a binary outcome. however, single biomarkers lack the sensitivity and specificity required for disease classification and outcome prediction in a clinical setting. soon, biomarker discovery included panels – a set of biomarkers working together to enhance diagnostic or prognostic performance. then the field shifted again toward spatially resolved biomarkers that reflected the complexity of the underlying diseases. rather than just provide aggregated information, these higher-order biomarkers incorporated the spatial data of cells expressing relevant molecular markers. at the same time, biomarker developers are also integrating a whole range of omics data sets, such as genomics, proteomics, metabolomics, epigenetics, etc., in order to get a more holistic view that could augment our ability to understand diseases and identify novel drug targets. the scope of biomarker discovery just keeps getting wider with the emergence of new data-gathering technologies like single-cell next-generation sequencing, liquid biopsy (blood sample) for circulating tumour dna, microbiomics, radiomics, and with high-throughput technologies generating enormous volumes of data at a relatively low cost. the big challenge, therefore, will be in the integration and analysis of these huge volumes of multimodal data. plus, biomarker data comes with some challenges of its own. biomarker data challenges data scarcity: despite their widespread currency, there are still very few biomarker databases available for developers. in addition, there could also be a lack of systemic omics studies and biological data relevant to biomarker research. for instance, metabolomics data, critical to biomarker research into radiation resistance in cancer therapy, is not part of large multi-omics initiatives such as the cancer genome atlas. therefore, it will require a network-centric approach to analytics that enable data enrichment and modelling with other available datasets. data fragmentation: biomarker data is typically distributed across subscription-based, commercial databases with no provision for cross-database interconnectivity, and a few open-access databases, each with its own therapeutic or molecular specialization. so, a truly multi-omics approach to analysis will depend entirely on the efficiency of data integration. lack of data standardization: many sources do not follow fair database principles and practices. moreover, different datasets are also generated using heterogeneous profiling technologies, pre-processed using diverse normalization procedures, and annotated in non-standard ways. intelligent, automated normalization should be a priority. how mindwalk can help at biostrand, we understand that a systems biology approach is crucial to the success of biomarker discovery. our unique hyft™ ip was born out of the acknowledgement that the only way to accelerate biological research was by unifying all biological data with a common computational language. access all biological data with hyft™: on the biostrand platform, multi-omics data integration is as simple as logging in. using hyft™, we have already normalized, integrated, and indexed 450 million sequences available across 11 popular omics databases. that’s instant access to an extensive omics knowledge base with over a billion hyfts™, with information about variation, mutation, structure, etc. what’s more, integrating your own biomarker research is just a click away. add structured databases (icd codes, lab tests, etc.) and unstructured datasets (patient record data, scientific literature, clinical trial data, chemical data, etc.) our technology will seamlessly normalize and standardize all your data and make it computable to enable a truly integrative multi-omics approach to biomarker discovery. accurate annotation and analysis: the mindwalk genomic analysis tools provide unmatched accuracy in annotation and variation analysis, such as in the large-scale whole-genome data of patients with a specific disease. use our platform’s advanced annotation capabilities to extract insights from genomic datasets and fill in the gaps in biomarker datasets. comprehensive data mining: combine the power of our hyft™ database with the graph-based data mining capabilities of our ai-powered platform to discover knowledge that can accelerate the development process. from single biomarkers to systems biology biomarkers have evolved considerably since their days as simple single-molecule indicators of biological processes. today, biomarker discovery is a sophisticated systems biology practice to unravel complex molecular interactions and expand the boundaries of clinical medicine and drug development. as the practice gets more multifaceted, it will also require more advanced data integration, management, and analysis tools. the mindwalk platform provides an integrated solution for normalization, integration, and analysis of high-volume high-dimensional data.

Intelligent multi-omics analysis: The power of AI/ML

in our previous blog on integrated multi-omics, we overviewed the need for this increase in scale, dimensionality and heterogeneity in genomic data to be matched by a shift from reductionist biology to a holistic systems biology approach to omics analysis. an integrated multi-dimensional and multivariate model for analysis is absolutely imperative for us to create a more comprehensive multi-scale characterization and understanding of biological systems. however, there are several limitations, like scalability and reproducibility, for example, in applying conventional integration and interpretation techniques to multi-omics data. and conventional single-omics analyses are wholly ineffective in interpreting complex cellular mechanisms or identifying the underlying causes of multifaceted diseases. as a result, multi-omics analyses increasingly rely on advanced computational methods and intelligent ai-powered technologies like ml, deep learning and nlp to optimize data management and transform multi-omics data into clinically actionable knowledge. ai/ml in multi-omics analysis since then, ml and ai capabilities have evolved exponentially and have been applied consistently if not extensively for decision support in several healthcare scenarios including the management of several communicable and noncommunicable diseases. image source: biomolecules a recent special issue of the journal biomolecules focusing on the integrated analysis of omics data using ai, ml and deep learning (dl) listed three areas in medical research – medical image analysis, omics analysis, and natural language processing – where ai was currently being implemented. here, then, is an overview of the transformative potential of ai in these specific areas. medical image analysis intelligent technologies are a critical component of radiogenomics, which focuses on studying the relationship between imaging phenotypes and genomic phenotypes of specific diseases. technologies such as ai/ml and dl have demonstrated their ability to extract meaningful information from medical imaging data, sometimes with greater precision than humans themselves and helped bring automated, accurate, and ultra-fast medical image analysis into the mainstream. for example, researchers in japan have used ai to successfully detect recurrent prostate cancer by analysing pathology images to identify features relevant to cancer prognosis that were not previously noted by pathologists. ai techniques like deep learning have even been used to predict neurological diseases like alzheimer's and amyotrophic lateral sclerosis (als) before patients become symptomatic. by training convolutional neural networks on images of motor neurons prepared from ips cells of 15 healthy donors and 15 als patients, researchers were able to predict with over 0.97 of area under curve (auc) whether donors were healthy or als patients. omics analysis ai, ml and dl have been used extensively across a range of applications including the improvement of disease predictions, establishing the moas of compounds, studying gene regulation, and molecular profiling of various diseases, to name a few. take the case of cancer, for instance, where treatments based solely on pathological features can yield very different outcomes. it is therefore important to be able to break down broad symptomatic classifications into more finely defined subtypes in order to administer more focused clinical care. by applying supervised and unsupervised learning techniques to rna, mirna, and dna data from hepatocellular carcinoma patients, researchers were able to identify two subgroups with significant survival differences, isolate consensus driver genes associated with survival, and establish that consensus driver mutations were associated more with mrna transcriptomes than with mirna transcriptomes. ai has also been used to focus cancer prognosis and therapy by detecting subtypes among cancer patients, identifying biomarkers that determine recurrence of cancer, and for drug response modelling to predict drug response behaviour. in drug development, techniques like gene signatures analysis and high-throughput screening have made it quicker and easier to identify compounds that affect a specific target or phenotype. even so, these compounds may still induce complex downstream functional consequences that could adversely affect their chances of approval. understanding the moas (modes of action) of compounds is crucial to increasing the success rate of clinical trials and drug approvals. researchers have demonstrated that it is possible to determine unknown moas, even in the absence of a comparable reference, by combining multi-omics with an interpretable machine learning model of transcriptomics, epigenomics, metabolomics, and proteomics. natural language processing (nlp) as omics data continues to pile up in the petabytes, it only amplifies the urgency to convert this data into meaningful biological and clinical insights. however, not all omics researchers and bioinformaticians have the statistical expertise to do so. it is in this context that ai techniques like nlp can help accelerate the pace of multi-omics research by making analytics accessible to a wider audience. for example, one open-access, natural language-oriented, ai-driven platform for analyzing and visualizing cancer omics data empowers users to interact with the system using a natural language interface. researchers ask biological questions, and the system responds by identifying relevant genomics datasets, performing various analyses, returning appropriate results and even learning from user feedback. one ai model uses nlp to analyse short speech samples from a clinically administered cognitive test to predict the eventual onset of alzheimer’s disease within healthy people with an auc of 0.74. nlp can also help researchers identify, relate, and analyse datasets from public repositories with a solution that combines nlp techniques, biomedical ontologies, and the r statistical framework to simplify the association of samples from large-scale repositories to their ontology-based annotations. lensai platform - the ai-centric multi-omics platform ai/ml technologies have become the defining component of multi-omics analysis as they provide the speed, accuracy, and sophistication to deal with voluminous, diverse, heterogeneous, and high dimensional genomic data. at a higher level, they also introduce innovative new capabilities into bioinformatics by augmenting data-driven decision-making and enabling a new era of predictive multi-omics. the lensai saas platform was designed from the ground up to fully leverage the potential of these versatile and intelligent technologies. take hyfts™, a biological discovery that enabled us to translate 440 million sequences of different species, types, and formats from across 11 popular public databases into one homogeneous pan-genomic multivariate knowledge database. with hyfts™, we have decoded the language of omics, which means that we now have the capability to translate any data that researchers might bring to our platform and make it instantly computable. defining a universal framework to accommodate all omics data, public and proprietary, opens up whole new possibilities for innovation. for instance, with our platform, researchers can seamlessly integrate all kinds of structured and unstructured non-biological data including patient record data, scientific literature, clinical trial data, chemical data, icd codes, lab tests, and more. however, developing a framework that unifies all omics data and metadata, both public and proprietary, into one integrated model is only half the battle. truly intelligent and integrated multiomics analysis will only become possible with the seamless integration of years of valuable experimental research insights scattered across volumes of scientific literature. these are insights with the potential to amplify, accelerate and transform bioinformatics that is often left on the table just because of the lack of easy-to-use textual data integration frameworks. and that’s where the lensai platform comes into play, enabling the efficient integration of textual research data together with multiomics data for holistic analysis using sophisticated ai/ml technologies. though there are several biomedical nlp solutions available to the research community, they are currently rather confined in terms of application. the primary issue is that, as mentioned earlier, most of these solutions require significant computational and statistical proficiency, limiting their access to a select few. secondly, many of these solutions use a top-down approach that is effective in extracting information related to a specific query. however, this approach is not very accurate or efficient at extracting all the information, at scale, that is pertinent to research. lensai platform is designed around a bottom-up approach that can reveal all research-relevant information about novel concepts and relationships available in scientific literature. and since our platform is domain agnostic and is, therefore, capable of identifying all meaningful relationships without the need for predefined knowledge. this means that researchers can completely bypass the training stage and get straight to their research. integrated, intelligent multi-omics with lensai platform mindwalk’s unique sequence + text approach enables researchers to instantly unify all omics, non-omics and unstructured data into one multidimensional dataset that serves as the single source of truth for their research. combine this with our platform’s constantly evolving advanced ai-powered analytics capabilities and researchers have a seamless, user-friendly and integrated data-ingestion-to-insight multi-omics platform that enables holistic biological research. to explore what lensai platform can do for your research, contact us here.

Integrated multi-omics analysis: From data normalisation to biological insight

the exponential generation of data by modern high-throughput, low-cost next generation sequencing (ngs) technologies is set to revolutionise genomics and molecular biology and enable a deeper and richer understanding of biological systems. and it is not just about more volumes of highly accurate, multi-layered data. it’s also about more types of omics datasets, such as glycomics, lipidomics, microbiomics, and phenomics. the increasing availability of large-scale, multidimensional and heterogeneous datasets has the potential to open up new insights into biological systems and processes, improve and increase diagnostic yield, and pave the way to shift from reductionist biology to a more holistic systems biology approach to decoding the complexities of biological entities. it has already been established that multi-dimensional analysis – as opposed to single layer analyses – yields better results from a statistical and a biological point of view, and can have a transformative impact on a range of research areas, such as genotype-phenotype interactions, disease biology, systems microbiology, and microbiome analysis. however, applying systems thinking principles to biological data requires the development of radically new integrative techniques and processes that can enable the multi-scale characterisation of biological systems. combining and integrating diverse types of omics data from different layers of biological regulation is the first computational challenge – and the next big opportunity – on the way to enabling a unified end-to-end workflow that is truly multi-omics. the challenge is quite colossal – indeed, a 2019 article in the journal of molecular endocrinology refers to the successful implementation of more than two datasets as very rare. data integration challenges in multi-omics analysing omics datasets at just one level of biological complexity is challenging enough. multi-omics analysis amplifies those challenges and introduces some unfamiliar new complications around data integration/fusion, clustering, visualisation, and functional characterisation. for instance, accommodating for the inherent complexity of biological systems, the sheer number of biological variables and the relatively low number of biological samples can on its own turn out to be a particularly difficult assignment. over and above this, there is a litany of other issues including process variations in data cleaning and normalisation, data dimensionality reduction, biological contextualisation, biomolecule identification, statistical validation, etc., etc., etc. data heterogeneity, arguably the raison d'être for integrated omics, is often the primary hurdle in multi-omics data management. omics data is typically distributed across multiple silos defined by domain, type, and access type (public/proprietary), to name just a few variables. more often than not, there are significant variations between datasets in terms of the technologies/platforms that were used to generate these datasets, nomenclature, data modalities, assay types, etc. data harmonisation, therefore, becomes a standard pre-integration process. but the process for data scaling, data normalisation, and data transformation to harmonise data can vary across different dataset types and sources. for example, there is a difference between normalisation and scaling techniques between rna-seq datasets and small rna-seq datasets. multi-omics data integration has its own set of challenges, including lack of reliability in parameter estimation, preserving accuracy in statistical inference, and/or the prevalence of large standard errors. there are, however, several tools currently available for multi-omics data integration, though they come with their own limitations. for example, there are web-based tools that require no computational experience – but the lack of visibility into their underlying processes makes it a challenge to deploy them for large-scale scientific research. on the other end of the spectrum, there are more sophisticated tools that afford more customisation and control – but also require considerable expertise in computational techniques. in this context, the development of a universal standard or unified framework for pre-analysis, let alone an integrated end-to-end pipeline for multi-omics analysis, seems rather daunting. however, if multi-omics analysis is to yield diagnostic value at scale, it is imperative that it quickly evolves from being a dispersed syndicate of tools, techniques and processes to a new integrated multi-omics paradigm that is versatile, computationally feasible and user-friendly. a platform approach to multi-omics analysis the data integration challenge in multi-omics essentially boils down to this. there either has to be a technological innovation designed specifically to handle the fine-grained and multidimensional heterogeneity of biological data. or, there has to be a biological discovery that unifies all omics data and makes them instantly computable even for conventional technologies. at mindwalk, we took the latter route and came up with hyfts™, a biological discovery that can instantly make all omics data computable. normalising/integrating data with hyfts™ we started with a new technique for indexing cellular blueprints and building blocks and used it to identify and catalogue unique signature sequences, or biological fingerprints, in dna, rna, and aa that we call hyft™ patterns. each hyft comprises multiple layers of information, relating to function, structure, position, etc., that together create a multilevel information network. we then designed a mindwalk parser to identify, collate and index hyfts from over 450 million sequences available across 11 popular public databases. this helped us create a proprietary pangenomic knowledge database using over 660 million hyft patterns containing information about variation, mutation, structure, and more. based on our biological discovery, we were able to normalise and integrate all publicly available omics data, including patent data, at scale, and render them multi-omics analysis-ready. the same hyft ip can also be applied to normalise and integrate proprietary omics data. making 660 million data points accessible that’s a lot of data points. so, we made it searchable. with google-like advanced indexing and exact matching technologies, only exact matches to search inputs are returned. through a simple search interface – use plain text or a fasta file – researchers can now accurately retrieve all relevant information about sequence alignments, similarities, and differences from a centralised knowledge base with information on millions of organisms in just 3 seconds. synthesising knowledge with our ai-powered saas platform around these core capabilities, we built the mindwalk saas platform with state-of-the-art ai tools to expand data management capabilities, mitigate data complexity, and to empower researchers to intuitively synthesise knowledge out of petabytes of biological data. with our platform, researchers can easily add different types of structured and unstructured data, leverage its advanced graph-based data mining features to extract insights from huge volumes of data, and use built-in genomic analysis tools for annotation and variation analysis. multi-omics as a platform as omics data sets become more multi-layered and multidimensional, only a truly sequence integrated multi-omics analysis solution can enable the discovery of novel and practically beneficial biological insights. with mindwalk platform, delivered as a saas, we believe we have created an integrated platform that enables a user-friendly, automated, intelligent, and data-ingestion-to-insight approach to multi-omics analysis. it eliminates all the data management challenges associated with conventional multi-omics analysis solutions and offers a cloud-based platform-centric approach to multi-omics analysis that is paramount to usability and productivity.

The SaaSification of Bioinformatics

in our previous blog post – ‘the imperative for bioinformatics-as-a-service’ – we addressed the issue of the profusion of choice in computational solutions in the fields of bioinformatics research. traditionally, there has been a systemic, acute, and documented dearth of off-the-shelf technological solutions designed specifically for the scientific research community. in bioinformatics and omics research, this has translated into the necessity for users to invent their own system configurations, data pipelines, and workflows that best suit their research objectives. the output of this years-long diy movement has now generated a rich corpus of specialised bioinformatics tools and databases that are now available to the next generation of bioinformaticians to broker, adapt, and chain into a sequence of point solutions. on the one hand, next-generation high throughput sequencing technologies are churning out genomics data more quickly, accurately, and cost-effectively than ever before. on the other, the pronounced lack of next-generation high throughput sequence analysis technologies still requires researchers to build or broker their own computational solutions that are capable of coping with the volume and complexity of digital age genomics big data. as a result, bioinformatics workflows are becoming longer, toolchains have grown more complex, and the number of software tools, programming interfaces, and libraries that have to be integrated has multiplied. even as cloud-based frameworks like saas become the default software delivery model across every industry, bioinformatics and omics research remain stranded in this diy status. the industry urgently needs to shift to a cloud-based as-a-service paradigm that will enable more focused, efficient, and productive use of research talents for data-driven omics innovation and insights, instead of grappling with improvisation and implementation. how saas transforms bioinformatics analytics for the augmented bioinformatician even as the cloud has evolved into the de-facto platform for advanced analytics, the long-running theme of enabling self-service analytics for non-technical users and citizen data scientists has undergone a radical reinterpretation. for instance, predefined dashboards that support intuitive data manipulation and exploration have become a key differentiating factor for solutions in the marketplace. however, according to gartner’s top ten data and analytics technology trends for 2021, dashboards will have to be supplemented with more intelligent capabilities in order to extend analytical power – that thus far was only available to specialist data scientists and analysts –to non-technical augmented consumers. these augmented analytics solutions enable ai/ml-powered automation across the entire data science process – from data preparation to insight generation – and feature natural language interfaces for nlp/nlg technologies to simplify how augmented consumers query and consume their insights and democratize the development, management, and deployment of ai/ml models. specialized bioinformatics-as-a-service platforms need to adopt a similar development trajectory. the focus has to be on completely eliminating the tedium of wrangling with disparate technologies, tools, and interfaces, and empowering a new generation of augmented bioinformaticians to focus on their core research. enhanced scalability and accessibility a single human genome sequence contains about 200 gigabytes of data. as genome sequencing becomes more affordable, data from the human genome alone is expected to add up to over 40 exabytes by 2025. this is not a scale that a motley assortment of technologies and tools can accommodate. in comparison, bioinformatics-as-a-solution platforms are designed with these data volumes in mind. a robust and scalable saas platform is built to effortlessly handle the normalization, storage, analysis, cross-comparison, and presentation of petabytes of genomics data. for instance, our mindwalk platform utilises a container-based architecture to auto-scale seamlessly to handle over 200 petabytes of data with zero on-ramping issues. and scalability is not just about capacity. saas platforms also offer high vertical scalability in terms of services and features that researchers need to access. all mindwalk platform users have a simple “google-style” search bar access to 350 million sequences spanning 11 of the most popular publicly available databases, as well as to in-built tools for sequence analysis, multiple sequence alignment, and protein domain analysis. over and above all this, saas solutions no longer restrict research to the lab environment. researchers can now access powerful and comprehensive bioinformatics-as-a-service via laptops – or even their smartphones if mobile-first turns out to be the next big saas trend – in the comfort of their own homes or their favourite coffee shop. increased speed and accuracy bioinformatics has typically involved a trade-off between speed and accuracy. in some cases, methodologies make reductive assumptions about the data to deliver quicker results, while in others the error rate may increase proportionally to the complexity of a query. in multi-tool research environments, the end result is a discrete sum of the results received from each module in the sequence. this means that errors generated in one process are neither flagged nor addressed in subsequent stages, leading to an accumulation of errors in the final analysis. a truly integrated multi-level solution consolidates disparate stages of conventional bioinformatics and omics data analysis into one seamlessly integrated platform that facilitates in-depth data exploration, maximizes researchers’ view of their data, and accelerates time-to-insight without compromising on speed or accuracy. access to continuous innovation with a saas solution, end-users no longer need to worry about updates, patch management, and upgrades. with vertical saas solutions, such as bioinformatics-as-a-service, continuous innovation becomes a priority to sustain vertical growth in a narrow market. for users, this translates into more frequent rollouts of new features and capabilities based on user feedback to address real pain points in the industry. for instance, in just a few months since the official launch of our platform, we have added new capabilities for sdk/api-based integrations for proprietary data and infrastructure, expanded our tools and expertise to assay design, drug development, gene therapy, crop protection products, and biomarkers, and we are building out an ai platform with state-of-the-art graph-based data mining to discover and synthesise knowledge out of a multitude of information sources. the imperative to saasify bioinformatics saas is currently the largest segment in the public cloud services market – and yet the segment’s footprint in bioinformatics is virtually non-existent. today, there are a few cloud-based technologies targeted at genomic applications that focus on specific workflows like sequence alignment, short read mapping, snp identification, etc. however, what the industry really needs is a cloud-based end-to-end bioinformatics-as-a-service solution that abstracts all the technological complexity to deliver simple yet powerful tools for bioinformaticians and omics researchers.

The problem of fast and reliable sequence similarity searching

case study: finding robust domains in the variable region of immunoglobulins. searching for similarity in biological databases is easy to grasp but hard to master. dna, rna and protein sequence databases are often large, complex and multi-dimensional. conceptually simple approaches such as dynamic programming perform poorly when the alignment of multiple sequences is desired, and heuristic algorithms cut corners to gain speed. a new method, based on advances in computer science, may combine the best of both worlds and provide great performance without sacrificing accuracy. searching for similarity in biological sequences is challenging finding patterns in biological data is one of the most important parts of many data analysis workflows in life sciences, like omics analysis. to distinguish similarity from variance is to find meaning. whether scientists are building evolutionary trees, identifying conserved domains in proteins of interest, or studying structure-function relationships, from dna to rna to amino acids, they all rely on a handful of methods for finding similarity and dissimilarity in biological sequences. searching and aligning sequences are in its essence a problem of matching letters on a grid and assigning regions of high similarity versus regions of high variation. but nature has done a great deal to make this a challenging task. first, there is the sheer scope of the data: the human genome contains three billion base pairs, and rarely are sequence similarity searches limited to a simple one-on-one query. aligning genomic sequences of large patient databases means that queries become n-on-n. the simple task of matching letters on a grid of this size is computationally intensive and clever optimization is necessary, but also dangerous: cutting corners can lead to the obfuscation of meaningful data. apart from its size, there is another reason why biological sequence data is notoriously difficult to work with when performing alignment searches. biological data is not static. whenever dna is replicated, mistakes are made. whenever a gene is transcribed or a transcript is translated, the possibility for error arises a well. this propensity for error is at the very heart of biology, as it is believed to be the molecular driving force behind the ability of living organisms to adapt to their environment. this elegant system of iterative adaptation however, makes biological data even more complex. random mutations and other irregularities in biological data (snvs, cnvs, inversions, etc.) make it difficult to differentiate between “natural noise” and meaningful differences. all of these properties make biological datasets challenging on a conceptual and mathematical level. even the simplest case of finding a dna pattern in a biological database is, in a mathematical sense, not a well-posed problem. this means that possibly, no single static solution exists. sequence alignment: dynamic programming is slow but reliable many solutions to the sequence similarity-searching problem have been found. in essence, they all try to do one thing: given a set of query sequences (of any nature), find the way in which the largest number of similar or identical units (typically amino acids or bases) align with each other. dynamic programming is the earliest developed method for aligning sequences and remains a gold standard in terms of quality. computationally, however, dynamic programming is sub-optimal, only being the recommended method of choice when alignments involve two, three, or four sequences. these methods are, in other words, not scalable in the least. commonly used dynamic programming algorithms for sequence alignment are the needleman-wunsch algorithm and the smith-waterman algorithm, developed in the 1970s and '80s. a standard dynamic programming approach will first construct alignment spaces of all pairs of input sequences, creating a collection of one-on-one alignments that are merged together into an n-level alignment grid, where n is the number of query sequences. although laborious, dynamic programming has the advantage of always leading to an optimal solution. in contrast to heuristics, discussed below, dynamic programming methods do not “cut corners”, causing them to be the method of choice when a low number of sequences need to be aligned. another advantage of dynamic programming is that it can be easily applied from open-source python tool collections such as biopython (https://biopython.org/), which contains the bio.pairwise2 module for simple pairwise sequence alignment, and others for more complex alignments. sequence alignment: heuristics are fast but cut corners heuristics are defined as practical approaches to solving a data problem that do not guarantee an optimal outcome. in other words: the alignment produced by a heuristic algorithm may not be the one representing the most sequence similarity. while this sounds like a serious caveat – after all, who wants a sub-optimal solution? – their practical nature makes heuristic algorithms much less computationally intensive when compared to dynamic programming methods. in fact, when solving complex multiple alignment problems, heuristics offer the only workable solution, because a classical dynamic programming approach to the same problem would take days or weeks of computation time. the first popular heuristic method for sequence alignment was fasta, developed in 1985, and it was soon followed by blast in the 1990s. the prime achievement of these heuristics is the use of word methods or k-tuples. “words”, in these methods, are short sequences taken from the sequence query and matched to a database of other sequences. by performing an initial alignment with the use of these words, sequences can be offset, create a relative position alignment that greatly speeds up the rest of the sequence alignment method. note that the fasta method should not to be confused with the fasta file format, which is the default input format for fasta alignment software, but has also become the industry standard in bioinformatics for dna, rna, and protein sequences throughout the years. many heuristics for sequence alignment are progressive methods, meaning that they build up an alignment grid by first aligning the two most similar sequences, and iteratively add less and less similar sequences to the grid until all sequences are incorporated. one pitfall to this method is that the initial choice of the “most related sequences” carries a lot of weight. if the initial estimate of which sequences are most related is incorrect, the accuracy of the final alignment suffers. common progressive heuristics methods are clustal and t-coffee. a new gold standard: the need for optimization and indexing neither of the two categories discussed above, dynamic programming and heuristics-based approaches, is perfect. one lacks the computational efficiency of a truly scalable tool, while the other may miss vital information. the need for a tool that combines the strengths of dynamic programming and heuristic methods, while avoiding their pitfalls, is high because databases are becoming increasingly complex and data analysis is becoming a bottleneck in many pipelines. one way to tackle this problem is by using techniques inspired by modern computer science, such as optimization and indexing. optimization algorithms such as hidden markov models are especially good at aligning remotely related sequences, but still regularly fall short of more traditional methods such as dynamic programming and heuristic approaches. indexing, on the other hand, adopts a google-like approach using algorithms from natural language programming to discover short informative patterns in biological sequence data, which can then be abstracted and indexed for fast retrieval on all molecular layers. using this method, no pre-selected search window has to be specified and thus, bias is avoided. below, a short case study is laid out, describing the search for robust domains in the variable region of immunoglobulins using hyftstm patterns, which allow for ultra-fast ultra-precise sequence alignment. case study: finding robust domains in the variable region of immunoglobulins immunoglobulins or antibodies are versatile, clinically relevant proteins with a wide range of applications in disease diagnosis and therapy. complex diseases, including many types of cancer, are increasingly treated with monoclonal antibody therapies. key in developing these therapies is the characterization of sequence similarity in immunoglobulin variable regions. while this challenge can be approached using classical dynamic programming or heuristics, the performance of the first is poor and the latter may lead to missing out on binding sites because of the limited search window. using indexing methods with hyfts patterns searches the complete sequence with optimal speed. immunoglobulin protein sequences from pdb are decomposed into hyfts patterns, which form fast and searchable abstractions of the sequences. next, all sequences are aligned based on their hyfts patterns, the outcome of which is shown below (figure 1). the algorithm returns 900 non-equivalent sequences, which are aligned in the constant region of the immunoglobulin (red), and show more variation in the variable region (blue). however, the variable region is not completely random, and a quick search already reveals many conserved domains in what is classically thought of as very variable domains. this search, which took less than one second and did not need any preprocessing or parameter tweaking, shows that index-based methods of sequencing alignment hold a great promise for bioinformatics, and may become the industry standard in the coming years. for a video demonstration of this case study, see prof. dirk valkenborgh’s (uhasselt) talk at gsk meets universities (https://info.biostrand.be/en/gskmeetsuniversities). figure 1: alignment of immunoglobulins based on hyfts patterns. conclusion while many solutions already exist for the sequence alignment problem, the most commonly used dynamic programming and heuristic approaches still suffer from pitfalls inherent in their design. new methods emerging from computer science, relying on optimization and indexing, will likely provide a leap forward in the performance and accuracy of sequence alignment methods. image source: adobestock © siarhei 335010335