The Blog

MindWalk is a biointelligence company uniting AI, multi-omics data, and advanced lab research into a customizable ecosystem for biologics discovery and development.

How retrieval-augmented generation (RAG) can transform drug discovery

in a recent article on knowledge graphs and large language models (llms) in drug discovery, we noted that despite the transformative potential of llms in drug discovery, there were several critical challenges that have to be addressed in order to ensure that these technologies conform to the rigorous standards demanded by life sciences research. synergizing knowledge graphs with llms into one bidirectional data- and knowledge-based reasoning framework addresses several concerns related to hallucinations and lack of interpretability. however, that still leaves the challenge of enabling llms access to external data sources that address their limitation with respect to factual accuracy and up-to-date knowledge recall. retrieval-augmented generation (rag), together with knowledge graphs and llms, is the third critical node on the trifecta of techniques required for the robust and reliable integration of the transformative potential of language models into drug discovery pipelines. why retrieval-augmented generation? one of the key limitations of general-purpose llms is their training data cutoff, which essentially means that their responses to queries are typically out of step with the rapidly evolving nature of information. this is a serious drawback, especially in fast-paced domains like life sciences research. retrieval-augmented generation enables biomedical research pipelines to optimize llm output by: grounding the language model on external sources of targeted and up-to-date knowledge to constantly refresh llms' internal representation of information without having to completely retrain the model. this ensures that responses are based on the most current data and are more contextually relevant. providing access to the model's information so that responses can be validated for accuracy and sources, ensuring that its claims can be checked for relevance and accuracy. in short, retrieval-augmented generation provides the framework necessary to augment the recency, accuracy, and interpretability of llm-generated information. how does retrieval-augmented generation work? retrieval augmented generation is a natural language processing (nlp) approach that combines elements of both information retrieval and text generation models to enhance the performance of knowledge-intensive tasks. the retrieval component aggregates information relevant to specific queries from a predefined set of documents or knowledge sources which then serves as the context for the generation model. once the information has been retrieved, it is combined with the input context to create an integrated context containing both the original query and the relevant retrieved information. this integrated context is then fed into a generation model to generate an accurate, coherent, and contextually appropriate response based on both pre-trained knowledge and retrieved query-specific information. the rag approach gives life sciences research teams more control over grounding data used by a biomedical llm by honing it on enterprise- and domain-specific knowledge sources. it also enables the integration of a range of external data sources, such as document repositories, databases, or apis, that are most relevant to enhancing model response to a query. the value of rag in biomedical research conceptually, the retrieve+generate model’s capabilities in terms of dealing with dynamic external information sources, minimizing hallucinations, and enhancing interpretability make it a natural and complementary fit to augment the performance of biollms. in order to quantify this augmentation in performance, a recent research effort evaluated the ability of a retrieval-augmented generative agent in biomedical question-answering vis-a-vis llms (gpt-3.5/4), state-of-the-art commercial tools (elicit, scite, and perplexity) and humans (biomedical researchers). the rag agent, paperqa, was first evaluated against a standard multiple-choice llm-evaluation dataset, pubmedqa, with the provided context removed to test the agents’ ability to retrieve information. in this case, the rag agent beats gpt-4 by 30 points (57.9% to 86.3%). next, the researchers constructed a more complex and more contemporary dataset (litqa), based on more recent full-text research papers outside the bounds of llm’s pre-training data, to compare the integrated abilities of paperqa, llms and human researchers to retrieve the right information and to generate an accurate answer based on that information. again, the rag agent outperformed both pre-trained llms and commercial tools with overall accuracy (69.5%) and precision (87.9%) scores that were on par with biomedical researchers. more importantly, the rag model produced zero hallucinated citations compared to llms (40-60%). despite being just a narrow evaluation of the performance of the retrieval+generation approach in biomedical qa, the above research does demonstrate the significantly enhanced value that rag+biollm can deliver compared to purely generative ai. the combined sophistication of retrieval and generation models can be harnessed to enhance the accuracy and efficiency of a range of processes across the drug discovery and development pipeline. retrieval-augmented generation in drug discovery in the context of drug discovery, rag can be applied to a range of tasks, from literature reviews to biomolecule design. currently, generative models have demonstrated potential for de novo molecular design but are still hampered by their inability to integrate multimodal information or provide interpretability. the rag framework can facilitate the retrieval of multimodal information, from a range of sources, such as chemical databases, biological data, clinical trials, images, etc., that can significantly augment generative molecular design. the same expanded retrieval + augmented generation template applies to a whole range of applications in drug discovery like, for example, compound design (retrieve compounds/ properties and generate improvements/ new properties), drug-target interaction prediction (retrieve known drug-target interactions and generate potential interactions between new compounds and specific targets. adverse effects prediction (retrieve known adverse and generate modifications to eliminate effects). etc. the template even applies to several sub-processes/-tasks within drug discovery to leverage a broader swathe of existing knowledge to generate novel, reliable, and actionable insights. in target validation, for example, retrieval-augmented generation can enable the comprehensive generative analysis of a target of interest based on an extensive review of all existing knowledge about the target, expression patterns and functional roles of the target, known binding sites, pertinent biological pathways and networks, potential biomarkers, etc. in short, the more efficient and scalable retrieval of timely information ensures that generative models are grounded in factual, sourceable knowledge, a combination with limitless potential to transform drug discovery. an integrated approach to retrieval-augmented generation retrieval-augmented generation addresses several of the critical limitations and augments the generative capabilities of biollms. however, there are additional design rules and multiple technological profiles that have to come together to successfully address the specific requirements and challenges of life sciences research. our lensai™ integrated intelligence platform seamlessly unifies the semantic proficiency of knowledge graphs, the versatile information retrieval capabilities of retrieval-augmented generation, and the reasoning capabilities of large language models to reinvent the understanding-retrieve-generate cycle in biomedical research. our unified approach empowers researchers to query a harmonized life science knowledge layer that integrates unstructured information & ontologies into a knowledge graph. a semantic-first approach enables a more accurate understanding of research queries, which in turn results in the retrieval of content that is most pertinent to the query. the platform also integrates retrieval-augmented generation with structured biomedical data from our hyft technology to enhance the accuracy of generated responses. and finally, lensai combines deep learning llms with neuro-symbolic logic techniques to deliver comprehensive and interpretable outcomes for inquiries. to experience this unified solution in action, please contact us here.

Creating a unified data + information architecture for scalable AI

the first blog in our series on data, information and knowledge management in the life sciences, provided an overview of some of the most commonly used data and information frameworks today. in this second blog, we will take a quick look at the data-information-knowledge continuum and the importance of creating a unified data + information architecture that can support scalable ai deployments. in 2000, a seminal knowledge management article, excerpted from the book working knowledge: how organizations manage what they know, noted that despite the distinction between the terms data, information, and knowledge being just a matter of degree, understanding that distinction could be key to organizational success and failure. the distinction itself is quite straightforward, data refers to a set of discrete, objective facts with little intrinsic relevance or purpose and provide no sustainable basis for action. data endowed with relevance and purpose becomes information that can influence judgment and behavior. and knowledge, which includes higher-order concepts such as wisdom and insight, is derived from information and enables decisions and actions. today, in the age of big data, ai (artificial intelligence), and the data-driven enterprise, the exponential increase in data volume and complexity has resulted in a rise in information gaps due to the inability to turn raw data into actionable information at scale. and the bigger the pile of data, the more the prevalence of valuable but not yet useful data. the information gap in life sciences the overwhelming nature of life sciences data typically expressed in exabase-scales, exabytes, zettabytes, or even yottabytes, and the imperative to convert this data deluge into information has resulted in the industry channeling nearly half of its technology investments into three analytics-related technologies — applied ai, industrialized ml (machine learning), and cloud and edge computing. at the same time, the key challenges in scaling analytics, according to life sciences leaders, were the lack of high-quality data sources and data integration. data integration is a key component of a successful enterprise information management (eim) strategy. however, data professionals spend an estimated 80 percent of their time on data preparation, thereby significantly slowing down the data-insight-action journey. creating the right data and information infrastructure (ia), therefore, will be critical to implementing, operationalizing, and scaling ai. or as it’s commonly articulated, no ai without ia. the right ia for ai information and data architectures share a symbiotic relationship in that the former accounts for organization structure, business strategy, and user information requirements while the latter provides the framework required to process data into information. together, they are the blueprints for an enterprise’s approach to designing, implementing, and managing a data strategy. the fundamental reasoning of the no ai without ia theorem is that ai requires machine learning, machine learning requires analytics, and analytics requires the right ia. not accidental ia, a patchwork of piecemeal efforts to architect information or traditional ia, a framework designed for legacy technology, but a modern and open ia that creates a trusted, enterprise-level foundation to deploy and operationalize sustainable ai/ml across the organization. ai information architecture can be defined in terms of six layers: data sources, source data access, data preparation and quality, analytics and ai, deployment and operationalization, and information governance and information catalog. some of the key capabilities of this architecture include support for the exchange of insights between ai models across it platforms, business systems, and traditional reporting tools. empowering users to develop and manage new ai artifacts, managing cataloging and governance of these artifacts, and promoting collaboration. and ensuring model accuracy and precision across the ai lifecycle. an ia-first approach to operationalizing ai at scale the ia-first approach to ai starts with creating a solid data foundation that facilitates the collection and storage of raw data from different perspectives and paradigms including batch collection and streaming data, structured and unstructured data, transactional and analytical data, etc. for life sciences companies, a modern ia infrastructure will address the top hurdle in scaling ai, i.e. the lack of high-quality data sources, time wasted on data preparation, and data integration. creating a unified architectural foundation to delay with life sciences big data will have a transformative impact on all downstream analytics. the next step is to make all this data business-ready and data governance plays a critical role in building the trust and transparency required to operationalize ai. in the life sciences, this includes ensuring that all data is properly protected and stored from acquisition to archival, ensuring the quality of data and metadata, engineering data for consumption, and creating standards and policies for data access and sharing. a unified data catalog that conforms to the information architecture will be key to enabling data management, data governance, and query optimization at scale. now that the data is business-ready, organizations can turn their focus to executing the full ai lifecycle. the availability of trusted data opens up additional opportunities for prediction, automation, and optimization plus prediction capabilities. in addition, people, processes, tools, and culture will also play a key role in scaling ai. the first step is to streamline ai processes with mlops to standardize and streamline the ml lifecycle and create a unified framework for ai development and operationalization. organizations must then choose the right tools and platforms, from a highly fragmented ecosystem, to build robust, repeatable workflows, with an emphasis on collaboration, speed, and safety. scaling ai will then require the creation of multidisciplinary teams organized as a center of excellence (coe) with management and governance oversight, as decentralized product, function or business unit teams with domain experts, or as a hybrid. and finally, culture is often the biggest impediment to ai adoption at scale and therefore needs the right investments in ai-ready cultural characteristics. however, deployment activity alone is not a guarantee for results with deloitte reporting that despite accelerating full-scale deployments outcomes are still lagging. the key to successfully scaling ai is to correlate technical performance with business kpis and outcomes. successful at-scale ai deployments are more likely to have adopted leading practices, like enterprise-wide platforms for ai model and application development, documented data governance and mlops procedures, and roi metrics for deployed models and applications. such deployments also deliver the strongest ai outcomes measured in revenue-generating results such as expansion into new segments and markets, creation of new products/services, and implementation of new business/service models. the success of ai depends on ia one contemporary interpretation of conway's law argues that the outcomes delivered by ai/ml deployments can only be as good as their underlying enterprise information architecture. the characteristics and limitations of, say, fragmented or legacy ia will inevitably be reflected in the performance and value of enterprise ai. a modern, open, and flexible enterprise information architecture is therefore crucial for the successful deployment of scalable, high-outcome, future-proof ai. and this architecture will be defined by a solid data foundation to transform and integrate all data, an information architecture that ensures data quality and data governance and a unified framework to standardize and streamline the ai/ml lifecycle and enable ai development and operationalization at scale. in the next blog in this series, we will look at how data architectures have evolved over time, discuss different approaches, such as etl, elt, lambda, kappa, data mesh, etc., define some hyped concepts like ‘big data’ and ‘data lakes’ and correlate all this to the context of drug discovery and development. read part 1 of our data management series: from fair principles to holistic data management in life sciences read part 3 of our data management series: ai-powered data integration and management with data fabric

From FAIR principles to holistic data management in life sciences

in january this year, the u.s. national institutes of health (nih), the largest public funder of biomedical research in the world, implemented an updated data management and sharing (dms) policy that will require all grant applications to be supported by, and comply with, a detailed dms plan. the key requirements of the dms plan include data/metadata volume and types along with their applied standards, tools, software, and code required for data access/analysis, and storage, access, distribution, and reuse protocols. a pre-implementation nih-sponsored multidisciplinary workshop to discuss this cultural shift in data management and sharing acknowledged that the policy was perceived more as a tax and an obligation than valued work within the research community. researchers also feel overwhelmed by the lack of resources and expertise that will be required to comply with these new data management norms. biomedical research also has a unique set of data management challenges that have to be addressed in order to maximize the value of vast volumes of complex, heterogeneous, siloed, interdisciplinary, and highly regulated data. plus, data management in life sciences is often seen as being a slow and costly process that is disruptive to conventional r&d workflows and with no direct roi. however, good data management can deliver cascading benefits for life sciences research – both for individual teams and the community. for research teams, effective data management standardizes data, code, and documentation. in turn, this enhances data quality, enables ai-driven workflows, and increases research efficiency. across the broader community, it lays the foundation for open science, enhances reusability and reproducibility, and ensures research integrity. most importantly, it is possible to create a strong and collaborative data management foundation by implementing established methodologies that do not disrupt current research processes or require reinventing the wheel. data management frameworks for life sciences the key dms requirements defined by the nih encompass the remit of data management. over the years, there has been a range of data management that have emerged from various disciplines that are broadly applicable to life sciences research. however, the fair principles — findability, accessibility, interoperability, and reusability — published in 2016 were the first to codify the discipline of data management for scientific research. these principles focus on the unique challenges of scientific data from the perspective of the scientist rather than that of it, and are applied to all components of the research process, including data, algorithms, tools, workflows, and pipelines. it is widely acknowledged that implementing the fair principles can improve the productivity of biopharma and other life sciences r&d. the challenge, however, is that these high-level principles provide no specific technology, standard, or implementation recommendations. instead, they merely provide the benchmark to evaluate the fairness of implementation choices. currently, frameworks are in development to coordinate fairification among all stakeholders in order to maximize interoperability, promote the reuse of existing resources and accelerate the convergence of standards and technologies in fair implementations. although a move in the good direction, note that fair principles only cover a small part of data management. it is necessary to complement the fair principles with best practices form other, more holistic frameworks (data management or related), so here are a few worth looking at. data management body of knowledge (dmbok) dmbok from dama (data management association) international, widely considered the gold standard of data management frameworks, was first published in 2009. it includes three distinct components – the dama wheel, the environmental factors hexagon, and the knowledge area context diagram. image source: dama international the dama wheel defines 11 key knowledge areas that together constitute a mature data management strategy, the environmental factors hexagon model provides the foundation for describing a knowledge area, with the context diagram framing its scope. dmbok2, released in 2017, includes some key changes. data governance, for instance, is no longer just a grand unifying theory, but delivers contextual relevance by defining specific governance activities and environmental factors relevant to each knowledge area. more broadly though, the dama-dmbok guide continues to serve as a comprehensive compilation of widely accepted principles and concepts that can help standardize activities, processes, and best practices, roles and responsibilities, deliverables and metrics, and maturity assessment. notwithstanding its widespread popularity, the dmbok framework does have certain challenges. for instance, it has been pointed out that the framework’s emphasis on providing “...the context for work carried out by data management professionals…” overlooks all the non-data professionals working with data analytics today. moreover, even though the model defines all the interrelated knowledge areas of data management, an integrated implementation of the entire framework from scratch would still require reinventing the wheel. the open group architecture framework (togaf) the togaf standard is a widely-used framework for enterprise architecture developed and maintained by members of the open group. the framework classifies enterprise architecture into four primary domains — business, data, application, and technology — spanned by other domains such as security, governance, etc. data architecture is just one component in the framework’s approach to designing and implementing enterprise architecture, and togaf offers data models, architectures, techniques, best practices, and governance principles. source: the open group at the core of the togaf standard is the togaf architecture development method (adm) which describes the approach to developing and managing the lifecycle of an enterprise architecture. this includes togaf standard elements as well as other architectural assets that are available to meet enterprise requirements. there are two more key parts to the togaf – the enterprise continuum and architecture repository. the enterprise continuum supports adm execution by providing the framework and context to leverage relevant architecture assets, including architecture descriptions, models, and patterns sourced from enterprise repositories and other available industry models and standards. the architecture repository provides reference architectures, models, and patterns that have previously been used within the organization along with architecture development work-in-progress. a key philosophy of the togaf framework is to provide a fully-featured core enterprise architecture metamodel that is broad enough to ensure out-of-the-box applicability across different contexts. at the same time, the open architecture standards enable users to apply a number of optional extension modules, for data, services, governance etc., to customize the metamodel to specific organizational needs. source: the open group this emphasis on providing a universal scaffolding that is uniquely customizable started with version 9 of the standard. the 10th edition, launched in 2022, is designed to embrace this dichotomy of universal concepts and granular configuration with a refreshed modular structure to streamline implementation across architecture styles and expanded guidance and “how-to” materials to simplify the adoption of best practices across a broad range of use cases. it infrastructure library (itil) the itil framework was developed in the 1980s to address the lack of quality in it services procured by the british government as a methodology to achieve better quality at a lower cost. the framework, currently administered and updated by axelos, defines a 5-stage service lifecycle comprising service strategy, service design, service transition, service operations, and continuous service improvement. itil 4 continues to build on the core guidance of previous versions to deliver an adaptable framework that supports traditional service management activities, aligns to transformative cloud, automation, and ai technologies, and works seamlessly with devops, lean, and agile methodologies. although this is a framework for service management, it contains a number of interesting concepts, processes and metrics that relate to data management. one example is the use of configuration management databases (cmdbs). cmdbs are a fundamental component of itil. these databases are used to store, manage and track information about individual configuration items (cis), i.e., any asset or component involved in the delivery of it services. the information recorded about each ci’s attributes, dependencies, and configuration changes allows it teams to understand how components are connected across the infrastructure and focus on managing these connections to create more efficient processes. control objectives for information and related technology (cobit) cobit is a comprehensive framework, created by isaca (information systems audit and control association) and first released in 1996. it defines the components and design factors required for implementing a management and governance system for enterprise it. governance framework principles source: isaca governance system design workflow source: isaca the latest release has shifted from iso/iec 33000 to the cmmi performance management scheme and a new governance system design workflow has been adopted to streamline implementation. cobit, being a governance framework, contains interesting data governance related metrics, key performance indicators (kpis) and processes on how an organization can follow up on quality and compliance, which is an essential part of good data management. these are just a few examples of the different approaches to information and data management that are currently in use across industries. in fact, there are quite a lot more to choose from including the gartner enterprise information management framework, zachman framework, eckerson, pwc enterprise data governance framework, dcam, sas data governance framework, dgi data governance framework, and the list goes on. taking steps towards a better data management the framework around fair principles certainly provided a good starting place for getting the basics of your data management right. in this blog we’ve shown that this alone is not enough, and we demonstrated the plethora of frameworks out there that are widely used and proven. they are priceless in avoiding reinventing the wheel and can accelerate your road to improving your data management. at biostrand, we have taken useful elements out of all these standards and frameworks to arrive at a mature data management strategy that guides the implementation of all our services. in the following blog in this series, we'll look at how a modern information architecture could set the foundation for enterprise-centric ai deployments. stay tuned. read part 2 of our data management series: creating a unified data + information architecture for scalable ai read part 3 of our data management series: ai-powered data integration and management with data fabric

The importance of reproducibility in in-silico drug discovery

reproducibility, getting the same results using the original data and analysis strategy, and replicability, is fundamental to valid, credible, and actionable scientific research. without reproducibility, replicability, the ability to confirm research results within different data contexts, becomes moot. a 2016 survey of researchers revealed a consensus that there was a crisis of reproducibility, with most researchers reporting that they failed to reproduce not only the experiments of other scientists (70%) but even their own (>50%). in biomedical research, reproducibility testing is still extremely limited, with some attempts to do so failing to comprehensively or conclusively validate reproducibility and replicability. over the years, there have been several efforts to assess and improve reproducibility in biomedical research. however, there is a new front opening in the reproducibility crisis, this time in ml-based science. according to this study, the increasing adoption of complex ml models is creating widespread data leakage resulting in “severe reproducibility failures,” “wildly overoptimistic conclusions,” and the inability to validate the superior performance of ml models over conventional statistical models. pharmaceutical companies have generally been cautious about accepting published results for a number of reasons, including the lack of scientifically reproducible data. an inability to reproduce and replicate preclinical studies can adversely impact drug development and has also been linked to drug and clinical trial failures. as drug development enters its latest innovation cycle, powered by computational in silico approaches and advanced ai-cadd integrations, reproducibility represents a significant obstacle to converting biomedical research into real-world results. reproducibility in in silico drug discovery the increasing computation of modern scientific research has already resulted in a significant shift with some journals incentivizing authors and providing badges for reproducible research papers. many scientific publications also mandate the publication of all relevant research resources, including code and data. in 2020, elife launched executable research articles (eras) that allowed authors to add live code blocks and computed outputs to create computationally reproducible publications. however, creating a robust reproducibility framework to sustain in silico drug discovery would require more transformative developments across three key dimensions: infrastructure/incentives for reproducibility in computational biology, reproducible ecosystems in research, and reproducible data management. reproducible computational biology this approach to industry-wide transformation envisions a fundamental cultural shift with reproducibility as the fulcrum for all decision-making in biomedical research. the focus is on four key domains. first, creating courses and workshops to expose biomedical students to specific computational skills and real-world biological data analysis problems and impart the skills required to produce reproducible research. second, promoting truly open data sharing, along with all relevant metadata, to encourage larger-scale data reuse. three, leveraging platforms, workflows, and tools that support the open data/code model of reproducible research. and four, promoting, incentivizing, and enforcing reproducibility by adopting fair principles and mandating source code availability. computational reproducibility ecosystem a reproducible ecosystem should enable data and code to be seamlessly archived, shared, and used across multiple projects. computational biologists today have access to a broad range of open-source and commercial resources to ensure their ecosystem generates reproducible research. for instance, data can now be shared across several recognized, domain and discipline-specific public data depositories such as pubchem, cdd vault, etc. public and private code repositories, such as github and gitlab, allow researchers to submit and share code with researchers around the world. and then there are computational reproducibility platforms like code ocean that enable researchers to share, discover, and run code. reproducible data management as per a recent data management and sharing (dms) policy issued by the nih, all applications for funding will have to be accompanied by a dms plan detailing the strategy and budget to manage and share research data. sharing scientific data, the nih points out, accelerates biomedical research discovery through validating research, increasing data access, and promoting data reuse. effective data management is critical to reproducibility and creating a formal data management plan prior to the commencement of a research project helps clarify two key facets of the research: one, key information about experiments, workflows, types, and volumes of data generated, and two, research output format, metadata, storage, and access and sharing policies. the next critical step towards reproducibility is having the right systems to document the process, including data/metadata, methods and code, and version control. for instance, reproducibility in in silico analyses relies extensively on metadata to define scientific concepts as well as the computing environment. in addition, metadata also plays a major role in making data fair. it is therefore important to document experimental and data analysis metadata in an established standard and store it alongside research data. similarly, the ability to track and document datasets as they adapt, reorganize, extend, and evolve across the research lifecycle will be crucial to reproducibility. it is therefore important to version control data so that results can be traced back to the precise subset and version of data. of course, the end game for all of that has to be the sharing of data and code, which is increasingly becoming a prerequisite as well as a voluntarily accepted practice in computational biology. one survey of 188 researchers in computational biology found that those who authored papers were largely satisfied with their ability to carry out key code-sharing tasks such as ensuring good documentation and that the code was running in the correct environment. the average researcher, however, would not commit any more time, effort, or expenditure to share code. plus, there still are certain perceived barriers that need to be addressed before the public archival of biomedical research data and code becomes prevalent. the future of reproducibility in drug discovery a 2014 report from the american association for the advancement of science (aaas) estimated that the u.s. alone spent approximately $28 billion yearly on irreproducible preclinical research. in the future, a set of blockchain-based frameworks may well enable the automated verification of the entire research process. meanwhile, in silico drug discovery has emerged as one of the maturing innovation areas in the pharmaceutical industry. the alliance between pharmaceutical companies and research-intensive universities has been a key component in de-risking drug discovery and enhancing its clinical and commercial success. reproducibility-related improvements and innovations will help move this alliance to a data-driven, ai/ml-based, in silico model of drug discovery.

Creating an AI-ready data foundation for successful AI-enabled drug discovery

over the past year, we have looked at drug discovery and development from several different perspectives. for instance, we looked at the big data frenzy in biopharma, as zettabytes of sequencing, real-world and textual data (rwd) pile up and stress the data integration and analytic capabilities of conventional solutions. we also discussed how the time-consuming, cost-intensive, low productivity characteristics of the prevalent roi-focused model of development have an adverse impact not just on commercial viability in the pharma industry but on the entire healthcare ecosystem. then we saw how antibody drug discovery processes continued to be cited as the biggest challenge in therapeutic r&d even as the industry was pivoting to biologics and mabs. no matter the context or frame of reference, the focus inevitably turns to how ai technologies can transform the entire drug discovery and development process, from research to clinical trials. biopharma companies have traditionally been slow to adopt innovative technologies like ai and the cloud. today, however, digital innovation has become an industry-wide priority with drug development expected to be the most impacted by smart technologies. from application-centric to data-centric ai technologies have a range of applications across the drug discovery and development pipeline, from opening up new insights into biological systems and diseases to streamlining drug design to optimizing clinical trials. despite the wide-ranging potential of ai-driven transformation in biopharma, the process does entail some complex challenges. the most fundamental challenge will be to make the transformative shift from an application-centric to a data-centric culture, where data and metadata are operationalized at scale and across the entire drug design and development value chain. however, creating a data-centric culture in drug development comes with its unique set of data-related challenges. to start with there is the sheer scale of data that will require a scalable architecture in order to be efficient and cost-effective. most of this data is often distributed across disparate silos with unique storage practices, quality procedures, and naming and labeling conventions. then there is the issue of different data modalities, from mr or ct scans to unstructured clinical notes, that have to be extracted, transformed, and curated at scale for unified analysis. and finally, the level of regulatory scrutiny on sensitive biomedical data means that there is this constant tension between enabling collaboration and ensuring compliance. therefore, creating a strong data foundation that accounts for all these complexities in biopharma data management and analysis will be critical to ensuring the successful adoption of ai in drug development. three key requisites for an ai-ready data foundation successful ai adoption in drug development will depend on the creation of a data foundation that addresses these three key requirements. accessibility data accessibility is a key characteristic of ai leaders irrespective of sector. in order to ensure effective and productive data democratization, organizations need to enable access to data distributed across complex technology environments spanning multiple internal and external stakeholders and partners. a key caveat of accessibility is that the data provided should be contextual to the analytical needs of specific data users and consumers. a modern cloud-based and connected enterprise data and ai platform designed as a “one-stop-shop” for all drug design and development-related data products with ready-to-use analytical models will be critical to ensuring broader and deeper data accessibility for all users. data management and governance the quality of any data ecosystem is determined by the data management and governance frameworks that ensure that relevant information is accessible to the right people at the right time. at the same time, these frameworks must also be capable of protecting confidential information, ensuring regulatory compliance, and facilitating the ethical and responsible use of ai. therefore, the key focus of data management and governance will be to consistently ensure the highest quality of data across all systems and platforms as well as full transparency and traceability in the acquisition and application of data. ux and usability successful ai adoption will require a data foundation that streamlines accessibility and prioritizes ux and usability. apart from democratizing access, the emphasis should also be on ensuring that even non-technical users are able to use data effectively and efficiently. different users often consume the same datasets from completely different perspectives. the key, therefore, is to provide a range of tools and features that help every user customize the experience to their specific roles and interests. apart from creating the right data foundation, technology partnerships can also help accelerate the shift from an application-centric to a data-centric approach to ai adoption. in fact, a 2018 gartner report advised organizations to explore vendor offerings as a foundational approach to jump-start their efforts to make productive use of ai. more recently, pharma-technology partnerships have emerged as the fastest-moving model for externalizing innovation in ai-enabled drug discovery. according to a recent roots analysis report on the ai-based drug discovery market, partnership activity in the pharmaceutical industry has grown at a cagr of 50%, between 2015 and 2021, with a majority of the deals focused on research and development. so with that trend as background, here’s a quick look at how a data-centric, full-service biotherapeutic platform can accelerate biopharma’s shift to an ai-first drug discovery model. the lensai™ approach to data-centric drug development our approach to biotherapeutic research places data at the very core of a dynamic network of biological and artificial intelligence technologies. with our lensai platform, we have created a google-like solution for the entire biosphere, organizing it into a multidimensional network of 660 million data objects with multiple layers of information about sequence, syntax, and protein structure. this “one-stop-shop” model enables researchers to seamlessly access all raw sequence data. in addition, hyfts®, our universal framework for organizing all biological data, allows easy, one-click integration of all other research-relevant data from across public and proprietary data repositories. researchers can then leverage the power of the lensai integrated intelligence platform to integrate unstructured data from text-based knowledge sources such as scientific journals, ehrs, clinical notes, etc. here again, researchers have the ability to expand the core knowledge base, containing over 33 million abstracts from the pubmed biomedical literature database, by integrating data from multiple sources and knowledge domains, including proprietary databases. around this multi-source, multi-domain, data-centric core, we have designed next-generation ai technologies that can instantly and concurrently convert these vast volumes of text, sequence, and protein structure data into meaningful knowledge that can transform drug discovery and development.

Challenges in multi-omics data integration

today, the integrative computational analysis of multi-omics data has become a central tenet of the big data-driven approach to biological research. and yet, there is still a lack of gold standards when it comes to evaluating and classifying integration methodologies that can be broadly applied across multi-omics analysis. more importantly, the lack of a cohesive or universal approach to big data integration is also creating new challenges in the development of novel computational approaches for multi-omics analysis. one aspect of sequence search and comparison, however, has not changed much at all – a biological sequence in a predefined and acceptable data format is still the primary input in most research. this approach is probably and arguably valid in many if not most real-world research scenarios. take machine learning (ml) models, for instance, which are increasingly playing a central role in the analysis of genomic big data. biological data presents several unique challenges, such as missing values and precision variations across omics modalities, that simply expand the gamut of integration strategies required to address each specific challenge. for example, omics datasets often contain missing values, which can hamper downstream integrative bioinformatics analyses. this requires an additional imputation process to infer the missing values in these incomplete datasets before statistical analyses can be applied. then there is the high-dimension low sample size (hdlss) problem, where the variables significantly outnumber samples, leading ml algorithms to overfit these datasets, thereby decreasing their generalisability on new data. in addition, there are multiple challenges inherent to all biological data irrespective of analytical methodology or framework. to start with there is the sheer heterogeneity of omics data comprising a variety of datasets originating from a range of data modalities and comprising completely different data distributions and types that have to be handled appropriately. integrating heterogeneous multi-omics data presents a cascade of challenges involving the unique data scaling, normalisation, and transformation requirements of each individual dataset. any effective integration strategy will also have to account for the regulatory relationships between datasets from different omics layers in order to accurately and holistically reflect the nature of this multidimensional data. furthermore, there is the issue of integrating omics and non-omics (ono) data, like clinical, epidemiological or imaging data, for example, in order to enhance analytical productivity and to access richer insights into biological events and processes. currently, the large-scale integration of non-omics data with high-throughput omics data is extremely limited due to a range of factors, including heterogeneity and the presence of subphenotypes, for instance. the crux of the matter is that without effective and efficient data integration, multi-omics analysis will only tend to become more complex and resource-intensive without any proportional or even significant augmentation in productivity, performance, or insight generation. an overview of multi-omics data integration early approaches to multi-omics analysis involved the independent analysis of different data modalities and combining results for a quasi-integrated view of molecular interactions. but the field has evolved significantly since then into a broad range of novel, predominantly algorithmic meta-analysis frameworks and methodologies for the integrated analysis of multi-dimensional multi-omics data. however, the topic of data integration and the challenges involved is often overshadowed by the ground-breaking developments in integrated, multi-omics analysis. it is therefore essential to understand the fundamental conceptual principles, rather than any specific method or framework, that define multi-omics data integration. horizontal vs vertical data integration multi-omics datasets are broadly organized as horizontal or vertical, corresponding to the complexity and heterogeneity of multi-omics data. horizontal datasets are typically generated from one or two technologies, for a specific research question and from a diverse population, and represent a high degree of real-world biological and technical heterogeneity. horizontal or homogeneous data integration, therefore, involves combining data from across different studies, cohorts or labs that measure the same omics entities. vertical data refers to data generated using multiple technologies, probing different aspects of the research question, and traversing the possible range of omics variables including the genome, metabolome, transcriptome, epigenome, proteome, microbiome, etc. vertical, or heterogeneous, data integration involves multi-cohort datasets from different omics levels, measured using different technologies and platforms. the fact that vertical integration techniques cannot be applied for horizontal integrative analysis and vice-versa opens up an opportunity for conceptual innovation in multi-omics for data integration techniques that can enable an integrative analysis of both horizontal and vertical multi-omics datasets. of course, each of these broad data heads can further be broken down into a range of approaches based on utility and efficiency. 5 integration strategies for vertical data a 2021 mini-review of general approaches to vertical data integration for ml analysis defined five distinct integration strategies – early, mixed, intermediate, late and hierarchical – based not just on the underlying mathematics but on a variety of factors including how they were applied. here’s a quick rundown of each approach. early integration is a simple and easy-to-implement approach that concatenates all omics datasets into a single large matrix. this increases the number of variables, without altering the number of observations, which results in a complex, noisy and high dimensional matrix that discounts dataset size difference and data distribution. mixed integration addresses the limitations of the early model by separately transforming each omics dataset into a new representation and then combining them for analysis. this approach reduces noise, dimensionality, and dataset heterogeneities. intermediate integration simultaneously integrates multi-omics datasets to output multiple representations, one common and some omics-specific. however, this approach often requires robust pre-processing due to potential problems arising from data heterogeneity. late integration circumvents the challenges of assembling different types of omics datasets by analysing each omics separately and combining the final predictions. this multiple single-omics approach does not capture inter-omics interactions. hierarchical integration focuses on the inclusion of prior regulatory relationships between different omics layers so that analysis can reveal the interactions across layers. though this strategy truly embodies the intent of trans-omics analysis, this is still a nascent field with many hierarchical methods focusing on specific omics types, thereby making them less generalisable. the availability of an unenviable choice of conceptual approaches – each with its own scope and limitations in terms of throughput, performance, and accuracy – to multi-omics data integration represents one of the biggest bottlenecks to downstream analysis and biological innovation. researchers often spend more time mired in the tedium of data munging and wrangling than they do extracting knowledge and novel insights. most conventional approaches to data integration, moreover, seem to involve some form of compromise involving either the integrity of high-throughput multi-omics data or achieving true trans-omics analysis. there has to be a new approach to multi-omics data integration that can 1), enable the one-click integration of all omics and non-omics data, and 2), preserve the biological consistency, in terms of correlations and associations across different regulatory datasets, for integrative multi-omics analysis in the process. the mindwalk hyft model for data integration at mindwalk, we took a lateral approach to the challenge of biological data integration. rather than start with a technological framework that could be customised for the complexity and heterogeneity of multi-omics data, we set out to decode the atomic units of all biological information that we call hyfts™. hyfts are essentially the building blocks of biological information, which means that they enable the tokenisation of all biological data, irrespective of species, structure, or function, to a common omics data language. we then built the framework to identify, collate, and index hyfts from sequence data. this enabled us to create a proprietary pangenomic knowledge database of over 660 million hyfts, each containing comprehensive information about variation, mutation, structure, etc., from over 450 million sequences available across 12 popular public databases. with the mindwalk platform, researchers and bioinformaticians have instant access to all the data from some of the most widely used omics data sources. plus, our unique hyfts framework allows researchers the convenience of one-click normalization and integration of all their proprietary omics data and metadata. based on our biological discovery, we were able to normalise and integrate all publicly available omics data, including patent data, at scale, and render them multi-omics analysis-ready. the same hyft ip can also be applied to normalise and integrate proprietary omics data. the transversal language of hyfts enables the instant normalisation and integration of multi-omics research-relevant data and metadata into one single source of truth. with the mindwalk approach to multi-omic data integration, it is no longer about whether research data is horizontal or vertical, homogeneous or heterogeneous, text or sequence, omics or non-omics. if it is data that is relevant to your research, mindwalk enables you to integrate it with just one click.