The Blog
MindWalk is a biointelligence company uniting AI, multi-omics data, and advanced lab research into a customizable ecosystem for biologics discovery and development.
×
there is a compelling case underlying the tremendous interest in generative ai and llms as the next big technological inflection point in computational drug discovery and development. for starters, llms help expand the data universe of in-silico drug discovery, especially in terms of opening up access to huge volumes of valuable information locked away in unstructured textual data sources including scientific literature, public databases, clinical trial notes, patient records, etc. llms provide the much-needed capability to analyze, identify patterns and connections, and extract novel insights about disease mechanisms and potential therapeutic targets. their ability to interpret complex scientific concepts and elucidate connections between diseases, genes, and biological processes can help accelerate disease hypothesis generation and the identification of potential drug targets and biomarkers. when integrated with biomedical knowledge graphs, llms help create a unique synergistic model that enables bidirectional data- and knowledge-based reasoning. the explicit structured knowledge of knowledge graphs enhances the knowledge of llms while the power of language models streamlines graph construction and user conversational interactions with complex knowledge bases. however, there are still several challenges that have to be addressed before llms can be reliably integrated into in silico drug discovery pipelines and workflows. one of these is hallucinations. why do llms hallucinate? at a time of some speculation about laziness and seasonal depression in llms, a hallucination leaderboard of 11 public llms revealed hallucination rates that ranged from 3% at the top end to 27% at the bottom of the barrel. another comparative study of two versions of a popular llm in generating ophthalmic scientific abstracts revealed very high hallucination rates (33% and 29%) of generating fake references. this tendency of llms to hallucinate, ergo present incorrect or unverifiable knowledge as accurate, even at 3% can have serious consequences in critical drug discovery applications. there are several reasons for llm hallucinations. at the core of this behavior is the fact that generative ai models have no actual intelligence, relying instead on a probability-based approach to predict data that is most likely to occur based on patterns and contexts ‘learned’ from their training data. apart from this inherent lack of contextual understanding, other potential causes include exposure to noise, errors, biases, and inconsistencies in training data, training and generation methods, or even prompting techniques. for some, hallucination is all llms do and others see it as inevitable for any prompt-based large language model. in the context of life sciences research, however, mitigating llm hallucinations remains one of the biggest obstacles to the large-scale and strategic integration of this potentially transformative technology. how to mitigate llm hallucinations? there are three broad and complementary approaches to mitigating hallucinations in large language models: prompt engineering, fine-tuning, and grounding + prompt augmentation. prompt engineering prompt engineering is the process of strategically designing user inputs, or prompts, in order to guide model behavior and obtain optimal responses. there are three major approaches to prompt engineering: zero-shot, few-shot, and chain-of-thought prompts. in zero-shot prompting, language models are provided with inputs that are not part of their training data but are still capable of generating reliable results. few-shot prompting involves providing examples to llms before presenting the actual query. chain-of-thought (cot) is based on the finding that a series of intermediate reasoning steps provided as examples during prompting can significantly improve the reasoning capabilities of large language models. the chain-of-thought concept has been expanded to include new techniques such as chain-of-verification (cove), a self-verification process that enables llms to check the accuracy and reliability of their output, and chain of density (cod), a process that focuses on summarization rather than reasoning to control the density of information in the generated text. prompt engineering, however, has its own set of limitations including prompt constraints that may cramp the ability to query complex domains and the lack of objective metrics to quantify prompt effectiveness. fine-tuning where the focus of prompt engineering is on the skill required to elicit better llm output, fine-tuning emphasizes task-specific training in order to enhance the performance of pre-trained models in specific topics or domain areas. a conventional approach to llm finetuning is full fine-tuning, which involves the additional training of pre-trained models on labeled, domain or task-specific data in order to generate more contextually relevant responses. this is a time, resource and expertise-intensive process. an alternative approach is parameter-efficient fine-tuning (peft), conducted on a small set of extra parameters without adjusting the entire model. the modular nature of peft means that the training can prioritize select portions or components of the original parameters so that the pre-trained model can be adapted for multiple tasks. lora (low-rank adaptation of large language models), a popular peft technique, can significantly reduce the resource intensity of fine-tuning while matching the performance of full fine-tuning. there are, however, challenges to fine-tuning including domain shift issues, the potential for bias amplification and catastrophic forgetting, and the complexities involved in choosing the right hyperparameters for fine-tuning in order to ensure optimal performance. grounding & augmentation llm hallucinations are often the result of language models attempting to generate knowledge based on information that they have not explicitly memorized or seen. the logical solution, therefore, would be to provide llms with access to a curated knowledge base of high-quality contextual information that enables them to generate more accurate responses. advanced grounding and prompt augmentation techniques can help address many of the accuracy and reliability challenges associated with llm performance. both techniques rely on external knowledge sources to dynamically generate context. grounding ensures that llms have access to up-to-date and use-case-specific information sources to provide the relevant context that may not be available solely from the training data. similarly, prompt augmentation enhances a prompt with contextually relevant information that enables llms to generate a more accurate and pertinent output. factual grounding is a technique typically used in the pre-training phase to ensure that llm output across a variety of tasks is consistent with a knowledge base of factual statements. post-training grounding relies on a range of external knowledge bases, including documents, code repositories, and public and proprietary databases, to improve the accuracy and relevance of llms on specific tasks. retrieval-augmented generation (rag), is a distinct framework for the post-training grounding of llms based on the most accurate, up-to-date information retrieved from external knowledge bases. the rag framework enables the optimization of biomedical llms output along three key dimensions. one, access to targeted external knowledge sources ensures llms' internal representation of information is dynamically refreshed with the most current and contextually relevant data. two, access to an llm’s information sources ensures that responses can be validated for relevance and accuracy. and three, there is the emerging potential to extend the rag framework beyond just text to multimodal knowledge retrieval, spanning images, audio, tables, etc., that can further boost the factuality, interpretability, and sophistication of llms. also read: how retrieval-augmented generation (rag) can transform drug discovery some of the key challenges of retrieval-augmented generation include the high initial cost of implementation as compared to standalone generative ai. however, in the long run, the rag-llm combination will be less expensive than frequently fine-tuning llms and provides the most comprehensive approach to mitigating llm hallucinations. but even with better grounding and retrieval, scientific applications demand another layer of rigor — validation and reproducibility. here’s how teams can build confidence in llm outputs before trusting them in high-stakes discovery workflows. how to validate llm outputs in drug discovery pipelines in scientific settings like drug discovery, ensuring the validity of large language model (llm) outputs is critical — especially when such outputs may inform downstream experimental decisions. here are key validation strategies used to assess llm-generated content in biomedical pipelines: validation checklist: compare outputs to curated benchmarks use structured, peer-reviewed datasets such as drugbank, chembl, or internal gold standards to benchmark llm predictions. cross-reference with experimental data validate ai-generated hypotheses against published experimental results, or integrate with in-house wet lab data for verification. establish feedback loops from in vitro validations create iterative pipelines where lab-tested results refine future model prompts, improving accuracy over time. advancing reproducibility in ai-augmented science for llm-assisted workflows to be trustworthy and audit-ready, they must be reproducible — particularly when used in regulated environments. reproducibility practices: dataset versioning track changes in source datasets, ensuring that each model run references a consistent data snapshot. prompt logging store full prompts (including context and input structure) to reproduce specific generations and analyze outputs over time. controlled inference environments standardize model versions, hyperparameters, and apis to eliminate variation in inference across different systems. integrated intelligence with lensai™ holistic life sciences research requires the sophisticated orchestration of several innovative technologies and frameworks. lensai integrated intelligence, our next-generation data-centric ai platform, fluently blends some of the most advanced proprietary technologies into one seamless solution that empowers end-to-end drug discovery and development. lensai integrates rag-enhanced biollms with an ontology-driven nlp framework, combining neuro-symbolic logic techniques to connect and correlate syntax (multi-modal sequential and structural data) and semantics (biological functions). a comprehensive and continuously expanding knowledge graph, mapping a remarkable 25 billion relationships across 660 million data objects, links sequence, structure, function, and literature information from the entire biosphere to provide a comprehensive overview of the relationships between genes, proteins, structures, and biological pathways. our next-generation, unified, knowledge-driven approach to the integration, exploration, and analysis of heterogeneous biomedical data empowers life sciences researchers with the high-tech capabilities needed to explore novel opportunities in drug discovery and development.
in a recent article on knowledge graphs and large language models (llms) in drug discovery, we noted that despite the transformative potential of llms in drug discovery, there were several critical challenges that have to be addressed in order to ensure that these technologies conform to the rigorous standards demanded by life sciences research. synergizing knowledge graphs with llms into one bidirectional data- and knowledge-based reasoning framework addresses several concerns related to hallucinations and lack of interpretability. however, that still leaves the challenge of enabling llms access to external data sources that address their limitation with respect to factual accuracy and up-to-date knowledge recall. retrieval-augmented generation (rag), together with knowledge graphs and llms, is the third critical node on the trifecta of techniques required for the robust and reliable integration of the transformative potential of language models into drug discovery pipelines. why retrieval-augmented generation? one of the key limitations of general-purpose llms is their training data cutoff, which essentially means that their responses to queries are typically out of step with the rapidly evolving nature of information. this is a serious drawback, especially in fast-paced domains like life sciences research. retrieval-augmented generation enables biomedical research pipelines to optimize llm output by: grounding the language model on external sources of targeted and up-to-date knowledge to constantly refresh llms' internal representation of information without having to completely retrain the model. this ensures that responses are based on the most current data and are more contextually relevant. providing access to the model's information so that responses can be validated for accuracy and sources, ensuring that its claims can be checked for relevance and accuracy. in short, retrieval-augmented generation provides the framework necessary to augment the recency, accuracy, and interpretability of llm-generated information. how does retrieval-augmented generation work? retrieval augmented generation is a natural language processing (nlp) approach that combines elements of both information retrieval and text generation models to enhance the performance of knowledge-intensive tasks. the retrieval component aggregates information relevant to specific queries from a predefined set of documents or knowledge sources which then serves as the context for the generation model. once the information has been retrieved, it is combined with the input context to create an integrated context containing both the original query and the relevant retrieved information. this integrated context is then fed into a generation model to generate an accurate, coherent, and contextually appropriate response based on both pre-trained knowledge and retrieved query-specific information. the rag approach gives life sciences research teams more control over grounding data used by a biomedical llm by honing it on enterprise- and domain-specific knowledge sources. it also enables the integration of a range of external data sources, such as document repositories, databases, or apis, that are most relevant to enhancing model response to a query. the value of rag in biomedical research conceptually, the retrieve+generate model’s capabilities in terms of dealing with dynamic external information sources, minimizing hallucinations, and enhancing interpretability make it a natural and complementary fit to augment the performance of biollms. in order to quantify this augmentation in performance, a recent research effort evaluated the ability of a retrieval-augmented generative agent in biomedical question-answering vis-a-vis llms (gpt-3.5/4), state-of-the-art commercial tools (elicit, scite, and perplexity) and humans (biomedical researchers). the rag agent, paperqa, was first evaluated against a standard multiple-choice llm-evaluation dataset, pubmedqa, with the provided context removed to test the agents’ ability to retrieve information. in this case, the rag agent beats gpt-4 by 30 points (57.9% to 86.3%). next, the researchers constructed a more complex and more contemporary dataset (litqa), based on more recent full-text research papers outside the bounds of llm’s pre-training data, to compare the integrated abilities of paperqa, llms and human researchers to retrieve the right information and to generate an accurate answer based on that information. again, the rag agent outperformed both pre-trained llms and commercial tools with overall accuracy (69.5%) and precision (87.9%) scores that were on par with biomedical researchers. more importantly, the rag model produced zero hallucinated citations compared to llms (40-60%). despite being just a narrow evaluation of the performance of the retrieval+generation approach in biomedical qa, the above research does demonstrate the significantly enhanced value that rag+biollm can deliver compared to purely generative ai. the combined sophistication of retrieval and generation models can be harnessed to enhance the accuracy and efficiency of a range of processes across the drug discovery and development pipeline. retrieval-augmented generation in drug discovery in the context of drug discovery, rag can be applied to a range of tasks, from literature reviews to biomolecule design. currently, generative models have demonstrated potential for de novo molecular design but are still hampered by their inability to integrate multimodal information or provide interpretability. the rag framework can facilitate the retrieval of multimodal information, from a range of sources, such as chemical databases, biological data, clinical trials, images, etc., that can significantly augment generative molecular design. the same expanded retrieval + augmented generation template applies to a whole range of applications in drug discovery like, for example, compound design (retrieve compounds/ properties and generate improvements/ new properties), drug-target interaction prediction (retrieve known drug-target interactions and generate potential interactions between new compounds and specific targets. adverse effects prediction (retrieve known adverse and generate modifications to eliminate effects). etc. the template even applies to several sub-processes/-tasks within drug discovery to leverage a broader swathe of existing knowledge to generate novel, reliable, and actionable insights. in target validation, for example, retrieval-augmented generation can enable the comprehensive generative analysis of a target of interest based on an extensive review of all existing knowledge about the target, expression patterns and functional roles of the target, known binding sites, pertinent biological pathways and networks, potential biomarkers, etc. in short, the more efficient and scalable retrieval of timely information ensures that generative models are grounded in factual, sourceable knowledge, a combination with limitless potential to transform drug discovery. an integrated approach to retrieval-augmented generation retrieval-augmented generation addresses several of the critical limitations and augments the generative capabilities of biollms. however, there are additional design rules and multiple technological profiles that have to come together to successfully address the specific requirements and challenges of life sciences research. our lensai™ integrated intelligence platform seamlessly unifies the semantic proficiency of knowledge graphs, the versatile information retrieval capabilities of retrieval-augmented generation, and the reasoning capabilities of large language models to reinvent the understanding-retrieve-generate cycle in biomedical research. our unified approach empowers researchers to query a harmonized life science knowledge layer that integrates unstructured information & ontologies into a knowledge graph. a semantic-first approach enables a more accurate understanding of research queries, which in turn results in the retrieval of content that is most pertinent to the query. the platform also integrates retrieval-augmented generation with structured biomedical data from our hyft technology to enhance the accuracy of generated responses. and finally, lensai combines deep learning llms with neuro-symbolic logic techniques to deliver comprehensive and interpretable outcomes for inquiries. to experience this unified solution in action, please contact us here.
natural language understanding (nlu) is an ai-powered technology that allows machines to understand the structure and meaning of human languages. nlu, like natural language generation (nlg), is a subset of natural language processing (nlp) that focuses on assigning structure, rules, and logic to human language so machines can understand the intended meaning of words, phrases, and sentences in text. nlg, on the other hand, deals with generating realistic written/spoken human-understandable information from structured and unstructured data. since the development of nlu is based on theoretical linguistics, the process can be explained in terms of the following linguistic levels of language comprehension. linguistic levels in nlu phonology is the study of sound patterns in different languages/dialects, and in nlu it refers to the analysis of how sounds are organized, and their purpose and behavior. lexical or morphological analysis is the study of morphemes, indivisible basic units of language with their own meaning, one at a time. indivisible words with their own meaning, or lexical morphemes (e.g.: work) can be combined with plural morphemes (e.g.: works) or grammatical morphemes (e.g.: worked/working) to create word forms. lexical analysis identifies relationships between morphemes and converts words into their root form. syntactic analysis, or syntax analysis, is the process of applying grammatical rules to word clusters and organizing them on the basis of their syntactic relationships in order to determine meaning. this also involves detecting grammatical errors in sentences. while syntactic analysis involves extracting meaning from the grammatical syntax of a sentence, semantic analysis looks at the context and purpose of the text. it helps capture the true meaning of a piece of text by identifying text elements as well as their grammatical role. discourse analysis expands the focus from sentence-length units to look at the relationships between sentences and their impact on overall meaning. discourse refers to coherent groups of sentences that contribute to the topic under discussion. pragmatic analysis deals with aspects of meaning not reflected in syntactic or semantic relationships. here the focus is on identifying intended meaning readers by analyzing literal and non-literal components against the context of background knowledge. common tasks/techniques in nlu there are several techniques that are used in the processing and understanding of human language. here’s a quick run-through of some of the key techniques used in nlu and nlp. tokenization is the process of breaking down a string of text into smaller units called tokens. for instance, a text document could be tokenized into sentences, phrases, words, subwords, and characters. this is a critical preprocessing task that converts unstructured text into numerical data for further analysis. stemming and lemmatization are two different approaches with the same objective: to reduce a particular word to its root word. in stemming, characters are removed from the end of a word to arrive at the “stem” of that word. algorithms determine the number of characters to be eliminated for different words even though they do not explicitly know the meaning of those words. lemmatization is a more sophisticated approach that uses complex morphological analysis to arrive at the root word, or lemma. parsing is the process of extracting the syntactic information of a sentence based on the rules of formal grammar. based on the type of grammar applied, the process can be classified broadly into constituency and dependency parsing. constituency parsing, based on context-free grammar, involves dividing a sentence into sub-phrases, or constituents, that belong to a specific grammar category, such as noun phrases or verb phrases. dependency parsing defines the syntax of a sentence not in terms of constituents but in terms of the dependencies between the words in a sentence. the relationship between words is depicted as a dependency tree where words are represented as nodes and the dependencies between them as edges. part-of-speech (pos) tagging, or grammatical tagging, is the process of assigning a grammatical classification, like noun, verb, adjective, etc., to words in a sentence. automatic tagging can be broadly classified as rule-based, transformation-based, and stochastic pos tagging. rule-based tagging uses a dictionary, as well as a small set of rules derived from the formal syntax of the language, to assign pos. transformation-based tagging, or brill tagging, leverages transformation-based learning for automatic tagging. stochastic refers to any model that uses frequency or probability, e.g. word frequency or tag sequence probability, for automatic pos tagging. name entity recognition (ner) is an nlp subtask that is used to detect, extract and categorize named entities, including names, organizations, locations, themes, topics, monetary, etc., from large volumes of unstructured data. there are several approaches to ner, including rule-based systems, statistical models, dictionary-based systems, ml-based systems, and hybrid models. these are just a few examples of some of the most common techniques used in nlu. there are several other techniques like, for instance, word sense disambiguation, semantic role labeling, and semantic parsing that focus on different levels of semantic abstraction, nlp/nlu in biomedical research nlp/nlu technologies represent a strategic fit for biomedical research with its vast volumes of unstructured data — 3,000-5,000 papers published each day, clinical text data from ehrs, diagnostic reports, medical notes, lab data, etc., and non-standardized digital real-world data. nlp-enabled text mining has emerged as an effective and scalable solution for extracting biomedical entity relations from vast volumes of scientific literature. techniques, like named entity recognition (ner), are widely used in relation extraction tasks in biomedical research with conventionally named entities, such as names, organizations, locations, etc., substituted with gene sequences, proteins, biological processes, and pathways, drug targets, etc. the unique vocabulary of biomedical research has necessitated the development of specialized, domain-specific bionlp frameworks. at the same time, the capabilities of nlu algorithms have been extended to the language of proteins and that of chemistry and biology itself. a 2021 article detailed the conceptual similarities between proteins and language that make them ideal for nlp analysis. more recently, an nlp model was trained to correlate amino acid sequences from the uniprot database with english language words, phrases, and sentences used to describe protein function to annotate over 40 million proteins. researchers have also developed an interpretable and generalizable drug-target interaction model inspired by sentence classification techniques to extract relational information from drug-target biochemical sentences. large neural language models and transformer-based language models are opening up transformative opportunities for biomedical nlp applications across a range of bioinformatics fields including sequence analysis, genome analysis, multi-omics, spatial transcriptomics, and drug discovery. most importantly, nlp technologies have helped unlock the latent value in huge volumes of unstructured data to enable more integrative, systems-level biomedical research. read more about nlp’s critical role in facilitating systems biology and ai-powered data-driven drug discovery. if you want more information on seamlessly integrating advanced bionlp frameworks into your research pipeline, please drop us a line here.
reproducibility, getting the same results using the original data and analysis strategy, and replicability, is fundamental to valid, credible, and actionable scientific research. without reproducibility, replicability, the ability to confirm research results within different data contexts, becomes moot. a 2016 survey of researchers revealed a consensus that there was a crisis of reproducibility, with most researchers reporting that they failed to reproduce not only the experiments of other scientists (70%) but even their own (>50%). in biomedical research, reproducibility testing is still extremely limited, with some attempts to do so failing to comprehensively or conclusively validate reproducibility and replicability. over the years, there have been several efforts to assess and improve reproducibility in biomedical research. however, there is a new front opening in the reproducibility crisis, this time in ml-based science. according to this study, the increasing adoption of complex ml models is creating widespread data leakage resulting in “severe reproducibility failures,” “wildly overoptimistic conclusions,” and the inability to validate the superior performance of ml models over conventional statistical models. pharmaceutical companies have generally been cautious about accepting published results for a number of reasons, including the lack of scientifically reproducible data. an inability to reproduce and replicate preclinical studies can adversely impact drug development and has also been linked to drug and clinical trial failures. as drug development enters its latest innovation cycle, powered by computational in silico approaches and advanced ai-cadd integrations, reproducibility represents a significant obstacle to converting biomedical research into real-world results. reproducibility in in silico drug discovery the increasing computation of modern scientific research has already resulted in a significant shift with some journals incentivizing authors and providing badges for reproducible research papers. many scientific publications also mandate the publication of all relevant research resources, including code and data. in 2020, elife launched executable research articles (eras) that allowed authors to add live code blocks and computed outputs to create computationally reproducible publications. however, creating a robust reproducibility framework to sustain in silico drug discovery would require more transformative developments across three key dimensions: infrastructure/incentives for reproducibility in computational biology, reproducible ecosystems in research, and reproducible data management. reproducible computational biology this approach to industry-wide transformation envisions a fundamental cultural shift with reproducibility as the fulcrum for all decision-making in biomedical research. the focus is on four key domains. first, creating courses and workshops to expose biomedical students to specific computational skills and real-world biological data analysis problems and impart the skills required to produce reproducible research. second, promoting truly open data sharing, along with all relevant metadata, to encourage larger-scale data reuse. three, leveraging platforms, workflows, and tools that support the open data/code model of reproducible research. and four, promoting, incentivizing, and enforcing reproducibility by adopting fair principles and mandating source code availability. computational reproducibility ecosystem a reproducible ecosystem should enable data and code to be seamlessly archived, shared, and used across multiple projects. computational biologists today have access to a broad range of open-source and commercial resources to ensure their ecosystem generates reproducible research. for instance, data can now be shared across several recognized, domain and discipline-specific public data depositories such as pubchem, cdd vault, etc. public and private code repositories, such as github and gitlab, allow researchers to submit and share code with researchers around the world. and then there are computational reproducibility platforms like code ocean that enable researchers to share, discover, and run code. reproducible data management as per a recent data management and sharing (dms) policy issued by the nih, all applications for funding will have to be accompanied by a dms plan detailing the strategy and budget to manage and share research data. sharing scientific data, the nih points out, accelerates biomedical research discovery through validating research, increasing data access, and promoting data reuse. effective data management is critical to reproducibility and creating a formal data management plan prior to the commencement of a research project helps clarify two key facets of the research: one, key information about experiments, workflows, types, and volumes of data generated, and two, research output format, metadata, storage, and access and sharing policies. the next critical step towards reproducibility is having the right systems to document the process, including data/metadata, methods and code, and version control. for instance, reproducibility in in silico analyses relies extensively on metadata to define scientific concepts as well as the computing environment. in addition, metadata also plays a major role in making data fair. it is therefore important to document experimental and data analysis metadata in an established standard and store it alongside research data. similarly, the ability to track and document datasets as they adapt, reorganize, extend, and evolve across the research lifecycle will be crucial to reproducibility. it is therefore important to version control data so that results can be traced back to the precise subset and version of data. of course, the end game for all of that has to be the sharing of data and code, which is increasingly becoming a prerequisite as well as a voluntarily accepted practice in computational biology. one survey of 188 researchers in computational biology found that those who authored papers were largely satisfied with their ability to carry out key code-sharing tasks such as ensuring good documentation and that the code was running in the correct environment. the average researcher, however, would not commit any more time, effort, or expenditure to share code. plus, there still are certain perceived barriers that need to be addressed before the public archival of biomedical research data and code becomes prevalent. the future of reproducibility in drug discovery a 2014 report from the american association for the advancement of science (aaas) estimated that the u.s. alone spent approximately $28 billion yearly on irreproducible preclinical research. in the future, a set of blockchain-based frameworks may well enable the automated verification of the entire research process. meanwhile, in silico drug discovery has emerged as one of the maturing innovation areas in the pharmaceutical industry. the alliance between pharmaceutical companies and research-intensive universities has been a key component in de-risking drug discovery and enhancing its clinical and commercial success. reproducibility-related improvements and innovations will help move this alliance to a data-driven, ai/ml-based, in silico model of drug discovery.
over the past year, we have looked at drug discovery and development from several different perspectives. for instance, we looked at the big data frenzy in biopharma, as zettabytes of sequencing, real-world and textual data (rwd) pile up and stress the data integration and analytic capabilities of conventional solutions. we also discussed how the time-consuming, cost-intensive, low productivity characteristics of the prevalent roi-focused model of development have an adverse impact not just on commercial viability in the pharma industry but on the entire healthcare ecosystem. then we saw how antibody drug discovery processes continued to be cited as the biggest challenge in therapeutic r&d even as the industry was pivoting to biologics and mabs. no matter the context or frame of reference, the focus inevitably turns to how ai technologies can transform the entire drug discovery and development process, from research to clinical trials. biopharma companies have traditionally been slow to adopt innovative technologies like ai and the cloud. today, however, digital innovation has become an industry-wide priority with drug development expected to be the most impacted by smart technologies. from application-centric to data-centric ai technologies have a range of applications across the drug discovery and development pipeline, from opening up new insights into biological systems and diseases to streamlining drug design to optimizing clinical trials. despite the wide-ranging potential of ai-driven transformation in biopharma, the process does entail some complex challenges. the most fundamental challenge will be to make the transformative shift from an application-centric to a data-centric culture, where data and metadata are operationalized at scale and across the entire drug design and development value chain. however, creating a data-centric culture in drug development comes with its unique set of data-related challenges. to start with there is the sheer scale of data that will require a scalable architecture in order to be efficient and cost-effective. most of this data is often distributed across disparate silos with unique storage practices, quality procedures, and naming and labeling conventions. then there is the issue of different data modalities, from mr or ct scans to unstructured clinical notes, that have to be extracted, transformed, and curated at scale for unified analysis. and finally, the level of regulatory scrutiny on sensitive biomedical data means that there is this constant tension between enabling collaboration and ensuring compliance. therefore, creating a strong data foundation that accounts for all these complexities in biopharma data management and analysis will be critical to ensuring the successful adoption of ai in drug development. three key requisites for an ai-ready data foundation successful ai adoption in drug development will depend on the creation of a data foundation that addresses these three key requirements. accessibility data accessibility is a key characteristic of ai leaders irrespective of sector. in order to ensure effective and productive data democratization, organizations need to enable access to data distributed across complex technology environments spanning multiple internal and external stakeholders and partners. a key caveat of accessibility is that the data provided should be contextual to the analytical needs of specific data users and consumers. a modern cloud-based and connected enterprise data and ai platform designed as a “one-stop-shop” for all drug design and development-related data products with ready-to-use analytical models will be critical to ensuring broader and deeper data accessibility for all users. data management and governance the quality of any data ecosystem is determined by the data management and governance frameworks that ensure that relevant information is accessible to the right people at the right time. at the same time, these frameworks must also be capable of protecting confidential information, ensuring regulatory compliance, and facilitating the ethical and responsible use of ai. therefore, the key focus of data management and governance will be to consistently ensure the highest quality of data across all systems and platforms as well as full transparency and traceability in the acquisition and application of data. ux and usability successful ai adoption will require a data foundation that streamlines accessibility and prioritizes ux and usability. apart from democratizing access, the emphasis should also be on ensuring that even non-technical users are able to use data effectively and efficiently. different users often consume the same datasets from completely different perspectives. the key, therefore, is to provide a range of tools and features that help every user customize the experience to their specific roles and interests. apart from creating the right data foundation, technology partnerships can also help accelerate the shift from an application-centric to a data-centric approach to ai adoption. in fact, a 2018 gartner report advised organizations to explore vendor offerings as a foundational approach to jump-start their efforts to make productive use of ai. more recently, pharma-technology partnerships have emerged as the fastest-moving model for externalizing innovation in ai-enabled drug discovery. according to a recent roots analysis report on the ai-based drug discovery market, partnership activity in the pharmaceutical industry has grown at a cagr of 50%, between 2015 and 2021, with a majority of the deals focused on research and development. so with that trend as background, here’s a quick look at how a data-centric, full-service biotherapeutic platform can accelerate biopharma’s shift to an ai-first drug discovery model. the lensai™ approach to data-centric drug development our approach to biotherapeutic research places data at the very core of a dynamic network of biological and artificial intelligence technologies. with our lensai platform, we have created a google-like solution for the entire biosphere, organizing it into a multidimensional network of 660 million data objects with multiple layers of information about sequence, syntax, and protein structure. this “one-stop-shop” model enables researchers to seamlessly access all raw sequence data. in addition, hyfts®, our universal framework for organizing all biological data, allows easy, one-click integration of all other research-relevant data from across public and proprietary data repositories. researchers can then leverage the power of the lensai integrated intelligence platform to integrate unstructured data from text-based knowledge sources such as scientific journals, ehrs, clinical notes, etc. here again, researchers have the ability to expand the core knowledge base, containing over 33 million abstracts from the pubmed biomedical literature database, by integrating data from multiple sources and knowledge domains, including proprietary databases. around this multi-source, multi-domain, data-centric core, we have designed next-generation ai technologies that can instantly and concurrently convert these vast volumes of text, sequence, and protein structure data into meaningful knowledge that can transform drug discovery and development.
artificial intelligence (ai) technologies are currently the most disruptive trend in the pharmaceutical industry. over the past year, we have quite extensively covered the impact that these intelligent technologies can have on conventional drug discovery and development processes. we charted how ai and machine learning (ml) technologies came to be a core component of drug discovery and development, their potential to exponentially scale and autonomize drug discovery and development, their ability to expand the scope of drug research even in data-scarce specialties like rare diseases, and the power of knowledge graph-based drug discovery to transform a range of drug discovery and development tasks. ai/ml technologies can radically remake every stage of the drug discovery and development process, from research to clinical trials. today, we will dive deeper into the transformational possibilities of these technologies in two foundational stages — early drug discovery and preclinical development — of the drug development process. early drug discovery and preclinical development source: sciencedirect early drug discovery and preclinical development is a complex process that essentially determines the productivity and value of downstream development programs. therefore, even incremental improvements in accuracy and efficiency during these early stages could dramatically improve the entire drug development value chain. ai/ml in early drug discovery the early small molecule drug discovery process flows broadly, across target identification, hit identification, lead identification, lead optimization, and finally, on to preclinical development. currently, this time-consuming and resource-intensive process relies heavily on translational approaches and assumptions. incorporating assumptions, especially those that cannot be validated due to lack of data, raises the risk of late-stage failure by advancing nmes without accurate evidence of human response into drug development. even the drastically different process of large-molecule, or biologicals, development, starts with an accurate definition of the most promising target. ai/ml methods, therefore, can play a critical role in accelerating the development process. investigating drug-target interactions (dtis), therefore, is a critical step to enhancing the success rate of new drug discovery. predicting drug-target interactions despite the successful identification of the biochemical functions of a myriad of proteins and compounds with conventional biomedical techniques, the limitations of these approaches come into play when scaling across the volume and complexity of data. this is what makes ml methods ideal for drug–target interaction (dti) prediction at scale. l techniques ideal for drug-target interaction prediction. there are currently several state-of-the-art ml models available for dti prediction. however, many conventional ml approaches regard dti prediction either as a classification or a regression task, both of which can lead to bias and variance errors. novel multi-dti models that balance bias and variance through a multi-task learning framework have been able to deliver superior performance and accuracy over even state-of-the-art methods. these dti prediction models combine a deep learning framework with a co-attention mechanism to model interactions from drug and protein modalities and improve the accuracy of drug target annotation. deep learning models perform significantly better at high-throughput dti prediction than conventional approaches and continue to evolve, from identifying simple interactions to revealing unknown mechanisms of drug action. lead identification & optimization this stage focuses on identifying and optimizing drug-like small molecules that exhibit therapeutic activity. the challenge in this hit-to-lead generation phase is twofold. firstly, the search space to extract hit molecules from compound libraries extends to millions of molecules. for instance, a single database like the zinc database comprises 230 million purchasable compounds and the universe of make-on-demand synthesis compounds can be 10 billion. secondly, the hit rate of conventional high-throughput screening (hts) approaches to yield an eligible viable compound is just around 0.1%. over the years, there have been several initiatives to improve the productivity and efficiency of hit-to-lead generation, including the use of high-content screening (hcs) techniques to complement hts and improve efficiency and cadd virtual screening methodologies to reduce the number of compounds to be tested. source: bcg the availability of huge volumes of high-quality data combined with the ability of ai to parse and learn from these data has the potential to take the computational screening process to a new level. there are at least four ways — access to new biology, improved or novel chemistry, better success rates, and quicker and cheaper discovery processes — in which ai can add new value to small-molecule drug discovery. ai technologies can be applied to a variety of discovery contexts and biological targets and can play a critical role in redefining long-standing workflows and many of the challenges of conventional techniques. ai/ml in preclinical development preclinical development addresses several critical issues relevant to the success of new drug candidates. preclinical studies are a regulatory prerequisite to generating toxicology data that validate the safety of a drug for humans prior to clinical trials. these studies inform trial design and provide the pharmacokinetic, pharmacodynamic, tolerability, and safety information, such as in vitro off-target and tissue-cross reactivity (tcr), that defines optimal dosage. preclinical data also provide chemical, manufacturing, and control information that will be crucial for clinical production. finally, they help pharma companies to identify candidates with the broadest potential benefits and the greatest chance of success. it is estimated that just 10 out of 10,000 small molecule drug candidates in preclinical studies make it to clinical trials. one reason for this extremely high turnover is the imperfect nature of preclinical in vivo research models, as compared to in vitro studies which can typically confirm efficacy, moa, etc., which results in challenges to accurately predicting clinical outcomes. however, ai/ml technologies are increasingly being used to bridge the translational gap between preclinical discoveries and new therapeutics. for instance, a key approach to de-risking clinical development has been the use of translational biomarkers that demonstrate target modulation, target engagement, and confirm proof of mechanism. in this context, ai techniques have been deployed to learn from large volumes of heterogeneous and high-dimensional omics data and provide valuable insights that streamline translational biomarker discovery. similarly, ml algorithms that learn from problem-specific training data have been successfully used to accurately predict bioactivity, absorption, distribution, metabolism, excretion, and toxicity (admet) -related endpoints, and physicochemical properties. these technologies also play a critical role in the preclinical development of biologicals, including in the identification of candidate molecules with a higher probability of providing species-agnostic reactive outcomes in animal/human testing, ortholog analysis, and off-target binding analysis. these technologies have also been used to successfully predict drug interactions, including drug-target and drug-drug interactions, during preclinical testing. the age of data-driven drug discovery & development network-based approaches that enable a systems-level view of the mechanisms underlying disease pathophysiology are increasingly becoming the norm in drug discovery and development. this in turn has opened up a new era of data-driven drug development where the focus is on the integration of heterogeneous types and sources of data, including molecular, clinical trial, and drug label data. the preclinical space is being transformed by ai technologies like natural language processing (nlp) that are enabling the identification of novel targets and previously undiscovered drug-disease associations based on insights extracted from unstructured data sources like biomedical literature, electronic medical records (emrs), etc. sophisticated and powerful ml/ai algorithms now enable the unified analysis of huge volumes of diverse datasets to autonomously reveal complex non-linear relationships that streamline and accelerate drug discovery and development. ultimately, the efficiency and productivity of early drug discovery and preclinical development processes will determine the value of the entire pharma r&d value chain. and that’s where ai/ml technologies have been gaining the most traction in recent years.
Topic: AI
Sorry. There were no results for your query.