The Blog
MindWalk is a biointelligence company uniting AI, multi-omics data, and advanced lab research into a customizable ecosystem for biologics discovery and development.
×
immunogenicity is a major cause of biologics failure, often identified too late in development. this blog explains how in silico screening helps detect anti-drug antibody (ada) risks early, before costly setbacks. learn how tools like lensai™ enable faster, more informed decision-making by supporting early candidate evaluation, risk mitigation, and regulatory alignment. the impact of immunogenicity in early biologics discovery immunogenicity remains one of the most important and often underappreciated factors in biologics development. for researchers and drug development teams working with monoclonal antibodies or therapeutic proteins, the risk of an unwanted immune response can derail even the most promising candidates. the presence of anti-drug antibodies (adas) doesn’t always show up immediately. in many cases, the problem becomes evident only after significant investment of time and resources, often in later-stage trials. adas can reduce a drug’s effectiveness, alter its pharmacokinetics, or introduce safety risks that make regulatory approval unlikely. some programs have even been discontinued because of immunogenicity-related findings that might have been identified much earlier. to avoid these setbacks, teams are increasingly integrating predictive immunogenicity screening earlier in development. in silico tools now make it possible to evaluate ada risk during the discovery stage, before resources are committed to high-risk candidates. this proactive approach supports smarter design decisions, reduces development delays, and helps safeguard against late-stage failure. in this blog, we’ll explore how in silico immunogenicity screening offers a proactive way to detect potential ada risks earlier in the pipeline. we’ll also look at how tools like mindwalk’s lensai platform are helping to simplify and scale these assessments, making immunogenicity screening a practical part of modern biologics development. why early ada risk assessment is critical immune responses to therapeutic proteins can derail even the most carefully designed drug candidates. when the immune system identifies a treatment as foreign, it may trigger the production of anti-drug antibodies (adas). these responses can alter how a drug is distributed in the body, reduce its therapeutic effect, or create safety concerns that weren't apparent during earlier studies. the consequences are often serious delays, added costs, program redesigns, or even full discontinuation. this isn’t something to be considered only when a drug is close to clinical testing. it’s a risk that needs to be addressed from the beginning. regulatory agencies increasingly expect sponsors to demonstrate that immunogenicity has been evaluated in early discovery, not just as a final check before filing. this shift reflects lessons learned from earlier products that failed late because they hadn't been properly screened. early-stage risk assessment allows developers to ask the right questions at the right time. are there t-cell epitopes likely to trigger immune recognition? is the candidate similar enough to self-proteins to escape detection? could minor sequence changes reduce the chances of immunogenicity without compromising function? immunogenicity screening provides actionable insights that can guide sequence optimization well before preclinical testing. for example, identifying epitope clustering or t-cell activation hotspots during discovery enables teams to make targeted modifications in regions such as the variable domain. these adjustments can reduce immunogenicity risk without compromising target binding, helping streamline development and avoid costly rework later in the process. beyond candidate selection, immunogenicity screening improves resource allocation. if a molecule looks risky, there is no need to invest heavily in downstream testing until it has been optimized. it’s a smarter, more strategic way to manage timelines and reduce unnecessary costs. the tools now available make this kind of assessment more accessible than ever. in silico screening platforms, powered by ai and machine learning, can run detailed analyses in a matter of hours. these insights help move projects forward without waiting for expensive and time-consuming lab work. in short, assessing immunogenicity is not just about risk avoidance. it’s about building a better, faster path to clinical success. in silico immunogenicity screening: how it works in silico immunogenicity screening refers to the use of computational models to evaluate the immune risk profile of a biologic candidate. these methods allow development teams to simulate how the immune system might respond to a therapeutic protein, particularly by predicting t-cell epitopes that could trigger anti-drug antibody (ada) formation. the primary focus is often on identifying mhc class ii binding peptides. these are the sequences most likely to be presented by antigen-presenting cells and recognized by helper t cells. if the immune system interprets these peptides as foreign, it can initiate a response that leads to ada generation. unlike traditional in vitro methods, which may require weeks of experimental setup, in silico tools deliver results quickly and at scale. developers can screen entire libraries of protein variants, comparing their immunogenicity profiles before any physical synthesis is done. this flexibility makes in silico screening particularly valuable in the discovery and preclinical stages, where multiple versions of a candidate might still be on the table. the strength of this approach lies in its ability to deliver both breadth and depth. algorithms trained on curated immunology datasets can evaluate binding affinity across a wide panel of human leukocyte antigen (hla) alleles. they can also flag peptide clusters, overlapping epitopes, and areas where modifications may reduce risk. the result is a clearer picture of how a candidate will interact with immune pathways long before preclinical and clinical studies are initiated. for teams juggling tight timelines and complex portfolios, these insights help drive smarter decision-making. high-risk sequences can be deprioritized or redesigned, while low-risk candidates can be advanced with greater confidence. how lensai supports predictive immunogenicity analysis one platform leading the charge in this space is lensai . designed for early-stage r&d, it offers high-throughput analysis with a user-friendly interface, allowing computational biologists, immunologists, and drug developers to assess risks rapidly. here’s how lensai supports smarter decision-making: multi-faceted risk scoring: rather than relying on a single predictor, lensai integrates several immunogenicity markers into one unified score. this includes predicted mhc class ii binding affinity across diverse hla alleles, epitope clustering patterns, and peptide uniqueness compared to self-proteins based on proprietary hyft technology. by combining these distinct factors, the platform provides insight into potential immune activation risk, supporting better-informed candidate selection. reliable risk prediction: lensai composite score reliably classifies candidates by ada risk, using two thresholds to define low risk: <10% and <30% ada risk. this distinction enables more confident go/no-go decisions in early development stages. by combining multiple features into a single score, the platform supports reproducible, interpretable risk assessment that is grounded in immunological relevance. early-stage design support: lensai is accessible from the earliest stages of drug design, without requiring lab inputs or complex configurations, designed for high-throughput screening of whole libraries of sequences in a few hours. researchers can quickly assess sequence variants, compare immunogenicity profiles, and prioritize low-risk candidates before investing in downstream studies. this flexibility supports more efficient resource use and helps reduce the likelihood of late-stage surprises. in a field where speed and accuracy both matter, this kind of screening helps bridge the gap between concept and clinic. it gives researchers the chance to make informed adjustments, rather than discovering late-stage liabilities when there is little room left to maneuver. case study: validating ada risk prediction with lensai in our recent case study, we applied lensai’s immunogenicity composite score to 217 therapeutic antibodies to evaluate predictive accuracy. for predicting ada incidence >10%, the model achieves an auc=0.79, indicating strong discriminative capability (auc=0.8 is excellent). for predicting ada incidence >30%, which is considered as more suitable for early-stage risk assessment purposes than the 10% cut-off, auc rises to 0.92, confirming lensai's value for ada risk classification. read the full case study or contact us to discuss how this applies to your pipeline. regulatory perspectives: immunogenicity is now a front-end issue it wasn’t long ago that immunogenicity testing was seen as something to be done late in development. but regulators have since made it clear that immunogenicity risk must be considered much earlier. agencies like the fda and ema now expect developers to proactively assess and mitigate immune responses well before clinical trials begin. this shift came after a series of high-profile biologic failures where ada responses were only discovered after significant time and money had already been spent. in some cases, the immune response not only reduced drug efficacy but also introduced safety concerns that delayed approval or halted development entirely. today, guidance documents explicitly encourage preclinical immunogenicity assessment. sponsors are expected to show that they have evaluated candidate sequences, made risk-informed design choices, and taken steps to reduce immunogenic potential. in silico screening, particularly when combined with in vitro and in vivo data, provides a valuable layer of evidence in this process. early screening also supports a culture of quality by design. it enables teams to treat immunogenicity not as a regulatory hurdle, but as a standard consideration during candidate selection and development. the regulatory landscape is shifting to support in silico innovation. in april 2025, the fda took a major step by starting to phase out some animal testing requirements for antibody and drug development. instead, developers are encouraged to use new approach methodologies (nams)—like ai models —to improve safety assessments and speed up time to clinic. the role of in silico methods in modern biologics development with the increasing complexity of therapeutic proteins and the diversity of patient populations, traditional testing methods are no longer enough. drug development teams need scalable, predictive tools that can keep up with the speed of discovery and the demand for precision. in silico immunogenicity screening is one of those tools. it has moved from being a theoretical exercise to a standard best practice in many organizations. reducing dependence on reactive testing and allowing early optimization leads these methods to helping companies move forward with greater efficiency and lower risk. when development teams have access to robust computational tools from the outset, the entire process tends to run more efficiently. these tools enable design flexibility, support earlier decision-making, and allow researchers to explore multiple design paths while maintaining alignment with regulatory expectations. for companies managing multiple candidates across different therapeutic areas, this kind of foresight can translate to faster development, fewer setbacks, and ultimately, better outcomes for patients. final thoughts: from screening to smarter development the promise of in silico immunogenicity screening lies in moving risk assessment to the earliest stages of development where it can have the greatest impact. by identifying high-risk sequences before synthesis, it helps researchers reduce late-stage failures, shorten timelines, lower overall project costs, and improve the likelihood of clinical success. in silico tools such as lensai support the early prediction of ada risk by flagging potential immunogenic regions and highlight risk patterns across diverse protein candidates, enabling earlier, more informed design decisions. see how early ada screening could strengthen your next candidate. learn more.
epitope mapping is a fundamental process to identify and characterize the binding sites of antibodies to their target antigens2. understanding these interactions is pivotal in developing diagnostics, vaccines, and therapeutic antibodies3–5. antibody-based therapeutics – which have taken the world by storm over the past decade – all rely on epitope mapping for their discovery, development, and protection. this includes drugs like humira, which reigned as the world’s best-selling drug for six years straight6, and rituximab, the first monoclonal antibody therapy approved by the fda for the treatment of cancer7. aside from its important role in basic research and drug discovery and development, epitope mapping is an important aspect of patent filings; it provides binding site data for therapeutic antibodies and vaccines that can help companies strengthen ip claims and compliance8. a key example is the amgen vs. sanofi case, which highlighted the importance of supporting broad claims like ‘antibodies binding epitope x’ with epitope residue identification at single amino acid resolution, along with sufficient examples of epitope binding8. while traditional epitope mapping approaches have been instrumental in characterizing key antigen-antibody interactions, scientists frequently struggle with time-consuming, costly processes that are limited in scalability and throughput and can cause frustration in even the most seasoned researchers9. the challenge of wet lab-based epitope mapping approaches traditional experimental approaches to epitope mapping include x-ray crystallography and hydrogen-deuterium exchange mass-spectrometry (hdx-ms). while these processes have been invaluable in characterizing important antibodies, their broader application is limited, particularly in high-throughput antibody discovery and development pipelines. x-ray crystallography has long been considered the gold standard of epitope mapping due to its ability to provide atomic-level resolution10. however, this labor-intensive process requires a full lab of equipment, several scientists with specialized skill-sets, months of time, and vast amounts of material just to crystallize a single antibody-antigen complex. structural biology researchers will understand the frustration when, after all this, the crystallization is unsuccessful (yet again), for no other reason than simply because not all antibody-antigen complexes form crystals11. additionally, even if the crystallization process is successful, this technique doesn’t always reliably capture dynamic interactions, limiting its applicability to certain epitopes12. the static snapshots provided by x-ray crystallography mean that it can’t resolve allosteric binding effects, transient interactions, or large/dynamic complexes, and other technical challenges mean that resolving membrane proteins, heterogeneous samples, and glycosylated antigens can also be a challenge. hdx-ms, on the other hand, can be a powerful technique for screening epitope regions involved in binding, with one study demonstrating an accelerated workflow with a success rate of >80%13. yet, it requires highly complex data analysis and specialized expertise and equipment, making it resource-intensive, time-consuming (lasting several weeks), and less accessible for routine use – often leading to further frustration among researchers. as the demand for therapeutic antibodies, vaccines, and diagnostic tools grows, researchers urgently need efficient, reliable, and scalable approaches to accelerate the drug discovery process. in silico epitope mapping is a promising alternative approach that allows researchers to accurately predict antibody-antigen interactions by integrating multiple computational techniques14. advantages of in silico epitope mapping in silico epitope mapping has several key advantages over traditional approaches, making it a beneficial tool for researchers, particularly at the early stage of antibody development. speed – computational epitope mapping methods can rapidly analyze antigen-antibody interactions, reducing prediction time from months to days11. this not only accelerates project timelines but also helps reduce the time and resources spent on unsuccessful experiments. accuracy – by applying advanced algorithms, in silico methods are designed to provide precise and accurate predictions11. continuous improvements in 3d modeling of protein complexes that can be used to support mapping also mean that predictions are becoming more and more accurate, enhancing reliability and success rates9. versatility – in silico approaches are highly flexible and can be applied to a broad range of targets that may otherwise be challenging to characterize, ranging from soluble proteins, multimers, to transmembrane proteins. certain in silico approaches can also overcome the limitations of x-ray crystallography as they can reliably study dynamic and transient interactions12. cost-effectiveness – by reducing the need for expensive reagents, specialized equipment, and labor-intensive experiments, and by cutting timelines down significantly, computational epitope mapping approaches lower the cost of epitope mapping considerably11,15. this makes epitope mapping accessible to more researchers and organizations with limited resources. scalability – in silico platforms can handle huge datasets and screen large numbers of candidates simultaneously, unlike traditional wet-lab methods that are limited by throughput constraints, enabling multi-target epitope mapping9. this is especially advantageous in high-throughput settings, such as immune profiling and drug discovery, and relieves researchers of the burden of processing large volumes of samples daily. ai-powered in silico epitope mapping in action meet lensai: your cloud-based epitope mapping lab imagine a single platform hosting analytical solutions for end-to-end target discovery-leads analysis, including epitope mapping in hours. now, this is all possible. meet lensai – an integrated intelligence platform hosting innovative analytical solutions for complete target-discovery-leads analysis and advanced data harmonization and integration. lensai epitope mapping is one of the platform’s applications that enables researchers to identify the amino acids on the target that are part of the epitope11. by simply inputting the amino acid sequences of antibodies and targets, the machine learning (ml) algorithm, combined with molecular modeling techniques, enables the tool to make a prediction. the outputs are: a sequence-based visualization containing a confidence score for each amino acid of the target, indicating whether that amino acid may be part of the epitope, and a 3d visualization with an indication of the predicted epitope region. lensai: comparable to x-ray crystallography, in a fraction of the time and cost to evaluate the accuracy of lensai epitope mapping, its predictions were compared to the data from a well-known study by dang et al. in this study, epitope mapping using six different well-known wet-lab techniques for epitope mapping were compared, using x-ray crystallography as the gold standard11. by comparing lensai to the epitope structures obtained by x-ray crystallography in this study, it was determined that lensai closely matches x-ray crystallography. the area under the curve (auc) from the receiver operating characteristic (roc) curve was used as a key performance metric to compare the two techniques. the roc curve plots the true positive rate against the false positive rate, providing a robust measure of the prediction’s ability to distinguish between epitope and non-epitope residues. the results demonstrated that lensai achieves consistently high auc values of approximately 0.8 and above, closely matching the precision of x-ray crystallography (figure 1). an auc of 1 would represent a perfect prediction, while an auc of 0.8 and above would be excellent, and 0.5 is not better than random. although the precision of lensai is comparable to that of x-ray crystallography, the time and cost burdens are not; lensai achieves this precision in a fraction of the time and with far fewer resources than those required for successful x-ray crystallography. figure 1. the benchmark comparison with x-ray crystallography and six other methods (peptide array, alanine scan, domain exchange, hydrogen-deuterium exchange, chemical cross-linking, and hydroxyl radical footprinting) for epitope identification in five antibody-antigen combinations the accuracy of lensai was further compared against the epitope mapping data from other widely used wet lab approaches, obtained from the dang et al., study. in this study, peptide array, alanine scan, domain exchange, hdx, chemical cross-linking, and hydroxyl radical footprinting techniques were assessed. to compare lensai with dang’s data, the epitope mapping identified by x-ray crystallography (obtained from the same study) was used as the ground truth. alongside showing near x-ray precision, lensai outperformed all wet lab methods, accurately identifying the true epitope residues (high recall combined with high precision and a low false positive rate). in addition to the high precision and accuracy shown here, lensai enables users to detect the amino acids in the target that are part of the epitope solely through in silico analysis. lensai is, therefore, designed to allow users to gain reliable and precise results, usually within hours to a maximum of 1 day, with the aim of enabling fast epitope mapping and significantly reducing the burden of technically challenging experimental approaches. this means there is no need to produce physical material through lengthy and unpredictable processes, thereby saving time and money and helping to improve the success rate. lensai also works for various target types, including typically challenging targets such as transmembrane proteins and multimers. lensai performs on unseen complexes with high accuracy a new benchmark validation demonstrates that lensai epitope mapping maintains high accuracy even when applied to entirely new antibody-antigen complexes it has never seen before. in this study, the platform accurately predicted binding sites across 17 unseen pairs without prior exposure to the antibodies, antigens, or complexes. the ability to generalize beyond training data shows the robustness of the lensai predictive model. these findings not only support broader applicability but also help reduce lab burden and timelines. you can explore both the new “unseen” case study and the original benchmark on a “seen” target for a side-by-side comparison. new case study: lensai epitope mapping on an “unseen” target[ link] previous case study: head-to-head benchmark on a “seen” target[ link] conclusion as many of us researchers know all too well, traditional wet lab epitope mapping techniques tend to be slow, costly, and not often successful, limiting their applicability and scalability in antibody discovery workflows. however, it doesn’t have to be this way – in silico antibody discovery approaches like lensai offer a faster, cost-effective, and highly scalable alternative. this supports researchers in integrating epitope mapping earlier in the development cycle to gain fine-grained insights, make more informed decisions, and optimize candidates more efficiently. are you ready to accelerate your timelines and improve success rates in antibody discovery? get in touch today to learn more about how lensai can streamline your antibody research. references 1. labmate i. market report: therapeutic monoclonal antibodies in europe. labmate online. accessed march 18, 2025. https://www.labmate-online.com/news/news-and-views/5/frost-sullivan/market-report-therapeutic-monoclonal-antibodies-in-europe/22346 2. mole se. epitope mapping. mol biotechnol. 1994;1(3):277-287. doi:10.1007/bf02921695 3. ahmad ta, eweida ae, sheweita sa. b-cell epitope mapping for the design of vaccines and effective diagnostics. trials vaccinol. 2016;5:71-83. doi:10.1016/j.trivac.2016.04.003 4. agnihotri p, mishra ak, agarwal p, et al. epitope mapping of therapeutic antibodies targeting human lag3. j immunol. 2022;209(8):1586-1594. doi:10.4049/jimmunol.2200309 5. gershoni jm, roitburd-berman a, siman-tov dd, tarnovitski freund n, weiss y. epitope mapping: the first step in developing epitope-based vaccines. biodrugs. 2007;21(3):145-156. doi:10.2165/00063030-200721030-00002 6. biology ©2025 mrc laboratory of molecular, avenue fc, campus cb, cb2 0qh c, uk. 01223 267000. from bench to blockbuster: the story of humira® – best-selling drug in the world. mrc laboratory of molecular biology. accessed march 18, 2025. https://www2.mrc-lmb.cam.ac.uk/news-and-events/lmb-exhibitions/from-bench-to-blockbuster-the-story-of-humira-best-selling-drug-in-the-world/ 7. milestones in cancer research and discovery - nci. january 21, 2015. accessed march 18, 2025. https://www.cancer.gov/research/progress/250-years-milestones 8. deng x, storz u, doranz bj. enhancing antibody patent protection using epitope mapping information. mabs. 2018;10(2):204-209. doi:10.1080/19420862.2017.1402998 9. grewal s, hegde n, yanow sk. integrating machine learning to advance epitope mapping. front immunol. 2024;15:1463931. doi:10.3389/fimmu.2024.1463931 10. toride king m, brooks cl. epitope mapping of antibody-antigen interactions with x-ray crystallography. in: rockberg j, nilvebrant j, eds. epitope mapping protocols. vol 1785. methods in molecular biology. springer new york; 2018:13-27. doi:10.1007/978-1-4939-7841-0_2 11. dang x, guelen l, lutje hulsik d, et al. epitope mapping of monoclonal antibodies: a comprehensive comparison of different technologies. mabs. 2023;15(1):2285285. doi:10.1080/19420862.2023.2285285 12. srivastava a, nagai t, srivastava a, miyashita o, tama f. role of computational methods in going beyond x-ray crystallography to explore protein structure and dynamics. int j mol sci. 2018;19(11):3401. doi:10.3390/ijms19113401 13. zhu s, liuni p, chen t, houy c, wilson dj, james da. epitope screening using hydrogen/deuterium exchange mass spectrometry (hdx‐ms): an accelerated workflow for evaluation of lead monoclonal antibodies. biotechnol j. 2022;17(2):2100358. doi:10.1002/biot.202100358 14. potocnakova l, bhide m, pulzova lb. an introduction to b-cell epitope mapping and in silico epitope prediction. j immunol res. 2016;2016:1-11. doi:10.1155/2016/6760830 15. parvizpour s, pourseif mm, razmara j, rafi ma, omidi y. epitope-based vaccine design: a comprehensive overview of bioinformatics approaches. drug discov today. 2020;25(6):1034-1042. doi:10.1016/j.drudis.2020.03.006
the importance of an integrated end-to-end antibody discovery process drug discovery processes are typically organized in a step-by-step manner – going from target identification to lead optimization processes. this implies that data is being siloed at every process, leading to an exponential loss of quantitative and qualitative insights across the different processes. to realize the full potential of drug discovery, data integration within a data-driven automation platform is essential. the lensai™ foundation ai model powered by hyft technology is designed to solve the challenges behind ai-driven rational drug design, harnessing advanced ai and ml capabilities to navigate the complexities of drug discovery with high precision. by integrating predictive modelling, data analysis and lead optimization functionalities, lensai accelerates the end-to-end discovery and development of promising drug candidates. the lensai system uniquely integrates both structured and unstructured data, serving as a centralized graph for storing, querying, and analyzing diverse datasets, including different omics layers, chemical, and pharmacological information. with lensai, data from every phase of the drug discovery process is no longer siloed but represented as subgraphs within an interconnected graph that summarizes data across all processes. this interconnected approach enables bidirectional and cyclical information flow, allowing for flexibility and iterative refinement. for example, during in-silico lead optimization, challenges may arise regarding pharmacokinetic properties or off-target effects of lead compounds. by leveraging the integrated knowledge graph, we can navigate back to earlier phases to reassess decisions and explore alternative strategies. this holistic view ensures that insights and adjustments can be continuously incorporated throughout the drug discovery process. navigation through integrated knowledge graphs of complex biological data is made possible by the patented hyft technology. hyfts, which are amino acid patterns mined across the biosphere, serve as critical connectors within the knowledge graph by capturing diverse layers of information at both the subsequence and sequence levels. the hyfts encapsulate information about ‘syntax’ (the arrangement of amino acids), as well as ‘structure’ and ‘function,’ and connect this data to textual information at the sentence and concept levels. this hyft-based multi-modal integration ensures that we move beyond mere ‘syntax’ to incorporate ‘biological semantics,’ representing the connection between structure and function. within this single framework, detailed structural information is aligned with relevant textual metadata, providing a comprehensive understanding of biological sequences. exploring textual metadata could be very useful in the target identification stage. for example, to gather detailed information on the target epitopes: “in which species are these epitopes represented?” “can we extract from literature additional information and insights on the epitopes?”. this information can be yielded by querying the knowledge graph and harnessing the benefits of the fine-grained hyft-based approach, capturing information at the ‘subsequence’ level. indeed, at the hyft level, relevant textual concepts (sub-sentence level) are captured, which allows us to identify whether a specific hyft, represented in the target, might reveal relevant epitopes. apart from textual meta-data there is ‘flat’ metadata such as immunogenicity information, germline information, pharmacological data, developability data, and sequence liability presence. at each of the previously mentioned information layers, additional 'vector' data is obtained from various protein large language models (pllms). this means that an embedding is associated with each (sub)-sequence or concept. this allows for 'vector' searches, which, based on the embeddings, can be used to identify similar sequences, enhancing tasks like protein structure prediction and functional annotation. for a deep dive into vector search, see our vector search in text analysis blog here. this capability allows for the extraction of a wider range of features and the uncovering of hidden patterns across all these dimensions. lensai: the importance of embeddings at the sub-sequence level mindwalk lensai’s comprehensive approach in protein analytics is similar to text-based analytics. in text analysis, we refine semantic boundaries by intelligently grouping words to capture textual meaning. similarly, in protein analytics, we strategically group residue tokens (amino acids) to form sequential hyfts. just as words are clustered into synonyms in text analytics, “protein words” are identified and clustered based on their biological function in protein analytics. these “protein words,” when present in different sequences, reveal a conserved function. by leveraging this method, we gain a deeper understanding of the functional conservation across various protein sequences. thus, the lensai platform based on hyft technology analyses proteins at the sub-sequence level focusing on the hyft patterns as well as on the full-sequence level. comparable to natural language, residues might be less relevant and do not contribute to meaning, which, in case of proteins, can be translated into function. therefore, by focusing on hyfts, we obtain a more condensed information representation and noise reduction by excluding the information captured in non-critical regions. in text analysis, we can almost immediately recognize semantic similarity. we recognize sentences similar in meaning, although compiled of different words, because of our natural understanding of synonyms. in protein language to identify ‘functional similarity’, in other words, to distinguish whether two different amino acid patterns (hyfts) might yield the same function, we use a mathematical method i.e. pllms. pllms are transformer-based models that generate an embedding starting from single amino acid residues. depending on the data the pllm is trained on (typically millions of protein sequences), it tries to discover hidden properties by diving into the residue-residue connections (neighboring residues on both short or longer distances). figure 1: mindwalk's method of chunking tokens the dataset and task a pllm were trained on, determine the represented properties, which can vary from one pllm to another pllm. by stacking the embeddings from different large language model (llm) a more complete view on the protein data is generated. furthermore, we can use clustering and vector search algorithms to group sequences that are similar in a broad range of dimensions. protein embeddings are typically generated at the single amino acid level. in contrast, the hyft-based model obtains embeddings from llms at a pattern level, by concatenating residue-level embeddings. these ‘protein words or hyft’ level embeddings can be obtained from several pre-trained llms – varying from antibody-specific llms to more generic pllms. this hyft-based embedding model offers several benefits. first, this approach captures richer and more informative embeddings compared to the single residue level embeddings second, the concatenation of residue-level embeddings allows for preserving sequence-specific patterns, enhancing the ability to identify functional and structural motifs within proteins. lastly, integrating different llms ensures that these embeddings leverage vast amounts of learned biological knowledge, improving the accuracy and robustness of downstream tasks such as protein function prediction and annotation. so, if we want to identify which hyfts are ‘synonyms’, we deploy the hyft-based level embeddings. returning to the language analogy, where ‘apple’ will take a similar place in the embedding space as ‘orange’ or ‘banana’ – because they are all fruits – in protein analytics we are interested in the hyfts that take similar places in the embedding space – because they all perform the same function in a certain context. figure 2: embeddings at the sub-sequence level: concepts versus hyfts as the figure above (fig. 2) illustrates, the word "apple" can have different meanings depending on the context (referring to a phone or a fruit), the sequence hyft ‘vkkpgas’ can also appear in various contexts, representing different protein annotations and classifications. for instance, a specific hyft is found in organisms ranging from bacteria and fungi to human immunoglobulins. consequently, the embeddings for hyft vkkpgas might occupy different positions in the embedding space, reflecting these distinct functional contexts. use case: transforming antibody discovery with integrated vector search in hit expansion analysis in the lensai hit expansion analysis pipeline, outputs from phage-display, b-cell, or hybridoma technologies are combined with a large-scale enriched antibody sequence dataset sequenced by ngs. the primary goal is to expand the number and diversity of potential binders—functional antibodies from the ngs dataset that are closely related to a set of known binders. the data from the ngs repertoire set and the known binders are represented in a multi-modal knowledge graph, incorporating various modalities such as sequence, structure, function, text, and embeddings. this comprehensive representation allows the ngs repertoire set to be queried to identify a diverse set of additional hits by simultaneously exploiting different information levels, such as structural, physiochemical, and pharmacological properties like immunogenicity. a vital component of this multi-modal knowledge graph is the use of vector embeddings, where antibody sequences are represented in multi-dimensional space, enabling sophisticated analysis. these vector embeddings can be derived from different llms. for instance, in the example below, clinical antibodies obtain sequence-level embeddings from an antibody-specific llm, represented in 2d space and colored by their immunogenicity score. this immunogenicity score can be used to filter some of the antibodies, demonstrating how metadata can be utilized to select embedding-based clusters. furthermore, using vector embeddings allows for continuous data enrichment and the integration of latest information into the knowledge graph at every step of the antibody discovery and development cycle, enhancing the overall process. in protein engineering, this continuous data enrichment proves advantageous in various aspects, such as introducing specific mutations aimed at enhancing binding affinity, humanizing proteins, and reducing immunogenicity. this new data is dynamically added to the knowledge graph, ensuring a fully integrated view of all the data throughout the antibody design cycle. these modifications are pivotal in tailoring proteins for therapeutics, ensuring they interact more effectively with their targets while minimizing unwanted immune responses. figure 3. the clinical antibodies obtain sequence-level embeddings from an antibody-specific llm, represented in 2d space and colored by their immunogenicity score (1 indicating high immunogenic) conclusion the lensai platform provides a robust multi-modal approach to optimize antibody discovery and development processes. by solving the integration of sequence, structure, function, textual insights, and vector embeddings, lensai bridges gaps between disparate data sources. the platform enhances feature extraction by leveraging embedding data from various llms, capturing a wide array of biologically relevant 'hidden properties' at the sub-sequence level. this capability ensures a comprehensive exploration of nuanced biological insights, facilitating an integrated data view. by utilizing “vector search”, the platform can efficiently query and analyze these embeddings, enabling the identification of similar sequences and functional motifs across large and complex datasets. this approach not only captures the 'syntax' and 'structure' of amino acid patterns but also integrates 'biological semantics,' thereby providing a holistic understanding of protein functions and interactions. consequently, lensai improves the efficiency of antibody discovery and development from identifying novel targets to optimizations in therapeutic development, such as hit expansion analysis, affinity maturation, humanization, and immunogenicity screening processes. furthermore, lensai enables cyclical enrichment of antibody discovery and development processes by adding and integrating information into a knowledge graph at every step of the development cycle. this continuous enrichment sets a new benchmark for integrated, data-driven approaches in biotechnology, ensuring ongoing improvements and innovations. references: frontiers | many routes to an antibody heavy-chain cdr3: necessary, yet insufficient, for specific binding benchmarking antibody clustering methods using sequence, structural, and machine learning similarity measures for antibody discovery applications rosario vitale, leandro a bugnon, emilio luis fenoy, diego h milone, georgina stegmayer, evaluating large language models for annotating proteins, briefings in bioinformatics, volume 25, issue 3, may 2024, bbae177
understanding immunogenicity at its core, immunogenicity refers to the ability of a substance, typically a drug or vaccine, to provoke an immune response within the body. it's the biological equivalent of setting off alarm bells. the stronger the response, the louder these alarms ring. in the case of vaccines, it is required for proper functioning of the vaccine: inducing an immune response and creating immunological memory. however, in the context of therapeutics, and particularly biotherapeutics, an unwanted immune response can potentially reduce the drug's efficacy or even lead to adverse effects. in pharma, the watchful eyes of agencies such as the fda and ema ensure that only the safest and most effective drugs make their way to patients; they require immunogenicity testing data before approving clinical trials and market access. these bodies necessitate stringent immunogenicity testing, especially for biosimilars, where it's essential to demonstrate that the biosimilar product has no increased immunogenicity risk compared to the reference product (1 ema), (2 fda). the interaction between the body's immune system and biologic drugs, such as monoclonal antibodies, can result in unexpected and adverse outcomes. cases have been reported where anti-drug antibodies (ada) led to lower drug levels and therapeutic failures, such as in the use of anti-tnf therapies, where patient immune responses occasionally reduced drug efficacy (3). beyond monoclonal antibodies, other biologic drugs, like enzyme replacement therapies and fusion proteins, also demonstrate variability in patient responses due to immunogenicity. in some instances, enzyme replacement therapies have been less effective because of immune responses that neutralize the therapeutic enzymes. similarly, fusion proteins used in treatments have shown varied efficacy, potentially linked to the formation of adas. the critical nature of immunogenicity testing is underscored by these examples, highlighting its role in ensuring drug safety and efficacy across a broader range of biologic treatments. the challenge is to know beforehand whether an immune response will develop, ie the immunogenicity of a compound. a deep dive into immunogenicity assessment of therapeutic antibodies researchers rely on empirical analyses to comprehend the immune system's intricate interactions with external agents. immunogenicity testing is the lens that magnifies this interaction, revealing the nuances that can determine a drug's success or failure. empirical analyses in immunogenicity assessments are informative but come with notable limitations. these analyses are often time-consuming, posing challenges to rapid drug development. early-phase clinical testing usually involves small sample sizes, which restricts the broad applicability of the results. pre-clinical tests, typically performed on animals, have limited relevance to human responses, primarily due to small sample sizes and interspecies differences. additionally, in vitro tests using human materials do not fully encompass the diversity and complexity of the human immune system. moreover, they often require substantial time, resources, and materials. these issues highlight the need for more sophisticated methodologies that integrate human genetic variation for better prediction of drug candidates' efficacy. furthermore, the ability to evaluate the outputs from phage libraries during the discovery stage and optimization strategies like humanizations, developability, and affinity maturation can add significant value. being able to analyzing these strategies' impact on immunogenicity, with novel tools , may enhance the precision of these high throughput methods. . the emergence of in silico in immunogenicity screening with the dawn of the digital age, computational methods have become integral to immunogenicity testing. in silico testing, grounded in computer simulations, introduces an innovative and less resource-intensive approach. however, it's important to understand that despite their advancements, in silico methods are not entirely predictive. there remains a grey area of uncertainty that can only be fully understood through experimental and clinical testing with actual patients. this underscores the importance of a multifaceted approach that combines computational predictions with empirical experimental and clinical data to comprehensively assess a drug's immunogenicity. predictive role immunogenicity testing is integral to drug development, serving both retrospective and predictive purposes. in silico analyses utilizing artificial intelligence and computational models to forecast a drug's behavior within the body can be used both in early and late stages of drug development. these predictions can also guide subsequent in vitro analyses, where the drug's cellular interactions are studied in a controlled laboratory environment. as a final step, traditionally immunogenicity monitoring in patients is crucial for regulatory approval. the future of drug development envisions an expanded role for in silico testing through the combination with experimental and clinical data, to enhance the accuracy of predictive immunogenicity. this approach aims to refine predictions about a drug's safety and effectiveness before clinical trials, potentially streamlining the drug approval process. by understanding how a drug interacts with the immune system, researchers can anticipate possible reactions, optimize treatment strategies, and monitor patients throughout the process. understanding a drug's potential immunogenicity can inform dosing strategies, patient monitoring, and risk management. for instance, dose adjustments or alternative therapies might be considered if a particular population is likely to develop adas against a drug early on. traditional vs. in silico methods: a comparative analysis traditional in vitro methods, despite being time-intensive, offer direct insights from real-world biological interactions. however, it's important to recognize the limitations in the reliability of these methods, especially concerning in vitro wet lab tests used to determine a molecule's immunogenicity in humans. these tests often fall into a grey area in terms of their predictive accuracy for human responses. given this, the potential benefits of in silico analyses become more pronounced. in silico methods can complement traditional approaches by providing additional predictive insights, particularly in the early stages of drug development where empirical data might be limited. this integration of computational analyses can help identify potential immunogenic issues earlier in the drug development process, aiding in the efficient design of subsequent empirical studies. in silico methods, with their rapid processing and efficiency, are ideal for initial screenings, large datasets, and iterative testing. large amounts of hits can already be screened in the discovery stage and repeated when lead candidates are chosen and further engineered. the advantage of in silico methodologies lies in their capacity for high throughput analysis and quick turn-around times. traditional testing methods, while necessary for regulatory approval, present challenges in high throughput analysis due to their reliance on specialized reagents, materials, and equipment. these requirements not only incur substantial costs but also necessitate significant human expertise and logistical arrangements for sample storage. on the other hand, in silico testing, grounded in digital prowess, sees the majority of its costs stemming from software and hardware acquisition, personnel and maintenance. by employing in silico techniques, it becomes feasible to rapidly screen and eliminate unsuitable drug candidates early in the discovery and development process. this early-stage screening significantly enhances the efficiency of the drug development pipeline by focusing resources and efforts on the most promising candidates. consequently, the real cost-saving potential of in silico analysis emerges from its ability to streamline the candidate selection process, ensuring that only the most viable leads progress to costly traditional testing and clinical trials. advantages of in silico in immunogenicity screening in silico immunogenicity testing is transforming drug development by offering rapid insights and early triaging, which is instrumental in de-risking the pipeline and reducing attrition costs. these methodologies can convert extensive research timelines into days or hours, vastly accelerating the early stages of drug discovery and validation. as in silico testing minimizes the need for extensive testing of high number of candidates in vitro, its true value lies in its ability to facilitate early-stage decision-making. this early triaging helps identify potential failures before significant investment, thereby lowering the financial risks associated with drug development. in silico immunogenicity screening in decision-making employing an in silico platform enables researchers to thoroughly investigate the molecular structure, function, and potential interactions of proteins at an early stage. this process aids in the early triaging of drug candidates by identifying subtle variations that could affect therapeutic efficacy or safety. additionally, the insights gleaned from in silico analyses can inform our understanding of how these molecular characteristics may relate to clinical outcomes, enriching the knowledge base from which we draw predictions about a drug's performance in real-world. de-risking with informed lead nomination the earliest stages of therapeutic development hinge on selecting the right lead candidates—molecules or compounds that exhibit the potential for longevity. making an informed choice at this stage can be the difference between success and failure. in-depth analysis such as immunogenicity analysis aims to validate that selected leads are effective and exhibit a high safety profile. to benefit from the potential and efficiency of in silico methods in drug discovery, it's crucial to choose the right platform to realize these advantages. this is where lensai integrated intelligence technology comes into play. introducing the future of protein analysis and immunogenicity screening: lensai. powered by the revolutionary hyft technology, lensai is not just another tool; it's a game-changer designed for unmatched throughput, lightning-fast speeds, and accuracy. streamline your workflow, achieve better results, and stay ahead in the ever-evolving world of drug discovery. experience the unmatched potency of lensai integrated intelligence technology. learn more: lensai in silico immunogenicity screening understanding immunogenicity and its intricacies is fundamental for any researcher in the field. traditional methods, while not entirely predictive, have been the cornerstone of immunogenicity testing. however, the integration of in silico techniques is enhancing the landscape, offering speed and efficiency that complement existing methods. at mindwalk we foresee the future of immunogenicity testing in a synergistic approach that strategically combines in silico with in vitro methods. in silico immunogenicity prediction can be applied in a high throughput way during the early discovery stages but also later in the development cycle when engineering lead candidates to provide deeper insights and optimize outcomes. for the modern researcher, employing both traditional and in silico methods is the key to unlocking the next frontier in drug discovery and development. looking ahead, in silico is geared towards becoming a cornerstone for future drug development, paving the way for better therapies. references: ema guideline on immunogenicity assessment of therapeutic proteins fda guidance for industry immunogenicity assessment for therapeutic protein products anti-tnf therapy and immunogenicity in inflammatory bowel diseases: a translational approach
ai-driven rational drug design ai-driven rational drug design is central to mindwalk's mission to power the intersection of biotech discovery, biotherapeutics and ai. ‘ai-driven’ signifies the application of artificial intelligence (ai), including machine learning (ml) and natural language processing (nlp). ‘rational’ alludes to the process of designing drugs based on the understanding of biological targets. this approach leverages computational models and algorithms to predict how drug molecules interact with their target biological molecules, such as proteins or enzymes, involved in disease processes. the goal is to create more effective and safer drugs by precisely targeting specific mechanisms within the body. integration of complex biological data the lensai™ integrated intelligence platform powered by patented hyft technology is unique in its data integration of structured and unstructured data, including genomic sequences, protein structures, scientific literature and clinical notes, facilitating a comprehensive understanding of biological systems. advanced computational techniques mindwalk's approach to drug discovery combines ai for rapid compound screening and predictive modeling with text analysis to retrieve information from research articles. this helps to identify promising drug candidates and optimize their properties for better efficacy and safety, which significantly reduces r&d timelines. the use of ai, ml, and nlp technologies within the lensai platform and the use of different protein large language model (llm) embeddings, facilitate the discovery of novel drug targets. these technologies allow for the identification of patterns, relationships, and insights within large datasets. the combination of mindwalk's technologies with the intersystems iris data platform introduces a powerful vector search mechanism that facilitates semantic analysis. this approach transforms the search for relevant biological and chemical information by enabling searches based on conceptual similarity rather than just keywords. as a result, researchers can uncover deeper insights into disease mechanisms and potential therapeutic targets. we wrote about vector search in an earlier blog. here we illustrate vector search for text. in a next post, we will dive into the application of vector search for protein analytics. utilizing vector search in text analysis the primary challenge for text search is locating specific and accurate information within a vast amount of unstructured data. it is like finding a needle in a haystack. conducting a simple keyword search in pubmed can yield thousands of results. while generative models can provide concise answers to questions within seconds, their accuracy is not always guaranteed. we implemented retrieval-augmented generation (rag) to combat the hallucinations that generative chat systems may experience. moreover, rag systems deliver up-to-date results and are able to refer to their sources, which makes their responses traceable. however, like all generative systems they struggle to handle large input prompts at once. this is where vector search becomes essential. vector search is a valuable tool to guide you to the precise area within your data haystack. representing meaning in vector space search terms often have various meanings in different contexts. for instance, the abbreviation 'ada' could refer to anti-drug antibodies, the american dental association, the american diabetes association, the americans with disabilities act, adenosine deaminase, adalimumab, and other entities. by encoding text data with embeddings, one can narrow down the focus to the meaning that aligns with the search query. the figure below illustrates a two-dimensional umap visualization of the embeddings for pubmed abstracts containing 'ada'. while the visual representation emphasizes similarity and does not provide a scalable measure for actual distance in the multidimensional vector space, it does demonstrate the presence of semantic ambiguity in the vector-based embeddings. thus, encoding the input allows for data clustering and focusing on the most relevant clusters. open ada abstracts | umap the embeddings used here are dense vectors. dense vectors are compact numerical representations of data points, typically generated by large language models, in this case pubmedbert. dense vectors capture the semantic meaning of text, allowing the system to retrieve relevant information even if the exact keywords are not present. this nuanced and context-aware retrieval offers advantages over traditional keyword-based methods. on the other hand, sparse vectors are typically high-dimensional but contain many zero values. as an example, a bag-of-words vector for a short english sentence would contain a one for every word that is present in the sentence and a zero for every english word that is not present in the sentence. the result is a very sparse vector with many zero values and a couple of ones. sparse vectors are often generated using traditional methods like tf-idf or bm25, which focus on the presence or frequency of specific terms in the text. these vectors require fewer resources and offer faster retrieval speeds. searching in vector space when generating embeddings, there are multiple levels to consider. chunking is the process of breaking down large pieces of text into smaller segments that will be considered as units of relevant context. from tokens to documents, each level offers a different way to understand and analyze text data. starting at the most granular level, tokens represent individual words or parts of individual words within a text. large language models often calculate embeddings based on single-word tokens. this may dilute the semantic richness of the text. the lensai integrated intelligence platform uses concepts. these are words or word groups that form a unit of meaning. concepts are more specific than tokens for keyword search. moreover, dense embeddings of concepts within a sentence are particularly well-suited for detecting synonyms. token embeddings - 'concept' embeddings the following umap visualization of concept embeddings shows similar embeddings for the semantically related instances of ‘ada treatment’ and ‘ada therapy’ and also for instances of ‘ada inhibition’ and ‘ada treatment’, whereas the embeddings for ‘ada professional practice committee’, ‘ada activity’, ‘ada scid’ and ‘ada formation’ build separate non-overlapping clusters. open ada concepts umap2 crc constructs (concept-relation-concept patterns) effectively capture the intricate boundaries of semantic meaning. focusing on crcs enhances the semantic similarity search while filtering out non-relevant sentence parts, yielding a more condensed representation of meaning. moving up to the levels of sentence and document embeddings can be useful for obtaining a more general idea of the context rather than focusing on a particular search term in a query. the level of embeddings that is most relevant will depend on the specific use case at hand. in conclusion, vector search presents numerous opportunities to optimize search results by guiding users to their most relevant data. leveraging dense and sparse vectors, as well as embeddings on various levels, can be combined to create a hybrid system tailored to specific use cases. in the field of ai-driven rational drug design, vector search is an additional computational technique that fits in a multidisciplinary approach, supporting more than only text data, as will become clear in our future blog post about vector search for protein analysis. combining the lensai integrated intelligence platform, with the intersystems iris data platform creates a robust vector search mechanism, enhancing rational drug discovery and personalized medicine. additionally,lensai is designed to support hallucination-free, traceable, and up-to-date retrieval-augmented generation, helping researchers access accurate and reliable data.
in our new lensai blog series, we explore how data itself often becomes the bottleneck in data-driven biological and biomedical research. we dive into the data-related challenges that affect the development and advancement of different research concepts and domains, such as drug discovery, and also the importance of integrating wet lab and in silico research etc. we start with systems biology, a holistic model that represents a radical departure from the conventional reductionist approach to understanding complex biological systems. biological and biomedical research in the 20th century was driven predominantly by reductionism, a pieces of life approach that seeks to understand complex biological systems as a sum of the functionalities of their individual components. now, there is definitely value in building a systems-level perspective that is based on an aggregation of component-level functionality. after all, reductionism has played a key role in elucidating the central dogmatic principles and concepts of biology. however, the limitations of this approach are hard to ignore. after all, a complex biological system, unlike, say, a bicycle, clearly has to be more than a sum of its parts. systems biology is the paradigm that defines an integrative and holistic strategy to decipher complex, hierarchical, adaptive, and dynamic biological systems across multiple components and levels of organization. complex biological systems, like those within living organisms, are much more intricate than simple objects like bicycles. unlike bicycles, these systems are not just a sum of their parts but have unique properties that emerge when all of the parts work together. systems biology is an approach that helps scientists study and understand these complex biological systems by looking at the big picture. by considering how all the different parts of the system interact, scientists get a better understanding of how the entire system functions as a whole, instead of only looking at individual components in isolation. inspired by the ideas from the santa fe institute, system thinking plays a crucial role in the systems biology approach. it helps researchers recognize the importance of the connections between different parts of the biological system, the influence of its surroundings, and how the system changes over time. this way, scientists can better understand health, disease, and potential treatments, leading to more effective medical therapies and diagnostic tools. the modern form of systems biology emerged in the late 1960s and it quickly became evident that mathematics and computation would play a critical role in realizing the potential of this holistic approach. mathematical and computational modeling based on large volumes of genome-scale data would be the key to unraveling the systems-level complexity of biological phenomena. today, the availability of sophisticated computational techniques and the exponential generation of high-throughput biomedical data provide the perfect foundation for a systems approach to tackling biological complexity. but here’s where things get a bit complicated. complex biological phenomena and systems are defined by complex biological data. a data-driven systems approach requires the integrated analysis of all available complex biological data in order to identify relevant interactions and patterns of a biosystem. however, the sheer complexity of biological data poses a major challenge for efficient data integration and curation that is required for generating a holistic view of complex biological systems. a quick overview of biological data complexity the james webb space telescope generates up to 57 gigabytes each day. by comparison, one of the world’s largest genome sequencing facilities sequences dna at a rate equivalent to a human genome, roughly 140 gigabytes in size, every 3.2 minutes. and that is just genomic data, which is expected to reach exabase-scale within a decade, from just one sequencing facility. despite the continuing exponential increase of publicly-available biological data, data volume is perhaps one of the more manageable complexities of biological big data. then there’s the expanding landscape of biological data, from single-cell omics data to genome-scale metabolic models (gems), that reflect the inherent complexity and heterogeneity of biological systems and vary in format, and scale. data formats can also vary based on the technologies and protocols used to characterize different levels of biological organization. from a data integration perspective, there also has to be due consideration for organizing structured and unstructured data as well as multi-format data from numerous databases that specialize in specific modalities, layers, organisms, etc. over and above all this, novel complexities continue to emerge as technological advancements open up new frontiers for biological research. moving on from simple static models derived from static data, the scope of research is now expanding to characterize biological complexity along the dynamic fourth dimension of time. for instance, rather than merely integrating single-time-point omics sequence data across biological levels, the emerging framework of temporal omics compares sequence data across time in order to evaluate the temporal dynamics of biological processes. so the big question is how to integrate, standardize, and curate all this complexity into one comprehensive, contextual, scalable data matrix that solves the information integration dilemma in systems biology. the lensai integrated intelligence platform for systems biology information integration dilemma (iid) refers to how the challenges of integrating, standardizing, and analyzing complex biological data have created a bottleneck in the holistic, systems-level analysis of biological complexity. currently integrating data, across diverse data modalities, formats, platforms, standards, ontologies, etc., for systems biology data analysis is not a trivial task. the process requires multiple tools and techniques for different tasks such as harmonizing and standardizing data formats, preprocessing, integration, and fusion. moreover, there is no single analytical framework that scales across the complex heterogeneity and diversity of biological data. the lensai integrated intelligence platform addresses these shortcomings of conventional solutions by incorporating the key organizing principles of intelligent data management and smart big data systems. one, the platform leverages ai-powered intelligent automation to organize and index all biological data, both structured and unstructured. hyft®, a proprietary framework that leverages advanced machine learning (ml) and natural language processing (nlp) technologies, seamlessly integrates and organizes all biological and textual data into a unified multidimensional network of data objects. the network currently comprises over 660 million data objects with multiple layers of information about sequence, syntax, and protein structure. plus, hyft® enables researchers to integrate proprietary research into the existing data network. this network is continuously updated with new data, metadata, relationships, and links, ensuring that the lensai data biosphere is always current. two, smart big data is not just about the number of data objects but also about latent relationships between those data sets. the lensai data biosphere is further augmented by a knowledge graph that currently maps over 25 billion cross-data relationships and makes it easier to visualize the interrelatedness of different entities. this visual relationship map is continuously updated with contextual biological information to create a constantly expanding knowledge resource. now that we have an organized, high-quality, contextualized data catalog, the next step is to provide comprehensive search and access capabilities that empower users to curate, customize and organize data sets to specific research requirements. for instance, the computational modeling of biological systems could follow two broad research directions — bottom-up theory-driven modeling, based on contextual links between model terms and known mechanisms of a biological system, or two, top-down data-driven modeling, where relationships between different variables in biological systems are extracted from large volumes of data without prior knowledge of underlying mechanisms. so, an intelligent data catalog must enable even non-technical users to organize and manipulate data in a way that best serves their research interests. multiscale data integration with the lensai platform biological systems operate across multiple and diverse spatiotemporal scales, with each represented by datasets with very diverse modalities. the systems biology approach requires the concurrent integration of all of these multimodal datasets into one unified analytical framework in order to obtain an accurate, systems-level simulation of biological complexity. however, there are currently no bioinformatics frameworks that facilitate the multiscale integration of vast volumes of complex, heterogeneous, system-wide biological data. but mindwalk’s patented hyft® technology and lensai platform enable true multiscale data unification — including syntactical (sequence) data, 3d structural data, unstructured scientific information (e.g. scientific literature), etc. — into one integrated, ai-powered analytical framework. by completely eliminating the friction in the integration of complex biological data, lensai shifts the paradigm in data-driven biological research.
identifying and validating optimal biological targets is a critical first step in drug discovery with a cascading downstream impact on late-stage trials, efficacy, safety, and clinical performance. traditionally, this process required the manual investigation of biomedical data to establish target-disease associations and to assess efficacy, safety, and clinical/commercial potential. however, the exponential growth in high-throughput data on a range of putative targets, including proteins, metabolites, dnas, rnas, etc., has led to the increasing use of in silico, or computer-aided drug design (cadd), methods to identify bioactive compounds and predict binding affinities at scale. today, in silico techniques are evolving at the same pace as in vitro technologies, such as dna-labelled libraries, and have proven to be critical in dealing with modern chemical libraries' scale, diversity, and complexity. cadd techniques encompass structure-based drug design (sbdd) and ligand-based drug design (lbdd) strategies depending on the availability of the three-dimensional biological structure of the target of interest. some of the most common applications for these techniques include in silico structure prediction, refinement, modelling and target validation. they are widely utilised across four phases: identifying hits with virtual screening (vs), investigating the specificity of selected hits through molecular docking, predicting admet properties and further molecular optimisation of hits/leads. as drug discovery becomes increasingly computational and data-driven, it is becoming common practice to combine cadd with advanced technologies like artificial intelligence (ai), machine learning (ml), and deep learning (dl) to cost- and time-efficiently convert biological big data into pharmaceutical value. in this article, we’ll take a closer look at how ai/ml/dl technologies are transforming three of the most widely used in silico techniques in drug discovery, virtual screening (vs), molecular docking and molecular dynamics (md) simulation. virtual screening virtual screening (vs), a computational approach to screening large libraries for hits, when integrated with an experimental approach, such as high-throughput screening, can significantly enhance the speed, accuracy and productivity of drug discovery. in silico screening techniques are classified as ligand-based vs (lbvs) and structure-based vs (sbvs). these distinct approaches can be combined, for instance, to identify active compounds using ligand-based techniques and follow through with structure-based methods to find favourable candidates. however, there are some shortcomings to cadd-based vs technologies with biochemical assays typically confirming desired bioactivity in only 12% of the top-scoring compounds derived from standard vs applications. over the past two decades, the application of ai/ml tools to virtual screening has evolved considerably with techniques like multi-objective optimization and ensemble-based virtual screening being used to enhance the efficiency, accuracy and speed of conventional sbvs and lbvs methodologies. studies show that deep learning (dl) techniques perform significantly better than ml algorithms across a range of tasks including target prediction, admet properties prediction and virtual screening. dl-based vs frameworks have proven to be more effective at extracting high-order molecule structure representations, accurately classifying active and inactive compounds, and enabling ultra-high-throughput screening. the integration of quantum computing is expected to be the next inflexion point for vs, with studies demonstrating that quantum classifiers can significantly outperform classical ml/dl-based vs. molecular docking molecular docking, a widely used method in sbvs for retrieving active compounds from large databases, typically relies on a scoring function to estimate binding affinities between receptors and ligands. this docking-scoring approach is an efficient way to quickly evaluate protein–ligand interactions (plis) based on a ranking of putative ligand binding poses that is indicative of binding affinity. the development of scoring functions (sfs) for binding affinity prediction has been evolving since the 90s and today includes classical sfs, such as physics-, regression-, and knowledge-based methods, and data-driven models, such as ml- and dl-based sfs. however, accuracy is a key challenge with high-throughput approaches as binding affinity predictions are derived from a static snapshot of the protein-ligand binding state rather than the complex dynamics of the ensemble. ml-based sfs perform significantly better than classical sfs in terms of comparative assessment of scoring functions (casf) benchmarks and their ability to learn from pli data and deal with non-linear relationships. but the predictions are based on approximations and data set biases rather than the interatomic dynamics that guide binding. the performance of ml-based sfs also depends on the similarity of targets across the training set and the test set, which makes generalisation a challenge. dl-based sfs have demonstrated significant advantages, including feature generation automation and the ability to capture complex binding interactions, over traditional ml methods. recently, a team of mit researchers took the novel approach of framing molecular docking as a generative modelling problem to develop diffdock, a new molecular docking model that delivers a much higher success rate (38%) than state-of-the-art of traditional docking (23%) and deep learning (20%) methods. molecular dynamics simulations since molecular docking methods only provide an initial static protein–ligand complex, molecular dynamics (md) simulations have become the go-to approach for information on the dynamics of the target. md simulations capture changes at the molecular and atomistic levels and play a critical role in elucidating intermolecular interactions that are essential to assess the stability of a protein-ligand complex. there are, however, still several issues with this approach including accuracy-versus-efficiency trade-offs, computational complexity, large timescale requirements and errors due to the underlying force fields. ml techniques have helped address many of these challenges and have proven vital to the development of md simulations for three reasons: objectivity in model selection, enhanced interpretability due to the statistically coherent representation of structure–function relationships, and the capability to generate quantitative, empirically-verifiable models for biological processes. deep learning methods are now emerging as an effective solution to dealing with the terabytes of dynamic biomolecular big data generated by md simulations with other applications including the prediction of quantum-mechanical energies and forces, extraction of free energy surfaces and kinetics, and coarse-grained molecular dynamics. shifting the in silico paradigm with ai a combination of in silico models and experimental approaches has become a central component of early-stage drug discovery, facilitating the faster generation of lead compounds at lower costs and with higher efficiency and accuracy. advanced ai technologies are a key driver of disruption in in silico drug discovery and have helped address some of the limitations and challenges of conventional in silico approaches. at the same time, they are also shifting the paradigm with their capability to auto-generate novel drug-like molecules from scratch. by one estimate, ai/ml in early-stage drug development could result in an additional 50 novel therapies, a $50 billion market, over a 10-year period.
reproducibility, getting the same results using the original data and analysis strategy, and replicability, is fundamental to valid, credible, and actionable scientific research. without reproducibility, replicability, the ability to confirm research results within different data contexts, becomes moot. a 2016 survey of researchers revealed a consensus that there was a crisis of reproducibility, with most researchers reporting that they failed to reproduce not only the experiments of other scientists (70%) but even their own (>50%). in biomedical research, reproducibility testing is still extremely limited, with some attempts to do so failing to comprehensively or conclusively validate reproducibility and replicability. over the years, there have been several efforts to assess and improve reproducibility in biomedical research. however, there is a new front opening in the reproducibility crisis, this time in ml-based science. according to this study, the increasing adoption of complex ml models is creating widespread data leakage resulting in “severe reproducibility failures,” “wildly overoptimistic conclusions,” and the inability to validate the superior performance of ml models over conventional statistical models. pharmaceutical companies have generally been cautious about accepting published results for a number of reasons, including the lack of scientifically reproducible data. an inability to reproduce and replicate preclinical studies can adversely impact drug development and has also been linked to drug and clinical trial failures. as drug development enters its latest innovation cycle, powered by computational in silico approaches and advanced ai-cadd integrations, reproducibility represents a significant obstacle to converting biomedical research into real-world results. reproducibility in in silico drug discovery the increasing computation of modern scientific research has already resulted in a significant shift with some journals incentivizing authors and providing badges for reproducible research papers. many scientific publications also mandate the publication of all relevant research resources, including code and data. in 2020, elife launched executable research articles (eras) that allowed authors to add live code blocks and computed outputs to create computationally reproducible publications. however, creating a robust reproducibility framework to sustain in silico drug discovery would require more transformative developments across three key dimensions: infrastructure/incentives for reproducibility in computational biology, reproducible ecosystems in research, and reproducible data management. reproducible computational biology this approach to industry-wide transformation envisions a fundamental cultural shift with reproducibility as the fulcrum for all decision-making in biomedical research. the focus is on four key domains. first, creating courses and workshops to expose biomedical students to specific computational skills and real-world biological data analysis problems and impart the skills required to produce reproducible research. second, promoting truly open data sharing, along with all relevant metadata, to encourage larger-scale data reuse. three, leveraging platforms, workflows, and tools that support the open data/code model of reproducible research. and four, promoting, incentivizing, and enforcing reproducibility by adopting fair principles and mandating source code availability. computational reproducibility ecosystem a reproducible ecosystem should enable data and code to be seamlessly archived, shared, and used across multiple projects. computational biologists today have access to a broad range of open-source and commercial resources to ensure their ecosystem generates reproducible research. for instance, data can now be shared across several recognized, domain and discipline-specific public data depositories such as pubchem, cdd vault, etc. public and private code repositories, such as github and gitlab, allow researchers to submit and share code with researchers around the world. and then there are computational reproducibility platforms like code ocean that enable researchers to share, discover, and run code. reproducible data management as per a recent data management and sharing (dms) policy issued by the nih, all applications for funding will have to be accompanied by a dms plan detailing the strategy and budget to manage and share research data. sharing scientific data, the nih points out, accelerates biomedical research discovery through validating research, increasing data access, and promoting data reuse. effective data management is critical to reproducibility and creating a formal data management plan prior to the commencement of a research project helps clarify two key facets of the research: one, key information about experiments, workflows, types, and volumes of data generated, and two, research output format, metadata, storage, and access and sharing policies. the next critical step towards reproducibility is having the right systems to document the process, including data/metadata, methods and code, and version control. for instance, reproducibility in in silico analyses relies extensively on metadata to define scientific concepts as well as the computing environment. in addition, metadata also plays a major role in making data fair. it is therefore important to document experimental and data analysis metadata in an established standard and store it alongside research data. similarly, the ability to track and document datasets as they adapt, reorganize, extend, and evolve across the research lifecycle will be crucial to reproducibility. it is therefore important to version control data so that results can be traced back to the precise subset and version of data. of course, the end game for all of that has to be the sharing of data and code, which is increasingly becoming a prerequisite as well as a voluntarily accepted practice in computational biology. one survey of 188 researchers in computational biology found that those who authored papers were largely satisfied with their ability to carry out key code-sharing tasks such as ensuring good documentation and that the code was running in the correct environment. the average researcher, however, would not commit any more time, effort, or expenditure to share code. plus, there still are certain perceived barriers that need to be addressed before the public archival of biomedical research data and code becomes prevalent. the future of reproducibility in drug discovery a 2014 report from the american association for the advancement of science (aaas) estimated that the u.s. alone spent approximately $28 billion yearly on irreproducible preclinical research. in the future, a set of blockchain-based frameworks may well enable the automated verification of the entire research process. meanwhile, in silico drug discovery has emerged as one of the maturing innovation areas in the pharmaceutical industry. the alliance between pharmaceutical companies and research-intensive universities has been a key component in de-risking drug discovery and enhancing its clinical and commercial success. reproducibility-related improvements and innovations will help move this alliance to a data-driven, ai/ml-based, in silico model of drug discovery.
in 2020, seventeen pharmaceutical companies came together in an alliance called qupharm to explore the potential of quantum computing (qc) technology in addressing real-world life science problems. the simple reason for this early enthusiasm, especially in a sector widely seen as being too slow to embrace technology, is qc’s promise to solve unsolvable problems. the combination of high-performance computing (hpc) and advanced ai more or less represents the cutting-edge of drug discovery today. however, the sheer scale of the drug discovery space can overwhelm even the most advanced hpc resources available today. there are an estimated 1063 potential drug-like molecules in the universe. meanwhile, caffeine, a molecule with just 24 atoms, is the upper limit for conventional hpcs. qc can help bridge this great divide between chemical diversity and conventional computing. in theory, a 300-qubit quantum computer can instantly perform as many calculations as there are atoms in the visible universe (1078-1082). and qc is not all theory, though much of it is still proof-of-concept. just last year, ibm launched a new 433-qubit processor, more than tripling the qubit count in just a year. this march witnessed the deployment of the first quantum computer in the world to be dedicated to healthcare, though the high-profile cafeteria installation was more to position the technology front-and-center for biomedical researchers and physicians. most pharmaceutical majors, including biogen, boehringer ingelheim, roche, pfizer, merck, and janssen, have also launched their own partnerships to explore quantum-inspired applications. if qc is the next digital frontier in pharma r&d, the combination of ai and hpc is currently the principal engine accelerating drug discovery, with in silico drug discovery emerging as a key ai innovation area. computational in silico approaches are increasingly used alongside conventional in vivo and in vitro models to address issues related to the scale, time, and cost of drug discovery. ai, hpc & in silico drug discovery according to gartner, ai is one of the top workloads driving infrastructure decisions. cloud computing provides businesses with cost-effective access to analytics, compute, and storage facilities and enables them to operationalize ai faster and with lower complexity. when it comes to hpcs, data-intensive ai workloads are increasingly being run in the cloud, a market that is growing twice as fast as on-premise hpc. from a purely economic perspective, the cloud can be more expensive than on-premise solutions for workloads that require a large hpc cluster. for some pharma majors, this alone is reason enough to avoid a purely cloud-based hpc approach and instead augment on-premise hpc platforms with the cloud for high-performance workloads. in fact, a hybrid approach seems to be the preferred option for many users with the cloud being used mainly for workload surges rather than for critical production. however, there are several ways in which running ai/ml workloads on cloud hpc systems can streamline in silico drug discovery. in silico drug discovery in the cloud the presence of multiple data silos, the proliferation of proprietary data, and the abundance of redundant/replicated data are some of the biggest challenges currently undermining drug development. at the same time, incoming data volumes are not only growing exponentially but also becoming more heterogeneous as information is generated across different modalities and biological layers. the success of computational drug discovery will depend on the industry’s ability to generate solutions that can scale across an integrated view of all this data. leveraging a unified data cloud as a common foundation for all data and analytics infrastructure can help streamline every stage of the data lifecycle and improve data usage, accessibility, and governance. as ai adoption in the life sciences approaches the tipping point, organizations can no longer afford to have discrete strategies for managing their data clouds and ai clouds. most companies today choose their data cloud platform based on the support available for ai/ml model execution. drug development is a constantly changing process and ai/ml-powered in silico discovery represents a transformative new opportunity in computer-aided drug discovery. meanwhile, ai-driven drug discovery is itself evolving dramatically with the emergence of computationally intensive deep learning models and methodologies that are redefining the boundaries of state-of-the-art computation. in this shifting landscape, a cloud-based platform enables life sciences companies to continuously adapt and upgrade to the latest technologies and capabilities. most importantly, a cloud-first model can help streamline the ai/ml life cycle in drug discovery. data collection for in silico drug discovery covers an extremely wide range, from sequence data to clinical data to real-world data (rwd) to unstructured data from scientific tests. the diverse, distributed nature of pharmaceutical big data often poses significant challenges to data acquisition and integration. the elasticity and scalability of cloud-based data management solutions help streamline access and integrate data more efficiently. in the data preprocessing phase, a cloud-based solution can simplify the development and deployment of end-to-end pipelines/workflows and enhance transparency, reproducibility, and scalability. in addition, several public cloud services offer big data preprocessing and analysis as a service. on-premise solutions are a common approach to model training and validation in ai-driven drug discovery. apart from the up-front capital expenditure and ongoing maintenance costs, this approach can also affect the scalability of the solution across an organization's entire research team, leading to long wait times and loss of productivity. a cloud platform, on the other hand, instantly provides users with just the right amount of resources needed to run their workloads. and finally, ensuring that end users have access to the ai models that have been developed is the most critical phase of the ml lifecycle. apart from the validation and versioning of models, model management and serving has to address several broader requirements, such as resilience and scalability, as well as specific factors, such as access control, privacy, auditability, and governance. most cloud services offer production-grade solutions for serving and publishing ml models. the rise of drug discovery as a service according to a 2022 market report, the increasing usage of cloud-based technologies in the global in-silico drug discovery sector is expected to drive growth at a cagr of nearly 11% between 2021 and 2030, with the saas segment forecast to develop the fastest at the same rate as the broader market. as per another report, the increasing adoption of cloud-based applications and services by pharmaceutical companies is expected to propel ai in the drug discovery market at a cagr of 30% to $2.99 billion by 2026. cloud-based ai-driven drug discovery has well and truly emerged as the current state-of-the-art in pharma r&d. at least until quantum computing and quantum ai are ready for prime time.
in our previous blog, we noted how the increasing utilization of ai across different phases of the drug discovery process has proven its strategic value in addressing some of the core efficiency and productivity challenges involved. as a result, ai in drug discovery and development has finally cut through the hype and become an industry-wide reality. a key milestone in this process has been the launch of clinical trials for the first drug developed completely using ai. currently, the rapid evolution of ai-powered protein folding algorithms, such as alphafold, rosettafold, and raptorx6, promises to dramatically accelerate structural biology, protein engineering, and drug discovery. in fact, ai is expected to underpin a million-x drug discovery future, wherein the ability of these technologies to exponentially scale up protein structure prediction and chemical compound generation will increase the opportunity for drug discovery by a million times. ai-driven drug development also facilitates several other strategic outcomes such as access to larger datasets, reduced drug discovery costs, optimized drug designs, accelerated drug repurposing or repositioning, enabling the discovery of new and hidden drug targets, and turning previously undruggable targets into druggable ones. ai applications in drug design source: springer there are a range of applications for ai across different phases of drug development, from target discovery to clinical studies. here’s a quick overview of how ai can transform some of the key stages of drug design: ai in virtual screening drug discovery typically begins with the identification of targets for a disease of interest, followed by high-throughput screening (hts) of large chemical libraries to identify bioactive compounds. though hts has its advantages, it may not always be appropriate or even adequate, especially in the big data era when chemical libraries have expanded beyond a billion molecules. this is where ai-powered virtual screening (vs) methods are being used to complement hts to accelerate the exploratory research process in the discovery of potential drug components. this is due to ai-based vs’s ability to rapidly screen millions of compounds at a fraction of the costs associated with hts and with a prediction accuracy as high as 85%. ai in lead optimization lead optimization (lo) is an essential yet expensive and time-consuming phase in preclinical drug discovery. the fundamental utility of the lo process is to enhance the desirable properties of a compound while eliminating structural deficiencies and the potential for adverse side effects. however, this is a complex multiparameter optimization problem where several competing objectives have to be precisely balanced in order to arrive at optimal drug candidates. done right, lo can significantly reduce the chances of attrition in pre-clinical as well as clinical stages of drug development. and reducing the iterations required for optimization in the design-make-test-analyze (dmta) cycle can help accelerate the drug development process. deep learning generative models are now being successfully used to accelerate the obtention of lead compounds while simultaneously ensuring that these compounds also conform to the requisite biological objectives. generative modeling platforms, with integrated predictive models for calculating various absorption, distribution, metabolism, excretion, and toxicity (admet) endpoints, can now significantly shorten the dmta cycle required to select and design compounds that satisfy all defined lo criteria. ai in computer-aided drug synthesis the integration of ai and drug synthesis has been accelerated over the last few years, significantly improving the design and synthesis of drug molecules. ai-driven computer-aided synthesis tools are being widely used in retrosynthetic analysis, reaction prediction, and automated synthesis. for instance, these tools can be applied to the retrosynthetic analysis of target compounds to identify feasible synthetic routes, predict reaction products and yields, and optimize hit compounds. . ai in computer-aided synthesis planning (casp) is enabling chemists to objectively identify the most efficient and cost-effective synthetic route for a target molecule, thereby accelerating the ‘make’ phase of the dmta cycle. the emergence of intelligent and automated technologies for continuous-flow chemical synthesis promises a future of fully autonomous synthesis. these are just a few examples of the potential for ai in drug discovery and development. in fact, companies are using ai to address key challenges across the r&d pipeline and the life sciences value chain. the future of ai in drug development according to a research paper, the future of drug discovery will entail a centralized closed-loop ml-controlled workflow that autonomously generates hypotheses, synthesizes lead candidates, tests them, and stores the data. according to the paper, the human interface between conventional discovery processes, such as data analysis, computational prediction, and experimentation, results in bottlenecks and biased hypothesis generation, which could be eliminated by a completely automated closed-loop system. fully autonomous drug discovery may well be the future but in the near term, the human component will remain essential in the drug discovery and development process. in the current humans-in-the-loop approach to ai in drug design, ai algorithms are augmenting human intelligence by independently extracting and learning from patterns in vast volumes of complex big data. ai technologies like natural language processing (nlp) are helping to obtain from unstructured data sources like scientific literature, clinical trials, electronic health records (ehrs), and social media posts that have thus far remained completely underutilized. most importantly though, ai in drug discovery has grown far beyond hype and hypothesis. as we mentioned in our ai in drug development - from hype to reality blog, today the ai-driven drug discovery space is rife with activity as big pharma, big tech, and big vc-funded scrappy startups jostle for a position in the next big innovation cycle in drug discovery and development. ai-driven innovation is already delivering measurable value across the biopharma research value chain. and companies continue to scale ai across their r&d systems, bringing the industry closer to a potential future of fully autonomous drug discovery.
Topic: In silico
Sorry. There were no results for your query.