MindWalk Blog: The power of integrated AI, Data and Lab precision in Biologics Discovery

Minimizing ADA risk with in silico immunogenicity screening

immunogenicity is a major cause of biologics failure, often identified too late in development. this blog explains how in silico screening helps detect anti-drug antibody (ada) risks early, before costly setbacks. learn how tools like lensai™ enable faster, more informed decision-making by supporting early candidate evaluation, risk mitigation, and regulatory alignment. the impact of immunogenicity in early biologics discovery immunogenicity remains one of the most important and often underappreciated factors in biologics development. for researchers and drug development teams working with monoclonal antibodies or therapeutic proteins, the risk of an unwanted immune response can derail even the most promising candidates. the presence of anti-drug antibodies (adas) doesn’t always show up immediately. in many cases, the problem becomes evident only after significant investment of time and resources, often in later-stage trials. adas can reduce a drug’s effectiveness, alter its pharmacokinetics, or introduce safety risks that make regulatory approval unlikely. some programs have even been discontinued because of immunogenicity-related findings that might have been identified much earlier. to avoid these setbacks, teams are increasingly integrating predictive immunogenicity screening earlier in development. in silico tools now make it possible to evaluate ada risk during the discovery stage, before resources are committed to high-risk candidates. this proactive approach supports smarter design decisions, reduces development delays, and helps safeguard against late-stage failure. in this blog, we’ll explore how in silico immunogenicity screening offers a proactive way to detect potential ada risks earlier in the pipeline. we’ll also look at how tools like mindwalk’s lensai platform are helping to simplify and scale these assessments, making immunogenicity screening a practical part of modern biologics development. why early ada risk assessment is critical immune responses to therapeutic proteins can derail even the most carefully designed drug candidates. when the immune system identifies a treatment as foreign, it may trigger the production of anti-drug antibodies (adas). these responses can alter how a drug is distributed in the body, reduce its therapeutic effect, or create safety concerns that weren't apparent during earlier studies. the consequences are often serious delays, added costs, program redesigns, or even full discontinuation. this isn’t something to be considered only when a drug is close to clinical testing. it’s a risk that needs to be addressed from the beginning. regulatory agencies increasingly expect sponsors to demonstrate that immunogenicity has been evaluated in early discovery, not just as a final check before filing. this shift reflects lessons learned from earlier products that failed late because they hadn't been properly screened. early-stage risk assessment allows developers to ask the right questions at the right time. are there t-cell epitopes likely to trigger immune recognition? is the candidate similar enough to self-proteins to escape detection? could minor sequence changes reduce the chances of immunogenicity without compromising function? immunogenicity screening provides actionable insights that can guide sequence optimization well before preclinical testing. for example, identifying epitope clustering or t-cell activation hotspots during discovery enables teams to make targeted modifications in regions such as the variable domain. these adjustments can reduce immunogenicity risk without compromising target binding, helping streamline development and avoid costly rework later in the process. beyond candidate selection, immunogenicity screening improves resource allocation. if a molecule looks risky, there is no need to invest heavily in downstream testing until it has been optimized. it’s a smarter, more strategic way to manage timelines and reduce unnecessary costs. the tools now available make this kind of assessment more accessible than ever. in silico screening platforms, powered by ai and machine learning, can run detailed analyses in a matter of hours. these insights help move projects forward without waiting for expensive and time-consuming lab work. in short, assessing immunogenicity is not just about risk avoidance. it’s about building a better, faster path to clinical success. in silico immunogenicity screening: how it works in silico immunogenicity screening refers to the use of computational models to evaluate the immune risk profile of a biologic candidate. these methods allow development teams to simulate how the immune system might respond to a therapeutic protein, particularly by predicting t-cell epitopes that could trigger anti-drug antibody (ada) formation. the primary focus is often on identifying mhc class ii binding peptides. these are the sequences most likely to be presented by antigen-presenting cells and recognized by helper t cells. if the immune system interprets these peptides as foreign, it can initiate a response that leads to ada generation. unlike traditional in vitro methods, which may require weeks of experimental setup, in silico tools deliver results quickly and at scale. developers can screen entire libraries of protein variants, comparing their immunogenicity profiles before any physical synthesis is done. this flexibility makes in silico screening particularly valuable in the discovery and preclinical stages, where multiple versions of a candidate might still be on the table. the strength of this approach lies in its ability to deliver both breadth and depth. algorithms trained on curated immunology datasets can evaluate binding affinity across a wide panel of human leukocyte antigen (hla) alleles. they can also flag peptide clusters, overlapping epitopes, and areas where modifications may reduce risk. the result is a clearer picture of how a candidate will interact with immune pathways long before preclinical and clinical studies are initiated. for teams juggling tight timelines and complex portfolios, these insights help drive smarter decision-making. high-risk sequences can be deprioritized or redesigned, while low-risk candidates can be advanced with greater confidence. how lensai supports predictive immunogenicity analysis one platform leading the charge in this space is lensai . designed for early-stage r&d, it offers high-throughput analysis with a user-friendly interface, allowing computational biologists, immunologists, and drug developers to assess risks rapidly. here’s how lensai supports smarter decision-making: multi-faceted risk scoring: rather than relying on a single predictor, lensai integrates several immunogenicity markers into one unified score. this includes predicted mhc class ii binding affinity across diverse hla alleles, epitope clustering patterns, and peptide uniqueness compared to self-proteins based on proprietary hyft technology. by combining these distinct factors, the platform provides insight into potential immune activation risk, supporting better-informed candidate selection. reliable risk prediction: lensai composite score reliably classifies candidates by ada risk, using two thresholds to define low risk: <10% and <30% ada risk. this distinction enables more confident go/no-go decisions in early development stages. by combining multiple features into a single score, the platform supports reproducible, interpretable risk assessment that is grounded in immunological relevance. early-stage design support: lensai is accessible from the earliest stages of drug design, without requiring lab inputs or complex configurations, designed for high-throughput screening of whole libraries of sequences in a few hours. researchers can quickly assess sequence variants, compare immunogenicity profiles, and prioritize low-risk candidates before investing in downstream studies. this flexibility supports more efficient resource use and helps reduce the likelihood of late-stage surprises. in a field where speed and accuracy both matter, this kind of screening helps bridge the gap between concept and clinic. it gives researchers the chance to make informed adjustments, rather than discovering late-stage liabilities when there is little room left to maneuver. case study: validating ada risk prediction with lensai in our recent case study, we applied lensai’s immunogenicity composite score to 217 therapeutic antibodies to evaluate predictive accuracy. for predicting ada incidence >10%, the model achieves an auc=0.79, indicating strong discriminative capability (auc=0.8 is excellent). for predicting ada incidence >30%, which is considered as more suitable for early-stage risk assessment purposes than the 10% cut-off, auc rises to 0.92, confirming lensai's value for ada risk classification. read the full case study or contact us to discuss how this applies to your pipeline. regulatory perspectives: immunogenicity is now a front-end issue it wasn’t long ago that immunogenicity testing was seen as something to be done late in development. but regulators have since made it clear that immunogenicity risk must be considered much earlier. agencies like the fda and ema now expect developers to proactively assess and mitigate immune responses well before clinical trials begin. this shift came after a series of high-profile biologic failures where ada responses were only discovered after significant time and money had already been spent. in some cases, the immune response not only reduced drug efficacy but also introduced safety concerns that delayed approval or halted development entirely. today, guidance documents explicitly encourage preclinical immunogenicity assessment. sponsors are expected to show that they have evaluated candidate sequences, made risk-informed design choices, and taken steps to reduce immunogenic potential. in silico screening, particularly when combined with in vitro and in vivo data, provides a valuable layer of evidence in this process. early screening also supports a culture of quality by design. it enables teams to treat immunogenicity not as a regulatory hurdle, but as a standard consideration during candidate selection and development. the regulatory landscape is shifting to support in silico innovation. in april 2025, the fda took a major step by starting to phase out some animal testing requirements for antibody and drug development. instead, developers are encouraged to use new approach methodologies (nams)—like ai models —to improve safety assessments and speed up time to clinic. the role of in silico methods in modern biologics development with the increasing complexity of therapeutic proteins and the diversity of patient populations, traditional testing methods are no longer enough. drug development teams need scalable, predictive tools that can keep up with the speed of discovery and the demand for precision. in silico immunogenicity screening is one of those tools. it has moved from being a theoretical exercise to a standard best practice in many organizations. reducing dependence on reactive testing and allowing early optimization leads these methods to helping companies move forward with greater efficiency and lower risk. when development teams have access to robust computational tools from the outset, the entire process tends to run more efficiently. these tools enable design flexibility, support earlier decision-making, and allow researchers to explore multiple design paths while maintaining alignment with regulatory expectations. for companies managing multiple candidates across different therapeutic areas, this kind of foresight can translate to faster development, fewer setbacks, and ultimately, better outcomes for patients. final thoughts: from screening to smarter development the promise of in silico immunogenicity screening lies in moving risk assessment to the earliest stages of development where it can have the greatest impact. by identifying high-risk sequences before synthesis, it helps researchers reduce late-stage failures, shorten timelines, lower overall project costs, and improve the likelihood of clinical success. in silico tools such as lensai support the early prediction of ada risk by flagging potential immunogenic regions and highlight risk patterns across diverse protein candidates, enabling earlier, more informed design decisions. see how early ada screening could strengthen your next candidate. learn more.

The rise of in silico epitope mapping: faster insights, near X-ray precision

epitope mapping is a fundamental process to identify and characterize the binding sites of antibodies to their target antigens2. understanding these interactions is pivotal in developing diagnostics, vaccines, and therapeutic antibodies3–5. antibody-based therapeutics – which have taken the world by storm over the past decade – all rely on epitope mapping for their discovery, development, and protection. this includes drugs like humira, which reigned as the world’s best-selling drug for six years straight6, and rituximab, the first monoclonal antibody therapy approved by the fda for the treatment of cancer7. aside from its important role in basic research and drug discovery and development, epitope mapping is an important aspect of patent filings; it provides binding site data for therapeutic antibodies and vaccines that can help companies strengthen ip claims and compliance8. a key example is the amgen vs. sanofi case, which highlighted the importance of supporting broad claims like ‘antibodies binding epitope x’ with epitope residue identification at single amino acid resolution, along with sufficient examples of epitope binding8. while traditional epitope mapping approaches have been instrumental in characterizing key antigen-antibody interactions, scientists frequently struggle with time-consuming, costly processes that are limited in scalability and throughput and can cause frustration in even the most seasoned researchers9. the challenge of wet lab-based epitope mapping approaches traditional experimental approaches to epitope mapping include x-ray crystallography and hydrogen-deuterium exchange mass-spectrometry (hdx-ms). while these processes have been invaluable in characterizing important antibodies, their broader application is limited, particularly in high-throughput antibody discovery and development pipelines. x-ray crystallography has long been considered the gold standard of epitope mapping due to its ability to provide atomic-level resolution10. however, this labor-intensive process requires a full lab of equipment, several scientists with specialized skill-sets, months of time, and vast amounts of material just to crystallize a single antibody-antigen complex. structural biology researchers will understand the frustration when, after all this, the crystallization is unsuccessful (yet again), for no other reason than simply because not all antibody-antigen complexes form crystals11. additionally, even if the crystallization process is successful, this technique doesn’t always reliably capture dynamic interactions, limiting its applicability to certain epitopes12. the static snapshots provided by x-ray crystallography mean that it can’t resolve allosteric binding effects, transient interactions, or large/dynamic complexes, and other technical challenges mean that resolving membrane proteins, heterogeneous samples, and glycosylated antigens can also be a challenge. hdx-ms, on the other hand, can be a powerful technique for screening epitope regions involved in binding, with one study demonstrating an accelerated workflow with a success rate of >80%13. yet, it requires highly complex data analysis and specialized expertise and equipment, making it resource-intensive, time-consuming (lasting several weeks), and less accessible for routine use – often leading to further frustration among researchers. as the demand for therapeutic antibodies, vaccines, and diagnostic tools grows, researchers urgently need efficient, reliable, and scalable approaches to accelerate the drug discovery process. in silico epitope mapping is a promising alternative approach that allows researchers to accurately predict antibody-antigen interactions by integrating multiple computational techniques14. advantages of in silico epitope mapping in silico epitope mapping has several key advantages over traditional approaches, making it a beneficial tool for researchers, particularly at the early stage of antibody development. speed – computational epitope mapping methods can rapidly analyze antigen-antibody interactions, reducing prediction time from months to days11. this not only accelerates project timelines but also helps reduce the time and resources spent on unsuccessful experiments. accuracy – by applying advanced algorithms, in silico methods are designed to provide precise and accurate predictions11. continuous improvements in 3d modeling of protein complexes that can be used to support mapping also mean that predictions are becoming more and more accurate, enhancing reliability and success rates9. versatility – in silico approaches are highly flexible and can be applied to a broad range of targets that may otherwise be challenging to characterize, ranging from soluble proteins, multimers, to transmembrane proteins. certain in silico approaches can also overcome the limitations of x-ray crystallography as they can reliably study dynamic and transient interactions12. cost-effectiveness – by reducing the need for expensive reagents, specialized equipment, and labor-intensive experiments, and by cutting timelines down significantly, computational epitope mapping approaches lower the cost of epitope mapping considerably11,15. this makes epitope mapping accessible to more researchers and organizations with limited resources. scalability – in silico platforms can handle huge datasets and screen large numbers of candidates simultaneously, unlike traditional wet-lab methods that are limited by throughput constraints, enabling multi-target epitope mapping9. this is especially advantageous in high-throughput settings, such as immune profiling and drug discovery, and relieves researchers of the burden of processing large volumes of samples daily. ai-powered in silico epitope mapping in action meet lensai: your cloud-based epitope mapping lab imagine a single platform hosting analytical solutions for end-to-end target discovery-leads analysis, including epitope mapping in hours. now, this is all possible. meet lensai – an integrated intelligence platform hosting innovative analytical solutions for complete target-discovery-leads analysis and advanced data harmonization and integration. lensai epitope mapping is one of the platform’s applications that enables researchers to identify the amino acids on the target that are part of the epitope11. by simply inputting the amino acid sequences of antibodies and targets, the machine learning (ml) algorithm, combined with molecular modeling techniques, enables the tool to make a prediction. the outputs are: a sequence-based visualization containing a confidence score for each amino acid of the target, indicating whether that amino acid may be part of the epitope, and a 3d visualization with an indication of the predicted epitope region. lensai: comparable to x-ray crystallography, in a fraction of the time and cost to evaluate the accuracy of lensai epitope mapping, its predictions were compared to the data from a well-known study by dang et al. in this study, epitope mapping using six different well-known wet-lab techniques for epitope mapping were compared, using x-ray crystallography as the gold standard11. by comparing lensai to the epitope structures obtained by x-ray crystallography in this study, it was determined that lensai closely matches x-ray crystallography. the area under the curve (auc) from the receiver operating characteristic (roc) curve was used as a key performance metric to compare the two techniques. the roc curve plots the true positive rate against the false positive rate, providing a robust measure of the prediction’s ability to distinguish between epitope and non-epitope residues. the results demonstrated that lensai achieves consistently high auc values of approximately 0.8 and above, closely matching the precision of x-ray crystallography (figure 1). an auc of 1 would represent a perfect prediction, while an auc of 0.8 and above would be excellent, and 0.5 is not better than random. although the precision of lensai is comparable to that of x-ray crystallography, the time and cost burdens are not; lensai achieves this precision in a fraction of the time and with far fewer resources than those required for successful x-ray crystallography. figure 1. the benchmark comparison with x-ray crystallography and six other methods (peptide array, alanine scan, domain exchange, hydrogen-deuterium exchange, chemical cross-linking, and hydroxyl radical footprinting) for epitope identification in five antibody-antigen combinations the accuracy of lensai was further compared against the epitope mapping data from other widely used wet lab approaches, obtained from the dang et al., study. in this study, peptide array, alanine scan, domain exchange, hdx, chemical cross-linking, and hydroxyl radical footprinting techniques were assessed. to compare lensai with dang’s data, the epitope mapping identified by x-ray crystallography (obtained from the same study) was used as the ground truth. alongside showing near x-ray precision, lensai outperformed all wet lab methods, accurately identifying the true epitope residues (high recall combined with high precision and a low false positive rate). in addition to the high precision and accuracy shown here, lensai enables users to detect the amino acids in the target that are part of the epitope solely through in silico analysis. lensai is, therefore, designed to allow users to gain reliable and precise results, usually within hours to a maximum of 1 day, with the aim of enabling fast epitope mapping and significantly reducing the burden of technically challenging experimental approaches. this means there is no need to produce physical material through lengthy and unpredictable processes, thereby saving time and money and helping to improve the success rate. lensai also works for various target types, including typically challenging targets such as transmembrane proteins and multimers. lensai performs on unseen complexes with high accuracy a new benchmark validation demonstrates that lensai epitope mapping maintains high accuracy even when applied to entirely new antibody-antigen complexes it has never seen before. in this study, the platform accurately predicted binding sites across 17 unseen pairs without prior exposure to the antibodies, antigens, or complexes. the ability to generalize beyond training data shows the robustness of the lensai predictive model. these findings not only support broader applicability but also help reduce lab burden and timelines. you can explore both the new “unseen” case study and the original benchmark on a “seen” target for a side-by-side comparison. new case study: lensai epitope mapping on an “unseen” target[ link] previous case study: head-to-head benchmark on a “seen” target[ link] conclusion as many of us researchers know all too well, traditional wet lab epitope mapping techniques tend to be slow, costly, and not often successful, limiting their applicability and scalability in antibody discovery workflows. however, it doesn’t have to be this way – in silico antibody discovery approaches like lensai offer a faster, cost-effective, and highly scalable alternative. this supports researchers in integrating epitope mapping earlier in the development cycle to gain fine-grained insights, make more informed decisions, and optimize candidates more efficiently. are you ready to accelerate your timelines and improve success rates in antibody discovery? get in touch today to learn more about how lensai can streamline your antibody research. references 1. labmate i. market report: therapeutic monoclonal antibodies in europe. labmate online. accessed march 18, 2025. https://www.labmate-online.com/news/news-and-views/5/frost-sullivan/market-report-therapeutic-monoclonal-antibodies-in-europe/22346 2. mole se. epitope mapping. mol biotechnol. 1994;1(3):277-287. doi:10.1007/bf02921695 3. ahmad ta, eweida ae, sheweita sa. b-cell epitope mapping for the design of vaccines and effective diagnostics. trials vaccinol. 2016;5:71-83. doi:10.1016/j.trivac.2016.04.003 4. agnihotri p, mishra ak, agarwal p, et al. epitope mapping of therapeutic antibodies targeting human lag3. j immunol. 2022;209(8):1586-1594. doi:10.4049/jimmunol.2200309 5. gershoni jm, roitburd-berman a, siman-tov dd, tarnovitski freund n, weiss y. epitope mapping: the first step in developing epitope-based vaccines. biodrugs. 2007;21(3):145-156. doi:10.2165/00063030-200721030-00002 6. biology ©2025 mrc laboratory of molecular, avenue fc, campus cb, cb2 0qh c, uk. 01223 267000. from bench to blockbuster: the story of humira® – best-selling drug in the world. mrc laboratory of molecular biology. accessed march 18, 2025. https://www2.mrc-lmb.cam.ac.uk/news-and-events/lmb-exhibitions/from-bench-to-blockbuster-the-story-of-humira-best-selling-drug-in-the-world/ 7. milestones in cancer research and discovery - nci. january 21, 2015. accessed march 18, 2025. https://www.cancer.gov/research/progress/250-years-milestones 8. deng x, storz u, doranz bj. enhancing antibody patent protection using epitope mapping information. mabs. 2018;10(2):204-209. doi:10.1080/19420862.2017.1402998 9. grewal s, hegde n, yanow sk. integrating machine learning to advance epitope mapping. front immunol. 2024;15:1463931. doi:10.3389/fimmu.2024.1463931 10. toride king m, brooks cl. epitope mapping of antibody-antigen interactions with x-ray crystallography. in: rockberg j, nilvebrant j, eds. epitope mapping protocols. vol 1785. methods in molecular biology. springer new york; 2018:13-27. doi:10.1007/978-1-4939-7841-0_2 11. dang x, guelen l, lutje hulsik d, et al. epitope mapping of monoclonal antibodies: a comprehensive comparison of different technologies. mabs. 2023;15(1):2285285. doi:10.1080/19420862.2023.2285285 12. srivastava a, nagai t, srivastava a, miyashita o, tama f. role of computational methods in going beyond x-ray crystallography to explore protein structure and dynamics. int j mol sci. 2018;19(11):3401. doi:10.3390/ijms19113401 13. zhu s, liuni p, chen t, houy c, wilson dj, james da. epitope screening using hydrogen/deuterium exchange mass spectrometry (hdx‐ms): an accelerated workflow for evaluation of lead monoclonal antibodies. biotechnol j. 2022;17(2):2100358. doi:10.1002/biot.202100358 14. potocnakova l, bhide m, pulzova lb. an introduction to b-cell epitope mapping and in silico epitope prediction. j immunol res. 2016;2016:1-11. doi:10.1155/2016/6760830 15. parvizpour s, pourseif mm, razmara j, rafi ma, omidi y. epitope-based vaccine design: a comprehensive overview of bioinformatics approaches. drug discov today. 2020;25(6):1034-1042. doi:10.1016/j.drudis.2020.03.006

Drug discovery at PMWC 2025: What’s next?

pmwc 2025 brought together a diverse mix of experts—data scientists, platform companies, researchers tackling rare diseases, investors, and non-profit organizations—all focused on advancing precision medicine. arnout van hyfte, head of products & platform at mindwalk, and dr. shuji sato, vp of innovative solutions at ipa, represented our team at pmwc 2025, diving into engaging discussions with researchers, industry leaders, and innovators. arnout took the stage at the ai & data sciences showcase, sharing practical insights on how blending ai with in vivo, in vitro, and in silico workflows is reshaping drug discovery, making it more efficient and data-driven. what everyone was talking about one of the hottest topics at pmwc 2025 was the importance of accurate and rapid diagnostic assays, where antibodies could deliver the required specificity and sensitivity. there’s a growing need for high-quality antibodies to detect disease biomarkers, generating richer datasets that provide deeper insight into disease progression. but as the complexity of data increases, managing and integrating it efficiently becomes just as critical as generating it. arnout van hyfte from mindwalk, presenting "accelerating drug discovery: integrating in vivo, in vitro, and in silico workflows" the shift to single-cell techniques we’re seeing a clear shift in how researchers are characterizing patients. dna and rna sequencing have become standard tools, and the next big step is single-cell analysis. by examining patients at the cellular level, researchers can better stratify diseases and develop more precise treatments. but working with this level of detail comes with challenges—more data means more complexity. this is where smarter data integration becomes crucial. making sense of diverse datasets and identifying meaningful connections can lead to faster, more effective decision-making in drug development. at mindwalk and ipa, we’re helping researchers turn raw data into actionable insights by linking diverse biological data layers seamlessly. making sense of complex data and targets as drug discovery advances, researchers are dealing with increasingly complex human targets that don’t have straightforward animal model counterparts. this is where making sense of vast amounts of biological data becomes even more crucial. biostrand’s hyft™ technology plays a key role here—linking sequence data to structural and functional information to map complex relationships across life science data layers. by integrating hyft with ai models, researchers can explore deeper biological insights that support target identification and validation. in silico techniques enable the construction of surrogate models that represent intricate disease pathways, aiding preclinical development while optimizing time and resources. combined with hyft-driven insights, this approach helps refine drug discovery strategies. precision is also essential in antibody discovery. the demand for highly specific and sensitive antibodies continues to rise, not just for diagnostics but also for reagents that keep pace with technological advancements in screening and disease characterization. engineering these antibodies to work effectively in a single iteration helps ensure they keep up with the latest screening technologies and research needs. arnout van hyfte, head of products & platform at mindwalk, and dr. shuji sato, vp of innovative solutions at ipa a future built on collaboration pmwc 2025 wasn’t just about the science—it highlighted the shift toward end-to-end models in the industry. platform companies are seeking collaboration, researchers need more integrated solutions, and the focus is increasingly on seamless, end-to-end approaches. at mindwalk and ipa, we’re bridging the gaps in drug discovery by combining ai, in silico modeling, and deep biological expertise. the key takeaway from this year’s conference? precision medicine isn’t just about data—it’s about making that data work smarter for better, faster discoveries. let’s talk about how we can support your research. reach out and let’s explore new possibilities together.

Transforming antibody development: Highlights from IPA 2024 TechDay

at ipa 2024 techday, some of the brightest minds in antibody development came together to explore the breakthroughs that are redefining the field. together with ipa, we showcased how our expertise and the innovative lensai platform are tackling some of the toughest challenges in drug discovery. here’s a look back at the event, the insights shared, and the technology driving the future of antibody development. what is lensai? dr. dirk van hyfte, co-founder of biostrand, introduced the lensai platform by explaining how it’s built on first principles. this isn’t just another incremental improvement—it’s a rethink of how we approach antibody discovery. the platform breaks down traditional assumptions, combining advanced ai with proprietary hyft patterns. the result? a system designed to make therapeutic antibody development faster, safer, and more precise. tackling the biggest challenges in antibody discovery fragmented data: antibody development often involves piecing together data from multiple sources—clinical notes, patents, omics data, and more. lensai simplifies this by bringing it all together in one framework. ai transparency: many ai tools are “black boxes,” leaving users unsure how decisions are made. lensai puts results into clear context, allowing researchers to trace outcomes back to their inputs. speed and scalability: processing millions of sequences can take weeks. lensai does it in minutes, offering real-time insights that keep projects moving forward. fig.1. core challenges in drug discovery how lensai is transforming the antibody development process identifying targets: lensai combines data from clinical reports, unstructured texts, and experimental findings to help researchers zero in on the right disease targets. tools like alphafold enhance this with 3d structure predictions. expanding hits: when you have a handful of promising antibody candidates, lensai takes it further—finding additional functional variants that might have otherwise been missed. this reduces timelines dramatically, often by as much as 300%. mapping epitopes and screening for immunogenicity: by clustering antibodies based on where they bind and screening for immunogenic hotspots, lensai provides clarity early in the process. this ensures candidates are not only effective but safe for clinical trials. fig. 2. lensai powered by patented hyft® technology the secret sauce: integrating in silico and wet lab approaches one of the biggest takeaways from techday was how lensai complements traditional wet lab workflows. ipa has a wealth of expertise in the use of rabbits in antibody development. rabbits might not be the first animal you think of for antibody research, but they offer some incredible benefits. dr. shuji sato walked us through their unique biology: higher diversity: rabbits have a broader antibody repertoire than rodents, which is essential for producing high-affinity, highly specific antibodies. proven success: rabbit antibodies have already been used to develop therapeutic and diagnostic antibodies, including treatments for macular degeneration and migraines. fig. 3. source: https://www.abcam.co.jp/primary-antibodies/kd-value-a-quantitive-measurement-of-antibody-affinity by combining in silico tools with advanced wet lab techniques, researchers can: quickly identify promising candidates. deepen the analysis with structural, functional, and sequential insights. streamline processes like humanization and immunogenicity assessment to save time and reduce costs. this hybrid approach is changing the game for drug discovery. fig. 4. rabbit b cell select program the bigger picture: data-driven decisions in precision medicine during the day’s discussions, one theme came up repeatedly: the importance of better data. as dr. van hyfte put it, “if you want better drugs, you need better data integration.” lensai does just that by harmonizing clinical, genomic, and proteomic data. this helps accelerate drug development while aiming to improve precision and minimize side effects, particularly in areas like oncology and personalized medicine. fig. 5. fully-integrated therapeutic end-to-end lead generation workflow what’s next? the momentum around lensai and our integrated approach to antibody development is only growing. over the next few months, we’ll be rolling out new applications and use cases to support researchers and organizations pushing the boundaries of discovery. if you missed techday, don’t worry! we’ve prepared an interactive demo that walks you through the power of lensai. check it out here. watch all the sessions here. conclusion a huge thank you to everyone who joined us at techday and contributed to the discussions. it’s clear that we’re at a turning point in antibody development—and we’re excited to see what the future holds. if you’re interested in learning more or exploring how lensai can help your research, don’t hesitate to reach out.

Epitope mapping: Techniques, applications, and innovations

introduction overview & significance of epitope mapping in targeted drug development therapeutic antibodies are currently the fastest-growing class of biological drugs and have significant potential in the treatment of a broad range of autoimmune conditions and cancers, amongst others. the increasing emphasis on the development of therapeutic antibodies is based on their multiple functions, including neutralization, ability to interfere with signaling pathways, opsonization, activation of the complement pathway, antibody-dependent cell-mediated cytotoxicity, etc., as well as their high antigenic specificity, bioactivity, and safety profile. epitope mapping is important in gaining knowledge about potential therapeutic window and engagement of the proposed mechanisms of action. thus deeper insights into the paratope/ epitope interface play a critical role in the development of more potent and effective treatments based on a better understanding of specificity, mechanisms of action, etc. understanding epitope mapping what is epitope mapping? antibodies bind to antigens via their paratopes, which interact with specific binding sites, called epitopes, on the antigen. epitope mapping is used to gain insights in which residues on the target are involved in antibody binding. for certain technologies, insights in the antibody's paratope are concurrenty obtained. insights in which residues are being part of the paratope-epitope are valuabe in guiding antibody engineering and fine-tuning, thereby increasing the efficiency of optimizing antibody's affinity, specificity, and mechanisms of action. why use epitope mapping? epitope mapping plays a critical role, some of which are detailed below, in the development of vaccines and therapeutic antibodies, and in diagnostic testing. ● understanding the role of epitopes in vaccine design, combined with knowledge of adjuvant mechanisms, can guide the selection of adjuvants that optimize immune responses against target pathogens. ● understanding epitopes allows for the rational design of antibody cocktails that target different epitopes on the same antigen, potentially improving efficacy, ensuring protection against mutational evolution, and reducing resistance. ● epitope mapping helps determine target epitope similarity, which is critical for ensuring similar binding properties and efficacy in biosimilar development and evaluation. ● detailed epitope information can strengthen patent claims either as a basis to claim a position or to differentiate from prior art and as such enhance patent protection for novel antibody therapeutics and vaccines. ● unique epitopes identified by epitope mapping allow diagnostic tests to be designed to target highly specific regions of an antigen thereby reducing false positives, improving overall test accuracy, and thus increasing the specificity of diagnostics. the importance of accurate and high-throughput epitope mapping in developing therapeutic antibodies epitope specificity is a unique intrinsic characteristic distinguishing each monoclonal antibody. one of the factors determining success of an antibody discovery campaign is the ability to select large sets of antibodies that show high epitope diversity. next to high throughput epitope binning, high throughput techniques for epitope mapping play an essential role in optimization of diversity-driven discovery and potentially subsequent triaging of leads. the earlier in the discovery process these types of characterization can be executed at scale, the more informed and efficient further downstream selections can be made. high-throughput epitope mapping can be achieved by certain lab techniques or via in silico predictions. in general, lab-based epitope mapping methods still tend to be costly and time-consuming and there continue to be challenges associated with high throughput fine specificity determination and detailed epitope mapping, for instance in the case of conformational epitopes on structurally complex proteins. in silico epitope mapping is better suited for high-throughput and can handle structurally complex proteins, without the need for producing physical material saving time and costs. techniques used in epitope mapping traditional methods: there are several traditional techniques used in epitope mapping each with its strengths and limitations. often, a combination of methods is used for comprehensive epitope mapping. peptide scanning peptide scanning is a widely used technique for epitope mapping. it involves synthesizing a series of overlapping peptides that span the entire sequence of the antigen of interest and testing each peptide for antibody binding. it is a simple and accessible technique that is effective for identifying linear epitopes. however, this approach is not effective for conformational epitopes, does not provide paratope mapping information, and can also be labor and cost-intensive for large proteins. alanine scanning alanine scanning is a protein engineering method that involves systematically selecting and substituting residues in the antigen with alanine. this systematic approach allows for the methodical examination of each residue's importance with minimal structural disruption. however, this approach can be expensive and time-consuming, is limited to single residue effects, and could produce potential false negatives for crucial residues with context-dependent roles. this technique also does not provide information on the paratope. chemical cross-linking mass-spectrometry (xl-ms) chemical cross-linking is a mass spectrometry (ms)-based technique that can simultaneously determine both protein structures and protein-protein interactions. it is applicable to both linear and discontinuous epitopes but requires specialized equipment and expertise in mass spectrometry. recent developments in this area include photo-crosslinking for more precise spatial control, integrating xl-ms with hydrogen-deuterium exchange (hdx-ms) for improved resolution, and the development of ms-cleavable crosslinkers for easier data analysis. x-ray crystallography x-ray crystallography is considered to be the gold standard in structural epitope mapping but advancements in in silico methods are inducing a shift towards computational methods given their improved accuracy and high-throughput nature. x-ray crystallography provides a near-atomic resolution model of antibody-antigen interactions for both linear and complex conformational epitopes. it is valued for its accuracy and the ability to provide structural context as well as insights into binding mechanisms. however, it is time-consuming and resource-intensive and may not capture dynamic aspects of binding. a key challenge is that this technique requires a lot of physical material (protein) and not all protein complexes crystallize. nuclear magnetic resonance (nmr) spectroscopy nmr spectroscopy is another epitope mapping technique that provides more detailed information than peptide mapping and at a faster pace than x-ray crystallography but it is expensive. it enables the examination of proteins in near-physiological conditions and can also identify secondary binding sites. the limitations include reduced efficacy for very large protein complexes and lower resolution compared to x-ray crystallography and cryo-em. cryo-electron microscopy (cryo-em) cryo-electron microscopy (cryo-em) allows scientists to observe biomolecules in a near-native state achieving atomic-level resolution without the need for crystallization. while cryo-em is excellent for large complexes, it typically struggles to achieve high resolution for small proteins. the procedure is also time-consuming and expensive. in silico epitope mapping the convergence of computational in silico methods and artificial intelligence (ai) technologies is revolutionizing epitope mapping with the capability to rapidly analyze vast protein sequences, account for multiple factors such as amino acid properties, structural information, and evolutionary conservation, and pinpoint potential epitopes with remarkable precision. epitope mapping should not be confused with epitope prediction, as they are fundamentally different tasks. epitope prediction only requires information about the antigen (sequence or structure), and the goal is to pinpoint which residue/amino acid at the surface is likely to be part of an epitope and might interact with the paratope of an antibody. epitope prediction is typically target focused and antibody-unaware. there may be more than one epitope on a given antigen. epitope mapping, on the other hand, requires information about both the antibody and the antigen, and the goal is to predict where a given antibody will specifically bind on the antigen. thus, with epitope mapping, it is possible to resolve the specific antibody-antigen binding spot. for instance, two antibodies can share the same epitope, or they can bind to different epitopes, but still compete with each other for target binding, having their respective epitopes very close to each other. lens ai in silico epitope mapping lensai’s in silico epitope mapping offers an efficient high throughput approach to identify the epitope on a target for a pool of antibodies. in a recent case study, we compared lens ai’s method with traditional x-ray crystallography using the crystal complex 6rps. check out our case study here. lensai provides epitope identification in a streamlined high throughput fashion with unmatched scalability. large quantities of antibody-antigen complexes can be analyzed in parallel and results are delivered within a few hours to one day. there is no need for production of physical material. the method is applicable to various target types, including transmembrane proteins. the ability for high scalability analysis allows a paradigm shift: hidden insights can be uncovered earlier in the research process, providing actionable insights to support diversity-driven discovery workflows. lensai helps optimize r&d by reducing overall timelines and costs, streamlining decision-making, improving efficiency and accelerating the journey to clinical success. lensai offers additional workflows that also provide information on the paratope, detailing the interacting residues on the corresponding antibodies. this information provides valuable insights for further in silico engineering if desired. future trends in epitope mapping the field of epitope mapping is evolving rapidly, driven by advances in technology and computational methods. some of the key trends that could transform the future of epitope mapping include improvements in 3d structural modeling of proteins and antibodies. especially advancements in prediction of protein-antibody interaction will contribute to further advancing in silico epitope mapping. the increasing sophistication of deep learning models (such as alphafold sample and alphafold 3) for the prediction of multimers will drive significant performance and accuracy gains. the power of in silico epitope mapping lies in seamless integration with other advanced ai-driven technologies and in silico methods allowing for parallel multi-parametric analyses and continuous feed-back loops, ultimately reshaping and revolutionizing the drug discovery process.

Perspectives on in silico immunogenicity screening

understanding immunogenicity at its core, immunogenicity refers to the ability of a substance, typically a drug or vaccine, to provoke an immune response within the body. it's the biological equivalent of setting off alarm bells. the stronger the response, the louder these alarms ring. in the case of vaccines, it is required for proper functioning of the vaccine: inducing an immune response and creating immunological memory. however, in the context of therapeutics, and particularly biotherapeutics, an unwanted immune response can potentially reduce the drug's efficacy or even lead to adverse effects. in pharma, the watchful eyes of agencies such as the fda and ema ensure that only the safest and most effective drugs make their way to patients; they require immunogenicity testing data before approving clinical trials and market access. these bodies necessitate stringent immunogenicity testing, especially for biosimilars, where it's essential to demonstrate that the biosimilar product has no increased immunogenicity risk compared to the reference product (1 ema), (2 fda). the interaction between the body's immune system and biologic drugs, such as monoclonal antibodies, can result in unexpected and adverse outcomes. cases have been reported where anti-drug antibodies (ada) led to lower drug levels and therapeutic failures, such as in the use of anti-tnf therapies, where patient immune responses occasionally reduced drug efficacy (3). beyond monoclonal antibodies, other biologic drugs, like enzyme replacement therapies and fusion proteins, also demonstrate variability in patient responses due to immunogenicity. in some instances, enzyme replacement therapies have been less effective because of immune responses that neutralize the therapeutic enzymes. similarly, fusion proteins used in treatments have shown varied efficacy, potentially linked to the formation of adas. the critical nature of immunogenicity testing is underscored by these examples, highlighting its role in ensuring drug safety and efficacy across a broader range of biologic treatments.  the challenge is to know beforehand whether an immune response will develop, ie the immunogenicity of a compound.   a deep dive into immunogenicity assessment of therapeutic antibodies  researchers rely on empirical analyses to comprehend the immune system's intricate interactions with external agents. immunogenicity testing is the lens that magnifies this interaction, revealing the nuances that can determine a drug's success or failure.  empirical analyses in immunogenicity assessments are informative but come with notable limitations. these analyses are often time-consuming, posing challenges to rapid drug development. early-phase clinical testing usually involves small sample sizes, which restricts the broad applicability of the results. pre-clinical tests, typically performed on animals, have limited relevance to human responses, primarily due to small sample sizes and interspecies differences. additionally, in vitro  tests using human materials do not fully encompass the diversity and complexity of the human immune system. moreover, they often require substantial time, resources, and materials. these issues highlight the need for more sophisticated methodologies that integrate human genetic variation for better prediction of drug candidates' efficacy. furthermore, the ability to evaluate the outputs from phage libraries during the discovery stage and optimization strategies like humanizations, developability, and affinity maturation can add significant value. being able to analyzing these strategies' impact on immunogenicity, with novel tools , may enhance the precision of these high throughput methods. . the emergence of in silico in immunogenicity screening with the dawn of the digital age, computational methods have become integral to immunogenicity testing.  in silico testing, grounded in computer simulations, introduces an innovative and less resource-intensive approach. however, it's important to understand that despite their advancements, in silico methods are not entirely predictive. there remains a grey area of uncertainty that can only be fully understood through experimental and clinical testing with actual patients. this underscores the importance of a multifaceted approach that combines computational predictions with empirical experimental and clinical data to comprehensively assess a drug's immunogenicity.   predictive role immunogenicity testing is integral to drug development, serving both retrospective and predictive purposes.  in silico analyses utilizing artificial intelligence and computational models to forecast a drug's behavior within the body can be used both in early and late stages of drug development. these predictions can also guide subsequent in vitro analyses, where the drug's cellular interactions are studied in a controlled laboratory environment. as a final step, traditionally immunogenicity monitoring in patients is crucial for regulatory approval.  the future of drug development envisions an expanded role for in silico testing through the combination with experimental and clinical data, to enhance the accuracy of predictive immunogenicity. this approach aims to refine predictions about a drug's safety and effectiveness before clinical trials, potentially streamlining the drug approval process. by understanding how a drug interacts with the immune system, researchers can anticipate possible reactions, optimize treatment strategies, and monitor patients throughout the process. understanding a drug's potential immunogenicity can inform dosing strategies, patient monitoring, and risk management. for instance, dose adjustments or alternative therapies might be considered if a particular population is likely to develop adas against a drug early on.   traditional vs. in silico methods: a comparative analysis traditional in vitro  methods, despite being time-intensive, offer direct insights from real-world biological interactions. however, it's important to recognize the limitations in the reliability of these methods, especially concerning in vitro wet lab tests used to determine a molecule's immunogenicity in humans. these tests often fall into a grey area in terms of their predictive accuracy for human responses. given this, the potential benefits of in silico analyses become more pronounced. in silico methods can complement traditional approaches by providing additional predictive insights, particularly in the early stages of drug development where empirical data might be limited. this integration of computational analyses can help identify potential immunogenic issues earlier in the drug development process, aiding in the efficient design of subsequent empirical studies. in silico methods, with their rapid processing and efficiency, are ideal for initial screenings, large datasets, and iterative testing. large amounts of hits can already be screened in the discovery stage and repeated when lead candidates are chosen and further engineered. the advantage of in silico methodologies lies in their capacity for high throughput analysis and quick turn-around times. traditional testing methods, while necessary for regulatory approval, present challenges in high throughput analysis due to their reliance on specialized reagents, materials, and equipment. these requirements not only incur substantial costs but also necessitate significant human expertise and logistical arrangements for sample storage. on the other hand, in silico testing, grounded in digital prowess, sees the majority of its costs stemming from software and hardware acquisition, personnel and maintenance.  by employing in silico techniques, it becomes feasible to rapidly screen and eliminate unsuitable drug candidates early in the discovery and development process. this early-stage screening significantly enhances the efficiency of the drug development pipeline by focusing resources and efforts on the most promising candidates. consequently, the real cost-saving potential of in silico analysis emerges from its ability to streamline the candidate selection process, ensuring that only the most viable leads progress to costly traditional testing and clinical trials.   advantages of in silico in immunogenicity screening in silico immunogenicity testing is transforming drug development by offering rapid insights and early triaging, which is instrumental in de-risking the pipeline and reducing attrition costs. these methodologies can convert extensive research timelines into days or hours, vastly accelerating the early stages of drug discovery and validation. as in silico testing minimizes the need for extensive testing of high number of candidates in vitro, its true value lies in its ability to facilitate early-stage decision-making. this early triaging helps identify potential failures before significant investment, thereby lowering the financial risks associated with drug development.   in silico immunogenicity screening in decision-making employing an in silico platform enables researchers to thoroughly investigate the molecular structure, function, and potential interactions of proteins at an early stage. this process aids in the early triaging of drug candidates by identifying subtle variations that could affect therapeutic efficacy or safety. additionally, the insights gleaned from in silico analyses can inform our understanding of how these molecular characteristics may relate to clinical outcomes, enriching the knowledge base from which we draw predictions about a drug's performance in real-world.   de-risking with informed lead nomination the earliest stages of therapeutic development hinge on selecting the right lead candidates—molecules or compounds that exhibit the potential for longevity. making an informed choice at this stage can be the difference between success and failure. in-depth analysis such as immunogenicity analysis aims to validate that selected leads are effective and exhibit a high safety profile. to benefit from the potential and efficiency of in silico methods in drug discovery, it's crucial to choose the right platform to realize these advantages. this is where lensai integrated intelligence technology comes into play. introducing the future of protein analysis and immunogenicity screening: lensai. powered by the revolutionary hyft technology, lensai is not just another tool; it's a game-changer designed for unmatched throughput, lightning-fast speeds, and accuracy. streamline your workflow, achieve better results, and stay ahead in the ever-evolving world of drug discovery. experience the unmatched potency of lensai integrated intelligence technology.   learn more: lensai in silico immunogenicity screening   understanding immunogenicity and its intricacies is fundamental for any researcher in the field. traditional methods, while not entirely predictive, have been the cornerstone of immunogenicity testing. however, the integration of in silico techniques is enhancing the landscape, offering speed and efficiency that complement existing methods. at mindwalk we foresee the future of immunogenicity testing in a synergistic approach that strategically combines in silico with in vitro methods. in silico  immunogenicity prediction can be applied in a high throughput way during the early discovery stages  but also later in the development cycle when engineering lead candidates to provide deeper insights and optimize outcomes. for the modern researcher, employing both traditional and in silico methods is the key to unlocking the next frontier in drug discovery and development. looking ahead, in silico is geared towards becoming a cornerstone for future drug development, paving the way for better therapies.   references: ema guideline on immunogenicity assessment of therapeutic proteins fda guidance for industry immunogenicity assessment for therapeutic protein products anti-tnf therapy and immunogenicity in inflammatory bowel diseases: a translational approach

Generative AI in drug discovery

generative ai is emerging as a strategic force in drug discovery, opening new possibilities across molecule generation, antibody design, de novo drug and vaccine development, and drug repurposing. as life sciences organizations work to accelerate innovation and reduce development costs, generative models offer a way to design more precise, effective, and personalized therapies. this blog explores how these technologies are being applied across the r&d pipeline, the deep learning techniques powering them, and the key challenges, like data quality, bias, and explainability—that must be addressed to fully realize their impact. generative ai in biopharma following a breakout year of rapid growth, generative ai has been widely, and justifiably, described as an undisputed game-changer for almost every industry. a recent mckinsey global survey lists the healthcare, pharma, and medical products sectors as one of the top regular users of generative ai. the report also highlights that organizations that have successfully maximized the value derived from their traditional ai capabilities tend to be more ardent adopters of generative ai tools. the ai revolution in the life sciences industry continues at an accelerated pace, reflected partly in the increasing number of partnerships, mergers, and acquisitions centered around the transformative potential of ai. for the life sciences industry, therefore, generative ai represents the logical next step to transcend conventional model predictive ai methods and explore new horizons in computational drug discovery. here then, is a quick overview of generative ai and its potential and challenges vis-a-vis in silico drug discovery and development. what is generative ai? where traditional ai systems make predictions based on large volumes of data, generative ai refers to a class of ai models that are capable of generating entirely new output based on a variety of inputs including text, images, audio, video, 3d models, and more. based solely on the input-output modality, generative ai models can be categorized as text to text (chatgpt-4, bard), to speech (vertex ai), to video (emu video), to audio (voicebox), to image (adobe firefly); image to text (pix2struct), to image (sincode ai), to video (leiapix); video to video (runway ai) and much more. currently, the most prominent types of generative ai models include generative adversarial networks (gans), variational autoencoders (vaes), recurrent neural networks (rnns), diffusion models, flow-based models, autoregressive models, transformer-based models, and style transfer models. what is the role of generative ai in drug discovery? it is estimated that generative ai technologies could yield as much as $110 billion a year in economic value for the life sciences industry. these technologies can play a transformative role across the drug discovery pipeline. generative ai can boost the precision, productivity, and efficiency of target identification and help accelerate the drug discovery process. these technologies will provide drug discovery teams with the capabilities to generate or design novel molecules with the desired properties and curate a set of drug candidates with the highest probability of success. this in turn would free up valuable r&d resources to focus on orphan, rare, and untreatable diseases. these technologies will enable life sciences r&d to cope with the explosion in digital data, in diverse formats such as unstructured text, images, patient records, pdfs, and emails, and ingest and process multimodal data at scale. the ability to extract patterns from vast volumes of patient data can empower more personalized treatments and improved patient outcomes. ai systems played an instrumental role in accelerating the development of an effective mrna vaccine for covid-19, the company put into place ai systems to accelerate the research process. generative ai technologies are now being leveraged to address some of the challenges associated with designing rna therapeutics and to design mrna medicines with optimal safety and performance. as with traditional ai systems, generative ai will help complement experimental drug discovery processes to further enhance the speed and accuracy of drug discovery and development while reducing the time and costs involved. how do different generative models compare for molecule design? generative models like vaes (variational autoencoders) and gans (generative adversarial networks) are increasingly applied to de novo drug design. vaes are particularly effective for exploring latent chemical space, offering structured representations that capture chemical relationships. gans, on the other hand, excel at generating structurally novel molecules, often producing higher diversity in candidate structures. combining both models in a generative pipeline helps balance molecular novelty with drug-like properties. model comparison model strengths weaknesses use case vae explores latent space; captures structure–property relationships lower novelty scaffold hopping gan high novelty; structurally diverse outputs training instability de novo design combined use balance between control and diversity may increase complexity balanced candidate profiles why deep learning matters in generative drug discovery behind many of the advances in generative ai lies deep learning. it’s what allows these models to go beyond pattern recognition—to actually learn chemical behavior, understand biological targets, and propose entirely new drug candidates that make sense in context. deep learning models don’t just process data; they learn from it across multiple formats—molecular structures, protein sequences, even scientific text—and help connect the dots. that’s what makes them so powerful in applications like molecule generation, antibody design, and precision medicine. by pairing deep learning with other tools—like alphafold2 or biomedical knowledge graphs—researchers can sharpen predictions, improve interpretability, and ultimately design better drug candidates, faster. how is generative ai used for compound screening in drug discovery? pharma and biotech companies are increasingly turning to generative ai for in silico screening of novel compounds. these models are trained on molecular graph datasets (e.g., smiles strings or 3d conformers) and validated using drug-likeness metrics like qed scores, docking simulations, and admet predictions. to build a generative ai model for molecules, most researchers: use a curated smiles-based dataset train a vae or gan on molecular representations validate outputs using metrics such as qed, synthesizability, and binding affinity predictions these workflows can be combined with retrieval-augmented generation (rag) pipelines to further refine candidate selection using up-to-date biomedical literature. what are the key generative ai applications in drug discovery? overall, generative ai offers a transformative approach to drug discovery, significantly accelerating the identification and optimization of promising drug candidates while reducing costs and experimental uncertainty. molecule generation generative ai models represent a more efficient approach to navigating the vast chemical space and creating novel molecular structures with desired properties. currently, a range of techniques, such as vaes, gans, rnns, genetic algorithms, and reinforcement learning, are being used to generate molecules with desirable admet properties. one approach synergistically combines generative ai, predictive modeling, and reinforcement learning to generate valid molecules with desired properties. with their ability to simultaneously optimize multiple properties of a molecule, generative ai systems can help identify candidates with the most balanced profile in terms of efficacy, safety, and other pharmacological parameters. antibody design & development the continuing evolution of artificial intelligence (ai), machine learning (ml), and deep learning (dl) techniques has helped significantly advance computational antibody discovery as a complement to traditional lab-based processes. the advent of protein language models (plm), generative ai models trained on protein sequences, has the potential to unlock further innovations in in silico antibody design and development. generative antibody design can significantly enhance the speed, quality, and efficiency of antibody design, help create more targeted and potent treatment modalities, and generate novel target-specific antibodies beyond the scope of conventional design techniques. recent developments in this field have demonstrated the ability of zero-shot generative ai, models that do not use training data, to generate novel antibody designs that were tested and functionally validated in the wet lab without the need for any further optimization. de novo drug design the power of generative ai models is also being harnessed to create entirely new drug candidates by predicting molecular structures that interact favorably with biological targets. the increasing popularity of generative techniques has created a new approach to generative chemistry that has been successfully applied across atom-based, fragment-based, and reaction-based approaches for generating novel structures. generative models have helped extend the capabilities of rule-based de novo molecule generation with recent research highlighting the potential of “rule-free” generative deep learning for de novo molecular design. the continuing evolution of generative ai towards multimodality will help further advance de novo design using complementary insights derived from diverse data modalities. drug repurposing generative ai can expedite the discovery of new uses for approved drugs, thereby circumventing the development time and costs associated with traditional drug discovery. one study demonstrated the power of generative ai technologies like chatgpt modes to accelerate the review of existing scientific knowledge in an extensive internet-based search space to prioritize drug repurposing candidates. new research also demonstrates how generative ai can rapidly model clinical trials to identify new uses for existing drugs and therapeutics. these technologies are already being applied successfully to the critical task of repurposing existing medicines for the treatment of rare diseases. precision drug discovery by analyzing large-scale multimodal datasets, including multiomics data, genome-wide association studies (gwas), disease-specific repositories, biobank-scale studies, patient data, genetic evidence, clinical data, imaging data, etc., generative ai models can help design drug candidates with the highest likelihood of efficacy and minimal side effects for specific patient populations. what are the generative ai challenges in drug discovery? despite their immense potential, there are still several challenges that need to be addressed before generative ai technologies can be successfully integrated into drug discovery workflows. limited and noisy training data: generative models require large, high-quality, diverse datasets for training. in drug discovery, experimental data is often sparse, and noisy, with errors and outliers. the availability of large volumes of high-quality data, especially for rare diseases or novel drug targets, remains a challenge. bias, generalizability, and ethical risks: generative models trained on biased or limited datasets may produce biased or unrealistic outputs. it is therefore crucial to ensure that these models are trained on unbiased, diverse datasets and are generalized across the vast chemical space and biological targets. these technologies raise significant ethical and regulatory considerations, including concerns about patient safety, data privacy, and intellectual property rights. black-box models and lack of explainability: finally, and most importantly, generative models are inherently a black box, raising further questions about interpretability and explainability. these challenges notwithstanding, generative ai has the potential to usher in the next generation of ai-driven drug discovery. ready to explore how generative ai can support your drug discovery programs? talk to our team or explore more use cases in our platform.

Biomedical knowledge graphs and the power of ontology

knowledge graphs play a crucial role in the organization, integration, and interpretation of vast volumes of heterogeneous life sciences data. they are key to the effective integration of disparate data sources. they help map the semantic or functional relationships between a million data points. they enable information from diverse datasets to be mapped to a common ontology to create a unified, comprehensive, and interconnected view of complex biological data that enables a more contextual approach to exploration and interpretation. though ontologies and knowledge graphs are concepts related to the contextual organization and representation of knowledge, their approach and purpose can vary. so here’s a closer look at these concepts, their similarities, individual strengths, and synergies. what is an ontology? an ontology is a “formal, explicit specification of a shared conceptualization” that helps define, capture, and standardize information within a particular knowledge domain. the key three critical requirements of an ontology can be further codified as follows: ‘shared conceptualization’ emphasizes the importance of a consensual definition (shared) of domain concepts and their interrelationships (conceptualization) among users of a specific knowledge domain. the term ‘explicit’ requires the unambiguous characterization and representation of domain concepts to create a common understanding. and finally, ‘formal’ refers to the capability of the specified conceptualization to be machine-interpretable and support algorithmic reasoning. what is a knowledge graph? a knowledge graph, aka a semantic network, is a graphical representation of the foundational entities in a domain connected by semantic, contextual relationships. a knowledge model uses formal semantics to interlink descriptions of different concepts, entities, relationships, etc. and enables efficient data processing by both people and machines. knowledge graphs, therefore, are a type of graph database with an embedded semantic model that unifies all domain data into one knowledge base. semantics, therefore, is an essential capability for any knowledge base to qualify as a knowledge graph. though an ontology is often used to define the formal semantics of a knowledge domain, the terms ‘semantic knowledge graph’ and ‘ontology’ refer to different aspects of organizing and representing knowledge. what’s the difference between ontology and a semantic knowledge graph? in broad terms, the key difference between a semantic knowledge graph and an ontology is that semantics focuses predominantly on the interpretation and understanding of data relationships within a knowledge graph, whereas an ontology is a formal definition of the vocabulary and structure unique to the knowledge domain. both ontologies and semantics play a distinct and critical role in defining the utility and performance of a knowledge graph. an ontology provides the structured framework, formal definitions, and common vocabulary required to organize domain-specific knowledge in a way that creates a shared understanding. semantics focuses on the meaning, context, interrelationships, and interpretation of different pieces of information in a given domain. ontologies provide a formal representation, using languages like rdf (resource description framework), and owl (web ontology language) to standardize the annotation, organization, and expression of domain-specific knowledge. a semantic data layer is a more flexible approach to extracting implicit meaning and interrelationships between entities, often relying on a combination of semantic technologies and natural language processing (nlp) / large language models (llms) frameworks to contextually integrate and organize structured and unstructured data. semantic layers are often built on top of an ontology to create a more enriched and context-aware representation of knowledge graph entities. what are the key functions of ontology in knowledge graphs? ontologies are essential to structuring and enhancing the capabilities of knowledge graphs, thereby enabling several key functions related to the organization and interpretability of domain knowledge. the standardized and formal representation provided by ontologies serves as a universal foundation for integrating, mapping and aligning data from heterogeneous sources into one unified view of knowledge. ontologies provide the structure, rules, and definitions that enable logical reasoning and inference and the deduction of new knowledge based on existing information. by establishing a shared and standardized vocabulary, ontologies enhance semantic interoperability between different knowledge graphs, databases, and systems and create a comprehensive and meaningful understanding of a given domain. they also contribute to the semantic layer of knowledge graphs, enabling a richer and deeper understanding of data relationships that drive advanced analytics and decision-making. ontologies help formalize data validation rules, thereby ensuring consistency and enhancing data quality. ontologies enhance the search and discovery capabilities of knowledge graphs with a structured and semantically rich knowledge representation that enables more flexible and intelligent querying as well as more contextually relevant and accurate results. the importance of ontologies in biomedical knowledge graphs knowledge graphs have emerged as a critical tool in addressing the challenges posed by rapidly expanding and increasingly dispersed volumes of heterogeneous, multimodal, and complex biomedical information. biomedical ontologies are foundational to creating ontology-based biomedical knowledge graphs that are capable of structuring all existing biological knowledge as a panorama of semantic biomedical data. for example, scalable precision medicine open knowledge engine (spoke), a biomedical knowledge graph connecting millions of concepts across 41 biomedical databases, uses 11 different ontologies as a framework to semantically organize and connect data. this massive knowledge engine integrates a wide variety of information, such as proteins, pathways, molecular functions, biological processes, etc., and has been used for a range of biomedical applications, including drug repurposing, disease prediction, and interpretation of transcriptomic data. ontology-based knowledge graphs will also be key to the development of precision medicine given their capability to standardize and harmonize data resources across different organizational scales, including multi-omics data, molecular functions, intra- and inter-cellular pathways, phenotypes, therapeutics, environmental effects, etc., into one holistic network. the use of ontologies for semantic enrichment of biomedical knowledge graphs will also help accelerate the fairification of biomedical data and enable researchers to use ontology-based queries to answer more complex questions with greater accuracy and precision. however, there are still several challenges to the more widespread use of ontologies in biomedical research. biomedical ontologies will play an increasingly strategic role in the representation and standardization of biomedical knowledge. however, given their rapid growth proliferation, the emphasis going forward will have to on the development of biomedical ontologies that adhere to mathematically precise shared standards and good practice design principles to ensure that they are more interoperable, exchangeable, and examinable.

Mitigating LLM hallucinations

there is a compelling case underlying the tremendous interest in generative ai and llms as the next big technological inflection point in computational drug discovery and development. for starters, llms help expand the data universe of in silico drug discovery, especially in terms of opening up access to huge volumes of valuable information locked away in unstructured textual data sources including scientific literature, public databases, clinical trial notes, patient records, etc. llms provide the much-needed capability to analyze, identify patterns and connections, and extract novel insights about disease mechanisms and potential therapeutic targets. their ability to interpret complex scientific concepts and elucidate connections between diseases, genes, and biological processes can help accelerate disease hypothesis generation and the identification of potential drug targets and biomarkers. when integrated with biomedical knowledge graphs, llms help create a unique synergistic model that enables bidirectional data- and knowledge-based reasoning. the explicit structured knowledge of knowledge graphs enhances the knowledge of llms while the power of language models streamlines graph construction and user conversational interactions with complex knowledge bases. however, there are still several challenges that have to be addressed before llms can be reliably integrated into in silico drug discovery pipelines and workflows. one of these is hallucinations. why do llms hallucinate? at a time of some speculation about laziness and seasonal depression in llms, a hallucination leaderboard of 11 public llms revealed hallucination rates that ranged from 3% at the top end to 27% at the bottom of the barrel. another comparative study of two versions of a popular llm in generating ophthalmic scientific abstracts revealed very high hallucination rates (33% and 29%) of generating fake references. this tendency of llms to hallucinate, ergo present incorrect or unverifiable knowledge as accurate, even at 3% can have serious consequences in critical drug discovery applications. there are several reasons for llm hallucinations. at the core of this behavior is the fact that generative ai models have no actual intelligence, relying instead on a probability-based approach to predict data that is most likely to occur based on patterns and contexts ‘learned’ from their training data. apart from this inherent lack of contextual understanding, other potential causes include exposure to noise, errors, biases, and inconsistencies in training data, training and generation methods, or even prompting techniques. for some, hallucination is all llms do and others see it as inevitable for any prompt-based large language model. in the context of life sciences research, however, mitigating llm hallucinations remains one of the biggest obstacles to the large-scale and strategic integration of this potentially transformative technology. how to mitigate llm hallucinations? there are three broad and complementary approaches to mitigating hallucinations in large language models: prompt engineering, fine-tuning, and grounding + prompt augmentation. prompt engineering prompt engineering is the process of strategically designing user inputs, or prompts, in order to guide model behavior and obtain optimal responses. there are three major approaches to prompt engineering: zero-shot, few-shot, and chain-of-thought prompts. in zero-shot prompting, language models are provided with inputs that are not part of their training data but are still capable of generating reliable results. few-shot prompting involves providing examples to llms before presenting the actual query. chain-of-thought (cot) is based on the finding that a series of intermediate reasoning steps provided as examples during prompting can significantly improve the reasoning capabilities of large language models. the chain-of-thought concept has been expanded to include new techniques such as chain-of-verification (cove), a self-verification process that enables llms to check the accuracy and reliability of their output, and chain of density (cod), a process that focuses on summarization rather than reasoning to control the density of information in the generated text. prompt engineering, however, has its own set of limitations including prompt constraints that may cramp the ability to query complex domains and the lack of objective metrics to quantify prompt effectiveness. fine-tuning where the focus of prompt engineering is on the skill required to elicit better llm output, fine-tuning emphasizes task-specific training in order to enhance the performance of pre-trained models in specific topics or domain areas. a conventional approach to llm finetuning is full fine-tuning, which involves the additional training of pre-trained models on labeled, domain or task-specific data in order to generate more contextually relevant responses. this is a time, resource and expertise-intensive process. an alternative approach is parameter-efficient fine-tuning (peft), conducted on a small set of extra parameters without adjusting the entire model. the modular nature of peft means that the training can prioritize select portions or components of the original parameters so that the pre-trained model can be adapted for multiple tasks. lora (low-rank adaptation of large language models), a popular peft technique, can significantly reduce the resource intensity of fine-tuning while matching the performance of full fine-tuning. there are, however, challenges to fine-tuning including domain shift issues, the potential for bias amplification and catastrophic forgetting, and the complexities involved in choosing the right hyperparameters for fine-tuning in order to ensure optimal performance. grounding & augmentation llm hallucinations are often the result of language models attempting to generate knowledge based on information that they have not explicitly memorized or seen. the logical solution, therefore, would be to provide llms with access to a curated knowledge base of high-quality contextual information that enables them to generate more accurate responses. advanced grounding and prompt augmentation techniques can help address many of the accuracy and reliability challenges associated with llm performance. both techniques rely on external knowledge sources to dynamically generate context. grounding ensures that llms have access to up-to-date and use-case-specific information sources to provide the relevant context that may not be available solely from the training data. similarly, prompt augmentation enhances a prompt with contextually relevant information that enables llms to generate a more accurate and pertinent output. factual grounding is a technique typically used in the pre-training phase to ensure that llm output across a variety of tasks is consistent with a knowledge base of factual statements. post-training grounding relies on a range of external knowledge bases, including documents, code repositories, and public and proprietary databases, to improve the accuracy and relevance of llms on specific tasks. retrieval-augmented generation (rag), is a distinct framework for the post-training grounding of llms based on the most accurate, up-to-date information retrieved from external knowledge bases. the rag framework enables the optimization of biomedical llms output along three key dimensions. one, access to targeted external knowledge sources ensures llms' internal representation of information is dynamically refreshed with the most current and contextually relevant data. two, access to an llm’s information sources ensures that responses can be validated for relevance and accuracy. and three, there is the emerging potential to extend the rag framework beyond just text to multimodal knowledge retrieval, spanning images, audio, tables, etc., that can further boost the factuality, interpretability, and sophistication of llms. also read: how retrieval-augmented generation (rag) can transform drug discovery some of the key challenges of retrieval-augmented generation include the high initial cost of implementation as compared to standalone generative ai. however, in the long run, the rag-llm combination will be less expensive than frequently fine-tuning llms and provides the most comprehensive approach to mitigating llm hallucinations. but even with better grounding and retrieval, scientific applications demand another layer of rigor — validation and reproducibility. here’s how teams can build confidence in llm outputs before trusting them in high-stakes discovery workflows. how to validate llm outputs in drug discovery pipelines in scientific settings like drug discovery, ensuring the validity of large language model (llm) outputs is critical — especially when such outputs may inform downstream experimental decisions. here are key validation strategies used to assess llm-generated content in biomedical pipelines: validation checklist: compare outputs to curated benchmarks use structured, peer-reviewed datasets such as drugbank, chembl, or internal gold standards to benchmark llm predictions. cross-reference with experimental data validate ai-generated hypotheses against published experimental results, or integrate with in-house wet lab data for verification. establish feedback loops from in vitro validations create iterative pipelines where lab-tested results refine future model prompts, improving accuracy over time. advancing reproducibility in ai-augmented science for llm-assisted workflows to be trustworthy and audit-ready, they must be reproducible — particularly when used in regulated environments. reproducibility practices: dataset versioning track changes in source datasets, ensuring that each model run references a consistent data snapshot. prompt logging store full prompts (including context and input structure) to reproduce specific generations and analyze outputs over time. controlled inference environments standardize model versions, hyperparameters, and apis to eliminate variation in inference across different systems. integrated intelligence with lensai™ holistic life sciences research requires the sophisticated orchestration of several innovative technologies and frameworks. lensai integrated intelligence, our next-generation data-centric ai platform, fluently blends some of the most advanced proprietary technologies into one seamless solution that empowers end-to-end drug discovery and development. lensai integrates rag-enhanced biollms with an ontology-driven nlp framework, combining neuro-symbolic logic techniques to connect and correlate syntax (multi-modal sequential and structural data) and semantics (biological functions). a comprehensive and continuously expanding knowledge graph, mapping a remarkable 25 billion relationships across 660 million data objects, links sequence, structure, function, and literature information from the entire biosphere to provide a comprehensive overview of the relationships between genes, proteins, structures, and biological pathways. our next-generation, unified, knowledge-driven approach to the integration, exploration, and analysis of heterogeneous biomedical data empowers life sciences researchers with the high-tech capabilities needed to explore novel opportunities in drug discovery and development.

Integrating knowledge graphs and large language models for next-generation drug discovery

across several previous blogs, we have explored the importance of knowledge graphs, large language models (llms), and semantic analysis in biomedical research. today, we focus on integrating these distinct concepts into a unified model that can help advance drug discovery and development. but before we get to that, here’s a quick synopsis of the knowledge graph, llm & semantic analysis narrative so far. llms, knowledge graphs & semantics in biomedical research it has been established that biomedical llms — domain-specific models pre-trained exclusively on domain-specific vocabulary — outperform conventional tools in many biological data-based tasks. it is therefore considered inevitable that these models will quickly expand across the broader biomedical domain. however, there are still several challenges, such as hallucinations and interpretability for instance, that have to be addressed before biomedical llms can be taken mainstream. a key biomedical domain-specific challenge is llms’ lack of semantic intelligence. llms have, debatably, been described as ‘stochastic parrots’ that comprehend none of the language, relying instead on ‘learning’ meaning based on the large-scale extraction of statistical correlations. this has led to the question of whether modern llms really possess any inductive, deductive, or abductive reasoning abilities. statistically extrapolated meaning may well be adequate for general language llm applications. however, the unique complexities and nuances of the biochemical, biomedical, and biological vocabulary, require a more semantic approach to convert words/sentences into meaning, and ultimately knowledge. biomedical knowledge graphs address this key capability gap in llms by going beyond statistical correlations to bring the power of context to biomedical language models. knowledge graphs help capture the inherent graph structure of biomedical data, such as drug-disease interactions and protein-protein interactions, and model complex relationships between disparate data elements into one unified structure that is both human-readable and computationally accessible. knowledge graphs accomplish this by emphasizing the definitions of, and the semantic relationships between, different entities. they use domain-specific ontologies that formally define various concepts and relations to enrich and interlink data based on context. a combination, therefore, of semantic knowledge graphs and biomedical llms will be most effective for life sciences applications. semantic knowledge graphs and llms in drug discovery there are three general frameworks for unifying the power of llms and knowledge graphs. the first, knowledge graph-enhanced llms, focuses on using the explicit, structured knowledge of knowledge graphs to enhance the knowledge of llms at different stages including pre-training, inference, and interpretability. this approach offers three distinct advantages: it improves the knowledge expression of llms, provides llms with continuous access to the most up-to-date knowledge, and affords more transparency into the reasoning process of black-box language models. structured data from knowledge graphs, related to genes, proteins, diseases, pathways, chemical compounds, etc., combined with the unstructured data, from scientific literature, clinical trial reports, and patents. etc, can help augment drug discovery by providing a more holistic domain view. the second, llm-augmented knowledge graphs, leverages the power of language models to streamline graph construction, enhance knowledge graph tasks such as graph-to-text generation and question answering, and augment the reasoning capabilities and performance of knowledge graph applications. llm-augmented knowledge graphs combine the natural language capabilities of llms with the rich semantic relationships represented in knowledge graphs to empower pharmaceutical researchers with faster and more precise answers to complex questions and to extract insights based on patterns and correlations. llms can also enhance the utility of knowledge graphs in drug discovery by constantly extracting and enriching pharmaceutical knowledge graphs. the third approach is towards creating a synergistic biomedical llm plus biomedical knowledge graph (bkg) model that enables bidirectional data- and knowledge-based reasoning. currently, the process of combining generative and reasoning capabilities into one symbiotic model is focused on specific tasks. however, this is poised to expand to diverse downstream applications in the near future. even as research continues to focus on the symbiotic possibilities of a unified knowledge graph-llm framework, these concepts are already having a transformative impact on several drug discovery and development processes. take target identification, for instance, a critical step in drug discovery with consequential implications for downstream development processes. ai-powered language models have been shown to outperform state-of-the-art approaches in key tasks such as biomedical named entity recognition (bioner) and biomedical relation extraction. transformer-based llms are being used in chemoinformatics to advance drug–target relationship prediction and to effectively generate novel, valid, and unique molecules. llms are also evolving beyond basic text-to-text frameworks to multi-modal large language models (mllms) that bring the combined power of image plus text adaptive learning to target identification and validation. meanwhile, the semantic capabilities of knowledge graphs enhance the efficiencies of target identification by enabling the harmonization and enrichment of heterogeneous data into one connected framework for more holistic exploration and analysis. ai-enabled llms are increasingly being used across the drug discovery and development pipeline to predict drug-target interactions (dtis) and drug-drug interactions, molecular properties, such as pharmacodynamics, pharmacokinetics, and toxicity, and even likely drug withdrawals from the market due to safety concerns. in the drug discovery domain, biomedical knowledge graphs are being across a range of tasks including polypharmacy prediction, dti prediction, adverse drug reaction (adr) prediction, gene-disease prioritization, and drug repurposing. the next significant point of inflection will be the integration of these powerful technologies into one synergized model to drive a stepped increase in performance and efficiency. optimizing llms for biomedical research there are three key challenges — knowledge cut-off, hallucinations, and interpretability — that must be addressed before llms can be reliably integrated into biomedical research. there are currently two complementary approaches to mitigate these challenges and optimize biomedical llm performance. the first approach is to leverage the structured, factual, domain-specific knowledge contained in biomedical knowledge graphs to enhance the factual accuracy, consistency, and transparency of llms. using graph-based query languages, the pre-structured data embedded in knowledge graph frameworks can be directly queried and integrated into llms. another key capability for biomedical llms is to retrieve information from external sources, on a per-query basis, in order to generate the most up-to-date and contextually relevant responses. there are two broad reasons why this is a critical capability in biomedical research: first, it ensures that llms' internal knowledge is supplemented by access to the most current and reliable information from domain-specific, high-quality, and updateable knowledge sources. and two, access to the data sources means that responses can be checked for accuracy and provenance. the retrieval augmented generation (rag) approach combines the power of llms with external knowledge retrieval mechanisms to enhance the reasoning, accuracy, and knowledge recall of biomedical llms. combining the knowledge graph- and rag-based approaches will lead to significant improvements in llm performance in terms of factual accuracy, context-awareness, and continuous knowledge enrichment. what is retrieval-augmented generation (rag) in drug discovery? retrieval-augmented generation (rag) is an approach that combines large language models with access to internal and external, trusted data sources. in the context of drug discovery, it helps generate scientifically grounded responses by drawing on biomedical datasets or proprietary silos. when integrated with a knowledge graph, rag can support context-aware candidate suggestions, summarize literature, or even generate hypotheses based on experimental inputs. this is especially useful in fragmented biomedical data landscapes, where rag helps surface meaningful cross-modal relationships—across omics layers, pathways, phenotypes, and more. what’s the difference between llms and plms in drug discovery? large language models (llms) are general-purpose models trained on vast textual corpora, capable of understanding and generating human-like language. protein language models (plms), on the other hand, are trained on biological sequences, like amino acids, to capture structural and functional insights. while llms can assist in literature mining or clinical trial design, plms power structure prediction, function annotation, and rational protein engineering. combining both enables cross-modal reasoning for smarter discovery. lensai: the next-generation rag-kg-llm platform these components—llms, plms, knowledge graphs, and rag—are increasingly being combined into unified frameworks for smarter drug discovery. imagine a system where a protein structure predicted by a plm is linked to pathway insights from a biomedical knowledge graph. an llm then interprets these connections to suggest possible disease associations or therapeutic hypotheses—supported by citations retrieved via rag. this kind of multi-layered integration mirrors how expert scientists reason, helping teams surface and prioritize meaningful leads much faster than traditional workflows. at biostrand, we have successfully actualized a next-generation unified knowledge graph-large language model framework for holistic life sciences research. at the core of our lensai platform is a comprehensive and continuously expanding knowledge graph that maps 25 billion relationships across 660 million data objects, linking sequence, structure, function, and literature information from the entire biosphere. our first-in-class technology provides a holistic understanding of the relationships between genes, proteins, and biological pathways thereby opening up powerful new opportunities for drug discovery and development. the platform leverages the latest advances in ontology-driven nlp and ai-driven llms to connect and correlate syntax (multi-modal sequential and structural data ) and semantics (functions). our unified approach to biomedical knowledge graphs, retrieval-augmented generation models, and large language models combines the reasoning capabilities of llms, the semantic proficiency of knowledge graphs, and the versatile information retrieval capabilities of rag to streamline the integration, exploration, and analysis of all biomedical data.

From words to meaning: Exploring semantic analysis in NLP

there’s more biomedical data than ever, but making sense of it is still tough. in this blog, we look at how semantic analysis—an essential part of natural language processing (nlp)—helps researchers turn free text into structured insights. from identifying key biomedical terms to mapping relationships between them, we explore how these techniques support everything from literature mining to optimizing clinical trials. what is semantic analysis in linguistics? semantic analysis is an important subfield of linguistics, the systematic scientific investigation of the properties and characteristics of natural human language. as the study of the meaning of words and sentences, semantics analysis complements other linguistic subbranches that study phonetics (the study of sounds), morphology (the study of word units), syntax (the study of how words form sentences), and pragmatics (the study of how context impacts meaning), to name just a few. there are three broad subcategories of semantics: formal semantics: the study of the meaning of linguistic expressions using mathematical-logical formalizations, such as first-order predicate logic or lambda calculus, to natural languages. conceptual semantics: this is the study of words, phrases, and sentences based not just on a set of strict semantic criteria but on schematic and prototypical structures in the minds of language users. lexical semantics: the study of word meanings not just in terms of the basic meaning of a lexical unit but in terms of the semantic relations that integrate these units into a broader linguistic system. semantic analysis in natural language processing (nlp) in nlp, semantic analysis is the process of automatically extracting meaning from natural languages in order to enable human-like comprehension in machines. there are two broad methods for using semantic analysis to comprehend meaning in natural languages: one, training machine learning models on vast volumes of text to uncover connections, relationships, and patterns that can be used to predict meaning (e.g. chatgpt). and two, using structured ontologies and databases that pre-define linguistic concepts and relationships that enable semantic analysis algorithms to quickly locate useful information from natural language text. though generalized large language model (llm) based applications are capable of handling broad and common tasks, specialized models based on a domain-specific taxonomy, ontology, and knowledge base design will be essential to power intelligent applications. how does semantic analysis work? there are two key components to semantic analysis in nlp. the first is lexical semantics, the study of the meaning of individual words and their relationships. this stage entails obtaining the dictionary definition of the words in the text, parsing each word/element to determine individual functions and properties, and designating a grammatical role for each. key aspects of lexical semantics include identifying word senses, synonyms, antonyms, hyponyms, hypernyms, and morphology. in the next step, individual words can be combined into a sentence and parsed to establish relationships, understand syntactic structure, and provide meaning. there are several different approaches within semantic analysis to decode the meaning of a text. popular approaches include: semantic feature analysis (sfa): this approach involves the extraction and representation of shared features across different words in order to highlight word relationships and help determine the importance of individual factors within a text. key subtasks include feature selection, to highlight attributes associated with each word, feature weighting, to distinguish the importance of different attributes, and feature vectors and similarity measurement, for insights into relationships and similarities between words, phrases, and concepts. latent semantic analysis (lsa): this technique extracts meaning by capturing the underlying semantic relationships and context of words in a large corpus. by recognizing the latent associations between words and concepts, lsa enhances machines’ capability to interpret natural languages like humans. the lsa process includes creating a term-document matrix, applying singular value decomposition (svd) to the matrix, dimension reduction, concept representation, indexing, and retrieval. probabilistic latent semantic analysis (plsa) is a variation on lsa with a statistical and probabilistic approach to finding latent relationships. semantic content analysis (sca): this methodology goes beyond simple feature extraction and distribution analysis to consider word usage context and text structure to identify relationships and impute meaning to natural language text. the process broadly involves dependency parsing, to determine grammatical relationships, identifying thematic and case roles to reveal relationships between actions, participants, and objects, and semantic frame identification, for a more refined understanding of contextual associations. semantic analysis techniques here’s a quick overview of some of the key semantic analysis techniques used in nlp: word embeddings these refer to techniques that represent words as vectors in a continuous vector space and capture semantic relationships based on co-occurrence patterns. word-to-vector representation techniques are categorized as conventional, or count-based/frequency-based models, distributional, static word embedding models that include latent semantic analysis (lsa), word-to-vector (word2vec), global vector (glove) and fasttext, and contextual models, which include embeddings from large language, generative pre-training, and bidirectional encoder representations from transformers (bert) models. semantic role labeling this a technique that seeks to answer a central question — who did what to whom, how, when, and where — in many nlp tasks. semantic role labeling identifies the roles that different words play by recognizing the predicate-argument structure of a sentence. it is traditionally broken down into four subtasks: predicate identification, predicate sense disambiguation, argument identification, and argument role labeling. given its ability to generate more realistic linguistic representations, semantic role labeling today plays a crucial role in several nlp tasks including question answering, information extraction, and machine translation. named entity recognition (ner) ner is a key information extraction task in nlp for detecting and categorizing named entities, such as names, organizations, locations, events, etc. ner uses machine learning algorithms trained on data sets with predefined entities to automatically analyze and extract entity-related information from new unstructured text. ner methods are classified as rule-based, statistical, machine learning, deep learning, and hybrid models. biomedical named entity recognition (bioner) is a foundational step in biomedical nlp systems with a direct impact on critical downstream applications involving biomedical relation extraction, drug-drug interactions, and knowledge base construction. however, the linguistic complexity of biomedical vocabulary makes the detection and prediction of biomedical entities such as diseases, genes, species, chemical, etc. even more challenging than general domain ner. the challenge is often compounded by insufficient sequence labeling, large-scale labeled training data and domain knowledge. deep learning bioner methods, such as bidirectional long short-term memory with a crf layer (bilstm-crf), embeddings from language models (elmo), and bidirectional encoder representations from transformers (bert), have been successful in addressing several challenges. currently, there are several variations of the bert pre-trained language model, including bluebert, biobert, and pubmedbert, that have applied to bioner tasks. an associated and equally critical task in bionlp is that of biomedical relation extraction (biore), the process of automatically extracting and classifying relationships between complex biomedical entities. in recent years, the integration of attention mechanisms and the availability of pre-trained biomedical language models have helped augment the accuracy and efficiency of biore tasks in biomedical applications. other semantic analysis techniques involved in extracting meaning and intent from unstructured text include coreference resolution, semantic similarity, semantic parsing, and frame semantics. the importance of semantic analysis in nlp semantic analysis is key to the foundational task of extracting context, intent, and meaning from natural human language and making them machine-readable. this fundamental capability is critical to various nlp applications, from sentiment analysis and information retrieval to machine translation and question-answering systems. the continual refinement of semantic analysis techniques will therefore play a pivotal role in the evolution and advancement of nlp technologies. how llms improve semantic search in biomedical nlp semantic search in biomedical literature has evolved far beyond simple keyword matching. today, large language models (llms) enable researchers to retrieve contextually relevant insights from complex, unstructured datasets—such as pubmed—by understanding meaning, not just matching words. unlike traditional search, which depends heavily on exact term overlap, llm-based systems leverage embeddings—dense vector representations of words and phrases—to capture nuanced relationships between biomedical entities. this is especially valuable when mining literature for drug-disease associations, extracting drug-gene relations using nlp, mode-of-action predictions, or identifying multi-sentence relationships between proteins and genes. by embedding both queries and biomedical documents in the same high-dimensional space, llms support more relevant and context-aware retrieval. for instance, a query such as "inhibitors of pd-1 signaling" can retrieve relevant articles even if they don’t explicitly use the phrase "pd-1 inhibitors." this approach has transformed pubmed mining with nlp by enabling deeper and more intuitive exploration of biomedical text. llm-powered semantic search is already being used in pubmed mining tools, clinical trial data extraction, and knowledge graph construction. looking ahead: nlp trends in drug discovery as semantic search continues to evolve, it’s becoming central to biomedical research workflows, enabling faster, deeper insights from unstructured text. the shift from keyword matching to meaning-based retrieval marks a key turning point in nlp-driven drug discovery. these llm-powered approaches are especially effective for use cases like: extracting drug-gene interactions identifying biomarkers from literature linking unstructured data across sources they also help address key challenges in biomedical nlp, such as ambiguity, synonymy, and entity disambiguation across documents.

Connecting the dots: Why knowledge graphs matter

knowledge graphs (kgs) have become a must-know innovation that will drive transformational benefits in data-centric ai applications across industries. kgs, big data and ai are complementary concepts that together address the challenges of integrating, unifying, analyzing and querying vast volumes of diverse and complex data. there are several inherent advantages to the kg approach to organizing and representing information. unlike traditional flat data structures, for instance, a kg framework is designed to model multilevel hierarchical, associative, and causal relationships that more accurately represent real-world data. the application of a semantic layer to data also makes it easier for both humans and machines to understand the context and significance of information. here then are some of the key features and benefits of knowledge graphs. efficient data integration: integrate disparate data sources and break down information silos ai-specific data management, including automated data and metadata integration, is a critical component in successful data-centric ai. however, factors such as data complexity, quality, and accessibility pose integration challenges that are barriers to ai adoption. data-centric ai requires a modern approach to data integration that integrates all organizational data entities into one unified semantic representation based on context (ontologies, metadata, domain knowledge, etc.) and time (temporal relationships). knowledge graphs (kgs) have become the ideal platform for the contextual integration and representation of complex data ecosystems. they enable the integration of information from multiple data sources and map them to a common ontology in order to create a comprehensive, consistent, and connected representation of all organizational data entities. the scalability of this approach, across large volumes of heterogeneous, structured, semi-structured, and multimodal unstructured data from diverse data sources and silos, makes them ideal for automated data acquisition, transformation, and integration. knowledge extraction methods can be used to classify entities and relations, identify matching entities (entity linking, entity resolution), combine entities into a single representation (entity fusion), and match and merge ontology concepts to create a kg graph data model. there are several advantages to kg data models. they have the flexibility to scale across complex heterogeneous data structures. when integrated with natural language technologies (nlt), kgs can help train language models on domain-specific knowledge and natural language technologies can streamline the construction of knowledge models. they allow for more intuitive querying of complex data even by users without specialized data science knowledge. they can evolve to assimilate new data, sources, definitions, and use cases without manageability and accessibility loss. they provide consistent and unified access to all organization knowledge that is typically distributed across different data silos and systems. rich contextualization: capture relationships and provide a holistic view of data context is a critical component of learning, for both humans and machines. contextual information will be key to the development of next-generation ai systems that adopt a human approach to transform data into knowledge that enables more human-like decision-making. kgs leverage the powers of context and relations to embed data with intelligence. by organizing data based on factual interconnections and interrelations, they add real-world meaning to data that makes it easier for ai systems to extract knowledge from vast volumes of data. a key organizing principle of kgs is the provision of an additional metadata layer that organizes data based on context to support logical reasoning and knowledge discovery. the organizing principle could take many forms including controlled vocabularies, such as taxonomies, ontologies, etc., entity resolution and analysis, and tagging, categorization, and classification. with kgs, smart behavior is encoded directly into the data so that the graph itself can dynamically understand connections and associations between entities, eliminating the need to manually program every new piece of information. knowledge graphs provide context for decision support and can be further classified based on use cases as actioning kgs (data management) and decisioning kgs (analytics), and as context-rich kgs (internal knowledge management), external-sensing kgs (external data mapping), and natural language processing kgs. enhanced search and discovery: enable precise and context-aware search results the first step towards understanding how kgs transform the data search and discovery function is to understand the distinction between data search and data discovery. data search broadly refers to a scenario in which users are looking for specific information that they know or assume to exist. this is a framework that allows users to seek and extract relevant information from volumes of non-relevant data. data discovery is focused more on proactively enabling users to surface and explore new information and ideas that are potentially related to the actual search string. discovery essentially is search powered by context. kgs contextually integrate all entities and relationships across different data silos and systems into a unified semantic layer. this enables them to deliver more accurate and comprehensive search results and to provide context-relevant connections and relationships that promote knowledge discovery. users can then follow the contextual links that are most pertinent to their interest to delve deeper into the data thereby boosting data utilization and value. and perhaps equally importantly, the intuitive and flexible querying capabilities of kgs allow even non-technical users to explore data and discover new insights. it is estimated that graph-based models can help organizations enhance their ability to find, access, and reuse information by as much as 30% and up to 75% faster. knowledge graphs in life sciences knowledge graphs are transformative frameworks that enable a structured, connected, and semantically-enhanced approach to organize and interpret data holistically. they provide the foundations for companies to create a uniform data fabric across different environments and technologies and operationalizing ai at scale. for the life sciences industry, knowledge graphs represent a powerful tool for integrating, harmonizing, and governing heterogeneous and siloed data while ensuring data quality, lineage, and compliance. they enable the creation of a centralized, shared and holistic repository of knowledge that can be continually updated and enriched with new entities, relationships, and attributes. according to gartner, graph technologies will drive 80% of data and analytics innovations by 2025. if you are interested in integrating the innovative potential of kgs and ai/ml to your research pipeline, please drop us a line.

Knowledge graphs and black box LLMs

what are the limitations of large language models (llms) in biological research? chatgpt responds to this query with quite a comprehensive list that includes a lack of domain-specific knowledge, contextual understanding, access to up-to-date information, and interpretability and explainability. nevertheless, it has to be acknowledged that llms can have a transformative impact on biological and biomedical research. after all, these models have already been applied successfully in biological sequential data-based tasks like protein structure predictions and could possibly be extended to the broader language of biochemistry. specialized llms like chemical language models (clms) have the potential to outperform conventional drug discovery processes in traditional small-molecule drugs as well as antibodies. more broadly, there is a huge opportunity to use large-scale pre-trained language models to extract value from vast volumes of unannotated biomedical data. pre-training, of course, will be key to the development of biological domain-specific llms. research shows that domains, such as biomedicine, with large volumes of unlabeled text benefit most from domain-specific pretraining, as opposed to starting from general-domain language models. biomedical language models, pre-trained solely on domain-specific vocabulary, cover a much wider range of applications and, more importantly, substantially outperform currently available biomedical nlp tools. however, there is a larger issue of interpretability and explainability when it comes to transformer-based llms. the llm black box the development of natural language processing (nlp) models has traditionally been rooted in white-box techniques that were inherently interpretable. since then, however, the evolution has been towards more sophistical and advanced techniques black-box techniques that have undoubtedly facilitated state-of-the-art performance but have also obfuscated interpretability. to understand the sheer scale of the interpretability challenge in llms, we turn to openai’s language models can explain neurons in language models paper from earlier this year, which opens with the sentence “language models have become more capable and more widely deployed, but we do not understand how they work.” millions of neurons need to be analyzed in order to fully understand llms, and the paper proposes an approach to automating interpretability so that it can be scaled to all neurons in a language model. the catch, however, is that “neurons may not be explainable.” so, even as work continues on interpretable llms, the life sciences industry needs a more immediate solution to harness the power of llms while mitigating the need for a more immediate solution to integrate the potential of llms while mitigating issues such as interpretability and explainability. and knowledge graphs could be that solution. augmenting bionlp interpretability with knowledge graphs one criticism of llms is that the predictions that they generated based on ‘statistically likely continuations of word sequences’ fail to capture relational functionings that are central to scientific knowledge creation. these relation functionings, as it were, are critical to effective life sciences research. biomedical data is derived from different levels of biological organization, with disparate technologies and modalities, and scattered across multiple non-standardized data repositories. researchers need to connect all these dots, across diverse data types, formats, and sources, and understand the relationships/dynamics between them in order to derive meaningful insights. knowledge graphs (kgs) have become a critical component of life sciences’ technology infrastructure because they help map the semantic or functional relationships between a million different data points. they use nlp to create a semantic network that visualises all objects in the systems in terms of the relationships between them. semantic data integration, based on ontology matching, helps organize and link disparate structured/unstructured information into a unified human-readable, computationally accessible, and traceable knowledge graph that can be further queried for novel relationships and deeper insights. unifying llms and kgs combining these distinct ontology-driven and natural language-driven systems creates a synergistic technique that enhances the advantages of each while addressing the limitations of both. kgs can provide llms with the traceable factual knowledge required to address interpretability concerns. one roadmap for the unification of llms and kgs proposes three different frameworks: kg-enhanced llms, where the structured traceable knowledge from kgs enhances the knowledge awareness and interpretability of llms. incorporating kgs in the pre-training stage helps with the transfer of knowledge whereas in the inference stage, it enhances llm performance in accessing domain-specific knowledge. llm-augmented kgs: llms can be used in two different contexts - they can be used to process the original corpus and extract relations and entities that inform kg construction. and, to process the textual corpus in the kgs to enrich representation. synergized llms + kgs: both systems are unified into one general framework containing four layers. one, a data layer that processes the textual and structural data that can be expanded to incorporate multi-modal data, such as video, audio, and images. two, the synergized model layer, where both systems' features are synergized to enhance capabilities and performance. three, a technique layer to integrate related llms and kgs into the framework. and four, an application layer, for addressing different real-world applications. the kg-llm advantage a unified kg-llm approach to bionlp provides an immediate solution to the black box concerns that impede large-scale deployment in the life sciences. combining domain-specific kgs, ontologies, and dictionaries can significantly enhance llm performance in terms of semantic understanding and interpretability. at the same time, llms can also help enrich kgs with real-world data, from ehrs, scientific publications, etc., thereby expanding the scope and scale of semantic networks and enhancing biomedical research. at mindwalk, we have already created a comprehensive knowledge graph that integrates over 660 million objects, linked by more than 25 billion relationships, from the biosphere and from other data sources, such as scientific literature. plus, our lensai platform, powered by hyft technology, leverages the latest advancements in llms to bridge the gap between syntax (multi-modal sequential and structural data ) and semantics (functions). by integrating retrieval-augmented generation (rag) models, we have been able to harness the reasoning capabilities of llms while simultaneously addressing several associated limitations such as knowledge-cutoff, hallucinations, and lack of interpretability. compared to closed-loop language modelling, this enhanced approach yields multiple benefits including clear provenance and attribution, and up-to-date contextual reference as our knowledge base updates and expands. if you would like to integrate the power of a unified kg-llm framework into your research, please drop us a line here.

NLP, NLU & NLG : What is the difference?

in 2022, eliza, an early natural language processing (nlp) system developed in 1966, won a peabody award for demonstrating that software could be used to create empathy. over 50 years later, human language technologies have evolved significantly beyond the basic pattern-matching and substitution methodologies that powered eliza. as we enter the new age of chatgp, generative ai, and large language models (llms), here’s a quick primer on the key components — nlp, nlu (natural language understanding), and nlg (natural language generation), of nlp systems. what is nlp? nlp is an interdisciplinary field that combines multiple techniques from linguistics, computer science, ai, and statistics to enable machines to understand, interpret, and generate human language. the earliest language models were rule-based systems that were extremely limited in scalability and adaptability. the field soon shifted towards data-driven statistical models that used probability estimates to predict the sequences of words. though this approach was more powerful than its predecessor, it still had limitations in terms of scaling across large sequences and capturing long-range dependencies. the advent of recurrent neural networks (rnns) helped address several of these limitations but it would take the emergence of transformer models in 2017 to bring nlp into the age of llms. the transformer model introduced a new architecture based on attention mechanisms. unlike sequential models like rnns, transformers are capable of processing all words in an input sentence in parallel. more importantly, the concept of attention allows them to model long-term dependencies even over long sequences. transformer-based llms trained on huge volumes of data can autonomously predict the next contextually relevant token in a sentence with an exceptionally high degree of accuracy. in recent years, domain-specific biomedical language models have helped augment and expand the capabilities and scope of ontology-driven bionlp applications in biomedical research. these domain-specific models have evolved from non-contextual models, such as biowordvec, biosentvec, etc., to masked language models, such as biobert, bioelectra, etc., and to generative language models, such as biogpt and biomedlm. knowledge-enhanced biomedical language models have proven to be more effective at knowledge-intensive bionlp tasks than generic llms. in 2020, researchers created the biomedical language understanding and reasoning benchmark (blurb), a comprehensive benchmark and leaderboard to accelerate the development of biomedical nlp. nlp = nlu + nlg + nlq nlp is a field of artificial intelligence (ai) that focuses on the interaction between human language and machines. it employs a constantly expanding range of techniques, such as tokenization, lemmatization, syntactic parsing, semantic analysis, and machine translation, to extract meaning from unstructured natural languages and to facilitate more natural, bidirectional communication between humans and machines. source: techtarget modern nlp systems are powered by three distinct natural language technologies (nlt), nlp, nlu, and nlg. it takes a combination of all these technologies to convert unstructured data into actionable information that can drive insights, decisions, and actions. according to gartner ’s hype cycle for nlts, there has been increasing adoption of a fourth category called natural language query (nlq). so, here’s a quick dive into nlu, nlg, and nlq. nlu while nlp converts unstructured language into structured machine-readable data, nlu helps bridge the gap between human language and machine comprehension by enabling machines to understand the meaning, context, sentiment, and intent behind the human language. nlu systems process human language across three broad linguistic levels: a syntactical level to understand language based on grammar and syntax, a semantic level to extract meaning, and a pragmatic level to decipher context and intent. these systems leverage several advanced techniques, including semantic analysis, named entity recognition, relation extraction and coreference resolution, to assign structure, rules, and logic to language to enable machines to get a human-level comprehension of natural languages. the challenge is to evolve from pipeline models, where each task is performed separately, to blended models that can combine critical bionlp tasks, such as biomedical named entity recognition (bioner) and biomedical relation extraction (biore), into one unified framework. nlg where nlu focuses on transforming complex human languages into machine-understandable information, nlg, another subset of nlp, involves interpreting complex machine-readable data in natural human-like language. this typically involves a six-stage process flow that includes content analysis, data interpretation, information structuring, sentence aggregation, grammatical structuring, and language presentation. nlg systems generate understandable and relevant narratives from large volumes of structured and unstructured machine data and present them as natural language outputs, thereby simplifying and accelerating the transfer of knowledge between machines and humans. to explain the nlp-nlu-nlg synergies in extremely simple terms, nlp converts language into structured data, nlu provides the syntactic, semantic, grammatical, and contextual comprehension of that data and nlg generates natural language responses based on data. nlq the increasing sophistication of modern language technologies has renewed research interest in natural language interfaces like nlq that allow even non-technical users to search, interact, and extract insights from data using everyday language. most nlq systems feature both nlu and nlg modules. the nlu module extracts and classifies the utterances, keywords, and phrases in the input query, in order to understand the intent behind the database search. nlg becomes part of the solution when the results pertaining to the query are generated as written or spoken natural language. nlq tools are broadly categorized as either search-based or guided nlq. the search-based approach uses a free text search bar for typing queries which are then matched to information in different databases. a key limitation of this approach is that it requires users to have enough information about the data to frame the right questions. the guided approach to nlq addresses this limitation by adding capabilities that proactively guide users to structure their data questions using modeled questions, autocomplete suggestions, and other relevant filters and options. augmenting life sciences research with nlp at mindwalk, our mission is to enable an authentic systems biology approach to life sciences research, and natural language technologies play a central role in achieving that mission. our lensai integrated intelligence platform leverages the power of our hyft® framework to organize the entire biosphere as a multidimensional network of 660 million data objects. our proprietary bionlp framework then integrates unstructured data from text-based information sources to enrich the structured sequence data and metadata in the biosphere. the platform also leverages the latest development in llms to bridge the gap between syntax (sequences) and semantics (functions). for instance, the use of retrieval-augmented generation (rag) models enables the platform to scale beyond the typical limitations of llm, such as knowledge cutoff and hallucinations, and provide the up-to-date contextual reference required for biomedical nlp applications. with the lensai, researchers can now choose to launch their research by searching for a specific biological sequence. or they may search in the scientific literature with a general exploratory hypothesis related to a particular biological domain, phenomenon, or function. in either case, our unique technological framework returns all connected sequence-structure-text information that is ready for further in-depth exploration and ai analysis. by combining the power of hyft®, nlp, and llms, we have created a unique platform that facilitates the integrated analysis of all life sciences data. thanks to our unique retrieval-augmented multimodal approach, now we can overcome the limitations of llms such as hallucinations and limited knowledge. stay tuned for hearing more in our next blog.

Natural Language Understanding (NLU) - Basics and Applications in Bioinformatics

natural language understanding (nlu) is an ai-powered technology that allows machines to understand the structure and meaning of human languages. nlu, like natural language generation (nlg), is a subset of natural language processing (nlp) that focuses on assigning structure, rules, and logic to human language so machines can understand the intended meaning of words, phrases, and sentences in text. nlg, on the other hand, deals with generating realistic written/spoken human-understandable information from structured and unstructured data. since the development of nlu is based on theoretical linguistics, the process can be explained in terms of the following linguistic levels of language comprehension. linguistic levels in nlu phonology is the study of sound patterns in different languages/dialects, and in nlu it refers to the analysis of how sounds are organized, and their purpose and behavior. lexical or morphological analysis is the study of morphemes, indivisible basic units of language with their own meaning, one at a time. indivisible words with their own meaning, or lexical morphemes (e.g.: work) can be combined with plural morphemes (e.g.: works) or grammatical morphemes (e.g.: worked/working) to create word forms. lexical analysis identifies relationships between morphemes and converts words into their root form. syntactic analysis, or syntax analysis, is the process of applying grammatical rules to word clusters and organizing them on the basis of their syntactic relationships in order to determine meaning. this also involves detecting grammatical errors in sentences. while syntactic analysis involves extracting meaning from the grammatical syntax of a sentence, semantic analysis looks at the context and purpose of the text. it helps capture the true meaning of a piece of text by identifying text elements as well as their grammatical role. discourse analysis expands the focus from sentence-length units to look at the relationships between sentences and their impact on overall meaning. discourse refers to coherent groups of sentences that contribute to the topic under discussion. pragmatic analysis deals with aspects of meaning not reflected in syntactic or semantic relationships. here the focus is on identifying intended meaning readers by analyzing literal and non-literal components against the context of background knowledge. common tasks/techniques in nlu there are several techniques that are used in the processing and understanding of human language. here’s a quick run-through of some of the key techniques used in nlu and nlp. tokenization is the process of breaking down a string of text into smaller units called tokens. for instance, a text document could be tokenized into sentences, phrases, words, subwords, and characters. this is a critical preprocessing task that converts unstructured text into numerical data for further analysis. stemming and lemmatization are two different approaches with the same objective: to reduce a particular word to its root word. in stemming, characters are removed from the end of a word to arrive at the “stem” of that word. algorithms determine the number of characters to be eliminated for different words even though they do not explicitly know the meaning of those words. lemmatization is a more sophisticated approach that uses complex morphological analysis to arrive at the root word, or lemma. parsing is the process of extracting the syntactic information of a sentence based on the rules of formal grammar. based on the type of grammar applied, the process can be classified broadly into constituency and dependency parsing. constituency parsing, based on context-free grammar, involves dividing a sentence into sub-phrases, or constituents, that belong to a specific grammar category, such as noun phrases or verb phrases. dependency parsing defines the syntax of a sentence not in terms of constituents but in terms of the dependencies between the words in a sentence. the relationship between words is depicted as a dependency tree where words are represented as nodes and the dependencies between them as edges. part-of-speech (pos) tagging, or grammatical tagging, is the process of assigning a grammatical classification, like noun, verb, adjective, etc., to words in a sentence. automatic tagging can be broadly classified as rule-based, transformation-based, and stochastic pos tagging. rule-based tagging uses a dictionary, as well as a small set of rules derived from the formal syntax of the language, to assign pos. transformation-based tagging, or brill tagging, leverages transformation-based learning for automatic tagging. stochastic refers to any model that uses frequency or probability, e.g. word frequency or tag sequence probability, for automatic pos tagging. name entity recognition (ner) is an nlp subtask that is used to detect, extract and categorize named entities, including names, organizations, locations, themes, topics, monetary, etc., from large volumes of unstructured data. there are several approaches to ner, including rule-based systems, statistical models, dictionary-based systems, ml-based systems, and hybrid models. these are just a few examples of some of the most common techniques used in nlu. there are several other techniques like, for instance, word sense disambiguation, semantic role labeling, and semantic parsing that focus on different levels of semantic abstraction, nlp/nlu in biomedical research nlp/nlu technologies represent a strategic fit for biomedical research with its vast volumes of unstructured data — 3,000-5,000 papers published each day, clinical text data from ehrs, diagnostic reports, medical notes, lab data, etc., and non-standardized digital real-world data. nlp-enabled text mining has emerged as an effective and scalable solution for extracting biomedical entity relations from vast volumes of scientific literature. techniques, like named entity recognition (ner), are widely used in relation extraction tasks in biomedical research with conventionally named entities, such as names, organizations, locations, etc., substituted with gene sequences, proteins, biological processes, and pathways, drug targets, etc. the unique vocabulary of biomedical research has necessitated the development of specialized, domain-specific bionlp frameworks. at the same time, the capabilities of nlu algorithms have been extended to the language of proteins and that of chemistry and biology itself. a 2021 article detailed the conceptual similarities between proteins and language that make them ideal for nlp analysis. more recently, an nlp model was trained to correlate amino acid sequences from the uniprot database with english language words, phrases, and sentences used to describe protein function to annotate over 40 million proteins. researchers have also developed an interpretable and generalizable drug-target interaction model inspired by sentence classification techniques to extract relational information from drug-target biochemical sentences. large neural language models and transformer-based language models are opening up transformative opportunities for biomedical nlp applications across a range of bioinformatics fields including sequence analysis, genome analysis, multi-omics, spatial transcriptomics, and drug discovery. most importantly, nlp technologies have helped unlock the latent value in huge volumes of unstructured data to enable more integrative, systems-level biomedical research. read more about nlp’s critical role in facilitating systems biology and ai-powered data-driven drug discovery. if you want more information on seamlessly integrating advanced bionlp frameworks into your research pipeline, please drop us a line here.

Cracking the Information Integration Dilemma (IID) in systems biology

in our new lensai blog series, we explore how data itself often becomes the bottleneck in data-driven biological and biomedical research. we dive into the data-related challenges that affect the development and advancement of different research concepts and domains, such as drug discovery, and also the importance of integrating wet lab and in silico research etc. we start with systems biology, a holistic model that represents a radical departure from the conventional reductionist approach to understanding complex biological systems. biological and biomedical research in the 20th century was driven predominantly by reductionism, a pieces of life approach that seeks to understand complex biological systems as a sum of the functionalities of their individual components. now, there is definitely value in building a systems-level perspective that is based on an aggregation of component-level functionality. after all, reductionism has played a key role in elucidating the central dogmatic principles and concepts of biology. however, the limitations of this approach are hard to ignore. after all, a complex biological system, unlike, say, a bicycle, clearly has to be more than a sum of its parts. systems biology is the paradigm that defines an integrative and holistic strategy to decipher complex, hierarchical, adaptive, and dynamic biological systems across multiple components and levels of organization. complex biological systems, like those within living organisms, are much more intricate than simple objects like bicycles. unlike bicycles, these systems are not just a sum of their parts but have unique properties that emerge when all of the parts work together. systems biology is an approach that helps scientists study and understand these complex biological systems by looking at the big picture. by considering how all the different parts of the system interact, scientists get a better understanding of how the entire system functions as a whole, instead of only looking at individual components in isolation. inspired by the ideas from the santa fe institute, system thinking plays a crucial role in the systems biology approach. it helps researchers recognize the importance of the connections between different parts of the biological system, the influence of its surroundings, and how the system changes over time. this way, scientists can better understand health, disease, and potential treatments, leading to more effective medical therapies and diagnostic tools. the modern form of systems biology emerged in the late 1960s and it quickly became evident that mathematics and computation would play a critical role in realizing the potential of this holistic approach. mathematical and computational modeling based on large volumes of genome-scale data would be the key to unraveling the systems-level complexity of biological phenomena. today, the availability of sophisticated computational techniques and the exponential generation of high-throughput biomedical data provide the perfect foundation for a systems approach to tackling biological complexity. but here’s where things get a bit complicated. complex biological phenomena and systems are defined by complex biological data. a data-driven systems approach requires the integrated analysis of all available complex biological data in order to identify relevant interactions and patterns of a biosystem. however, the sheer complexity of biological data poses a major challenge for efficient data integration and curation that is required for generating a holistic view of complex biological systems. a quick overview of biological data complexity the james webb space telescope generates up to 57 gigabytes each day. by comparison, one of the world’s largest genome sequencing facilities sequences dna at a rate equivalent to a human genome, roughly 140 gigabytes in size, every 3.2 minutes. and that is just genomic data, which is expected to reach exabase-scale within a decade, from just one sequencing facility. despite the continuing exponential increase of publicly-available biological data, data volume is perhaps one of the more manageable complexities of biological big data. then there’s the expanding landscape of biological data, from single-cell omics data to genome-scale metabolic models (gems), that reflect the inherent complexity and heterogeneity of biological systems and vary in format, and scale. data formats can also vary based on the technologies and protocols used to characterize different levels of biological organization. from a data integration perspective, there also has to be due consideration for organizing structured and unstructured data as well as multi-format data from numerous databases that specialize in specific modalities, layers, organisms, etc. over and above all this, novel complexities continue to emerge as technological advancements open up new frontiers for biological research. moving on from simple static models derived from static data, the scope of research is now expanding to characterize biological complexity along the dynamic fourth dimension of time. for instance, rather than merely integrating single-time-point omics sequence data across biological levels, the emerging framework of temporal omics compares sequence data across time in order to evaluate the temporal dynamics of biological processes. so the big question is how to integrate, standardize, and curate all this complexity into one comprehensive, contextual, scalable data matrix that solves the information integration dilemma in systems biology. the lensai integrated intelligence platform for systems biology information integration dilemma (iid) refers to how the challenges of integrating, standardizing, and analyzing complex biological data have created a bottleneck in the holistic, systems-level analysis of biological complexity. currently integrating data, across diverse data modalities, formats, platforms, standards, ontologies, etc., for systems biology data analysis is not a trivial task. the process requires multiple tools and techniques for different tasks such as harmonizing and standardizing data formats, preprocessing, integration, and fusion. moreover, there is no single analytical framework that scales across the complex heterogeneity and diversity of biological data. the lensai integrated intelligence platform addresses these shortcomings of conventional solutions by incorporating the key organizing principles of intelligent data management and smart big data systems. one, the platform leverages ai-powered intelligent automation to organize and index all biological data, both structured and unstructured. hyft®, a proprietary framework that leverages advanced machine learning (ml) and natural language processing (nlp) technologies, seamlessly integrates and organizes all biological and textual data into a unified multidimensional network of data objects. the network currently comprises over 660 million data objects with multiple layers of information about sequence, syntax, and protein structure. plus, hyft® enables researchers to integrate proprietary research into the existing data network. this network is continuously updated with new data, metadata, relationships, and links, ensuring that the lensai data biosphere is always current. two, smart big data is not just about the number of data objects but also about latent relationships between those data sets. the lensai data biosphere is further augmented by a knowledge graph that currently maps over 25 billion cross-data relationships and makes it easier to visualize the interrelatedness of different entities. this visual relationship map is continuously updated with contextual biological information to create a constantly expanding knowledge resource. now that we have an organized, high-quality, contextualized data catalog, the next step is to provide comprehensive search and access capabilities that empower users to curate, customize and organize data sets to specific research requirements. for instance, the computational modeling of biological systems could follow two broad research directions — bottom-up theory-driven modeling, based on contextual links between model terms and known mechanisms of a biological system, or two, top-down data-driven modeling, where relationships between different variables in biological systems are extracted from large volumes of data without prior knowledge of underlying mechanisms. so, an intelligent data catalog must enable even non-technical users to organize and manipulate data in a way that best serves their research interests. multiscale data integration with the lensai platform biological systems operate across multiple and diverse spatiotemporal scales, with each represented by datasets with very diverse modalities. the systems biology approach requires the concurrent integration of all of these multimodal datasets into one unified analytical framework in order to obtain an accurate, systems-level simulation of biological complexity. however, there are currently no bioinformatics frameworks that facilitate the multiscale integration of vast volumes of complex, heterogeneous, system-wide biological data. but mindwalk’s patented hyft® technology and lensai platform enable true multiscale data unification — including syntactical (sequence) data, 3d structural data, unstructured scientific information (e.g. scientific literature), etc. — into one integrated, ai-powered analytical framework. by completely eliminating the friction in the integration of complex biological data, lensai shifts the paradigm in data-driven biological research.

Transforming in silico drug discovery with AI

identifying and validating optimal biological targets is a critical first step in drug discovery with a cascading downstream impact on late-stage trials, efficacy, safety, and clinical performance. traditionally, this process required the manual investigation of biomedical data to establish target-disease associations and to assess efficacy, safety, and clinical/commercial potential. however, the exponential growth in high-throughput data on a range of putative targets, including proteins, metabolites, dnas, rnas, etc., has led to the increasing use of in silico, or computer-aided drug design (cadd), methods to identify bioactive compounds and predict binding affinities at scale. today, in silico techniques are evolving at the same pace as in vitro technologies, such as dna-labelled libraries, and have proven to be critical in dealing with modern chemical libraries' scale, diversity, and complexity. cadd techniques encompass structure-based drug design (sbdd) and ligand-based drug design (lbdd) strategies depending on the availability of the three-dimensional biological structure of the target of interest. some of the most common applications for these techniques include in silico structure prediction, refinement, modelling and target validation. they are widely utilised across four phases: identifying hits with virtual screening (vs), investigating the specificity of selected hits through molecular docking, predicting admet properties and further molecular optimisation of hits/leads. as drug discovery becomes increasingly computational and data-driven, it is becoming common practice to combine cadd with advanced technologies like artificial intelligence (ai), machine learning (ml), and deep learning (dl) to cost- and time-efficiently convert biological big data into pharmaceutical value. in this article, we’ll take a closer look at how ai/ml/dl technologies are transforming three of the most widely used in silico techniques in drug discovery, virtual screening (vs), molecular docking and molecular dynamics (md) simulation. virtual screening virtual screening (vs), a computational approach to screening large libraries for hits, when integrated with an experimental approach, such as high-throughput screening, can significantly enhance the speed, accuracy and productivity of drug discovery. in silico screening techniques are classified as ligand-based vs (lbvs) and structure-based vs (sbvs). these distinct approaches can be combined, for instance, to identify active compounds using ligand-based techniques and follow through with structure-based methods to find favourable candidates. however, there are some shortcomings to cadd-based vs technologies with biochemical assays typically confirming desired bioactivity in only 12% of the top-scoring compounds derived from standard vs applications. over the past two decades, the application of ai/ml tools to virtual screening has evolved considerably with techniques like multi-objective optimization and ensemble-based virtual screening being used to enhance the efficiency, accuracy and speed of conventional sbvs and lbvs methodologies. studies show that deep learning (dl) techniques perform significantly better than ml algorithms across a range of tasks including target prediction, admet properties prediction and virtual screening. dl-based vs frameworks have proven to be more effective at extracting high-order molecule structure representations, accurately classifying active and inactive compounds, and enabling ultra-high-throughput screening. the integration of quantum computing is expected to be the next inflexion point for vs, with studies demonstrating that quantum classifiers can significantly outperform classical ml/dl-based vs. molecular docking molecular docking, a widely used method in sbvs for retrieving active compounds from large databases, typically relies on a scoring function to estimate binding affinities between receptors and ligands. this docking-scoring approach is an efficient way to quickly evaluate protein–ligand interactions (plis) based on a ranking of putative ligand binding poses that is indicative of binding affinity. the development of scoring functions (sfs) for binding affinity prediction has been evolving since the 90s and today includes classical sfs, such as physics-, regression-, and knowledge-based methods, and data-driven models, such as ml- and dl-based sfs. however, accuracy is a key challenge with high-throughput approaches as binding affinity predictions are derived from a static snapshot of the protein-ligand binding state rather than the complex dynamics of the ensemble. ml-based sfs perform significantly better than classical sfs in terms of comparative assessment of scoring functions (casf) benchmarks and their ability to learn from pli data and deal with non-linear relationships. but the predictions are based on approximations and data set biases rather than the interatomic dynamics that guide binding. the performance of ml-based sfs also depends on the similarity of targets across the training set and the test set, which makes generalisation a challenge. dl-based sfs have demonstrated significant advantages, including feature generation automation and the ability to capture complex binding interactions, over traditional ml methods. recently, a team of mit researchers took the novel approach of framing molecular docking as a generative modelling problem to develop diffdock, a new molecular docking model that delivers a much higher success rate (38%) than state-of-the-art of traditional docking (23%) and deep learning (20%) methods. molecular dynamics simulations since molecular docking methods only provide an initial static protein–ligand complex, molecular dynamics (md) simulations have become the go-to approach for information on the dynamics of the target. md simulations capture changes at the molecular and atomistic levels and play a critical role in elucidating intermolecular interactions that are essential to assess the stability of a protein-ligand complex. there are, however, still several issues with this approach including accuracy-versus-efficiency trade-offs, computational complexity, large timescale requirements and errors due to the underlying force fields. ml techniques have helped address many of these challenges and have proven vital to the development of md simulations for three reasons: objectivity in model selection, enhanced interpretability due to the statistically coherent representation of structure–function relationships, and the capability to generate quantitative, empirically-verifiable models for biological processes. deep learning methods are now emerging as an effective solution to dealing with the terabytes of dynamic biomolecular big data generated by md simulations with other applications including the prediction of quantum-mechanical energies and forces, extraction of free energy surfaces and kinetics, and coarse-grained molecular dynamics. shifting the in silico paradigm with ai a combination of in silico models and experimental approaches has become a central component of early-stage drug discovery, facilitating the faster generation of lead compounds at lower costs and with higher efficiency and accuracy. advanced ai technologies are a key driver of disruption in in silico drug discovery and have helped address some of the limitations and challenges of conventional in silico approaches. at the same time, they are also shifting the paradigm with their capability to auto-generate novel drug-like molecules from scratch. by one estimate, ai/ml in early-stage drug development could result in an additional 50 novel therapies, a $50 billion market, over a 10-year period.

Creating a unified data + information architecture for scalable AI

the first blog in our series on data, information and knowledge management in the life sciences, provided an overview of some of the most commonly used data and information frameworks today. in this second blog, we will take a quick look at the data-information-knowledge continuum and the importance of creating a unified data + information architecture that can support scalable ai deployments. in 2000, a seminal knowledge management article, excerpted from the book working knowledge: how organizations manage what they know, noted that despite the distinction between the terms data, information, and knowledge being just a matter of degree, understanding that distinction could be key to organizational success and failure. the distinction itself is quite straightforward, data refers to a set of discrete, objective facts with little intrinsic relevance or purpose and provide no sustainable basis for action. data endowed with relevance and purpose becomes information that can influence judgment and behavior. and knowledge, which includes higher-order concepts such as wisdom and insight, is derived from information and enables decisions and actions. today, in the age of big data, ai (artificial intelligence), and the data-driven enterprise, the exponential increase in data volume and complexity has resulted in a rise in information gaps due to the inability to turn raw data into actionable information at scale. and the bigger the pile of data, the more the prevalence of valuable but not yet useful data. the information gap in life sciences the overwhelming nature of life sciences data typically expressed in exabase-scales, exabytes, zettabytes, or even yottabytes, and the imperative to convert this data deluge into information has resulted in the industry channeling nearly half of its technology investments into three analytics-related technologies — applied ai, industrialized ml (machine learning), and cloud and edge computing. at the same time, the key challenges in scaling analytics, according to life sciences leaders, were the lack of high-quality data sources and data integration. data integration is a key component of a successful enterprise information management (eim) strategy. however, data professionals spend an estimated 80 percent of their time on data preparation, thereby significantly slowing down the data-insight-action journey. creating the right data and information infrastructure (ia), therefore, will be critical to implementing, operationalizing, and scaling ai. or as it’s commonly articulated, no ai without ia. the right ia for ai information and data architectures share a symbiotic relationship in that the former accounts for organization structure, business strategy, and user information requirements while the latter provides the framework required to process data into information. together, they are the blueprints for an enterprise’s approach to designing, implementing, and managing a data strategy. the fundamental reasoning of the no ai without ia theorem is that ai requires machine learning, machine learning requires analytics, and analytics requires the right ia. not accidental ia, a patchwork of piecemeal efforts to architect information or traditional ia, a framework designed for legacy technology, but a modern and open ia that creates a trusted, enterprise-level foundation to deploy and operationalize sustainable ai/ml across the organization. ai information architecture can be defined in terms of six layers: data sources, source data access, data preparation and quality, analytics and ai, deployment and operationalization, and information governance and information catalog. some of the key capabilities of this architecture include support for the exchange of insights between ai models across it platforms, business systems, and traditional reporting tools. empowering users to develop and manage new ai artifacts, managing cataloging and governance of these artifacts, and promoting collaboration. and ensuring model accuracy and precision across the ai lifecycle. an ia-first approach to operationalizing ai at scale the ia-first approach to ai starts with creating a solid data foundation that facilitates the collection and storage of raw data from different perspectives and paradigms including batch collection and streaming data, structured and unstructured data, transactional and analytical data, etc. for life sciences companies, a modern ia infrastructure will address the top hurdle in scaling ai, i.e. the lack of high-quality data sources, time wasted on data preparation, and data integration. creating a unified architectural foundation to delay with life sciences big data will have a transformative impact on all downstream analytics. the next step is to make all this data business-ready and data governance plays a critical role in building the trust and transparency required to operationalize ai. in the life sciences, this includes ensuring that all data is properly protected and stored from acquisition to archival, ensuring the quality of data and metadata, engineering data for consumption, and creating standards and policies for data access and sharing. a unified data catalog that conforms to the information architecture will be key to enabling data management, data governance, and query optimization at scale. now that the data is business-ready, organizations can turn their focus to executing the full ai lifecycle. the availability of trusted data opens up additional opportunities for prediction, automation, and optimization plus prediction capabilities. in addition, people, processes, tools, and culture will also play a key role in scaling ai. the first step is to streamline ai processes with mlops to standardize and streamline the ml lifecycle and create a unified framework for ai development and operationalization. organizations must then choose the right tools and platforms, from a highly fragmented ecosystem, to build robust, repeatable workflows, with an emphasis on collaboration, speed, and safety. scaling ai will then require the creation of multidisciplinary teams organized as a center of excellence (coe) with management and governance oversight, as decentralized product, function or business unit teams with domain experts, or as a hybrid. and finally, culture is often the biggest impediment to ai adoption at scale and therefore needs the right investments in ai-ready cultural characteristics. however, deployment activity alone is not a guarantee for results with deloitte reporting that despite accelerating full-scale deployments outcomes are still lagging. the key to successfully scaling ai is to correlate technical performance with business kpis and outcomes. successful at-scale ai deployments are more likely to have adopted leading practices, like enterprise-wide platforms for ai model and application development, documented data governance and mlops procedures, and roi metrics for deployed models and applications. such deployments also deliver the strongest ai outcomes measured in revenue-generating results such as expansion into new segments and markets, creation of new products/services, and implementation of new business/service models. the success of ai depends on ia one contemporary interpretation of conway's law argues that the outcomes delivered by ai/ml deployments can only be as good as their underlying enterprise information architecture. the characteristics and limitations of, say, fragmented or legacy ia will inevitably be reflected in the performance and value of enterprise ai. a modern, open, and flexible enterprise information architecture is therefore crucial for the successful deployment of scalable, high-outcome, future-proof ai. and this architecture will be defined by a solid data foundation to transform and integrate all data, an information architecture that ensures data quality and data governance and a unified framework to standardize and streamline the ai/ml lifecycle and enable ai development and operationalization at scale. in the next blog in this series, we will look at how data architectures have evolved over time, discuss different approaches, such as etl, elt, lambda, kappa, data mesh, etc., define some hyped concepts like ‘big data’ and ‘data lakes’ and correlate all this to the context of drug discovery and development. read part 1 of our data management series: from fair principles to holistic data management in life sciences read part 3 of our data management series: ai-powered data integration and management with data fabric

From FAIR principles to holistic data management in life sciences

in january this year, the u.s. national institutes of health (nih), the largest public funder of biomedical research in the world, implemented an updated data management and sharing (dms) policy that will require all grant applications to be supported by, and comply with, a detailed dms plan. the key requirements of the dms plan include data/metadata volume and types along with their applied standards, tools, software, and code required for data access/analysis, and storage, access, distribution, and reuse protocols. a pre-implementation nih-sponsored multidisciplinary workshop to discuss this cultural shift in data management and sharing acknowledged that the policy was perceived more as a tax and an obligation than valued work within the research community. researchers also feel overwhelmed by the lack of resources and expertise that will be required to comply with these new data management norms. biomedical research also has a unique set of data management challenges that have to be addressed in order to maximize the value of vast volumes of complex, heterogeneous, siloed, interdisciplinary, and highly regulated data. plus, data management in life sciences is often seen as being a slow and costly process that is disruptive to conventional r&d workflows and with no direct roi. however, good data management can deliver cascading benefits for life sciences research – both for individual teams and the community. for research teams, effective data management standardizes data, code, and documentation. in turn, this enhances data quality, enables ai-driven workflows, and increases research efficiency. across the broader community, it lays the foundation for open science, enhances reusability and reproducibility, and ensures research integrity. most importantly, it is possible to create a strong and collaborative data management foundation by implementing established methodologies that do not disrupt current research processes or require reinventing the wheel. data management frameworks for life sciences the key dms requirements defined by the nih encompass the remit of data management. over the years, there has been a range of data management that have emerged from various disciplines that are broadly applicable to life sciences research. however, the fair principles — findability, accessibility, interoperability, and reusability — published in 2016 were the first to codify the discipline of data management for scientific research. these principles focus on the unique challenges of scientific data from the perspective of the scientist rather than that of it, and are applied to all components of the research process, including data, algorithms, tools, workflows, and pipelines. it is widely acknowledged that implementing the fair principles can improve the productivity of biopharma and other life sciences r&d. the challenge, however, is that these high-level principles provide no specific technology, standard, or implementation recommendations. instead, they merely provide the benchmark to evaluate the fairness of implementation choices. currently, frameworks are in development to coordinate fairification among all stakeholders in order to maximize interoperability, promote the reuse of existing resources and accelerate the convergence of standards and technologies in fair implementations. although a move in the good direction, note that fair principles only cover a small part of data management. it is necessary to complement the fair principles with best practices form other, more holistic frameworks (data management or related), so here are a few worth looking at. data management body of knowledge (dmbok) dmbok from dama (data management association) international, widely considered the gold standard of data management frameworks, was first published in 2009. it includes three distinct components – the dama wheel, the environmental factors hexagon, and the knowledge area context diagram. image source: dama international the dama wheel defines 11 key knowledge areas that together constitute a mature data management strategy, the environmental factors hexagon model provides the foundation for describing a knowledge area, with the context diagram framing its scope. dmbok2, released in 2017, includes some key changes. data governance, for instance, is no longer just a grand unifying theory, but delivers contextual relevance by defining specific governance activities and environmental factors relevant to each knowledge area. more broadly though, the dama-dmbok guide continues to serve as a comprehensive compilation of widely accepted principles and concepts that can help standardize activities, processes, and best practices, roles and responsibilities, deliverables and metrics, and maturity assessment. notwithstanding its widespread popularity, the dmbok framework does have certain challenges. for instance, it has been pointed out that the framework’s emphasis on providing “...the context for work carried out by data management professionals…” overlooks all the non-data professionals working with data analytics today. moreover, even though the model defines all the interrelated knowledge areas of data management, an integrated implementation of the entire framework from scratch would still require reinventing the wheel. the open group architecture framework (togaf) the togaf standard is a widely-used framework for enterprise architecture developed and maintained by members of the open group. the framework classifies enterprise architecture into four primary domains — business, data, application, and technology — spanned by other domains such as security, governance, etc. data architecture is just one component in the framework’s approach to designing and implementing enterprise architecture, and togaf offers data models, architectures, techniques, best practices, and governance principles. source: the open group at the core of the togaf standard is the togaf architecture development method (adm) which describes the approach to developing and managing the lifecycle of an enterprise architecture. this includes togaf standard elements as well as other architectural assets that are available to meet enterprise requirements. there are two more key parts to the togaf – the enterprise continuum and architecture repository. the enterprise continuum supports adm execution by providing the framework and context to leverage relevant architecture assets, including architecture descriptions, models, and patterns sourced from enterprise repositories and other available industry models and standards. the architecture repository provides reference architectures, models, and patterns that have previously been used within the organization along with architecture development work-in-progress. a key philosophy of the togaf framework is to provide a fully-featured core enterprise architecture metamodel that is broad enough to ensure out-of-the-box applicability across different contexts. at the same time, the open architecture standards enable users to apply a number of optional extension modules, for data, services, governance etc., to customize the metamodel to specific organizational needs. source: the open group this emphasis on providing a universal scaffolding that is uniquely customizable started with version 9 of the standard. the 10th edition, launched in 2022, is designed to embrace this dichotomy of universal concepts and granular configuration with a refreshed modular structure to streamline implementation across architecture styles and expanded guidance and “how-to” materials to simplify the adoption of best practices across a broad range of use cases. it infrastructure library (itil) the itil framework was developed in the 1980s to address the lack of quality in it services procured by the british government as a methodology to achieve better quality at a lower cost. the framework, currently administered and updated by axelos, defines a 5-stage service lifecycle comprising service strategy, service design, service transition, service operations, and continuous service improvement. itil 4 continues to build on the core guidance of previous versions to deliver an adaptable framework that supports traditional service management activities, aligns to transformative cloud, automation, and ai technologies, and works seamlessly with devops, lean, and agile methodologies. although this is a framework for service management, it contains a number of interesting concepts, processes and metrics that relate to data management. one example is the use of configuration management databases (cmdbs). cmdbs are a fundamental component of itil. these databases are used to store, manage and track information about individual configuration items (cis), i.e., any asset or component involved in the delivery of it services. the information recorded about each ci’s attributes, dependencies, and configuration changes allows it teams to understand how components are connected across the infrastructure and focus on managing these connections to create more efficient processes. control objectives for information and related technology (cobit) cobit is a comprehensive framework, created by isaca (information systems audit and control association) and first released in 1996. it defines the components and design factors required for implementing a management and governance system for enterprise it. governance framework principles source: isaca governance system design workflow source: isaca the latest release has shifted from iso/iec 33000 to the cmmi performance management scheme and a new governance system design workflow has been adopted to streamline implementation. cobit, being a governance framework, contains interesting data governance related metrics, key performance indicators (kpis) and processes on how an organization can follow up on quality and compliance, which is an essential part of good data management. these are just a few examples of the different approaches to information and data management that are currently in use across industries. in fact, there are quite a lot more to choose from including the gartner enterprise information management framework, zachman framework, eckerson, pwc enterprise data governance framework, dcam, sas data governance framework, dgi data governance framework, and the list goes on. taking steps towards a better data management the framework around fair principles certainly provided a good starting place for getting the basics of your data management right. in this blog we’ve shown that this alone is not enough, and we demonstrated the plethora of frameworks out there that are widely used and proven. they are priceless in avoiding reinventing the wheel and can accelerate your road to improving your data management. at biostrand, we have taken useful elements out of all these standards and frameworks to arrive at a mature data management strategy that guides the implementation of all our services. in the following blog in this series, we'll look at how a modern information architecture could set the foundation for enterprise-centric ai deployments. stay tuned. read part 2 of our data management series: creating a unified data + information architecture for scalable ai read part 3 of our data management series: ai-powered data integration and management with data fabric

The importance of reproducibility in in-silico drug discovery

reproducibility, getting the same results using the original data and analysis strategy, and replicability, is fundamental to valid, credible, and actionable scientific research. without reproducibility, replicability, the ability to confirm research results within different data contexts, becomes moot. a 2016 survey of researchers revealed a consensus that there was a crisis of reproducibility, with most researchers reporting that they failed to reproduce not only the experiments of other scientists (70%) but even their own (>50%). in biomedical research, reproducibility testing is still extremely limited, with some attempts to do so failing to comprehensively or conclusively validate reproducibility and replicability. over the years, there have been several efforts to assess and improve reproducibility in biomedical research. however, there is a new front opening in the reproducibility crisis, this time in ml-based science. according to this study, the increasing adoption of complex ml models is creating widespread data leakage resulting in “severe reproducibility failures,” “wildly overoptimistic conclusions,” and the inability to validate the superior performance of ml models over conventional statistical models. pharmaceutical companies have generally been cautious about accepting published results for a number of reasons, including the lack of scientifically reproducible data. an inability to reproduce and replicate preclinical studies can adversely impact drug development and has also been linked to drug and clinical trial failures. as drug development enters its latest innovation cycle, powered by computational in silico approaches and advanced ai-cadd integrations, reproducibility represents a significant obstacle to converting biomedical research into real-world results. reproducibility in in silico drug discovery the increasing computation of modern scientific research has already resulted in a significant shift with some journals incentivizing authors and providing badges for reproducible research papers. many scientific publications also mandate the publication of all relevant research resources, including code and data. in 2020, elife launched executable research articles (eras) that allowed authors to add live code blocks and computed outputs to create computationally reproducible publications. however, creating a robust reproducibility framework to sustain in silico drug discovery would require more transformative developments across three key dimensions: infrastructure/incentives for reproducibility in computational biology, reproducible ecosystems in research, and reproducible data management. reproducible computational biology this approach to industry-wide transformation envisions a fundamental cultural shift with reproducibility as the fulcrum for all decision-making in biomedical research. the focus is on four key domains. first, creating courses and workshops to expose biomedical students to specific computational skills and real-world biological data analysis problems and impart the skills required to produce reproducible research. second, promoting truly open data sharing, along with all relevant metadata, to encourage larger-scale data reuse. three, leveraging platforms, workflows, and tools that support the open data/code model of reproducible research. and four, promoting, incentivizing, and enforcing reproducibility by adopting fair principles and mandating source code availability. computational reproducibility ecosystem a reproducible ecosystem should enable data and code to be seamlessly archived, shared, and used across multiple projects. computational biologists today have access to a broad range of open-source and commercial resources to ensure their ecosystem generates reproducible research. for instance, data can now be shared across several recognized, domain and discipline-specific public data depositories such as pubchem, cdd vault, etc. public and private code repositories, such as github and gitlab, allow researchers to submit and share code with researchers around the world. and then there are computational reproducibility platforms like code ocean that enable researchers to share, discover, and run code. reproducible data management as per a recent data management and sharing (dms) policy issued by the nih, all applications for funding will have to be accompanied by a dms plan detailing the strategy and budget to manage and share research data. sharing scientific data, the nih points out, accelerates biomedical research discovery through validating research, increasing data access, and promoting data reuse. effective data management is critical to reproducibility and creating a formal data management plan prior to the commencement of a research project helps clarify two key facets of the research: one, key information about experiments, workflows, types, and volumes of data generated, and two, research output format, metadata, storage, and access and sharing policies. the next critical step towards reproducibility is having the right systems to document the process, including data/metadata, methods and code, and version control. for instance, reproducibility in in silico analyses relies extensively on metadata to define scientific concepts as well as the computing environment. in addition, metadata also plays a major role in making data fair. it is therefore important to document experimental and data analysis metadata in an established standard and store it alongside research data. similarly, the ability to track and document datasets as they adapt, reorganize, extend, and evolve across the research lifecycle will be crucial to reproducibility. it is therefore important to version control data so that results can be traced back to the precise subset and version of data. of course, the end game for all of that has to be the sharing of data and code, which is increasingly becoming a prerequisite as well as a voluntarily accepted practice in computational biology. one survey of 188 researchers in computational biology found that those who authored papers were largely satisfied with their ability to carry out key code-sharing tasks such as ensuring good documentation and that the code was running in the correct environment. the average researcher, however, would not commit any more time, effort, or expenditure to share code. plus, there still are certain perceived barriers that need to be addressed before the public archival of biomedical research data and code becomes prevalent. the future of reproducibility in drug discovery a 2014 report from the american association for the advancement of science (aaas) estimated that the u.s. alone spent approximately $28 billion yearly on irreproducible preclinical research. in the future, a set of blockchain-based frameworks may well enable the automated verification of the entire research process. meanwhile, in silico drug discovery has emerged as one of the maturing innovation areas in the pharmaceutical industry. the alliance between pharmaceutical companies and research-intensive universities has been a key component in de-risking drug discovery and enhancing its clinical and commercial success. reproducibility-related improvements and innovations will help move this alliance to a data-driven, ai/ml-based, in silico model of drug discovery.

Streamlining in silico drug discovery with cloud computing

in 2020, seventeen pharmaceutical companies came together in an alliance called qupharm to explore the potential of quantum computing (qc) technology in addressing real-world life science problems. the simple reason for this early enthusiasm, especially in a sector widely seen as being too slow to embrace technology, is qc’s promise to solve unsolvable problems. the combination of high-performance computing (hpc) and advanced ai more or less represents the cutting-edge of drug discovery today. however, the sheer scale of the drug discovery space can overwhelm even the most advanced hpc resources available today. there are an estimated 1063 potential drug-like molecules in the universe. meanwhile, caffeine, a molecule with just 24 atoms, is the upper limit for conventional hpcs. qc can help bridge this great divide between chemical diversity and conventional computing. in theory, a 300-qubit quantum computer can instantly perform as many calculations as there are atoms in the visible universe (1078-1082). and qc is not all theory, though much of it is still proof-of-concept. just last year, ibm launched a new 433-qubit processor, more than tripling the qubit count in just a year. this march witnessed the deployment of the first quantum computer in the world to be dedicated to healthcare, though the high-profile cafeteria installation was more to position the technology front-and-center for biomedical researchers and physicians. most pharmaceutical majors, including biogen, boehringer ingelheim, roche, pfizer, merck, and janssen, have also launched their own partnerships to explore quantum-inspired applications. if qc is the next digital frontier in pharma r&d, the combination of ai and hpc is currently the principal engine accelerating drug discovery, with in silico drug discovery emerging as a key ai innovation area. computational in silico approaches are increasingly used alongside conventional in vivo and in vitro models to address issues related to the scale, time, and cost of drug discovery. ai, hpc & in silico drug discovery according to gartner, ai is one of the top workloads driving infrastructure decisions. cloud computing provides businesses with cost-effective access to analytics, compute, and storage facilities and enables them to operationalize ai faster and with lower complexity. when it comes to hpcs, data-intensive ai workloads are increasingly being run in the cloud, a market that is growing twice as fast as on-premise hpc. from a purely economic perspective, the cloud can be more expensive than on-premise solutions for workloads that require a large hpc cluster. for some pharma majors, this alone is reason enough to avoid a purely cloud-based hpc approach and instead augment on-premise hpc platforms with the cloud for high-performance workloads. in fact, a hybrid approach seems to be the preferred option for many users with the cloud being used mainly for workload surges rather than for critical production. however, there are several ways in which running ai/ml workloads on cloud hpc systems can streamline in silico drug discovery. in silico drug discovery in the cloud the presence of multiple data silos, the proliferation of proprietary data, and the abundance of redundant/replicated data are some of the biggest challenges currently undermining drug development. at the same time, incoming data volumes are not only growing exponentially but also becoming more heterogeneous as information is generated across different modalities and biological layers. the success of computational drug discovery will depend on the industry’s ability to generate solutions that can scale across an integrated view of all this data. leveraging a unified data cloud as a common foundation for all data and analytics infrastructure can help streamline every stage of the data lifecycle and improve data usage, accessibility, and governance. as ai adoption in the life sciences approaches the tipping point, organizations can no longer afford to have discrete strategies for managing their data clouds and ai clouds. most companies today choose their data cloud platform based on the support available for ai/ml model execution. drug development is a constantly changing process and ai/ml-powered in silico discovery represents a transformative new opportunity in computer-aided drug discovery. meanwhile, ai-driven drug discovery is itself evolving dramatically with the emergence of computationally intensive deep learning models and methodologies that are redefining the boundaries of state-of-the-art computation. in this shifting landscape, a cloud-based platform enables life sciences companies to continuously adapt and upgrade to the latest technologies and capabilities. most importantly, a cloud-first model can help streamline the ai/ml life cycle in drug discovery. data collection for in silico drug discovery covers an extremely wide range, from sequence data to clinical data to real-world data (rwd) to unstructured data from scientific tests. the diverse, distributed nature of pharmaceutical big data often poses significant challenges to data acquisition and integration. the elasticity and scalability of cloud-based data management solutions help streamline access and integrate data more efficiently. in the data preprocessing phase, a cloud-based solution can simplify the development and deployment of end-to-end pipelines/workflows and enhance transparency, reproducibility, and scalability. in addition, several public cloud services offer big data preprocessing and analysis as a service. on-premise solutions are a common approach to model training and validation in ai-driven drug discovery. apart from the up-front capital expenditure and ongoing maintenance costs, this approach can also affect the scalability of the solution across an organization's entire research team, leading to long wait times and loss of productivity. a cloud platform, on the other hand, instantly provides users with just the right amount of resources needed to run their workloads. and finally, ensuring that end users have access to the ai models that have been developed is the most critical phase of the ml lifecycle. apart from the validation and versioning of models, model management and serving has to address several broader requirements, such as resilience and scalability, as well as specific factors, such as access control, privacy, auditability, and governance. most cloud services offer production-grade solutions for serving and publishing ml models. the rise of drug discovery as a service according to a 2022 market report, the increasing usage of cloud-based technologies in the global in-silico drug discovery sector is expected to drive growth at a cagr of nearly 11% between 2021 and 2030, with the saas segment forecast to develop the fastest at the same rate as the broader market. as per another report, the increasing adoption of cloud-based applications and services by pharmaceutical companies is expected to propel ai in the drug discovery market at a cagr of 30% to $2.99 billion by 2026. cloud-based ai-driven drug discovery has well and truly emerged as the current state-of-the-art in pharma r&d. at least until quantum computing and quantum ai are ready for prime time.

Targeting Disease with Precision: A Primer on Antibody Discovery

the earliest reference to the potential of antibodies for treatment in humans came from an 1890 publication, authored by dr. shibasaburo kitasato and dr. emil von behring, describing how serum from animals with diphtheria and tetanus could cure those diseases in other animals. since then, significant milestones have been achieved at a continuous pace. the chemical structure of antibodies was published in 1959, the first atomic resolution structure of an antibody fragment in 1973, and a technique to create identical antibodies, or monoclonal antibodies, emerged in 1975 to catalyze the field of modern antibody research and discovery. the first therapeutic monoclonal antibody, muromonab-cd3, was approved by the fda in 1986 with the second, which incidentally was the first approved-for-use chimeric antibody, following 8 years later. the milestone 50th approval came 21 years later, crossed a hundred in 2021, and according to recent data, an estimated 175 antibody therapeutics are in regulatory review or approved with an additional 1200 currently in clinical studies. by 2027, monoclonal antibodies are projected to drive biologics sales revenue past that of small molecules. as we head to that seminal milestone, here’s a quick back-to-basics overview of some of the key terms and concepts underlying antibody drug discovery. antibodies what are antibodies? antibodies, or immunoglobulins (ig), are specialized proteins that are part of the adaptive immune system and help recognize, neutralize, and eliminate disease-causing pathogens. they are a key component of the natural immune response to infection. how do antibodies work? antibodies recognize and bind to specific proteins on the surfaces of antigens such as viruses, bacteria, etc. when exposed to a foreign substance for the first time, the body’s immune cells produce antibodies that target specific proteins associated with that particular antigen, for example, the unique spike protein sars-cov-2. even after the threat of infection has been neutralized, these antibodies, along with memory cells of antibody-producing immune cells, remain in the body to provide a faster immune response when exposed to the same pathogen in the future. this humoral immunological memory is created by long-lived plasma cells (llpcs), which produce the protective antibodies, and memory b cells, which are quiescent cells that remember past exposure to enable a quick response. structure of antibodies source: wikimedia commons all antibodies share the same y-shaped structure consisting of two identical light chains (lcs) and two identical heavy chains (hcs). most natural systems feature one lc paired with one hc associated with another identical heterodimer to create the final form. in mammals, lc can be either of two functionally similar types, lambda (λ) and kappa (κ) with each type consisting of two domains, a constant domain (cl) and a variable domain (vl). hcs, on the other hand, can be any of five isotypes, iga, igd, ige, igg, and igm. the number of constant (c) and variable (v) domains varies across these 5 types with igas, igds, and iggs having three constant and one variable domains and iges and igms having one variable and four constant domains. each of these variations differs slightly in terms of shape — for instance, igg, is a straightforward y compared to igm which looks like five ys stacked together — as well as function within the adaptive immune system. for instance, igm is the primary antibody response to pathogens and its secreted form enables strong binding to the pathogen, igg, the most abundant antibody, neutralizes toxins and opsonizes bacteria, and iga is the first defense for mucosal membranes. a clear understanding of the relationships between the structure and functions of antibodies is vital to the successful development of antibody-related therapeutics. monoclonal antibodies what are monoclonal antibodies? monoclonal antibodies (mabs) are man-made proteins designed to simulate the immune system’s natural ability to make specialized proteins, or antibodies, that can recognize and destroy target antigens. they are called monoclonal because antibodies created in the laboratory are clones, i.e. exact copies, of just one antibody and bind only to one antigen. polyclonal antibodies (pabs), on the other hand, are a collection of antibodies, secreted by different b cell lineages in response to an antigen, that can recognize and react to several different isotopes of the same antigen. how are mabs produced? source: wikimedia commons there are 4 different ways to make mabs in the laboratory. murine (the names of treatments developed using this approach end in -omab): early monoclonal antibodies were created with mouse myeloma cells. even though these products were invaluable laboratory research, the potential of these murine-derived antibodies was limited by human anti-mouse antibody responses. consequently, the focus turned to antibody engineering techniques that could lower the risk of immune reactions. chimeric (names end in -ximab): chimeric antibodies were the first engineered improvement and involved replacing murine constant regions with human constant regions. this approach, of sequentially replacing mouse sequence-derived amino acids, was able to significantly reduce the immunogenicity of murine antibodies. chimeric antibodies are molecules containing domains from different species, such as chimeric equine-human and canine-derived chimeric mabs. humanized (names end in -zumab): humanization was the next evolutionary leap in the development of animal-derived antibodies and the first product appeared in 1997 with the fda approval of daclizumab for the prevention of organ transplant rejection. approximately 40% of fda-approved antibodies are humanized. humanized antibodies, mabs that are mostly, but less than fully, human, are produced using different techniques. they could be nonhuman antibodies where some proportion of the animal sequences have been replaced with human sequences using techniques like complementary-determining regions (cdrs), grafting, resurfacing, and hyperchimerization. it could even be fully human mabs that have been engineered to be less than fully human or to acquire sequences beyond the human repertoire. fully human (names end in -umab): human sequence-derived antibodies, unlike humanized antibodies, do not contain murine-sequence-derived cdr regions. there are currently three platforms available for the development of human mabs: phage display technologies, transgenic mice, and yeast display. at last count, there were at least 28 fda-approved human mabs of which a majority (19) were derived from transgenic mice and the rest with phage-display technology. in recent years, yeast surface displays (ysds) have emerged as a robust methodology for the isolation of mabs against various antigens, though they are yet to deliver a product to market, however, phage display, the technique that delivered the first fully human therapeutic antibody, adalimumab, in 2002, continues to be the gold standard for isolating mabs, especially for difficult-to-screen targets. how do mabs work? monoclonal antibodies today have a wide range of clinical and experimental applications. initially, mabs in clinical medicine were designed for activity against specific immune cells, progressing to specific cytokines and on to inhibiting the activity of specific enzymes, cell surface transporters, or signaling molecules. as the use of mabs continues to expand across diverse therapeutic applications, they are also being designed to work in different ways based on the protein being targeted. in the case of cancer, for example, there are two broad ways mabs can work as an immunotherapy. one, they can trigger the immune system by attaching themselves to otherwise hard-to-spot cancer cells in a process called antibody-dependent cell-mediated cytotoxicity (adcc). two, they could act on cells of the immune system to help attack cancer cells in a type of immunotherapy called checkpoint inhibitors. in fact, a particular monoclonal antibody drug may function by more than one means. apart from flagging cancer cells and blocking inhibitors, mabs could also work by triggering cancer cell-membrane destruction, blocking the connection to proteins that promote cell growth to prevent cancer growth and survival, and blocking protein-cell interactions to inhibit the development of new blood vessels that are required for a tumor to survive. so, that was a quick dive into the basic concepts related to antibodies in this first article in our three-part series on antibody drug discovery and development. we’ll continue in the next with a closer look at the role of antibodies in modern drug development, examine some of the formats of approved and experimental antibodies, and explore the scope & potential in silico antibody discovery. until then.

Data-driven biomedical research

data-driven biomedical research is the foundation of effective precision medicine (pm). of course, over the long term, the success of the pm model will be determined as much by the productivity of data-driven research as it will be by our ability to translate research‐based theories and discoveries into clinical precision medicine. the expanding universe of biomedical data this paradigm shift from a symptom-specific to a patient-specific approach to biomedical research and medicine is powered by exponential volumes of multimodal data from a wide range of domains including molecular data from primary/secondary biomedical research, clinical data from laboratory studies and ehrs, and non-clinical data pertaining to environmental, lifestyle and socio-economic factors. since the advent of the first omics discipline of genomics, the scope of omics technologies has constantly been expanding across multiple biological layers. advances in ngs and mass spectrometry technologies have significantly expanded our knowledge of the biomolecular milieu and ushered in a new era of multi-omics analysis. however, the explosion of data from primary and secondary research only created new challenges in multi-omics data integration. it has been estimated that genome sequence data alone will be the biggest big data domain of all by 2025, the same year that healthcare data is expected to start doubling every 73 days. all this meant that biomedical research needed new computational methods that were better at dealing with such huge volumes of data than traditional statistical approaches. ai/ml in biomedical research as omics data sets become more multi-layered and multidimensional, advanced computational methods powered by intelligent ai/ml technologies became the key to integrating and transforming complex multi-omics data sets into actionable knowledge. these technologies soon expanded beyond the omics domain into new fields, such as radiogenomics, a complementary field in precision medicine. ml/dl models have helped automate the assessment of radiological images for diagnosis, staging, and tumor segmentation while improving accuracy and reducing the time required. deep learning networks have also demonstrated high performance across low-level and high-level tasks in the rather complex field of histopathology image analysis. similar algorithmic approaches are now being adopted across a range of image analysis applications covering ct scans, mammographies, mri, etc. ai-based technologies like nlp have helped unlock access to unstructured textual data, i.e. documents, journal articles, blogs, emails, electronic health records, social media posts, etc., and bring the knowledge embedded in these data sets within the purview of integrated biomedical research. notwithstanding the immense potential of ai/ml technologies in biomedical research and, therefore, precision medicine, there are still several technical and ethical challenges that need to be addressed. the pm model is built on the principle of the integrated analysis of vast volumes of data, including proprietary, personal, and often sensitive information. the efficiency and accuracy of data-intensive ai/ml models depend on troves of representative data. however, this immediately raises a host of compliance issues around security, privacy, ownership, and consent. then there are the ethical questions of reliability, explainability, opacity, bias, trustworthiness, traceability, fairness, and moral responsibility. all these technical and ethical factors will have a collective influence on the evolution of data-driven research into a comprehensive and practical precision medicine model. however, as mentioned at the outset, successful implementation of the pm model will also depend on the critical ability to translate research‐based theories and discoveries into clinical practice. translational research & precision medicine according to a frequently cited statistic, there is a 17-year gap between the discovery of scientific evidence and its implementation in clinical practice. this gap between laboratory research and clinical practice is what the evolving scientific discipline of translational research seeks to address. translational research enables the transformation of theoretical and experimental knowledge into innovations at the point of care while also ensuring that there is a reverse flow of clinical information and insights to biomedical research. a discovery-to-practice precision medicine pipeline will require the integration of humongous volumes of biomedical data from high-throughput multi-omics technologies, multi-modal imaging and clinical data, secondary research, ehrs, scientific publications, real-world data and non-clinical digital devices and health apps, amongst others. the real-time integration and analysis of all this data will require the deployment of advanced ai tools and technologies in the clinical setting. however, the deployment of ai in clinical care, according to a 2022 report from the council of europe’s steering committee for human rights in the fields of biomedicine and health (cdbio), remains nascent. some of the report’s key observations include a huge lag between the scale of research activity vis-a-vis demonstrating clinical efficacy, inadequate generalization of performance from trials to clinical practice, and inability to translate research, development, and testing into broader clinical deployment. a multiparametric practice like personalized medicine requires advanced ai/ml capabilities that span the spectrum of translational medicine from research to testing, clinical practice, and patient management. despite increasing ambitions and investments in ml for health (ml4h) in routine clinical care, the deployment of these models is currently limited to isolated workflows, like radiology, for example. this gap between research ai and clinical ai is now being addressed by the new field of translational ml or translational ai. the futuristic vision for this approach is to combine discrete ai agents from across the spectrum of translational medicine into one collaborative translational ai that will enable real-time lab-to-clinic analysis. translational omics and precision medicine a more immediate approach to expanding the use of ai/ml in clinical practice could be to use omics as an ai strategy. ai-driven translational omics, the clinical utilization of molecular data derived from multiple biological domains, can create the framework for precision medicine initiatives. combining state-of-the-art multi-omics technologies with ai-based integration and analysis strategies has had a significant impact on cancer precision medicine, in terms of early screening, diagnosis, response assessment, and prognosis prediction. ai technologies have the proven potential to streamline the integration of multi-modal omics with imaging, phenotypic, ehr, and patient-specific data to generate more precise insights into disease biology and enhance routine clinical decision-making. therefore, an ai-enabled multi-omics strategy may be the first evolutionary step required toward realizing the vision of an end-to-end research-lab-to-clinical-care model of precision medicine.

Making sense of multi-omics data

we love multi-omics analysis. it is data-driven. it is continuously evolving and expanding across new modalities, techniques, and technologies. integrated multi-omics analysis is essential for a holistic understanding of complex biological systems and a foundational step on the road to a systems biology approach to innovation. and it is the key to innovation in biomedical and life sciences research, underpinning antibody discovery, biomarker discovery, and precision medicine, to name just a few. in fact, if you love multi-omics as much as we do, we have an extensive library of multi-perspective omics-related content just for you. however, today we will take a closer look at some of the biggest data-related challenges — data integration, data quality, and data fairness — currently facing integrative multi-omics analysis. data integration over the years, multiomics analysis has evolved beyond basic multi-staged integration, i.e combining just two data features at a time. nowadays, true multi-level data integration, which transforms all data of research interest from across diverse datasets into a single matrix for concurrent analysis, is the norm. and yet, multi-omics data integration techniques still span multiple categories based on diverse methodologies with different objectives. for instance, there are two distinct approaches to multi-level data integration: horizontal and vertical integration. the horizontal model is used to integrate omics data of the same type derived from different studies whereas the vertical model integrates different types of omics data from different experiments on the same cohort of samples. single-cell data integration further expands this classification to include diagonal integration, which further expands the scope of integration beyond the previous two methods, and mosaic integration, which includes features shared across datasets as well as features exclusive to a single experiment. the increasing use of ai/ml technologies has helped address many previous challenges inherent in multiomics data integration but has only added to the complexity of classification. for instance, vertical data integration strategies for ml analysis are further subdivided into 5 groups based on a variety of factors. even the classification of supervised and unsupervised techniques covers several distinct approaches and categories. as a result, researchers today can choose from various applications and analytical frameworks for handling diverse omics data types, and yet not many standardized workflows for integrative data analyses. the biggest challenge, therefore, in multi-omics data integration is the lack of a universal framework that can unify all omics data. data quality the success of integrative multi-omics depends as much on an efficient and scalable data integration strategy as it does on the quality of omics data. and when it comes to multi-omics research, it is rarely prudent to assume that data values are precise representations of true biological value. there are several factors, between the actual sampling to the measurement, that affect the quality of a sample. this applies equally to data generated from manual small-scale experiments and from sophisticated high-throughput technologies. for instance, there can be intra-experimental quality heterogeneity where there is variation in data quality even when the same omics procedure is used to conduct a large number of single experiments simultaneously. similarly, there can also be inter-experimental heterogeneity in which the quality of data from one experimental procedure is affected by factors shared by other procedures. in addition, data quality also depends on the computational methods used to process raw experimental data into quantitative data tables. an effective multi-omics analysis solution must have first-line data quality assessment capabilities to guarantee high-quality datasets and ensure accurate biological inferences. however, there are currently few classification or prediction algorithms that can compensate for the quality of input data. however, in recent years there have been efforts to harmonize quality control vocabulary across different omics and high-throughput methods in order to develop a unified framework for quality control in multi-omics experiments. data fairness the ability to reuse life sciences data is critical for validating existing hypotheses, exploring novel hypotheses, and gaining new knowledge that can significantly advance interdisciplinary research. quality, for instance, is a key factor affecting the reusability of multi-omics and clinical data due to the lack of common quality control frameworks that can harmonize data across different studies, pipelines, and laboratories. the publishing of the fair principles in 2016 represented one of the first concerted efforts to focus on improving the quality, standardization, and reusability of scientific data. the fair data principles, designed by a representative set of stakeholders, defined measurable guidelines for “those wishing to enhance the reusability of their data holdings” both for individuals and for machines to automatically find and use the data. the four foundational principles — findability, accessibility, interoperability, and reusability — were applicable to data as well as to the algorithms, tools, and workflows that contributed to data generation. since then there have been several collaborative initiatives, such as the eatris-plus project and the global alliance for genomics and health (ga4gh) for example, that have championed data fairness and advanced standards and frameworks to enhance data quality, harmonization, reproducibility, and reusability. despite these efforts, the use of specific and non-standard formats continues to be quite common in the life sciences. integrative multi-omics - the mindwalk model our approach to truly integrated and scalable multi-omics analysis is defined by three key principles. one, we have created a universal and automated framework, based on a proprietary transversal language called hyfts®, that has pre-indexed and organized all publicly available biological data into a multilayered multidimensional knowledge graph of 660 million data objects that are currently linked by over 25 billion relations. we then further augmented this vast and continuously expanding knowledge network, using our unique lensai integrated intelligence platform, to provide instant access to over 33 million abstracts from the pubmed biomedical literature database. most importantly, our solution enables researchers to easily integrate proprietary datasets, both sequence- and text-based data. with our unique data-centric model, researchers can integrate all research-relevant data into one distinct analysis-ready data matrix mosaic. two, we combined a simple user interface with a universal workflow that allows even non-data scientists to quickly explore, interrogate, and correlate all existing and incoming life sciences data. and three, we built a scalable platform with proven big data technologies and an intelligent, unified analytical framework that enables integrative multi-omics research. in conclusion, if you share our passion for integrated multi-omics analysis, then please do get in touch with us. we’d love to compare notes on how best to realize the full potential of truly data-driven multi-omics analysis.

LensAI™ - biotherapeutic intelligence for precision drug development

the completion of the human genome project in 2003 set the stage for the modern era in precision medicine. the emergence of genomics, the first omics discipline, opened up new opportunities to personalize the prevention, diagnosis, and treatment of disease to patients’ genetic profiles. over the past two decades, the scope of modern precision medicine has expanded much beyond the first omics. today, there are a variety of omics technologies beyond genomics, such as epigenomics, transcriptomics, proteomics, microbiomics, metabolomics, etc. generating valuable biomedical data from across different layers of biological systems. however, structured data from omics and other high-throughput technologies is just a small part of the biomedical data universe. today, there are several large-scale clinical and phenotypic studies generating massive volumes of data. and new unstructured data-intensive outputs, such as ehr/emrs and text-based information sources, are constantly creating even more volumes of quantitative, qualitative, and transactional data. as a result, precision medicine has evolved into a data-centric multi-modal practice that traverses omics data, medical history, social/behavioral determinants, and other environmental factors to accurately diagnose health states and determine therapeutic options at an individual level. the challenge, however, with clinical and biomedical data is that they constitute a wide variety in terms of size, form, format, modality, etc. seamlessly integrating a variety of complex and heterogeneous biological, medical, and environmental data into a unified analytical framework is therefore critical for truly data-centric precision medicine. ai/ml technologies currently play a central role in the analysis of clinical and biomedical big data. however, the complexity of classifying, labeling, indexing, and integrating heterogeneous datasets is often the bottleneck in achieving large-scale ai-enabled analysis. the sheer volume, heterogeneity, and complexity of life sciences data present an inherent limitation to fully harnessing the sophisticated analytical capabilities of ai/ml technologies in biotherapeutic and life sciences research. the key to driving innovation in precision medicine, therefore, will be to streamline the process of acquiring, processing, curating, storing, and exchanging biomedical data. as a full-service, therapeutic antibody discovery company, our mission is to develop next-generation solutions with the intelligence to seamlessly transform complex data into biotherapeutic intelligence. next-generation ai technology for antibody discovery lensai™ integrated intelligence platform represents a new approach to applying ai technologies to reduce the risk, time, and cost associated with antibody discovery. our approach to biotherapeutic research is designed around the key principle of data-centricity around which we have built a dynamic network of biological and artificial intelligence technologies. there are three key building blocks in the lensai approach to data-centric drug development. one, intelligent automation to code and index all biological data, both structured and unstructured, and instantly make data specific and applicable. two, a simple interface to facilitate the rapid exploration, interrogation, and correlation of all existing and incoming biomedical data. and three, a unified framework to enable the concurrent analysis of data from multiple domains and dimensions. the lensai platform is a google-like solution that provides biopharma researchers with instant access to the entire biosphere. using hyfts®, a universal framework for organizing all biological data, we have created a multidimensional network of 660 million data objects with multiple layers of information about sequence, syntax, and protein structure. there are currently over 25 billion relations that link the data objects in this vast network to create a unique knowledge graph of all data in the biosphere. more importantly, the hyfts framework allows researchers to effortlessly integrate their proprietary research into the existing knowledge network. and the network is constantly expanding and evolving. the network is continuously updated with new metadata, relationships, and links, as with the recent addition of over 20 million structural hyfts. the continuous enrichment and updation of the network with newly emergent biologically relevant data and relationships mean that the knowledge graph of the biosphere is constantly and exponentially evolving in terms of the quantity and quality of links connecting all the data. this means that with lensai, researchers have an integrated, sophisticated, and constantly up-to-date view of all biological data and context. the continuously evolving graph representation of all formal and explicit biological information in the biosphere creates a strong data foundation to build even more sophisticated ai/ml applications for antibody discovery and precision medicine. another unique characteristic of the lensai platform is that the hyfts network also links to textual information sources, such as scientific papers that are relevant to the biological context of the research. the platform provides out-of-the-box access to over 33 million abstracts from the pubmed biomedical literature database. plus, a built-in nlp pipeline means that researchers can easily integrate proprietary text-based data sets that are relevant to their research. lensai is currently the only ai platform that can analyze text, sequence, and protein structure concurrently. the unified analysis of all biological data across the three key dimensions of text, sequence, and protein structure can significantly enhance the efficiency and productivity of the drug discovery process. and to enable unified analysis, the lensai platform incorporates next-generation ai technologies that can instantly transform multidimensional data into meaningful knowledge that can transform drug discovery and development. a new lensai on biotherapeutic intelligence the sheer volume of data involved in biotherapeutic research and analytics has limited the capability of most conventional ai solutions to bridge the gap between wet lab limitations and in silico efficiencies. lensai is currently the only ai platform that can concurrently and instantly analyze text, sequence, and protein structure, in silico and in parallel. the platform organizes the entire biosphere and all relevant unstructured textual data into one vast multi-level biotherapeutic intelligence network. next-generation intelligent technologies then render the data useful for drug discovery by crystallizing specificity from vast pools of heterogeneous data. with lensai, biopharma researchers now have an integrated, intelligently automated solution designed for the data-intensive task of developing precision drugs for the precision medicine era.

Creating an AI-ready data foundation for successful AI-enabled drug discovery

over the past year, we have looked at drug discovery and development from several different perspectives. for instance, we looked at the big data frenzy in biopharma, as zettabytes of sequencing, real-world and textual data (rwd) pile up and stress the data integration and analytic capabilities of conventional solutions. we also discussed how the time-consuming, cost-intensive, low productivity characteristics of the prevalent roi-focused model of development have an adverse impact not just on commercial viability in the pharma industry but on the entire healthcare ecosystem. then we saw how antibody drug discovery processes continued to be cited as the biggest challenge in therapeutic r&d even as the industry was pivoting to biologics and mabs. no matter the context or frame of reference, the focus inevitably turns to how ai technologies can transform the entire drug discovery and development process, from research to clinical trials. biopharma companies have traditionally been slow to adopt innovative technologies like ai and the cloud. today, however, digital innovation has become an industry-wide priority with drug development expected to be the most impacted by smart technologies. from application-centric to data-centric ai technologies have a range of applications across the drug discovery and development pipeline, from opening up new insights into biological systems and diseases to streamlining drug design to optimizing clinical trials. despite the wide-ranging potential of ai-driven transformation in biopharma, the process does entail some complex challenges. the most fundamental challenge will be to make the transformative shift from an application-centric to a data-centric culture, where data and metadata are operationalized at scale and across the entire drug design and development value chain. however, creating a data-centric culture in drug development comes with its unique set of data-related challenges. to start with there is the sheer scale of data that will require a scalable architecture in order to be efficient and cost-effective. most of this data is often distributed across disparate silos with unique storage practices, quality procedures, and naming and labeling conventions. then there is the issue of different data modalities, from mr or ct scans to unstructured clinical notes, that have to be extracted, transformed, and curated at scale for unified analysis. and finally, the level of regulatory scrutiny on sensitive biomedical data means that there is this constant tension between enabling collaboration and ensuring compliance. therefore, creating a strong data foundation that accounts for all these complexities in biopharma data management and analysis will be critical to ensuring the successful adoption of ai in drug development. three key requisites for an ai-ready data foundation successful ai adoption in drug development will depend on the creation of a data foundation that addresses these three key requirements. accessibility data accessibility is a key characteristic of ai leaders irrespective of sector. in order to ensure effective and productive data democratization, organizations need to enable access to data distributed across complex technology environments spanning multiple internal and external stakeholders and partners. a key caveat of accessibility is that the data provided should be contextual to the analytical needs of specific data users and consumers. a modern cloud-based and connected enterprise data and ai platform designed as a “one-stop-shop” for all drug design and development-related data products with ready-to-use analytical models will be critical to ensuring broader and deeper data accessibility for all users. data management and governance the quality of any data ecosystem is determined by the data management and governance frameworks that ensure that relevant information is accessible to the right people at the right time. at the same time, these frameworks must also be capable of protecting confidential information, ensuring regulatory compliance, and facilitating the ethical and responsible use of ai. therefore, the key focus of data management and governance will be to consistently ensure the highest quality of data across all systems and platforms as well as full transparency and traceability in the acquisition and application of data. ux and usability successful ai adoption will require a data foundation that streamlines accessibility and prioritizes ux and usability. apart from democratizing access, the emphasis should also be on ensuring that even non-technical users are able to use data effectively and efficiently. different users often consume the same datasets from completely different perspectives. the key, therefore, is to provide a range of tools and features that help every user customize the experience to their specific roles and interests. apart from creating the right data foundation, technology partnerships can also help accelerate the shift from an application-centric to a data-centric approach to ai adoption. in fact, a 2018 gartner report advised organizations to explore vendor offerings as a foundational approach to jump-start their efforts to make productive use of ai. more recently, pharma-technology partnerships have emerged as the fastest-moving model for externalizing innovation in ai-enabled drug discovery. according to a recent roots analysis report on the ai-based drug discovery market, partnership activity in the pharmaceutical industry has grown at a cagr of 50%, between 2015 and 2021, with a majority of the deals focused on research and development. so with that trend as background, here’s a quick look at how a data-centric, full-service biotherapeutic platform can accelerate biopharma’s shift to an ai-first drug discovery model. the lensai™ approach to data-centric drug development our approach to biotherapeutic research places data at the very core of a dynamic network of biological and artificial intelligence technologies. with our lensai platform, we have created a google-like solution for the entire biosphere, organizing it into a multidimensional network of 660 million data objects with multiple layers of information about sequence, syntax, and protein structure. this “one-stop-shop” model enables researchers to seamlessly access all raw sequence data. in addition, hyfts®, our universal framework for organizing all biological data, allows easy, one-click integration of all other research-relevant data from across public and proprietary data repositories. researchers can then leverage the power of the lensai integrated intelligence platform to integrate unstructured data from text-based knowledge sources such as scientific journals, ehrs, clinical notes, etc. here again, researchers have the ability to expand the core knowledge base, containing over 33 million abstracts from the pubmed biomedical literature database, by integrating data from multiple sources and knowledge domains, including proprietary databases. around this multi-source, multi-domain, data-centric core, we have designed next-generation ai technologies that can instantly and concurrently convert these vast volumes of text, sequence, and protein structure data into meaningful knowledge that can transform drug discovery and development.

Why we love multi-omics analysis

the key challenge to understanding complex biological systems is that they cannot be simply decoded as a sum of their parts. biomedical research, therefore, is transitioning from this reductionist approach to a more holistic and integrated systems biology model to understand the bigger picture. the first step in the transition to this holistic model is to catalog a complete parts list of biological systems and decode how they connect, interact, and individually and collectively correlate to the function and behavior of that specific system. omics is the science of analyzing the structure and functions of all the parts of a specific biological function, across different levels, including the gene, the protein, and metabolites. today, we’ll take an objective look at why we believe multi-omics is central to modern biomedical and life sciences research. the importance of multi-omics in four points it delivers a holistic, dynamic, high-resolution view omics experiments have evolved considerably since the days of single-omics data. nowadays, it is fairly commonplace for researchers to combine multiple assays to generate multi-omics datasets. multi-omics is central to obtaining a detailed picture of molecular-level dynamics. the integration of multidimensional molecular datasets provides deeper insight into biological mechanisms and networks. more importantly, multi-omics can provide a dynamic view of different cell and tissue types over time which can be vital to understand the progressive effect of different environmental and genetic factors. combining data from different modalities enables a more holistic view of biological systems and a more comprehensive understanding of the underlying dynamics. the development of massively parallel genomic technologies is constantly broadening the scope and scale of biological modalities that can be integrated into research. at the same time, a new wave of multi-omics approaches is enabling researchers to simultaneously explore different layers of omics information to gain unparalleled insights into the internal dynamics of specific cells and tissues. emerging technologies such as single-cell sequencing and spatial analysis are opening up new layers of biological information to deliver a comprehensive, high-resolution view at the molecular level. it is constantly expanding & evolving genomics was the first omics discipline. since then the omics sciences have been constantly expanding beyond genomics, transcriptomics, proteomics, and metabolomics which were derived from the central dogma. however, the increasing sophistication of modern high-throughput technologies means that today we have a continuously expanding variety of omics datasets focusing on multiple diverse yet complementary biological layers. in fact, the ‘omics’ suffix seems to have developed its own unique cachet that it has even crossed over into emerging scientific fields, such as polymeromics, humeomics, etc., that deal with huge volumes of data but are not related to the life sciences. omics technologies can be broadly classified into two categories. the first, technology-based omics, is itself further subdivided into sequencing-based omics, focusing on the genome, transcriptome, their epitomes, and interactomes, and mass spectrometry-based omics that interrogate proteome, metabolome, and interactomes not involving dna/rna. the second category, comprising knowledge-based omics such as immunomics and microbiomics, develops organically from the integration of multiple omics data from different computational approaches and molecular layers for specific research applications. the consistent development of techniques to cover new omics modalities has also contributed to the trend of combining multiple techniques to simultaneously collect information from different layers. next-generation multi-omics approaches, spearheaded by new single-cell and spatial sequencing technologies, enable researchers to concurrently explore multiple omics profiles of a sample and gain novel insights into cell systems. and mechanisms operating within specific cells and tissues, providing a greater understanding of cell biology. it is data-driven the omics revolution ushered in the era of big data in biological research. the exponential generation of high-throughput data following the hgp triggered the shift from traditional hypothesis-driven approaches to data-driven methodologies that opened up new perspectives and accelerated biological research and innovation. it was not just about data volumes though. with the continuous evolution of high-throughput omics technologies came the ability to measure a wider array of biological data. the rapid development of novel omics technologies in the post-genomic era produced a wealth of multilayered biological information across transcriptomics, proteomics, epigenomics, metabolomics, spatial omics, single-cell omics, etc. the increasing availability of large-scale, multidimensional, and heterogeneous datasets created unprecedented opportunities for biological research to gain deeper and holistic insights into the inner workings of biological systems and processes. the shift from single-layer to multi-dimensional analysis also yielded better results that would have a transformative impact on a range of research areas including biomarker identification, microbiome analysis, and systems microbiology. researchers have already taken on the much more complex challenge of referencing the human multi-ome and describing normal epigenetic conditions and levels of mrna, proteins, and metabolites in each of the 200 cell types in an adult human. when completed, this effort will deliver even more powerful datasets than those that emerged following the sequencing of the genome. it is key to innovation in recent years, multi-omics analysis has become a key component across several areas of biomedical and life sciences research. take precision medicine, for example, a practice that promotes the integration of collective and individualized clinical data with patient-specific multi-omics data to accurately diagnose health states and determine personalized therapeutic options at an individual level. modern ai/ml-powered bioinformatics platforms enable researchers to seamlessly integrate all relevant omics and clinical data, including unstructured textual data in order to develop predictive models that are able to identify risks much before they become clinically apparent and thereby facilitate preemptive interventions. in the case of complex diseases, multi-omics data provide molecular profiles of disease-relevant cell types that when integrated with gwas insights help translate genetic findings into clinical applications. in drug discovery, multi-omics data is used to create multidimensional models that help identify and validate new drug targets, predict toxicity and develop biomarkers for downstream diagnostics in the field. modern biomarker development relies on the effective integration of a range of omics datasets in order to obtain a more holistic understanding of diseases and to augment the accuracy and speed of identifying novel drug targets. the future of multi-omics integrated multi-omics analysis has revolutionized biology and opened up new horizons for basic biology and disease research. however, the complexity of managing and integrating multi-dimensional data that drives such analyses continues to be a challenge. modern bioinformatics platforms are designed for multi-dimensional data. for instance, our integrated data-ingestion-to-insight platform eliminates all multi-omics data management challenges while prioritizing user experience, automation, and productivity. with unified access to all relevant data, researchers can focus on leveraging the ai-powered features of our solution to maximize the potential of multi-omics analysis.

What is precision medicine?

in 1999, an innovative collaboration between 10 of the world’s largest pharmaceutical companies, the world’s largest medical research charity, and five leading academic centres emerged in the form of the snp consortium (tsc). focused on advancing the field of medicine and development of genetic-based diagnostics and therapeutics, the tsc aims to develop a high-density, single nucleotide polymorphism (snp) map of the human genome. a wall street journal article described how the two-year, $45 million program to create a map of genetic landmarks would usher in a new era of personal medicines. the following year, with the announcement of the "working draft" sequence, the consortium collaborated with the human genome project to accelerate the construction of a higher-density snp map. in 2002, a summary from the chairman of the consortium described how the program identified 1.7 million common snps, significantly outperformingits original objective to identify 300,000. he also observed that creating a high-quality snp map for the public domain would facilitate novel diagnostic tests, new ways to intervene in disease processes, and development of new medicines to personalise therapies. in the 20 years since that milestone in modern personalised medicine, there have been several significant advances. today, the use of genotyping and genomics has progressed many cancer treatments from blanket approaches to more patient-centred models. the ability to decode dna and identify mutations has opened up the possibility of developing therapies that address those specific mutations. the sequencing of the human genome introduced the concept of the druggable gene and advanced the field of pharmacogenomics by enabling the exploration of the entire genome in terms of response to a medication, rather than to just a few candidate loci. precision vs. personalisation in medicine the broad consensus seems to be that these terms are interchangeable. for instance, the national human genome research institute highlights that the terms are generally considered analogous to personalised medicine or individualised medicine. additionally, the national cancer institute, american cancer society and federal drug administration include references to personalised medicine and personalised care. in fact, the view that the terms are interchangeable, or at least very similar, is common across a host of international institutions. however, for at least one organization, a clear distinction between, and preference for, one term over the other has been noted. this comes from the european society for medical oncology (esmo), with the unambiguous statement that precision medicine is preferred to personalised medicine. according to esmo, these concepts ‘generated the greatest discussion’ during the creation of their glossary and their decision to go with precision medicine came down to these three reasons: the term ‘personalised’ could be misinterpreted to imply that treatments and preventions are being developed uniquely for each individual. personalised medicine describes all modern oncology given that personal preference, cognitive aspects, and co-morbidities are considered alongside treatment and disease factors. in this context, personalised medicine describes the holistic approach of which biomarker-based precision medicine is just one part. precision medicine communicates the highly accurate nature of new technologies used in base pair resolution dissection of cancer genomes. and finally, according to the national research council, precision medicine “does not literally mean the creation of drugs or medical devices that are unique to a patient, but rather the ability to classify individuals into subpopulations that differ in their susceptibility to a particular disease, in the biology and/or prognosis of those diseases they may develop, or in their response to a specific treatment.” key elements of precision medicine there are several models that seek to break down the complexity of the precision medicine ecosystem into a sequence of linked components. for instance, the university of california, san francisco (ucsf) envisions precision medicine as a fluid, circular process that informs both life sciences research and healthcare decision-making at the level of the individuals or populations. this model integrates findings from basic, clinical, and population sciences research; data from digital health, omics technologies, imaging, and computational health sciences; and ethical and legal guidelines into a "google maps for health" knowledge network. source: precision medicine at ucsf in the publication, precision medicine: from science to value, authors ginsburg and phillips outline a knowledge-generating, learning health system model. in this model, information is constantly being generated and looped between clinical practice and research to improve the efficiency and effectiveness of precision medicine. this enables researchers to leverage data derived from clinical care settings, while clinicians get access to a vast knowledge base curated from research laboratories. participation in this system could be extended further to include industry, government agencies, policymakers, regulators, providers, payers, etc., to create a collaborative and productive precision medicine ecosystem. source: precision medicine: from science to value the uc davis model visualises precision medicine as the ‘intersection between people, their environment, the changes in their markers of health and illness, and their social and behavioural factors over time’. this model focuses on four key components: 1) patient-related data from electronic health records, 2) scientific markers of health and illness including genetics, genomics, metabolomics, phenomics, pharmacogenomics, etc. 3) environmental exposure and influence on persons and populations such as the internal environment (e.g., microbiomes) and the external environment (e.g., socio-economics) and, 4) behavioural health factors (e.g., life choices). source: uc davis health another precision medicine approach discussed in a recent brookings report is presented as a simple, four-stage pipeline envisioned to help companies ethically innovate and equitably deploy precision medicine. the first stage, data acquisition and storage, deals with the aggregation of big data and ownership, privacy, sovereignty, storage, and movement of this data. the second stage pertains to information access and research and the need to balance healthcare innovation with adequate oversight and protection. in the third clinical trials and commercialization stage, a robust framework is in place to ensure the safety, efficacy, and durability of precision medicine treatments, as well as the commercialization of individualised products. the final stage involves evaluating societal benefits, including investments and innovations in healthcare systems with an aim toward equitable precision medicine, so that products and treatments reach all patients with unmet medical needs. integrating precision medicine and healthcare systems the true potential for a patient-centric model such as precision medicine can only be realised when physicians are able to apply research insights into clinical decisions at the point of care. however, despite huge scientific and technological breakthroughs over the past two decades, healthcare providers face multiple challenges in integrating novel personalised medicine technologies and practices. a study of a representative sample of us-based health systems revealed that, despite widespread integration efforts, the clinical implementation of personalised medicine was measurable but incomplete system-wide. this practice gap could be attributed to any number of limitations and challenges, and addressing these will have to become a priority if the breakthroughs in precision medicine are to be translated into improved care for patients.

Accelerating pharma R&D and innovation through technology partnerships

biopharmaceutical companies are increasingly turning to alliances & partnerships to drive external innovation. having raised over $80 billion in follow-on financing, venture funding, and initial public offerings (ipos) between january and november 2021, the focus in 2022 is expected to be on the more sustainable allocation of capital by leveraging the potential of alliances and strategic partnerships to access new talent and innovation. the race to market for covid-19 vaccines has only accentuated the value of alliances as companies with core vaccine capabilities turned to external partnerships to leverage the value of emergent mrna technology. and with alliances historically delivering higher return on investment (roi), major biopharmaceutical companies have been deploying more capital toward alliances and strategic partnerships since 2020. pharma-startup partnerships represent the fastest-moving model for externalizing innovation to accelerate r&d productivity and drive portfolio growth. within this broader trend, the ai-enabled drug discovery and development space continues to attract a lot of big pharma interest, spanning investments, acquisitions, and partnerships. ai is currently the top investment priority among big pharma players. biopharma majors, like pfizer, takeda, and astrazeneca, have unsurprisingly also been leading the way in terms of ai start-up deals. in addition, these industry players are focusing on forging partnerships in the ai space to improve r&d activities. just in the first quarter of 2022, leading industry players including pfizer, sanofi, glaxosmithkline, and bristol-myers squibb, have announced multi-billion-dollar strategic partnerships with ai vendors. however, the pharmaceutical sector has traditionally preferred to keep r&d and innovation in-house. managing these strategic partnerships, therefore, introduces some new challenges that go beyond relatively simpler build versus buy decisions involving informatics solutions. managing strategic ai partnerships according to research data from accenture, the success rate of pharma-tech partnerships, assessed across a total of 149 partnerships between companies of all sizes, is around 60%. for early-stage partnerships, there are additional risks that can impact the success rate. the accenture report distilled the four most common pitfalls that can impact every pharma-tech partnership. source: accenture failing to prepare internally: according to executives of life science companies, defining partnership strategy and partner management functions are a key challenge in creating successful technology alliances. it is important to start by defining the appropriate partnership structure and governance for the alliance, with mutually agreed partnership objectives, a dedicated team with the right technical knowledge and resources, and clearly defined partnership management functions. engaging with the wrong partner: despite the most stringent due diligence around technological relevance and strategic alignment, tech partnerships can fail because of organizational and cultural differences. sometimes the distinctive and complementary characteristics of each partner that make collaboration attractive can themselves create a “paradox of asymmetry” that makes working together difficult. most corporations may be well equipped to deal with the two main phases of collaboration between large companies and startups: the design phase, where the businesses meet and decide to engage, and the process phase, where the interactions and collaborations kick off. new research shows that a preceding upstream phase, to define and create conditions conducive to the design and process phases, can be decisive in the success of startup partnerships. undefined partnership roadmap: technological partnerships can be structured in a myriad of ways. for instance, the financial structure could be based on revenue sharing, milestone-based payments, etc. it is necessary to clearly define each engagement structure in terms of its operations, organizational, financial, legal, and ip implications. formalize the roles, responsibilities, and accountabilities expected of each party. establish short to medium-term goals, metrics, key milestones, and stage gates that build towards long-term partnership outcomes. continuously reassess and fine-tune based on milestones and key performance indicators (kpis). poor execution: effective long-term partnerships are based on executional excellence. successful partnerships require a dedicated leader accountable for the execution and results. this role is essential for providing daily oversight of operational issues, addressing inter-organizational bottlenecks, and enforcing accountability on both sides. there also should be partnership meetings involving senior leadership to discuss how to accelerate progress or how to change tactics in the face of challenges or changing market conditions. building successful technology partnerships offers a fast, efficient, and cost-effective model for pharma and life sciences companies to develop new capabilities, accelerate r&d processes, and drive innovation. however, the scale and complexity of these partnerships, and the challenges of managing partnership networks, are only bound to increase over time. building end-to-end ai partnerships in the race to become pharma ai leaders, many companies are looking at end-to-end ai coverage spanning biology (target discovery and disease modeling), chemistry (virtual screening, retrosynthesis, and small molecule generation), and clinical development (patient stratification, clinical trial design and prediction of trial outcomes). this is where ai platforms like our lensai platform can play a key role in enabling value realization at scale. ai-native platforms based on multi-dimensional information models can seamlessly scale pharma r&d by automating data aggregation across different biological layers, multiple domains, and internal and external data repositories. given the diverse nature of ai-driven platforms and services, pharma companies have the flexibility to choose partnerships that address strategic gaps in their r&d value chain. this includes custom data science services, drug candidate or target discovery as a service, ai-powered cros, and platforms specializing in low-data targets. the focus has to be on enabling end-to-end ai coverage in pharma r&d, through a combination of partnerships and in-house investments in order to increase the productivity and efficiency of r&d processes while cutting the cost and the time to value.

Improving drug safety with adverse event detection using NLP

it is estimated that adverse events (aes) are likely one of the 10 leading causes of death and disability in the world. in high-income countries, one in every 10 patients is exposed to the harm that can be caused by a range of adverse events, at least 50% of which are preventable. in low- and middle-income countries, 134 million such events occur each year, resulting in 2.6 million deaths. across populations, the incidence of aes also varies based on age, gender, ethnic and racial disparities. and according to a recent study, external disruptions, like the current pandemic, can significantly alter the incidence, dispersion and risk trajectory of these events. apart from their direct patient health-related consequences, aes also have significantly detrimental implications for healthcare costs and productivity. it is estimated that 15% of total hospital activity and expenditure in oecd countries is directly attributable to adverse events. there is therefore a dire need for a systematic approach to detecting and preventing adverse events in the global healthcare system. and that’s exactly where ai technologies are taking the lead. ai applications in adverse drug events (ades) a 2021 scoping review to identify potential ai applications to predict, prevent or mitigate the effects of ades homed in on four interrelated use cases. first use case: prediction of patients with the likelihood to have a future ade in order to prevent or effectively manage these events. second use case: predicting the therapeutic response of patients to medications in order to prevent ades, including in patients not expected to benefit from treatment. third use case: predicting optimal dosing for specific medications in order to balance therapeutic benefits with ade-related risks. fourth use case: predicting the most appropriate treatment options to guide the selection of safe and effective pharmacological therapies. the review concluded that ai technologies could play an important role in the prediction, detection and mitigation of ades. however, it also noted that even though the studies included in the review applied a range of ai techniques, model development was overwhelmingly based on structured data from health records and administrative health databases. therefore, the reviewers noted, integrating more advanced approaches like nlp and transformer neural networks would be essential in order to access and integrate unstructured data, like clinical notes, and improve the performance of predictive models. nlp in pharmacovigilance spontaneous reporting systems (srss) have traditionally been the cornerstone of pharmacovigilance with reports being pooled from a wide range of sources. for instance, vigibase, the global database at the heart of the world health organization’s international global pharmacovigilance system, currently holds over 30 million reports of suspected drug-related adverse effects in patients from 170 member countries. the problem, however, is that spontaneous reporting is, by definition, a passive approach and currently fewer than 5% of ades are reported even in jurisdictions with mandatory reporting. the vast majority of ade-related information resides in free-text channels: emails and phone calls to patient support centres, social media posts, news stories, doctor-pharma rep call transcripts, online patient forums, scientific literature etc. mining these free text channels and clinical narratives in ehrs can supplement spontaneous reporting and enable significant improvements in ade identification. nlp & ehrs ehrs provide a longitudinal electronic record of patient health information captured across different systems within the healthcare setting. one of the main benefits of integrating ehrs as a pharmacovigilance data source is that they provide real-time real-world data. these systems also contain multiple fields of unstructured data, like discharge summaries, lab test findings, nurse notifications, etc., that can be explored with nlp technologies to detect safety signals. and compared to srss, ehr data is not affected by duplication or under- or over-reporting and enables a more complete assessment of drug exposure and comorbidity status. in recent years, deep nlp models have been successfully used across a variety of text classification and prediction tasks in ehrs including medical text classification, segmentation, word sense disambiguation, medical coding, outcome prediction, and de-identification. hybrid clinical nlp systems, combining a knowledge-based general clinical nlp system for medical concepts extraction with a task-specific deep learning system for relations identification, have been able to automatically extract ade and medication-related information from clinical narratives. but some challenges still remain, such as the limited availability and complexity of domain-specific text, lack of annotated data, and the extremely sensitive nature of ehr information. nlp & biomedical literature biomedical literature is one of the most valuable sources of drug-related information, stemming both from development cycles as well as the post-marketing phase. in post-marketing surveillance(pms), for instance, scientific literature is becoming essential to the detection of emerging safety signals. but with as many as 800,000 new articles in medicine and pharmacology published every year, the value of nlp in automating the extraction of events and safety information cannot be overstated. over the years, a variety of nlp techniques have been applied to a range of literature mining tasks to demonstrate the accuracy and versatility of the technology. take pms, for example, a time-consuming and manual intellectual review process to actively screen biomedical databases and literature for new ades. researchers were able to train an ml algorithm on historic screening knowledge data to automatically sort relevant articles for intellectual review. another deep learning pipeline implemented with three nlp modules not only monitors biomedical literature for adr signals but also filters and ranks publications across three output levels. nlp & social media there has been a lot of interest in the potential of nlp-based pipelines that can automate information extraction from social media and other online health forums. but these data sources, specifically social media networks, present a unique set of challenges. for instance, adr mentions on social media typically include long, varied and informal descriptions that are completely different from the formal terminology found in pubmed. one proposed way around this challenge has been to use an adversarial transfer framework to transfer auxiliary features from pubmed to social media datasets in order to improve generalization, mitigate noise and enhance adr identification performance. pharmacovigilance on social media data has predominantly focused on mining ades using annotated datasets. achieving the larger objective of detecting ade signals and informing public policy will require the development of end-to-end solutions that enable the large-scale analysis of social media for a variety of drugs. one project to evaluate the performance of automated ae recognition systems for twitter warned of a potentially large discrepancy between published performance results and actual performance based on independent data. the transferability of ae recognition systems, the study concluded, would be key to their more widespread use in pharmacovigilance. all that notwithstanding, there is little doubt that user-generated textual content on the internet will have a substantive influence on conventional pharmacovigilance processes. integrated pharmacovigilance pharmacovigilance is still a very fragmented and uncoordinated process, both in terms of data collection and analysis. the value of nlp technologies lies in their ability to unlock real-time real-world insights at scale from data sources that will enable a more proactive approach to predicting and preventing adverse events. but for this to happen, the focus has to be on the development of outcome-based hybrid nlp models that can unify all textual data across clinical trials, clinical narratives, ehrs, biomedical literature, user-generated content etc. at the same time, the approach to the collection and analysis of structured data in pharmacovigilance also needs to be modernised to augment efficiency, productivity and accuracy. combining structured and unstructured data will open up a new era in data-driven pharmacovigilance.

Artificial Intelligence in early phase drug development

artificial intelligence (ai) technologies are currently the most disruptive trend in the pharmaceutical industry. over the past year, we have quite extensively covered the impact that these intelligent technologies can have on conventional drug discovery and development processes. we charted how ai and machine learning (ml) technologies came to be a core component of drug discovery and development, their potential to exponentially scale and autonomize drug discovery and development, their ability to expand the scope of drug research even in data-scarce specialties like rare diseases, and the power of knowledge graph-based drug discovery to transform a range of drug discovery and development tasks. ai/ml technologies can radically remake every stage of the drug discovery and development process, from research to clinical trials. today, we will dive deeper into the transformational possibilities of these technologies in two foundational stages — early drug discovery and preclinical development — of the drug development process. early drug discovery and preclinical development source: sciencedirect early drug discovery and preclinical development is a complex process that essentially determines the productivity and value of downstream development programs. therefore, even incremental improvements in accuracy and efficiency during these early stages could dramatically improve the entire drug development value chain. ai/ml in early drug discovery the early small molecule drug discovery process flows broadly, across target identification, hit identification, lead identification, lead optimization, and finally, on to preclinical development. currently, this time-consuming and resource-intensive process relies heavily on translational approaches and assumptions. incorporating assumptions, especially those that cannot be validated due to lack of data, raises the risk of late-stage failure by advancing nmes without accurate evidence of human response into drug development. even the drastically different process of large-molecule, or biologicals, development, starts with an accurate definition of the most promising target. ai/ml methods, therefore, can play a critical role in accelerating the development process. investigating drug-target interactions (dtis), therefore, is a critical step to enhancing the success rate of new drug discovery. predicting drug-target interactions despite the successful identification of the biochemical functions of a myriad of proteins and compounds with conventional biomedical techniques, the limitations of these approaches come into play when scaling across the volume and complexity of data. this is what makes ml methods ideal for drug–target interaction (dti) prediction at scale. l techniques ideal for drug-target interaction prediction. there are currently several state-of-the-art ml models available for dti prediction. however, many conventional ml approaches regard dti prediction either as a classification or a regression task, both of which can lead to bias and variance errors. novel multi-dti models that balance bias and variance through a multi-task learning framework have been able to deliver superior performance and accuracy over even state-of-the-art methods. these dti prediction models combine a deep learning framework with a co-attention mechanism to model interactions from drug and protein modalities and improve the accuracy of drug target annotation. deep learning models perform significantly better at high-throughput dti prediction than conventional approaches and continue to evolve, from identifying simple interactions to revealing unknown mechanisms of drug action. lead identification & optimization this stage focuses on identifying and optimizing drug-like small molecules that exhibit therapeutic activity. the challenge in this hit-to-lead generation phase is twofold. firstly, the search space to extract hit molecules from compound libraries extends to millions of molecules. for instance, a single database like the zinc database comprises 230 million purchasable compounds and the universe of make-on-demand synthesis compounds can be 10 billion. secondly, the hit rate of conventional high-throughput screening (hts) approaches to yield an eligible viable compound is just around 0.1%. over the years, there have been several initiatives to improve the productivity and efficiency of hit-to-lead generation, including the use of high-content screening (hcs) techniques to complement hts and improve efficiency and cadd virtual screening methodologies to reduce the number of compounds to be tested. source: bcg the availability of huge volumes of high-quality data combined with the ability of ai to parse and learn from these data has the potential to take the computational screening process to a new level. there are at least four ways — access to new biology, improved or novel chemistry, better success rates, and quicker and cheaper discovery processes — in which ai can add new value to small-molecule drug discovery. ai technologies can be applied to a variety of discovery contexts and biological targets and can play a critical role in redefining long-standing workflows and many of the challenges of conventional techniques. ai/ml in preclinical development preclinical development addresses several critical issues relevant to the success of new drug candidates. preclinical studies are a regulatory prerequisite to generating toxicology data that validate the safety of a drug for humans prior to clinical trials. these studies inform trial design and provide the pharmacokinetic, pharmacodynamic, tolerability, and safety information, such as in vitro off-target and tissue-cross reactivity (tcr), that defines optimal dosage. preclinical data also provide chemical, manufacturing, and control information that will be crucial for clinical production. finally, they help pharma companies to identify candidates with the broadest potential benefits and the greatest chance of success. it is estimated that just 10 out of 10,000 small molecule drug candidates in preclinical studies make it to clinical trials. one reason for this extremely high turnover is the imperfect nature of preclinical in vivo research models, as compared to in vitro studies which can typically confirm efficacy, moa, etc., which results in challenges to accurately predicting clinical outcomes. however, ai/ml technologies are increasingly being used to bridge the translational gap between preclinical discoveries and new therapeutics. for instance, a key approach to de-risking clinical development has been the use of translational biomarkers that demonstrate target modulation, target engagement, and confirm proof of mechanism. in this context, ai techniques have been deployed to learn from large volumes of heterogeneous and high-dimensional omics data and provide valuable insights that streamline translational biomarker discovery. similarly, ml algorithms that learn from problem-specific training data have been successfully used to accurately predict bioactivity, absorption, distribution, metabolism, excretion, and toxicity (admet) -related endpoints, and physicochemical properties. these technologies also play a critical role in the preclinical development of biologicals, including in the identification of candidate molecules with a higher probability of providing species-agnostic reactive outcomes in animal/human testing, ortholog analysis, and off-target binding analysis. these technologies have also been used to successfully predict drug interactions, including drug-target and drug-drug interactions, during preclinical testing. the age of data-driven drug discovery & development network-based approaches that enable a systems-level view of the mechanisms underlying disease pathophysiology are increasingly becoming the norm in drug discovery and development. this in turn has opened up a new era of data-driven drug development where the focus is on the integration of heterogeneous types and sources of data, including molecular, clinical trial, and drug label data. the preclinical space is being transformed by ai technologies like natural language processing (nlp) that are enabling the identification of novel targets and previously undiscovered drug-disease associations based on insights extracted from unstructured data sources like biomedical literature, electronic medical records (emrs), etc. sophisticated and powerful ml/ai algorithms now enable the unified analysis of huge volumes of diverse datasets to autonomously reveal complex non-linear relationships that streamline and accelerate drug discovery and development. ultimately, the efficiency and productivity of early drug discovery and preclinical development processes will determine the value of the entire pharma r&d value chain. and that’s where ai/ml technologies have been gaining the most traction in recent years.

Attention mechanisms, transformers and NLP

natural language processing is a multidisciplinary field and over the years several models and algorithms have been successfully used to parse text. ml approaches have been central to nlp development with many of them particularly focussing on a technique called sequence-to-sequence learning (seq2seq). deep neural networks first introduced by google in 2014, seq2seq models revolutionized translation and were quickly being used for a variety of nlp tasks including text summarization, speech recognition, image captioning, question-answering etc. prior to this, deep neural networks (dnns) had been used to tackle difficult problems such as speech recognition. however, they suffered from a significant limitation in that they required the dimensionality of inputs and outputs to be known and fixed. hence, they were not suitable for sequential problems, such as speech recognition, machine translation and question answering, where dimensionality can not be pre-defined. as a result, recurrent neural networks (rnns), a type of artificial neural network, soon became the state of the art for sequential data. recurrent neural networks in a traditional dnn, the assumption is that inputs and outputs are independent of each other. rnns, however, operate on the principle that the output depends on both the current input as well as the “memory” of previous inputs from a sequence. the use of feedback loops to process sequential data allows information to persist thereby giving rnns their “memory.” as a result, this approach is perfectly suitable for language applications where context is vital to the accuracy of the final output. however, there was the issue of vanishing gradients — information loss when dealing with long sequences because of their ability to only focus on the most recent information — that impaired meaningful learning in the context of large data sequences. rnns soon evolved into several specialized versions, like lstm (long short-term memory), gru (gated recurrent unit), time distributed layer, and convlstm2d layer, with the capability to process long sequences. each of these versions was designed to address specific situations, for instance, grus outperformed lstms on low complexity sequences, consumed less memory and delivered faster results whereas lstms performed better with high complexity sequences and enabled higher accuracy. rnns and their variants soon became state-of-the-art for sequence translation. however, there were still several limitations related to long-term dependencies, parallelization, resource intensity and their inability to take full advantage of emerging computing paradigms devices such as tpus and gpus. however, a new model would soon emerge and go on to become the dominant architecture for complex nlp tasks. transformers by 2017, complex rnns and variants became the standard for sequence modelling and transduction with the best models incorporating an encoder and decoder connected through an attention mechanism. that year, however, a paper from google called attention is all you need proposed a new model architecture called the transformer based entirely on attention mechanisms. having dropped recurrence in favour of attention mechanisms, these models performed remarkably better at translation tasks, while enabling significantly more parallelization and requiring less time to train. what is the attention mechanism? the concept of attention mechanism was first introduced in a 2014 paper on neural machine translation. prior to this, rnn encoder-decoder frameworks encoded variable-length source sentences into fixed-length vectors that would then be decoded into variable-length target sentences. this approach not only restricts the network’s ability to cope with large sentences but also results in performance deterioration for long input sentences. rather than trying to force-fit all the information from an input sentence into a fixed-length vector, the paper proposed the implementation of a mechanism of attention in the decoder. in this approach, the information from an input sentence is encoded across a sequence of vectors, instead of a fixed-length vector, with the attention mechanism allowing the decoder to adaptively choose a subset of these vectors to decode the translation. types of attention mechanisms the transformer was the first transduction model to implement self-attention as an alternative to recurrence and convolutions. a self-attention, or intra-attention, mechanism relates to different positions in order to compute a representation of the sequence. and depending on the implementation there can be several types of attention mechanisms. for instance, in terms of source states that contribute to deriving the attention vector, there is global attention, where attention is placed on all source states, hard attention, just one source state and soft attention, a limited set of source states. there is also luong attention from 2015, a variation on the original bahdanau or additive attention, which combined two classes of mechanisms, one global for all source words and the other local and focused on a selected subset of words, to predict the target sentence. the 2017 google paper introduced scaled dot-product attention, which itself was like dot-product, or multiplicative, attention, but with a scaling factor. the same paper also defined multi-head attention, where instead of performing a single attention function it is performed in parallel. this approach enables the model to concurrently attend to information from different representation subspaces at different positions. multi-head attention has played a central role in the success of transformer models, demonstrating consistent performance improvements over other attention mechanisms. in fact, rnns that would typically underperform transformers have been shown to outperform them when using multi-head attention. apart from rnns, they have also been incorporated into other models like graph attention networks and convolutional neural networks. transformers in nlp transformer architecture has become a dominant choice in nlp. in fact, some of the leading language models for nlp, such as bidirectional encoder representations from transformers (bert), generative pre-training models (gpt-3), and xlnet are transformer-based. in fact, transformer-based pretrained language models (t-ptlms) have been successfully used in a variety of nlp tasks. built on transformers, self-supervised learning and transfer learning, t-ptlms are able to use self-supervised learning on large volumes of text data to understand universal language representations and then transfer this knowledge to downstream tasks. today, there is a long list of t-ptlms including general, social media, monolingual, multilingual and domain-specific t-ptlms. specialized biomedical language models, like biobert, bioelectra, bioalbert and bioelmo, have been able to produce meaningful concept representations that augment the power and accuracy of a range of bionlp applications such as named entity recognition, relationship extraction and question answering. transformer-based language models trained with large-scale drug-target interaction (dti) data sets have been able to outperform conventional methods in the prediction of novel drug-target interactions. it’s hard to tell if transformers will eventually replace rnns but they are currently the model of choice for nlp.

Data related challenges in NLP

nlp challenges can be classified into two broad categories. the first category is linguistic and refers to the challenges of decoding the inherent complexity of human language and communication. we covered this category in a recent "why is nlp challenging?" article. the second is data-related and refers to some of the data acquisition, accuracy, and analysis issues that are specific to nlp use cases. in this article, we will look at four of the most common data-related challenges in nlp. low resource languages there is currently a digital divide in nlp between high resource languages, such as english, mandarin, french, german, arabic, etc., and low resource languages, which include most of the remaining 7,000+ languages of the world. though there is a range of ml techniques that can reduce the need for labelled data, there still needs to be enough data, both labelled and unlabelled, to feed data-hungry ml techniques and to evaluate system performance. in recent times, multilingual language models (mllms) have emerged as a viable option to handle multiple languages in a single model. pretrained mllms have been successfully used to transfer nlp capabilities to low-resource languages. as a result, there is increasing focus on zero-shot transfer learning approaches to building bigger mllms that cover more languages, and on creating benchmarks to understand and evaluate the performance of these models on a wider variety of tasks. apart from transfer learning, there are a range of techniques, like data augmentation, distant & weak supervision, cross-lingual annotation projections, learning with noisy labels, and non-expert support, that have been developed to generate alternative forms of labelled data for low-resource languages and low-resource domains. today, there is even a no-code platform that allows users to build nlp models in low-resource languages. training data building accurate nlp models requires huge volumes of training data. though there has been a sharp increase in recent times of nlp datasets, these are often collected through automation or crowdsourcing. there is, therefore, the potential for incorrectly labelled data which, when used for training, can lead to memorisation and poor generalisation. apart from finding enough raw data for training, the key challenge is to ensure accurate and extensive data annotation to make training data more reliable. data annotation broadly refers to the process of organising and annotating training data for specific nlp use cases. in-text annotation, a subset of data annotation, text data is transcribed and annotated so that ml algorithms are able to make associations between actual and intended meanings. there are five main techniques for text annotation: sentiment annotation, intent annotation, semantic annotation, entity annotation, and linguistic annotation. however, there are several challenges that each of these has to address. for instance, data labelling for entity annotations typically has to contend with issues related to nesting annotations, introducing new entity types in the middle of a project, managing extensive lists of tags, and categorising trailing and preceding whitespaces and punctuation. currently, there are several annotation and classification tools for managing nlp training data at scale. however, manually-labelled gold standard annotations remain a prerequisite and though ml models are increasingly capable of automated labelling, human annotation becomes essential in cases where data cannot be auto-labelled with high confidence. large or multiple documents dealing with large or multiple documents is another significant challenge facing nlp models. most nlp research is about benchmarking models on small text tasks and even state-of-the-art models have a limit on the number of words allowed in the input text. the second problem is that supervision is scarce and expensive to obtain. as a result, scaling up nlp to extract context from huge volumes of medium to long unstructured documents remains a technical challenge. current nlp models are mostly based on recurrent neural networks (rnns) that cannot represent longer contexts. however, there is a lot of focus on graph-inspired rnns as it emerges that a graph structure may serve as the best representation of nlp data. research at the intersection of dl, graphs and nlp is driving the development of graph neural networks (gnns). today, gnns have been applied successfully to a variety of nlp tasks, from classification tasks such as sentence classification, semantic role labelling and relation extraction, to generation tasks like machine translation, question generation, and summarisation. development time and resources as we mentioned in our previous article regarding the linguistic challenges of nlp, ai programs like alphago have evolved quickly to master a broader variety of games with less predefined knowledge. but nlp development cycles are yet to see that pace and degree of evolution. that’s because human language is inherently complex as it makes "infinite use of finite means" by enabling the generation of an infinite number of possibilities from a finite set of building blocks. the prevalent shape of syntax of every language is the result of communicative needs and evolutionary processes that have developed over thousands of years. as a result, nlp development is a complex and time-consuming process that requires evaluating billions of data points in order to adequately train ai from scratch. meanwhile, the complexity of large language models is doubling every two months. a powerful language model like the gpt-3 packs 175 billion parameters and requires 314 zettaflops, 1021 floating-point operations, to train. it has been estimated that it would cost nearly $100 million in deep learning (dl) infrastructure to train the world’s largest and most powerful generative language model with 530 billion parameters. in 2021, google open-sourced a 1.6 trillion parameter model and the projected parameter count for gpt-4 is about 100 trillion. as a result, language modelling is quickly becoming as economically challenging as it is conceptually complex. scaling nlp nlp continues to be one of the fastest-growing sectors within ai. as the race to build larger transformer models continues, the focus will turn to cost-effective and efficient means to continuously pre-train gigantic generic language models with proprietary domain-specific data. even though large language models and computational graphs can help address some of the data-related challenges of nlp, they will also require infrastructure on a whole new scale. today, vendors like nvidia are offering fully packaged products that enable organisations with extensive nlp expertise but limited systems, hpc, or large-scale nlp workload expertise to scale-out faster. so, despite the challenges, nlp continues to expand and grow to include more and more new use cases.

Knowledge graphs & the power of context

data overload is becoming a real challenge for businesses of all stripes even as a majority continue gathering data faster than they can analyse and harness its business value. and it’s not just about volume. much of modern big data, as much as 93%, comes in the form of unstructured data and most if not all of which ends up as dark data i.e. collected but not analysed. unlocking knowledge at scale from troves of unstructured organisational data is rapidly becoming one of the most pressing needs for businesses today. concurrent themes in this regard include the importance of connected data, the value of applying knowledge in context and the benefits of using ai to contextualize data and create knowledge. and the need for connected, contextualised data and continuing developments in ai has resulted in increasing interest in knowledge graphs as a means to generate context-based insights. in fact, gartner believes that graph technologies are the foundation of modern data and analytics, noting that most client inquiries on the topic of ai typically involve a discussion on graph technology. a brief history of knowledge graphs in 1735s königsberg, swiss mathematician leonhard euler used a concept of nodes/objects and links/relationships to prove that there was no route across the city’s four districts that would involve crossing each of its interconnecting seven bridges exactly once, thereby laying the foundations for graph theory. cut to more modern times and 1956 witnessed the development of a semantic network, a well-known ancestor of knowledge graphs, for machine translation of natural languages. fast forward to the early aughts, and sir timothy john berners-lee proposed a semantic web that would use structured and standardized metadata about webpages and their interlinks to make the knowledge stored in these relationships machine-readable. unfortunately, the concept did not exactly scale but search and social companies were quick to latch on to the value of extremely large graphs and the potential in extracting knowledge from them. google is often credited with rebranding the semantic web and popularising knowledge graphs with the introduction of the google knowledge graph in 2012. most of the first big knowledge graphs, from companies such as google, ibm, amazon, samsung, ebay, bloomberg, ny times, compiled non-proprietary information into a single graph that served a wide range of interests. enterprise knowledge graphs emerged as the second wave and used ontologies to elucidate various conceptual models (schemas, taxonomies, vocabularies, etc.) used across different enterprise systems. back in 2019, gartner predicted that an annualised 100% growth in the application of graph processing and graph databases would help accelerate data preparation and enable more complex and adaptive data science. today, graphs are considered to be one of the fastest-growing database niches, having surpassed the growth rate of standard classical relational databases, and graph db + ai may well be the future of data management. defining knowledge graphs a knowledge graph is quite simply any graph of data that accumulates and conveys knowledge of the real world. data graphs can conform to different graph-based data models, such as a directed edge-labelled graph, a heterogeneous graph, a property graph, etc. for instance, a directed labelled knowledge graph consists of nodes representing entities of interest, edges that connect nodes and reference potential relationships between various entities, and labels that capture the nature of the relationship. so, knowledge graphs use a graph-based data model to integrate, manage and extract knowledge from diverse sources of data at scale. knowledge graph databases enable ai systems to deal with huge volumes of complex data by storing information as a network of data points correlated by the nature of their relationships. key characteristics of knowledge graphs by connecting multiple data points around relevant and contextually related attributes, graph technologies enable the creation of rich knowledge databases that enhance augmented analytics. some of the most defining characteristics of this approach include: knowledge graphs work across structured and unstructured datasets and represent the most credible means of aggregating all enterprise data regardless of structure variation, type, or format. compared to knowledge bases with ﬂat structures and static content, knowledge graphs integrate adjacent information on how different data points are correlated to enable a human brain-like approach to derive new knowledge. knowledge graphs are dynamic and can be programmed to automatically identify attribute-based associations across new incoming data. the ability to create connected clusters of data based on levels of inﬂuence, frequency of interaction and probability opens up the possibility of developing and training highly complex models. knowledge graphs simplify the process of integrating and analysing complicated data by establishing a semantic layer of business definitions. the use of intelligent metadata enables users to even ﬁnd insights that otherwise might have been beyond the scope of analytics. applications of knowledge graphs today, knowledge graphs are everywhere. every consumer-facing digital brand, such as google, amazon, facebook, spotify, etc., has invested significantly in building knowledge and the concept of graphs has evolved to underpin everything from critical infrastructure to supply chains and policing. here’s a quick look at how this technology can transform certain key sectors and functions. healthcare in the healthcare sector, it is especially critical that classification models are reliable and accurate. but this continues to be a challenge given the volume, quality and complexity of data within the sector. despite the application of advanced classification methodologies, including deep learning, the outcomes do not demonstrate adequate superiority over previous techniques. much of this boils down to the fact that conventional techniques disregard correlations between data instances. however, it has been demonstrated that knowledge graph algorithms, with their inherent focus on correlations, could significantly advance capabilities for the discovery of knowledge and insights from connected data. finance knowledge graphs, and their ability to uncover new dimensions of data-driven knowledge, are expected to be adopted by as much as 80% of financial services firms in the near future. in fact, a 2020 report from business and technology management consultancy capco provided a veritable laundry list of knowledge graph applications across the financial services value chain. for instance, graphs can be used across compliance, kyc and fraud detection to build a ‘deep client insight’ capability that can transform compliance from a cost to a revenue-driving function. the adoption of graph data models could also drive product innovations given the inflexibility of current tabular data structures to reflect real-world needs. pharma machine learning approaches that use knowledge graphs have the potential to transform a range of drug discovery and development tasks, including drug repurposing, drug toxicity prediction and target gene-disease prioritisation. in the context of knowledge graph-based drug discovery, in a drug discovery graph, genes, diseases, drugs etc. are represented as entities with the edges indicating relationships/interactions. as a result, an edge between a disease and drug entity could indicate a successful clinical trial. similarly, an edge between two drug entities could reference either a potentially harmful interaction or compatibility. the pharma sector is also emerging as the ideal target for text-enhanced knowledge graph representation models that utilise textual information to augment knowledge representations. knowledge graphs and ai/ml ai/ml technologies are playing an increasingly critical role in driving data-driven decision making in the digital enterprise. knowledge graphs will play a significant role in sustaining and growing this trend by providing the context required for more intelligent decision-making. there are two distinct reasons for knowledge graphs being at the epicentre of ai and machine learning. on the one hand, they are a manifestation of ai given their ability to derive a connected and contextualised understanding of diverse data points. on the other, they also represent a new approach to integrating all data, structured and unstructured, required to build the ml models that drive decision-making. the combination, therefore, of knowledge graphs and ai technologies will be critical not only for integrating all enterprise data but also add the power of context to augment ai/ml approaches.

AI, ML, DL, and NLP: An Overview

today artificial intelligence (ai), machine learning (ml), deep learning (dl) and natural language processing (nlp) are all technologies that have become a part of the fabric of enterprise it. however, solutions providers and end-users often use these terms interchangeably. even though there can be significant conceptual overlaps, there are also important distinctions between these key technologies. increasingly, the value of ai in drug discovery is determined not by model complexity alone, but by how well biological context is preserved across data, computation, and experimentation. platforms such as mindwalk reflect this shift—prioritizing biological fidelity, traceability, and integration with experimental workflows so that computational insight remains actionable as discovery programs scale. here’s a quick overview of the definition and scope of each of these terms. artificial intelligence (ai) the term ai has been around since the 1950s and broadly refers to the simulation of human intelligence by machines. it encompasses several areas beyond computer science including psychology, philosophy, linguistics and others. ai can be classified into four types, from simplest to most advanced, as reactive machines, limited memory, theory of mind and self-awareness. reactive machines: purely reactive machines are trained to perform a basic set of tasks based on certain inputs. this ai cannot function beyond a specific context and is not capable of learning or evolving over time. examples: ibm’s deep blue chess ai, and google’s alphago ai. limited memory systems: as the nomenclature suggests, these ai systems have limited memory to store and analyze data. this memory is what enables “learning” and gives them the capability to improve over time. in practical terms, these are the most advanced ai systems we currently have. examples: self-driving vehicles, virtual voice assistants, chatbots. theory of mind: at this level, we are already into theoretical concepts that have not yet been achieved yet. with their ability to understand human thoughts and emotions, these advanced ai systems can facilitate more complex two-way interactions with users. self-awareness: self-aware ais with human-level desires, emotions and consciousness is the aspirational end state for ai and, as yet, are pure science fiction. another broad approach to distinguishing between ai systems is in terms of narrow or weak ai, specialized intelligence trained to perform specific tasks better than humans, general artificial intelligence (agi) or strong ai, a theoretical system that could be applied to any task or problem, and artificial super intelligence (asi), ai that comprehensively surpasses human cognition. the concept of ai is continuously evolving based on the emergence of technologies that enable the most accurate simulation of human intelligence. some of those technologies include ml, dl, and artificial neural networks (ann) or simply neural networks (nn). ml, dl, rl, and drl here’s the tl;dr before we get into each of these concepts in a bit more detail: if ai’s objective is to endow machines with human intelligence, ml refers to methods for implementing ai by using algorithms for data-driven learning and decision-making. dl is a technology for realizing ml and expanding the scope of ai. reinforcement learning (rl), or evaluation learning, is an ml technique. and deep reinforcement learning (drl) combines dl and rl to realize optimization objectives and advance toward general ai. source: researchgate machine learning (ml) ml is a subset of ai that involves the implementation of algorithms and neural networks to give machines the ability to learn from experience and act automatically. ml algorithms can be broadly classified into three categories. supervised learning ml algorithms using a labelled input dataset and known responses to develop a regression/classification model that can then be used on new datasets to generate predictions or draw conclusions. the limitation of this approach is that it is not viable for datasets that are beyond a certain context. unsupervised learning algorithms are subjected to “unknown” data that has yet to be categorized or labelled. in this case, the ml system itself learns to classify and process unlabeled data to learn from its inherent structure. there is also an intermediate approach between supervised and unsupervised learning, called semi-supervised learning, where the system is trained based on a small amount of labelled data to determine correlations between data points. reinforcement learning (rl) is an ml paradigm where algorithms learn through ongoing interactions between an ai system and its environment. algorithms receive numerical scores as rewards for generating decisions and outcomes so that positive interactions and behaviours are reinforced over time. deep learning (dl) dl is a subset of ml where models built on deep neural networks work with unlabeled data to detect patterns with minimal human involvement. dl technologies are based on the theory of mind type of ai where the idea is to simulate the human brain by using neural networks to teach models to perceive, classify, and analyze information and continuously learn from these interactions. dl techniques can be classified into three major categories: deep networks for supervised or discriminative learning, deep networks for unsupervised or generative learning, and deep networks for hybrid learning that is an integration of both supervised and unsupervised models and relevant others. deep reinforcement learning (drl) combines rl with dl techniques to solve challenging sequential decision-making problems. because of its ability to learn different levels of abstractions from data, drl is capable of addressing more complicated tasks. natural language processing (nlp) what is natural language processing? nlp is the branch of ai that deals with the training of machines to understand, process, and generate language. by enabling machines to process human languages, nlp helps streamline information exchange between human beings and machines and opens up new avenues by which ai algorithms can receive data. nlp functionality is derived from cross-disciplinary theories from linguistics, ai and computer science. there are two main types of nlp algorithms, rules-based and ml-based. rules-based systems use carefully designed linguistic rules whereas ml-based systems use statistical methods. nlp also consists of two core subsets, natural language understanding (nlu) and natural language generation (nlg). nlu enables computers to comprehend human languages and communicate back to humans in their own languages. nlg is the use of ai programming to mine large quantities of numerical data, identify patterns and share that information as written or spoken narratives that are easier for humans to understand. comparing rules-based and deep learning nlp approaches natural language processing (nlp) systems generally fall into two broad categories: rules-based and deep learning-based. rules-based systems rely on expert-defined heuristics and pattern matching, offering transparency and interpretability. however, they tend to be brittle and limited in scalability across biomedical domains. in contrast, deep learning models—including transformers like biobert and scispacy—automatically learn contextual relationships from large biomedical corpora. these models serve as powerful biomedical text mining tools, offering greater flexibility and accuracy in processing complex, ambiguous language found in clinical narratives, scientific publications, and electronic health records (ehrs). many life sciences applications now favor hybrid pipelines that combine the precision of rule-based systems with the adaptability of deep learning—balancing interpretability and performance in production settings. conclusion this overview outlines the key technological acronyms shaping today’s discussions around ai-driven drug discovery. you can also explore how ai/ml technologies are are advancing intelligent bioinformatics and autonomous drug discovery and the importance and challenges of nlp in biomedical research. curious about nlp? dive deeper into our article for further exploration.

AI in drug development - toward fully autonomous drug discovery

in our previous blog, we noted how the increasing utilization of ai across different phases of the drug discovery process has proven its strategic value in addressing some of the core efficiency and productivity challenges involved. as a result, ai in drug discovery and development has finally cut through the hype and become an industry-wide reality. a key milestone in this process has been the launch of clinical trials for the first drug developed completely using ai. currently, the rapid evolution of ai-powered protein folding algorithms, such as alphafold, rosettafold, and raptorx6, promises to dramatically accelerate structural biology, protein engineering, and drug discovery. in fact, ai is expected to underpin a million-x drug discovery future, wherein the ability of these technologies to exponentially scale up protein structure prediction and chemical compound generation will increase the opportunity for drug discovery by a million times. ai-driven drug development also facilitates several other strategic outcomes such as access to larger datasets, reduced drug discovery costs, optimized drug designs, accelerated drug repurposing or repositioning, enabling the discovery of new and hidden drug targets, and turning previously undruggable targets into druggable ones. ai applications in drug design source: springer there are a range of applications for ai across different phases of drug development, from target discovery to clinical studies. here’s a quick overview of how ai can transform some of the key stages of drug design: ai in virtual screening drug discovery typically begins with the identification of targets for a disease of interest, followed by high-throughput screening (hts) of large chemical libraries to identify bioactive compounds. though hts has its advantages, it may not always be appropriate or even adequate, especially in the big data era when chemical libraries have expanded beyond a billion molecules. this is where ai-powered virtual screening (vs) methods are being used to complement hts to accelerate the exploratory research process in the discovery of potential drug components. this is due to ai-based vs’s ability to rapidly screen millions of compounds at a fraction of the costs associated with hts and with a prediction accuracy as high as 85%. ai in lead optimization lead optimization (lo) is an essential yet expensive and time-consuming phase in preclinical drug discovery. the fundamental utility of the lo process is to enhance the desirable properties of a compound while eliminating structural deficiencies and the potential for adverse side effects. however, this is a complex multiparameter optimization problem where several competing objectives have to be precisely balanced in order to arrive at optimal drug candidates. done right, lo can significantly reduce the chances of attrition in pre-clinical as well as clinical stages of drug development. and reducing the iterations required for optimization in the design-make-test-analyze (dmta) cycle can help accelerate the drug development process. deep learning generative models are now being successfully used to accelerate the obtention of lead compounds while simultaneously ensuring that these compounds also conform to the requisite biological objectives. generative modeling platforms, with integrated predictive models for calculating various absorption, distribution, metabolism, excretion, and toxicity (admet) endpoints, can now significantly shorten the dmta cycle required to select and design compounds that satisfy all defined lo criteria. ai in computer-aided drug synthesis the integration of ai and drug synthesis has been accelerated over the last few years, significantly improving the design and synthesis of drug molecules. ai-driven computer-aided synthesis tools are being widely used in retrosynthetic analysis, reaction prediction, and automated synthesis. for instance, these tools can be applied to the retrosynthetic analysis of target compounds to identify feasible synthetic routes, predict reaction products and yields, and optimize hit compounds. . ai in computer-aided synthesis planning (casp) is enabling chemists to objectively identify the most efficient and cost-effective synthetic route for a target molecule, thereby accelerating the ‘make’ phase of the dmta cycle. the emergence of intelligent and automated technologies for continuous-flow chemical synthesis promises a future of fully autonomous synthesis. these are just a few examples of the potential for ai in drug discovery and development. in fact, companies are using ai to address key challenges across the r&d pipeline and the life sciences value chain. the future of ai in drug development according to a research paper, the future of drug discovery will entail a centralized closed-loop ml-controlled workflow that autonomously generates hypotheses, synthesizes lead candidates, tests them, and stores the data. according to the paper, the human interface between conventional discovery processes, such as data analysis, computational prediction, and experimentation, results in bottlenecks and biased hypothesis generation, which could be eliminated by a completely automated closed-loop system. fully autonomous drug discovery may well be the future but in the near term, the human component will remain essential in the drug discovery and development process. in the current humans-in-the-loop approach to ai in drug design, ai algorithms are augmenting human intelligence by independently extracting and learning from patterns in vast volumes of complex big data. ai technologies like natural language processing (nlp) are helping to obtain from unstructured data sources like scientific literature, clinical trials, electronic health records (ehrs), and social media posts that have thus far remained completely underutilized. most importantly though, ai in drug discovery has grown far beyond hype and hypothesis. as we mentioned in our ai in drug development - from hype to reality blog, today the ai-driven drug discovery space is rife with activity as big pharma, big tech, and big vc-funded scrappy startups jostle for a position in the next big innovation cycle in drug discovery and development. ai-driven innovation is already delivering measurable value across the biopharma research value chain. and companies continue to scale ai across their r&d systems, bringing the industry closer to a potential future of fully autonomous drug discovery.

The importance of NLP in biomedical research

there will be more than twice as much digital data created over the next five years as has been generated since the advent of digital storage. and a vast majority of that data, more than 80 per cent, will be unstructured and estimated to be growing at 55-65% per year. textual data, in the form of documents, journal articles, blogs, emails, electronic health records and social media posts, is one of the most common types of unstructured data. this is where ai-based technologies like nlp, can help extract meaning and context from large volumes of unstructured textual data. nlp unlocks access to valuable new data sources that were hitherto beyond the purview of conventional data integration and analysis frameworks. biomedical-domain-specific nlp techniques open up a gamut of possibilities in automating the extraction of statistical and biological information from large volumes of text including scientific literature and medical/clinical data. more importantly, they bring several new benefits in terms of productivity, efficiency, performance and innovation. key benefits of nlp enabling scale, across multiple dimensions scientific journals and other specialized online publications are critical to the dissemination of experiments and studies in biomedical and life sciences research. every biomedical research project can benefit significantly from extracting relevant scientific knowledge, like protein-protein interactions, for example, embedded in this distributed information trove. and with an estimated 3000 biomedical articles being published every day, nlp becomes an indispensable tool for the collation and propagation of knowledge. it is a similar situation in the clinical context, where nlp can quickly extract meaning and context from a sprawl of unstructured text records such as ehrs, diagnostic reports, medical notes, lab data etc. nlp methods have also been successfully reimagined to scale across structured biological information like sequence data. today, high-throughput sequencing technologies are generating more biological sequence data that still lack interpretation or biological information. this creates a major integration and analysis bottleneck for conventional downstream frameworks. for instance, at mindwalk we have applied nlp methods to transcribe the universal language of all omics data and develop a unified framework that can instantly scale across all omics data. uncovering new actionable insights using nlp to expand the scope of biomedical research to textual data can lead to the discovery of insights that lie outside the realm of clinical and biological data. in the clinical context, for example, effective patient-physician communication is vital for enhancing patient understanding of treatment and adherence in order to improve clinical outcomes and patient quality of life. and patient-reported outcome measures (proms) are often used to assess and improve communication. however, one study set out to complement conventional approaches by extracting a patient-centred view of diseases and treatments through social media analytics. the strategy was to use a text-mining methodology to analyse health-related forums to understand the therapeutic experience of patients affected by hypothyroidism and to detect possible adverse drug reactions (adrs) that may not necessarily be communicated in the formal clinical setting. the analysis of reported adrs revealed that a pattern of well-known side effects uncertainties about proper administration was causing anxiety and fear. the other key finding was that some reported symptoms quite frequently posted online, like dizziness, memory impairment, and sexual dysfunction were usually not discussed at in-person consultations. empowering researchers, accelerating research nlp technologies significantly expand the scope and potential of biological research by putting into play vast volumes of information that were hitherto underutilised. by automating the analysis of unstructured textual data, it empowers researchers with more data points to explore more correlations and possibilities. in addition, it relieves them from tedious, repetitive tasks thereby allowing them to focus on activities that add real value and accelerate time-to-insight. take rare disease drug development, for example, a field characterised by small patient populations and a shortage of data. to account for the inherent data scarcity, researchers had to manually scour through large volumes of information to identify any links between rare diseases and specific genes and gene variants. the advent of nlp relieves researchers from the tedium of manual search, instantly expands their data universe and helps accelerate the drug development process for rare diseases. enabling innovation nlp can help disrupt and reinvent tried and tested processes that have become part of the established convention in many industries. take biological research, for example, where sequence search and comparison is the launch point for a lot of projects. in this standard process, users typically input a research-relevant biological sequence, in a predefined and acceptable data format, and use relevant search results to chart their research pathway. even though the underlying frameworks, models and algorithms have evolved considerably over the years, the standard process still remains the same; users input a sequence to obtain a list of all pertinent sequences. however, nlp-based innovations, like the mindwalk platform, for example, can completely disrupt this process to yield significant improvements in efficiency, productivity and performance. in the nlp-based model, users can start with a simple text input, say covid, to launch their search. more importantly, the model surfaces all relevant results, both at the sequence and text levels, thereby facilitating a more data-inclusive and integrative approach to genomics research. integrative research with biostrand mindwalk platform is our latest technology innovation in our continuing quest to make omics research more efficient, productive and integrative. by adding literature analysis to our existing omics and metadata integration framework, we now offer a unified solution that scales across sequence data and unstructured textual data to facilitate a truly integrative and data-driven approach to biological research. our platform's semantics-driven analysis framework is fully domain-agnostic and uses a bottom-up approach which means that even proprietary literature with custom words can be easily parsed. our integrated framework traverses omics data, metadata and textual data to capture all correlated information across structured and unstructured data at one shot. this provides researchers with a ‘single pane of glass’ view of all entities, associations and relationships that are relevant to their research. and we believe that enabling this singular focus on all the most relevant data points and correlations that exist between a specific research purpose and all prior knowledge can help researchers significantly accelerate time to insight and value.

Accelerating R&D and innovation in life sciences

the covid-19 pandemic catalyzed the global life sciences sector into a new normal. the industry as a whole transitioned from a conventional inward-looking model to drive rapid innovation based on technology adoption and collaboration. the entire sector came together, combining individual contributions with collective action to accelerate the development, manufacture, and delivery of vaccines, diagnostics, and treatments for covid-19. there was a notable increase in co-developed assets with collaborations and partnerships accounting for almost half of those in the late-stage pipeline. the industry also demonstrated the ability to adapt and innovate conventional r&d models in order to respond to the demands of the pandemic. the focus now has to be on building on the learnings and sustaining the momentum from this generational and disruptive experience. even though the life sciences r&d function more than adequately proved its mettle, there are still a few broad challenges that need to be addressed as we move forward. key challenges in life sciences key challenges in life sciences r&d technology the life sciences industry has long relied on point solutions, often adapted from generic solutions, that have been designed to address specific, discrete issues along the r&d pipeline. this has resulted in many r&d organizations having to grapple with multiple loosely connected technologies and siloed legacy systems, each of which focuses on an isolated function rather than a singular strategic outcome. this patchwork integration of disparate solutions will also be unable to cope with the distinctive challenges of life sciences research in the big data age. and finally, these are not frameworks that are easily adapted or upgraded to include emerging technologies such as ml and ai that are becoming critical data-intensive, outcome-focused, patient-centric research. the focus here has to be on reimagining the role of technology in life sciences r&d with the focus on cloud-first modular architectures and integrated user-friendly solutions that facilitate desired research outcomes. data rapid innovations in ngs technologies have resulted in the exponential growth of genomic data that the life sciences r&d organizations have to deal with. in addition, there is the ever-expanding catalogue of experimental data sources, including omics data, omics subdisciplines, ehrs, medical imaging data, social networks, wearables etc. data-driven r&d, therefore, has become both a challenge and an opportunity for the life sciences industry. the big data processing capabilities of ml/ai technologies have made them a critical component of most modern r&d pipelines. however, the process of scaling, normalizing, transforming and integrating vast volumes of heterogeneous data still remains a significant bottleneck in biological research. as a result, the life science industry is currently facing a data dilemma wherein the imperative for the democratization of ai to enable value at scale may be being stifled by the reality that 50% of the time is still spent on data preparation and deployment. productivity & innovation the 2020 edition of deloitte’s annual analysis of the returns on r&d investments of a cohort of biopharma companies found a small uptick in their average irr, from 1.5 to 2.7, suggesting the reversal of a decade-long decline in r&d activity. by 2021, the irr had improved further, from 2.7 to 7.0, representing the largest annual increase since the study began in 2010. as deloitte emphasized, even though the pandemic had accelerated r&d innovation, sustaining it would require expanding investments in digital technologies, data science approaches and transformative development models. moreover, the year-on-year decline in the average cost to bring an asset to market was mainly down to an increase in the number of assets in the late-stage pipeline and even though average cycle time had improved slightly it was still above pre-pandemic levels. the challenge now will be to move beyond incremental change and embrace the full-scale transformation of the r&d pipeline in order to boost innovation and productivity. regulation the growing volume of regulatory legislation, often cited as a reason for lower r&d pipeline yields, is emerging as a major challenge for life science organizations. as a result, safety, regulatory, and compliance functions now have to account for a broad range of intricate and complex requirements that vary by market and regulator. for instance, the different governments have different evaluation requirements, from health technology assessment (hta) appraisals and health economic data to mandated reductions in price. in europe, life sciences companies are also facing the implementation of comprehensive clinical trials regulation as well as compliance with gdpr. as a result of the ongoing evolution shift of the regulatory regime, conventional compliance technologies and processes may no longer be enough to assess the risk or ensure compliance with emerging legislation. talent the life sciences sector has witnessed a significant transformation in the role of the hr since the onset of the pandemic. over half of the human capital and c-suite leaders in the sector also cite talent scarcity as the factor with the most impact on their business. the life sciences industry requires a rigorously unique talent deployment model. according to a 2021 life sciences workforce trends report, high-skill positions account for nearly half (47%) of all life science industry employment, compared to just 27% for all other industries. the life sciences also have the highest concentration of stem talent, one in three employees, in comparison with all industries, one in 15 employees. for life sciences companies, the challenge is not only to compete with conventional industries for highly-skilled stem talent but also to attract specialist sector talent, such as computational biologists and bioinformaticians, away from deep-pocketed technology companies. and the battle for talent seems to have begun in earnest. in the us, for instance, life sciences companies are embracing skyrocketing real estate costs in key life sciences clusters just to give themselves an edge in the talent war. in the uk, the government has launched a life sciences future skills strategy report in order to strategize how to develop future talent for the country’s life sciences sector. for the life sciences industry, the challenge will be to adopt new models of working that will help them attract, engage and retain the talent required for future growth and innovation. towards data-driven patient-centric r&d the life sciences industry is currently at a critical point of inflexion. the covid-19 experience has highlighted the value of technology adoption, collaboration and innovation around r&d models. however, there is still significant progress to be made in terms of addressing cost and productivity inefficiencies in r&d pipelines. concerted investments in technology, data management and talent can help address these issues and transition the sector to a truly data-driven patient-centric approach to r&d.

What is a biomarker?

a biomarker, or biological marker, broadly refers to a range of objectively, accurately, and reproducibly measurable medical signs. these signs indicate the medical state of a patient in terms of the presence/progress, or the effect of treatment on a particular disease. a simple example of a biomarker is blood pressure, the most relied upon medical sign for the diagnosis and treatment of hypertension in clinical practice. in terms of a more formal definition, the european medicines agency (ema) defines biomarkers as any biological molecule found in body fluids or tissues that can be used to follow body processes and diseases in humans and animals. the biomarkers, endpoints, and other tools (best) glossary from the food and drug administration and national institutes of health (fda-nih) biomarker working group defines them as a characteristic that is measured as an indicator of normal biological processes, pathogenic processes, or biological responses to an exposure or intervention, including therapeutic interventions. in recent years, biomarker research has helped open new perspectives on the molecular dynamics of complex diseases. as a result, biomarkers are now used widely in healthcare settings as well as in different phases across the drug discovery and development pipeline. the use of biomarkers during drug development has increased significantly over the years. according to one study, well over half of all fda/ema approvals between 2015 and 2019 were supported by biomarker data during at least one of the development stages. in 2020. more than 33,000 clinical trials, including around 4000 phase 3 and 4 trials, registered with the clinicaltrials.gov database involved biomarkers. the evolution toward multi-component biomarkers the best glossary defines seven categories of biomarkers with the possibility for a biomarker to have characteristics associated with different categories. susceptibility/risk biomarkers are linked to the likelihood of individuals to develop a disease or medical condition that they do not yet have. these biomarkers enable the detection of medical conditions years before the appearance of clinical signs and symptoms though they do not describe a relationship to any specific treatment. prognostic biomarkers help identify the likelihood of a specific future clinical event, disease recurrence or progression in individuals who have already been diagnosed with a disease or medical condition. since these biomarkers are measured at a defined baseline that may be associated with specific treatments. diagnostic biomarkers are used to detect or confirm a disease or condition of interest in individuals or to identify individuals with a subtype of the disease. biomarkers that enable the diagnosis of disease subtypes can often play a critical role as prognostic and predictive biomarkers. predictive biomarkers identify individuals based on the likelihood of their response, favourable or unfavourable, to exposure to a particular medical product or environmental agent. these biomarkers apply to a wide variety of interventions in a clinical trial setting and in informing patient care decisions. monitoring biomarkers are those that are assessed repeatedly over time and are used to evaluate specific characteristics of disease progression and response of a disease or condition to treatment. response biomarkers identify the biological response in individuals exposed to a medical product or an environmental agent. pharmacodynamic biomarkers belong to this category and measure the response, including the potential for harm, to establish proof-of-concept, assist in dose selection, etc. safety biomarkers are measured before or after exposure to a medical product or an environmental agent to indicate the likelihood, presence, or extent of toxicity as an adverse drug or exposure effect. these biomarkers can also be used to identify patients at risk from specific therapies. as the scene of biomarkers evolves, new categories are being added to these fundamental seven types. companion diagnostics (cdx) is the development of predictive biomarkers in conjunction with novel therapeutics. the fda’s concurrent approval of trastuzumab and the her2 immunohistochemical (ihc) assay through a coordinated procedure launched this new drug-diagnostic co-development model. companion diagnostics subdivide patients based on molecular biomarkers, which can then be used to drive decision-making on patient selection for drug trials and identification of clinically effective drugs for personalised treatments. another new category of note is digital biomarkers, defined as ‘characteristics, collected from digital health technologies, that are measured as an indicator of normal biological processes, pathogenic processes, or responses to an exposure or intervention, including therapeutic interventions. digital biomarkers open up new possibilities for the remote and continuous assessment of patients and preclinical subjects, even in non-clinical settings through clinical trials or disease progression. however, as biomarker research evolves, the use of multi-component biomarkers continues to expand. earlier this year, the fda convened a public meeting to identify concepts and terminology for further development and use of multi-component biomarkers. as per the agency, multi-component or multi-variate biomarkers could comprise multiple components of the same type or of different types based on independent measurements. these measurements could be used independently and/or in combination as a characteristic that indicates normal biological processes, pathogenic processes, or responses to an exposure or intervention, including therapeutic interventions and environmental exposures. at the same time, there is also ongoing research aimed at developing a multi-dimensional, evidence-based approach to biomarker-drug classification. the argument is that the usefulness of a biomarker for drug development has to be assessed across two dimensions: one, to determine the biological plausibility of a biomarker acting directly on the drug target and two, to establish its precision in predicting the clinical outcome in drug efficacy or safety. biomarkers in drug development research shows that the use of biomarkers for patient stratification improves the probability of success (pos) of clinical trials across all phases, most significantly in phases 1 and 2. the overall pos of trials that used biomarkers was almost double that of non-biomarker trials. however, clinical trials are just one phase in the development cycle. even within drug development pipelines that incorporate biomarkers, less than 10% use them across all stages of development. and oncology still leads other therapeutic areas in the effective adoption of biomarkers. moreover, the resource intensity of biomarker research means most studies are linked to specific drug development programmes. biomarker-driven drug development has the potential to enhance the efficiency and productivity of drug discovery and development while simultaneously reducing time and cost. increasing the availability of public biomarker data while accelerating the scope of individual programmes will be key to realising the full potential of biomarker-driven drug development.

Challenges in multi-omics data integration

today, the integrative computational analysis of multi-omics data has become a central tenet of the big data-driven approach to biological research. and yet, there is still a lack of gold standards when it comes to evaluating and classifying integration methodologies that can be broadly applied across multi-omics analysis. more importantly, the lack of a cohesive or universal approach to big data integration is also creating new challenges in the development of novel computational approaches for multi-omics analysis. one aspect of sequence search and comparison, however, has not changed much at all – a biological sequence in a predefined and acceptable data format is still the primary input in most research. this approach is probably and arguably valid in many if not most real-world research scenarios. take machine learning (ml) models, for instance, which are increasingly playing a central role in the analysis of genomic big data. biological data presents several unique challenges, such as missing values and precision variations across omics modalities, that simply expand the gamut of integration strategies required to address each specific challenge. for example, omics datasets often contain missing values, which can hamper downstream integrative bioinformatics analyses. this requires an additional imputation process to infer the missing values in these incomplete datasets before statistical analyses can be applied. then there is the high-dimension low sample size (hdlss) problem, where the variables significantly outnumber samples, leading ml algorithms to overfit these datasets, thereby decreasing their generalisability on new data. in addition, there are multiple challenges inherent to all biological data irrespective of analytical methodology or framework. to start with there is the sheer heterogeneity of omics data comprising a variety of datasets originating from a range of data modalities and comprising completely different data distributions and types that have to be handled appropriately. integrating heterogeneous multi-omics data presents a cascade of challenges involving the unique data scaling, normalisation, and transformation requirements of each individual dataset. any effective integration strategy will also have to account for the regulatory relationships between datasets from different omics layers in order to accurately and holistically reflect the nature of this multidimensional data. furthermore, there is the issue of integrating omics and non-omics (ono) data, like clinical, epidemiological or imaging data, for example, in order to enhance analytical productivity and to access richer insights into biological events and processes. currently, the large-scale integration of non-omics data with high-throughput omics data is extremely limited due to a range of factors, including heterogeneity and the presence of subphenotypes, for instance. the crux of the matter is that without effective and efficient data integration, multi-omics analysis will only tend to become more complex and resource-intensive without any proportional or even significant augmentation in productivity, performance, or insight generation. an overview of multi-omics data integration early approaches to multi-omics analysis involved the independent analysis of different data modalities and combining results for a quasi-integrated view of molecular interactions. but the field has evolved significantly since then into a broad range of novel, predominantly algorithmic meta-analysis frameworks and methodologies for the integrated analysis of multi-dimensional multi-omics data. however, the topic of data integration and the challenges involved is often overshadowed by the ground-breaking developments in integrated, multi-omics analysis. it is therefore essential to understand the fundamental conceptual principles, rather than any specific method or framework, that define multi-omics data integration. horizontal vs vertical data integration multi-omics datasets are broadly organized as horizontal or vertical, corresponding to the complexity and heterogeneity of multi-omics data. horizontal datasets are typically generated from one or two technologies, for a specific research question and from a diverse population, and represent a high degree of real-world biological and technical heterogeneity. horizontal or homogeneous data integration, therefore, involves combining data from across different studies, cohorts or labs that measure the same omics entities. vertical data refers to data generated using multiple technologies, probing different aspects of the research question, and traversing the possible range of omics variables including the genome, metabolome, transcriptome, epigenome, proteome, microbiome, etc. vertical, or heterogeneous, data integration involves multi-cohort datasets from different omics levels, measured using different technologies and platforms. the fact that vertical integration techniques cannot be applied for horizontal integrative analysis and vice-versa opens up an opportunity for conceptual innovation in multi-omics for data integration techniques that can enable an integrative analysis of both horizontal and vertical multi-omics datasets. of course, each of these broad data heads can further be broken down into a range of approaches based on utility and efficiency. 5 integration strategies for vertical data a 2021 mini-review of general approaches to vertical data integration for ml analysis defined five distinct integration strategies – early, mixed, intermediate, late and hierarchical – based not just on the underlying mathematics but on a variety of factors including how they were applied. here’s a quick rundown of each approach. early integration is a simple and easy-to-implement approach that concatenates all omics datasets into a single large matrix. this increases the number of variables, without altering the number of observations, which results in a complex, noisy and high dimensional matrix that discounts dataset size difference and data distribution. mixed integration addresses the limitations of the early model by separately transforming each omics dataset into a new representation and then combining them for analysis. this approach reduces noise, dimensionality, and dataset heterogeneities. intermediate integration simultaneously integrates multi-omics datasets to output multiple representations, one common and some omics-specific. however, this approach often requires robust pre-processing due to potential problems arising from data heterogeneity. late integration circumvents the challenges of assembling different types of omics datasets by analysing each omics separately and combining the final predictions. this multiple single-omics approach does not capture inter-omics interactions. hierarchical integration focuses on the inclusion of prior regulatory relationships between different omics layers so that analysis can reveal the interactions across layers. though this strategy truly embodies the intent of trans-omics analysis, this is still a nascent field with many hierarchical methods focusing on specific omics types, thereby making them less generalisable. the availability of an unenviable choice of conceptual approaches – each with its own scope and limitations in terms of throughput, performance, and accuracy – to multi-omics data integration represents one of the biggest bottlenecks to downstream analysis and biological innovation. researchers often spend more time mired in the tedium of data munging and wrangling than they do extracting knowledge and novel insights. most conventional approaches to data integration, moreover, seem to involve some form of compromise involving either the integrity of high-throughput multi-omics data or achieving true trans-omics analysis. there has to be a new approach to multi-omics data integration that can 1), enable the one-click integration of all omics and non-omics data, and 2), preserve the biological consistency, in terms of correlations and associations across different regulatory datasets, for integrative multi-omics analysis in the process. the mindwalk hyft model for data integration at mindwalk, we took a lateral approach to the challenge of biological data integration. rather than start with a technological framework that could be customised for the complexity and heterogeneity of multi-omics data, we set out to decode the atomic units of all biological information that we call hyfts™. hyfts are essentially the building blocks of biological information, which means that they enable the tokenisation of all biological data, irrespective of species, structure, or function, to a common omics data language. we then built the framework to identify, collate, and index hyfts from sequence data. this enabled us to create a proprietary pangenomic knowledge database of over 660 million hyfts, each containing comprehensive information about variation, mutation, structure, etc., from over 450 million sequences available across 12 popular public databases. with the mindwalk platform, researchers and bioinformaticians have instant access to all the data from some of the most widely used omics data sources. plus, our unique hyfts framework allows researchers the convenience of one-click normalization and integration of all their proprietary omics data and metadata. based on our biological discovery, we were able to normalise and integrate all publicly available omics data, including patent data, at scale, and render them multi-omics analysis-ready. the same hyft ip can also be applied to normalise and integrate proprietary omics data. the transversal language of hyfts enables the instant normalisation and integration of multi-omics research-relevant data and metadata into one single source of truth. with the mindwalk approach to multi-omic data integration, it is no longer about whether research data is horizontal or vertical, homogeneous or heterogeneous, text or sequence, omics or non-omics. if it is data that is relevant to your research, mindwalk enables you to integrate it with just one click.

The metadata challenge in biological research

one of the key highlights of the global response to covid-19 has been the importance of effective, ethical, and equitable data sharing and how it can exponentially accelerate outbreak research. however, data sharing cannot just be an incident response strategy. with the volume of biological data increasing exponentially year on year, the public sharing of experimental multi-omics data needs to become part of the culture of biological and life sciences research. the ability to assemble data from across domains and disciplines will pave the way for more integrated multi-omics and cross-disciplinary research and expand the potential for more sophisticated insights into biological systems. there are already concerted efforts within the industry to evolve towards an open data paradigm in biological research. however, the public availability of data will not automatically translate into enhanced value in terms of insights and knowledge. it has to be accompanied by a radical rethink of how data is generated, stored, shared and accessed. and, equally importantly, data management frameworks will have to account for the value of metadata. the challenges of metadata metadata – essentially data that describes data – provide context and provenance to raw data and is crucial to both data discovery and validation. for instance, metadata can describe a sample – in terms of the biomaterial it was derived from, how the sample was handled, the details of processes used for sample purification, profiling, and quantification, and provide detailed information into the experimental set-up and procedure. integrating data from multiple analyses and experiments enables high-level research that can address more complex questions in the life sciences. however, several case studies indicate that even current efforts to prioritise open data have not been able to catalyse open analysis at a proportionate scale. there are several reasons for this lag. a key challenge in this context is the fact that many research groups have adopted community-specific conventions that are not easily scalable for multidisciplinary research. many researchers still use no formal reporting conventions or completely exclude the metadata critical to the interpretation and reuse of data. often, the metadata included with open datasets is incomplete and/or poorly annotated. when it comes to data integration, conventional methods fall into one of two categories – multi-staged or meta-dimensional analysis. though meta-dimensional analysis is capable of incorporating all data into a comprehensive metadata matrix, combining data from different datasets still remains a significant challenge. data integration is further complicated by the lack of user-friendly tools for researchers with limited bioinformatics, biostatistics, and programming expertise. open data without metadata adds no discernible new value to the research process. though the biomedical community is making a concerted effort to share omics data, there is still a lack of consistency among researchers in ensuring that this data is backed by complete, annotated and usable metadata. so, if open analysis has to catch up with open data, there needs to be a universal standard for the reporting and sharing of valuable data. standardising biological metadata though metadata has been acknowledged as a key component of research infrastructure design, there is still no universal standard for reporting and sharing metadata. instead, there have been numerous initiatives for the development of hundreds of metadata standards with diverse characteristics. however, there has been a conceptual consensus on the three types of metadata standards – descriptive, administrative, and structural. there is also general agreement that metadata is key to supporting fair principles in order to overcome obstacles to data discovery and reuse for both humans and machines. as a result, there has been some change in the profile of public data with many repositories showing some degrees of “fairness” and several new projects emerging with fair as a central objective. in addition, many scientific journals are now urging researchers to make data shareable and public even as they endorse public data repositories that implement fair principles. notwithstanding that, much of the data in public repositories is far from being perfectly fair. one limited study of engineered nanomaterial databases found that even though a majority met fair criteria, one of the potential areas of improvement was the use of standard schema for metadata. another study to evaluate the completeness of metadata, referenced to nine clinical phenotypes, in public omics data reported a large variability in both the number and consistency of reported clinical phenotypes. even coordinated efforts, like miame, designed to encourage metadata sharing have had a limited impact given the fact that they define the content but not the format for this information. the creation of a unified framework for metadata continues to be a significant challenge with the public data landscape still characterized by diverse databases and standards that still require users to devise and manage compatibility. it is therefore going to require a monumental and orchestrated effort to ensure data and metadata quality adherence across the universe of public data repositories. the primary reason for this is that genomic data organization is essentially a fraught endeavour. for instance, files come in multiple formats with widely different semantics to fit neatly into a predefined universal framework. more importantly, there is no commonly accepted standard for a general yet basic data unit that can represent the heterogeneous and multi-dimensional data assets that are central to biological research. hyfts – the atomic units of biological data at mindwalk, we apply advanced nlp techniques to protein and dna sequences to transcribe the universal language of all omics data. in doing so, we were able to decode the atomic units of information, called hyfts™, that are the building blocks of biological information. with hyfts, all biological data, irrespective of species, structure, or function, can be tokenised to a common omics data language. in addition, these atomic data units are also extremely efficient carriers of biological information. each hyft pattern represents a unique signature sequence in dna, rna, and aa and integrates data and metadata across all omics layers. the transversal language of hyfts enables the unification, standardisation, and normalisation of all data, across species and domains, to create a single source of truth. the default integration of omics layers and associated metadata combined with the lensai platform’s hyper-scalable technology and unified analytical framework enables truly integrated multi-omics research. standardised metadata is the key to the usability and reproducibility of public data. with mindwalk, all public data is usable, and all biological research is reproducible.

Multi-omics data in biomarker discovery

current diagnostic alternatives for neurodegenerative diseases like alzheimer’s, parkinson’s, down’s syndrome, dementia, and motor neuron disease, are either invasive lumbar punctures, expensive brain imaging scans, pen-and-paper cognitive tests, or a simple blood test in a primary care setting to check for nfl (neurofilament light chain) concentration. similarly, despite increasing evidence that exercise could delay or even prevent alzheimer’s, there are currently no cost-effective or scalable procedures to validate or measure that correlation. however, research has now revealed that post-exercise increases in levels of plasma ctsb, a protease positively associated with learning and memory, could help evaluate how training influences cognitive change. nfl and plasma ctsb are two prime examples of biomarkers, biological or characteristics found in body fluids and tissues that can be objectively measured and evaluated to differentiate between normal biological processes and pathogenic processes, or pharmacologic responses to therapeutic interventions. the growing promise of biomarkers in the seven decades since the term was first introduced, biomarkers have evolved from simple indicators of health and disease to transformative instruments in clinical care and precision medicine. today, biomarkers have a wide variety of applications – diagnostic, prognostic, predictive, disease screening and detection, treatment response, risk stratification, etc., – across a broad range of therapeutic areas (cancer, cardiovascular, hepatic, renal, respiratory, neuroscience, gastrointestinal, etc.). in keeping with the times, we now also have digital biomarkers – objective, quantifiable physiological and behavioural data collected and measured by digital devices. biomarkers are at the heart of ground-breaking medical research to, for instance, reveal the underlying mechanism in acute myelogenous leukemia, improve prognosis of gastric cancer, establish a new prognostic gene profile for ovarian cancer, and provide novel etiological insights into obesity that facilitate patient stratification and precision prevention. biomarkers are also playing an increasingly critical role in the drug discovery, development and approval process. they enable a better understanding of the mechanism of action of a drug, help reduce the risk of failure and discovery costs, and allow for more precise patient stratification. between 2015 and 2019, more than half of the drugs approved by ema and fda were supported by biomarker data during the development stage. it is, therefore, hardly surprising that there is currently a lot of focus on biomarker discovery. however, this inherently complex process is only getting more complex, data-driven, and time-consuming – and that introduces some significant new challenges along the way. the increasing complexity of biomarker discovery initially, a biomarker was a simple one-dimensional molecule whose presence, or absence, indicated a binary outcome. however, single biomarkers lack the sensitivity and specificity required for disease classification and outcome prediction in a clinical setting. soon, biomarker discovery included panels – a set of biomarkers working together to enhance diagnostic or prognostic performance. then the field shifted again toward spatially resolved biomarkers that reflected the complexity of the underlying diseases. rather than just provide aggregated information, these higher-order biomarkers incorporated the spatial data of cells expressing relevant molecular markers. at the same time, biomarker developers are also integrating a whole range of omics data sets, such as genomics, proteomics, metabolomics, epigenetics, etc., in order to get a more holistic view that could augment our ability to understand diseases and identify novel drug targets. the scope of biomarker discovery just keeps getting wider with the emergence of new data-gathering technologies like single-cell next-generation sequencing, liquid biopsy (blood sample) for circulating tumour dna, microbiomics, radiomics, and with high-throughput technologies generating enormous volumes of data at a relatively low cost. the big challenge, therefore, will be in the integration and analysis of these huge volumes of multimodal data. plus, biomarker data comes with some challenges of its own. biomarker data challenges data scarcity: despite their widespread currency, there are still very few biomarker databases available for developers. in addition, there could also be a lack of systemic omics studies and biological data relevant to biomarker research. for instance, metabolomics data, critical to biomarker research into radiation resistance in cancer therapy, is not part of large multi-omics initiatives such as the cancer genome atlas. therefore, it will require a network-centric approach to analytics that enable data enrichment and modelling with other available datasets. data fragmentation: biomarker data is typically distributed across subscription-based, commercial databases with no provision for cross-database interconnectivity, and a few open-access databases, each with its own therapeutic or molecular specialization. so, a truly multi-omics approach to analysis will depend entirely on the efficiency of data integration. lack of data standardization: many sources do not follow fair database principles and practices. moreover, different datasets are also generated using heterogeneous profiling technologies, pre-processed using diverse normalization procedures, and annotated in non-standard ways. intelligent, automated normalization should be a priority. how mindwalk can help at biostrand, we understand that a systems biology approach is crucial to the success of biomarker discovery. our unique hyft™ ip was born out of the acknowledgement that the only way to accelerate biological research was by unifying all biological data with a common computational language. access all biological data with hyft™: on the biostrand platform, multi-omics data integration is as simple as logging in. using hyft™, we have already normalized, integrated, and indexed 450 million sequences available across 11 popular omics databases. that’s instant access to an extensive omics knowledge base with over a billion hyfts™, with information about variation, mutation, structure, etc. what’s more, integrating your own biomarker research is just a click away. add structured databases (icd codes, lab tests, etc.) and unstructured datasets (patient record data, scientific literature, clinical trial data, chemical data, etc.) our technology will seamlessly normalize and standardize all your data and make it computable to enable a truly integrative multi-omics approach to biomarker discovery. accurate annotation and analysis: the mindwalk genomic analysis tools provide unmatched accuracy in annotation and variation analysis, such as in the large-scale whole-genome data of patients with a specific disease. use our platform’s advanced annotation capabilities to extract insights from genomic datasets and fill in the gaps in biomarker datasets. comprehensive data mining: combine the power of our hyft™ database with the graph-based data mining capabilities of our ai-powered platform to discover knowledge that can accelerate the development process. from single biomarkers to systems biology biomarkers have evolved considerably since their days as simple single-molecule indicators of biological processes. today, biomarker discovery is a sophisticated systems biology practice to unravel complex molecular interactions and expand the boundaries of clinical medicine and drug development. as the practice gets more multifaceted, it will also require more advanced data integration, management, and analysis tools. the mindwalk platform provides an integrated solution for normalization, integration, and analysis of high-volume high-dimensional data.

Intelligent multi-omics analysis: The power of AI/ML

in our previous blog on integrated multi-omics, we overviewed the need for this increase in scale, dimensionality and heterogeneity in genomic data to be matched by a shift from reductionist biology to a holistic systems biology approach to omics analysis. an integrated multi-dimensional and multivariate model for analysis is absolutely imperative for us to create a more comprehensive multi-scale characterization and understanding of biological systems. however, there are several limitations, like scalability and reproducibility, for example, in applying conventional integration and interpretation techniques to multi-omics data. and conventional single-omics analyses are wholly ineffective in interpreting complex cellular mechanisms or identifying the underlying causes of multifaceted diseases. as a result, multi-omics analyses increasingly rely on advanced computational methods and intelligent ai-powered technologies like ml, deep learning and nlp to optimize data management and transform multi-omics data into clinically actionable knowledge. ai/ml in multi-omics analysis since then, ml and ai capabilities have evolved exponentially and have been applied consistently if not extensively for decision support in several healthcare scenarios including the management of several communicable and noncommunicable diseases. image source: biomolecules a recent special issue of the journal biomolecules focusing on the integrated analysis of omics data using ai, ml and deep learning (dl) listed three areas in medical research – medical image analysis, omics analysis, and natural language processing – where ai was currently being implemented. here, then, is an overview of the transformative potential of ai in these specific areas. medical image analysis intelligent technologies are a critical component of radiogenomics, which focuses on studying the relationship between imaging phenotypes and genomic phenotypes of specific diseases. technologies such as ai/ml and dl have demonstrated their ability to extract meaningful information from medical imaging data, sometimes with greater precision than humans themselves and helped bring automated, accurate, and ultra-fast medical image analysis into the mainstream. for example, researchers in japan have used ai to successfully detect recurrent prostate cancer by analysing pathology images to identify features relevant to cancer prognosis that were not previously noted by pathologists. ai techniques like deep learning have even been used to predict neurological diseases like alzheimer's and amyotrophic lateral sclerosis (als) before patients become symptomatic. by training convolutional neural networks on images of motor neurons prepared from ips cells of 15 healthy donors and 15 als patients, researchers were able to predict with over 0.97 of area under curve (auc) whether donors were healthy or als patients. omics analysis ai, ml and dl have been used extensively across a range of applications including the improvement of disease predictions, establishing the moas of compounds, studying gene regulation, and molecular profiling of various diseases, to name a few. take the case of cancer, for instance, where treatments based solely on pathological features can yield very different outcomes. it is therefore important to be able to break down broad symptomatic classifications into more finely defined subtypes in order to administer more focused clinical care. by applying supervised and unsupervised learning techniques to rna, mirna, and dna data from hepatocellular carcinoma patients, researchers were able to identify two subgroups with significant survival differences, isolate consensus driver genes associated with survival, and establish that consensus driver mutations were associated more with mrna transcriptomes than with mirna transcriptomes. ai has also been used to focus cancer prognosis and therapy by detecting subtypes among cancer patients, identifying biomarkers that determine recurrence of cancer, and for drug response modelling to predict drug response behaviour. in drug development, techniques like gene signatures analysis and high-throughput screening have made it quicker and easier to identify compounds that affect a specific target or phenotype. even so, these compounds may still induce complex downstream functional consequences that could adversely affect their chances of approval. understanding the moas (modes of action) of compounds is crucial to increasing the success rate of clinical trials and drug approvals. researchers have demonstrated that it is possible to determine unknown moas, even in the absence of a comparable reference, by combining multi-omics with an interpretable machine learning model of transcriptomics, epigenomics, metabolomics, and proteomics. natural language processing (nlp) as omics data continues to pile up in the petabytes, it only amplifies the urgency to convert this data into meaningful biological and clinical insights. however, not all omics researchers and bioinformaticians have the statistical expertise to do so. it is in this context that ai techniques like nlp can help accelerate the pace of multi-omics research by making analytics accessible to a wider audience. for example, one open-access, natural language-oriented, ai-driven platform for analyzing and visualizing cancer omics data empowers users to interact with the system using a natural language interface. researchers ask biological questions, and the system responds by identifying relevant genomics datasets, performing various analyses, returning appropriate results and even learning from user feedback. one ai model uses nlp to analyse short speech samples from a clinically administered cognitive test to predict the eventual onset of alzheimer’s disease within healthy people with an auc of 0.74. nlp can also help researchers identify, relate, and analyse datasets from public repositories with a solution that combines nlp techniques, biomedical ontologies, and the r statistical framework to simplify the association of samples from large-scale repositories to their ontology-based annotations. lensai platform - the ai-centric multi-omics platform ai/ml technologies have become the defining component of multi-omics analysis as they provide the speed, accuracy, and sophistication to deal with voluminous, diverse, heterogeneous, and high dimensional genomic data. at a higher level, they also introduce innovative new capabilities into bioinformatics by augmenting data-driven decision-making and enabling a new era of predictive multi-omics. the lensai saas platform was designed from the ground up to fully leverage the potential of these versatile and intelligent technologies. take hyfts™, a biological discovery that enabled us to translate 440 million sequences of different species, types, and formats from across 11 popular public databases into one homogeneous pan-genomic multivariate knowledge database. with hyfts™, we have decoded the language of omics, which means that we now have the capability to translate any data that researchers might bring to our platform and make it instantly computable. defining a universal framework to accommodate all omics data, public and proprietary, opens up whole new possibilities for innovation. for instance, with our platform, researchers can seamlessly integrate all kinds of structured and unstructured non-biological data including patient record data, scientific literature, clinical trial data, chemical data, icd codes, lab tests, and more. however, developing a framework that unifies all omics data and metadata, both public and proprietary, into one integrated model is only half the battle. truly intelligent and integrated multiomics analysis will only become possible with the seamless integration of years of valuable experimental research insights scattered across volumes of scientific literature. these are insights with the potential to amplify, accelerate and transform bioinformatics that is often left on the table just because of the lack of easy-to-use textual data integration frameworks. and that’s where the lensai platform comes into play, enabling the efficient integration of textual research data together with multiomics data for holistic analysis using sophisticated ai/ml technologies. though there are several biomedical nlp solutions available to the research community, they are currently rather confined in terms of application. the primary issue is that, as mentioned earlier, most of these solutions require significant computational and statistical proficiency, limiting their access to a select few. secondly, many of these solutions use a top-down approach that is effective in extracting information related to a specific query. however, this approach is not very accurate or efficient at extracting all the information, at scale, that is pertinent to research. lensai platform is designed around a bottom-up approach that can reveal all research-relevant information about novel concepts and relationships available in scientific literature. and since our platform is domain agnostic and is, therefore, capable of identifying all meaningful relationships without the need for predefined knowledge. this means that researchers can completely bypass the training stage and get straight to their research. integrated, intelligent multi-omics with lensai platform mindwalk’s unique sequence + text approach enables researchers to instantly unify all omics, non-omics and unstructured data into one multidimensional dataset that serves as the single source of truth for their research. combine this with our platform’s constantly evolving advanced ai-powered analytics capabilities and researchers have a seamless, user-friendly and integrated data-ingestion-to-insight multi-omics platform that enables holistic biological research. to explore what lensai platform can do for your research, contact us here.

Integrated multi-omics analysis: From data normalisation to biological insight

the exponential generation of data by modern high-throughput, low-cost next generation sequencing (ngs) technologies is set to revolutionise genomics and molecular biology and enable a deeper and richer understanding of biological systems. and it is not just about more volumes of highly accurate, multi-layered data. it’s also about more types of omics datasets, such as glycomics, lipidomics, microbiomics, and phenomics. the increasing availability of large-scale, multidimensional and heterogeneous datasets has the potential to open up new insights into biological systems and processes, improve and increase diagnostic yield, and pave the way to shift from reductionist biology to a more holistic systems biology approach to decoding the complexities of biological entities. it has already been established that multi-dimensional analysis – as opposed to single layer analyses – yields better results from a statistical and a biological point of view, and can have a transformative impact on a range of research areas, such as genotype-phenotype interactions, disease biology, systems microbiology, and microbiome analysis. however, applying systems thinking principles to biological data requires the development of radically new integrative techniques and processes that can enable the multi-scale characterisation of biological systems. combining and integrating diverse types of omics data from different layers of biological regulation is the first computational challenge – and the next big opportunity – on the way to enabling a unified end-to-end workflow that is truly multi-omics. the challenge is quite colossal – indeed, a 2019 article in the journal of molecular endocrinology refers to the successful implementation of more than two datasets as very rare. data integration challenges in multi-omics analysing omics datasets at just one level of biological complexity is challenging enough. multi-omics analysis amplifies those challenges and introduces some unfamiliar new complications around data integration/fusion, clustering, visualisation, and functional characterisation. for instance, accommodating for the inherent complexity of biological systems, the sheer number of biological variables and the relatively low number of biological samples can on its own turn out to be a particularly difficult assignment. over and above this, there is a litany of other issues including process variations in data cleaning and normalisation, data dimensionality reduction, biological contextualisation, biomolecule identification, statistical validation, etc., etc., etc. data heterogeneity, arguably the raison d'être for integrated omics, is often the primary hurdle in multi-omics data management. omics data is typically distributed across multiple silos defined by domain, type, and access type (public/proprietary), to name just a few variables. more often than not, there are significant variations between datasets in terms of the technologies/platforms that were used to generate these datasets, nomenclature, data modalities, assay types, etc. data harmonisation, therefore, becomes a standard pre-integration process. but the process for data scaling, data normalisation, and data transformation to harmonise data can vary across different dataset types and sources. for example, there is a difference between normalisation and scaling techniques between rna-seq datasets and small rna-seq datasets. multi-omics data integration has its own set of challenges, including lack of reliability in parameter estimation, preserving accuracy in statistical inference, and/or the prevalence of large standard errors. there are, however, several tools currently available for multi-omics data integration, though they come with their own limitations. for example, there are web-based tools that require no computational experience – but the lack of visibility into their underlying processes makes it a challenge to deploy them for large-scale scientific research. on the other end of the spectrum, there are more sophisticated tools that afford more customisation and control – but also require considerable expertise in computational techniques. in this context, the development of a universal standard or unified framework for pre-analysis, let alone an integrated end-to-end pipeline for multi-omics analysis, seems rather daunting. however, if multi-omics analysis is to yield diagnostic value at scale, it is imperative that it quickly evolves from being a dispersed syndicate of tools, techniques and processes to a new integrated multi-omics paradigm that is versatile, computationally feasible and user-friendly. a platform approach to multi-omics analysis the data integration challenge in multi-omics essentially boils down to this. there either has to be a technological innovation designed specifically to handle the fine-grained and multidimensional heterogeneity of biological data. or, there has to be a biological discovery that unifies all omics data and makes them instantly computable even for conventional technologies. at mindwalk, we took the latter route and came up with hyfts™, a biological discovery that can instantly make all omics data computable. normalising/integrating data with hyfts™ we started with a new technique for indexing cellular blueprints and building blocks and used it to identify and catalogue unique signature sequences, or biological fingerprints, in dna, rna, and aa that we call hyft™ patterns. each hyft comprises multiple layers of information, relating to function, structure, position, etc., that together create a multilevel information network. we then designed a mindwalk parser to identify, collate and index hyfts from over 450 million sequences available across 11 popular public databases. this helped us create a proprietary pangenomic knowledge database using over 660 million hyft patterns containing information about variation, mutation, structure, and more. based on our biological discovery, we were able to normalise and integrate all publicly available omics data, including patent data, at scale, and render them multi-omics analysis-ready. the same hyft ip can also be applied to normalise and integrate proprietary omics data. making 660 million data points accessible that’s a lot of data points. so, we made it searchable. with google-like advanced indexing and exact matching technologies, only exact matches to search inputs are returned. through a simple search interface – use plain text or a fasta file – researchers can now accurately retrieve all relevant information about sequence alignments, similarities, and differences from a centralised knowledge base with information on millions of organisms in just 3 seconds. synthesising knowledge with our ai-powered saas platform around these core capabilities, we built the mindwalk saas platform with state-of-the-art ai tools to expand data management capabilities, mitigate data complexity, and to empower researchers to intuitively synthesise knowledge out of petabytes of biological data. with our platform, researchers can easily add different types of structured and unstructured data, leverage its advanced graph-based data mining features to extract insights from huge volumes of data, and use built-in genomic analysis tools for annotation and variation analysis. multi-omics as a platform as omics data sets become more multi-layered and multidimensional, only a truly sequence integrated multi-omics analysis solution can enable the discovery of novel and practically beneficial biological insights. with mindwalk platform, delivered as a saas, we believe we have created an integrated platform that enables a user-friendly, automated, intelligent, and data-ingestion-to-insight approach to multi-omics analysis. it eliminates all the data management challenges associated with conventional multi-omics analysis solutions and offers a cloud-based platform-centric approach to multi-omics analysis that is paramount to usability and productivity.

The SaaSification of Bioinformatics

in our previous blog post – ‘the imperative for bioinformatics-as-a-service’ – we addressed the issue of the profusion of choice in computational solutions in the fields of bioinformatics research. traditionally, there has been a systemic, acute, and documented dearth of off-the-shelf technological solutions designed specifically for the scientific research community. in bioinformatics and omics research, this has translated into the necessity for users to invent their own system configurations, data pipelines, and workflows that best suit their research objectives. the output of this years-long diy movement has now generated a rich corpus of specialised bioinformatics tools and databases that are now available to the next generation of bioinformaticians to broker, adapt, and chain into a sequence of point solutions. on the one hand, next-generation high throughput sequencing technologies are churning out genomics data more quickly, accurately, and cost-effectively than ever before. on the other, the pronounced lack of next-generation high throughput sequence analysis technologies still requires researchers to build or broker their own computational solutions that are capable of coping with the volume and complexity of digital age genomics big data. as a result, bioinformatics workflows are becoming longer, toolchains have grown more complex, and the number of software tools, programming interfaces, and libraries that have to be integrated has multiplied. even as cloud-based frameworks like saas become the default software delivery model across every industry, bioinformatics and omics research remain stranded in this diy status. the industry urgently needs to shift to a cloud-based as-a-service paradigm that will enable more focused, efficient, and productive use of research talents for data-driven omics innovation and insights, instead of grappling with improvisation and implementation. how saas transforms bioinformatics analytics for the augmented bioinformatician even as the cloud has evolved into the de-facto platform for advanced analytics, the long-running theme of enabling self-service analytics for non-technical users and citizen data scientists has undergone a radical reinterpretation. for instance, predefined dashboards that support intuitive data manipulation and exploration have become a key differentiating factor for solutions in the marketplace. however, according to gartner’s top ten data and analytics technology trends for 2021, dashboards will have to be supplemented with more intelligent capabilities in order to extend analytical power – that thus far was only available to specialist data scientists and analysts –to non-technical augmented consumers. these augmented analytics solutions enable ai/ml-powered automation across the entire data science process – from data preparation to insight generation – and feature natural language interfaces for nlp/nlg technologies to simplify how augmented consumers query and consume their insights and democratize the development, management, and deployment of ai/ml models. specialized bioinformatics-as-a-service platforms need to adopt a similar development trajectory. the focus has to be on completely eliminating the tedium of wrangling with disparate technologies, tools, and interfaces, and empowering a new generation of augmented bioinformaticians to focus on their core research. enhanced scalability and accessibility a single human genome sequence contains about 200 gigabytes of data. as genome sequencing becomes more affordable, data from the human genome alone is expected to add up to over 40 exabytes by 2025. this is not a scale that a motley assortment of technologies and tools can accommodate. in comparison, bioinformatics-as-a-solution platforms are designed with these data volumes in mind. a robust and scalable saas platform is built to effortlessly handle the normalization, storage, analysis, cross-comparison, and presentation of petabytes of genomics data. for instance, our mindwalk platform utilises a container-based architecture to auto-scale seamlessly to handle over 200 petabytes of data with zero on-ramping issues. and scalability is not just about capacity. saas platforms also offer high vertical scalability in terms of services and features that researchers need to access. all mindwalk platform users have a simple “google-style” search bar access to 350 million sequences spanning 11 of the most popular publicly available databases, as well as to in-built tools for sequence analysis, multiple sequence alignment, and protein domain analysis. over and above all this, saas solutions no longer restrict research to the lab environment. researchers can now access powerful and comprehensive bioinformatics-as-a-service via laptops – or even their smartphones if mobile-first turns out to be the next big saas trend – in the comfort of their own homes or their favourite coffee shop. increased speed and accuracy bioinformatics has typically involved a trade-off between speed and accuracy. in some cases, methodologies make reductive assumptions about the data to deliver quicker results, while in others the error rate may increase proportionally to the complexity of a query. in multi-tool research environments, the end result is a discrete sum of the results received from each module in the sequence. this means that errors generated in one process are neither flagged nor addressed in subsequent stages, leading to an accumulation of errors in the final analysis. a truly integrated multi-level solution consolidates disparate stages of conventional bioinformatics and omics data analysis into one seamlessly integrated platform that facilitates in-depth data exploration, maximizes researchers’ view of their data, and accelerates time-to-insight without compromising on speed or accuracy. access to continuous innovation with a saas solution, end-users no longer need to worry about updates, patch management, and upgrades. with vertical saas solutions, such as bioinformatics-as-a-service, continuous innovation becomes a priority to sustain vertical growth in a narrow market. for users, this translates into more frequent rollouts of new features and capabilities based on user feedback to address real pain points in the industry. for instance, in just a few months since the official launch of our platform, we have added new capabilities for sdk/api-based integrations for proprietary data and infrastructure, expanded our tools and expertise to assay design, drug development, gene therapy, crop protection products, and biomarkers, and we are building out an ai platform with state-of-the-art graph-based data mining to discover and synthesise knowledge out of a multitude of information sources. the imperative to saasify bioinformatics saas is currently the largest segment in the public cloud services market – and yet the segment’s footprint in bioinformatics is virtually non-existent. today, there are a few cloud-based technologies targeted at genomic applications that focus on specific workflows like sequence alignment, short read mapping, snp identification, etc. however, what the industry really needs is a cloud-based end-to-end bioinformatics-as-a-service solution that abstracts all the technological complexity to deliver simple yet powerful tools for bioinformaticians and omics researchers.

Data science challenges in population genomics

nearly a decade ago, the human genome project successfully delivered a baseline definition of the dna sequences in the entire human genome. population genomics extends the scope of genomics research beyond baseline data to get a better understanding of gene variability at the level of individuals, populations, and continents. take india for example, where an ambitious program called indigen has been rolled out to map whole-genome sequences across different populations in the country. the first phase of the program, involving extensive computation analysis of the 1,029 sequenced genomes from india, identified 55,898,122 single-nucleotide variants in the india genome dataset, 32% of which were unique to the sequence samples from india. these findings are expected to provide the foundations for what will become an india-centric population-scale genomics initiative. population genomics opens up a range of region-specific opportunities such as identifying genes responsible for complex diseases, predicting and mitigating disease outbreaks, focusing on country-level drug development, usage, and dosing guidelines, and formulating precision public health strategies that deliver optimal value for the population. as a result, several countries across the globe have launched their own initiatives for the large-scale comparison of dna sequences in local populations. the population genomics rush images source: iqvia the international hapmap project, launched in 2002 as a collaborative program of scientists from public and private organisations across six countries, is one of the earliest population-scale genomics programs. a 2020 analysis of the global genomics landscape reported close to 190 global genomic initiatives, with the u.s. and europe accounting for an overwhelming majority of these programs. several countries have already launched large-scale sequencing programs such as all of us (u.s.), genomics england, genome of greece, dna do brasil, turkish genome project, and the saudi human genome program, to name just a few. then there is the “1+ million genomes” initiative in the eu to create a cross-border network of national genome cohorts to unify population-scale data from several national initiatives. there is a spectrum of objectives being collectively targeted by these projects including analysing normal and pathological genomic variation, improving infrastructure, and enabling personalised medicine. as a result, population genomics data is exploding. an estimated 40 million human genomes have been sequenced as of 2020 with the number of analysed genomes expected to grow to 52 million by 2025. this exponential increase in population-scale data presents significant challenges, both in terms of crunching raw data at scale and in analysing and interpreting complex datasets. the analytics challenge in population genomics genomic data volumes have been increasing exponentially over the past decade, thanks in part to the plummeting costs of next-generation sequencing technologies. then there is the ever-expanding scope of health-related data, such as data from electronic health records biomonitoring devices etc., that are becoming extremely valuable for population-scale research. however, conventional integrative analysis techniques and computational methods that worked well with traditional genomics data are ill-equipped to deal with the unique data characteristics and overwhelming volumes of ngs and digital-era data. data exploration and analysis already lag data generation by a significant order of magnitude – and that deficit will only be exacerbated as we transition from ngs to third-generation sequencing technologies. image source: sciencedirect over the years, several de facto standards have emerged for processing genomics big data. but in spite of the significant progress that has been made in this context, the gap between data generation and data exploration continues to grow. most large institutions are already heavily invested in hardware/software infrastructure and in standardised workflows for genomic data analysis. a wholesale remapping of these investments to integrate agility, flexibility, and versatility features required for big data genomics is just plain impractical. integrating a variety of datasets from multiple external sources is a hallmark of modern genomics research and still represents a fundamental challenge for genomic analyses workflows. the biggest challenge, however, is the demand for extremely specialized and scarce bioinformatics talent to build bespoke analytics pipelines for each research project. this significantly restricts the pace of progress in genomics research. for data analysis to catch up with data acquisition, researchers need access to an easy-to-use powerful solution that spans the entire workflow – from raw data analysis to data exploration and insight. the mindwalk “one model” approach at mindwalk, we offer an end-to-end, self-service saas platform that unifies all components of the genomics analysis and research workflow into one intuitive, comprehensive, and powerful solution. we designed the platform to address every pain point in the genomics research value chain. for starters, it doesn't matter if you’re a seasoned bioinformatician or a budding geneticist. our platform has a learning curve that’s as easy to master as google search. at mindwalk we believe that wrangling data is a tedious chore best left to technology. to that end, we have precomputed and indexed nearly 350 million sequences available across 11 public databases into one proprietary knowledge database that is continuously reviewed and updated. ninety percent of population data from currently ongoing programs is soon expected to be publicly available, which means it will probably just be a click away. in addition, you can add self-owned databases with just one click to combine them with publicly available datasets to accelerate time-to-insight. if it’s genomic data, we’ll make it computable. with the mindwalk solution, you can use sequence or text to search through volumes of sequence data and instantly retrieve all pertinent information about alignments, similarities, and differences in sequences in a matter of seconds. no more choosing algorithms and building complex pipelines. our technology enables both experts and enthusiasts to focus entirely on their research objectives without being side-tracked by the technology. the mindwalk platform provides you with a range of intuitive, powerful, versatile, and multidimensional tools that allow you to define the scope, focus, and pace of your research without being restricted by any technological limitations. parse, slice, dice, sort, filter, drill down, pan out, and do whatever it takes to define and pursue the research pathways that you think have the maximum potential for a breakthrough. leverage the power of the mindwalk platform’s state-of-the-art ai tools to quickly and intuitively synthesise knowledge from a multitude of data sources and across structured and unstructured data types. with mindwalk research, researchers and bioinformaticians finally have access to a user-centric, multidimensional, secure, end-to-end data-to-insight research platform that enables a personalised and productive research experience by leveraging the power of modern digital technologies in the background harnessing the potential of population genomics population genomic data will continue to grow as more and more countries, especially in the developing world, realise the positive impact large-scale sequencing can have on genomics research, personalised patient care and public precision health. however, data science is key to realising the inherent value of genomic data at scale. conventional approaches to genomic research and analysis are severely limited in terms of their ability to efficiently extract value from genomics big data. and research is often hampered by the need for highly skilled human capital that is hard to come by. with the mindwalk platform, genomics research finally has an integrated solution that incorporates all research-related workflows, unifies discrete data sources and provides all the tools, features and functionality required for researchers to focus on what really matters – pushing the boundaries of genomics research, personalised patient care, and public precision health.

The benefits of having DNA, RNA, and protein analysis in a single tool

today, we have access to an enormous amount of data - and when it comes to diseases and the processes involving humans, plants, and animals, it is crucial to understand the complex translation process from dna to rna to proteins, and be able to effectively and efficiently analyse this data. when we consider this translation process, changes in the dna code could have different consequences. for instance, sometimes changes in dna do not lead to alterations in the translation process. sometimes, there is a beneficial effect, such as developing resistance to a certain type of bacteria, or it has a detrimental impact and can cause several types of diseases. because of this, it is vital to have a global overview of these three different levels of genomic data in all types of research across various industries. the limitations of current dna, rna and protein analysis genetic research includes massive amounts of sequences to be analysed and is time consuming and overly fragmented. this causes a delay in discoveries. the fragmentation causes the data analysis to be organised in silos, having a complex series of procedures, hampering the accuracy and the way research could speed up novel discoveries. in addition, the current analysis methodology is based on heuristics, which results in an accumulation of errors during the process of analysis. this lack of accuracy causes errors in data interpretation. an important challenge in current genetic research is the possibility to analyse dna, rna, and proteins altogether, at all levels. until today, a tool able to analyse all levels, dna, rna, and proteins at once, and automatically give us the translations (for instance, searching for dna and receiving results for all three levels) has not been available. as per current procedures, several separate stages are involved, and the whole process becomes long and unclear. after performing all the separate stages, the data has to be assembled in a single document, making it hard to have a comprehensive overview, causing difficulties in the interpretation of the data gathered. the implications of new methods of analysis the potential value of genetic information that can be unlocked by using an approach in which molecular analysis of dna, rna and proteins is truly integrated is immense. one tool could then improve the health and the quality of life of many individuals affected by different diseases. for instance, in unraveling disease mechanisms, it is common practice to analyse whole genomes to discover which genes are involved. once these genes are identified, their different forms of expression are examined, to discover differences in mechanisms of regulation and its impact on symptomatology. in these research processes, an integrated view from dna down to proteins, and vice versa, is necessary to discover the impact of changes in each level of the transcription process on development of disease and ideally on prediction of its severity (severely or mildly affected). unraveling 'disease' is not only important in human medicine but also in veterinary medicine to protect animal health. an example here is the development of phage therapy, the use of bacteriophage viruses to treat bacterial disease, for tuberculosis in cows. different virus variants have to be identified and tested for their efficacy. if we look at agriculture, combating 'disease' is an important topic in which integrated analysis of the dna, rna and protein level is necessary. an example here is the use of bacteriophage viruses against bacterial infections of the tomato plant root. discovering new associations in the data is currently complex and requires highly skilled individuals. therefore, it is crucial to step forward and avoid the silos in genetic analysis by providing integrated, comprehensive and relevant results to make sure that r&d departments can maximise their discoveries in the shortest amount of time. biostrand revolutionises genetic analysis with the lensai platform that allows us to check the data in a simple, intuitive, and yet highly accurate way by extracting the information of all levels (dna, rna and proteins) at once, providing researchers with an integrated view on their data, saving vast amounts of time and reducing costs. picture source: adobestock © chris 122533894

LIVE Webinar

MindWalk

Keep up to date