The Blog

MindWalk is a biointelligence company uniting AI, multi-omics data, and advanced lab research into a customizable ecosystem for biologics discovery and development.

Mitigating LLM hallucinations

there is a compelling case underlying the tremendous interest in generative ai and llms as the next big technological inflection point in computational drug discovery and development. for starters, llms help expand the data universe of in-silico drug discovery, especially in terms of opening up access to huge volumes of valuable information locked away in unstructured textual data sources including scientific literature, public databases, clinical trial notes, patient records, etc. llms provide the much-needed capability to analyze, identify patterns and connections, and extract novel insights about disease mechanisms and potential therapeutic targets. their ability to interpret complex scientific concepts and elucidate connections between diseases, genes, and biological processes can help accelerate disease hypothesis generation and the identification of potential drug targets and biomarkers. when integrated with biomedical knowledge graphs, llms help create a unique synergistic model that enables bidirectional data- and knowledge-based reasoning. the explicit structured knowledge of knowledge graphs enhances the knowledge of llms while the power of language models streamlines graph construction and user conversational interactions with complex knowledge bases. however, there are still several challenges that have to be addressed before llms can be reliably integrated into in silico drug discovery pipelines and workflows. one of these is hallucinations. why do llms hallucinate? at a time of some speculation about laziness and seasonal depression in llms, a hallucination leaderboard of 11 public llms revealed hallucination rates that ranged from 3% at the top end to 27% at the bottom of the barrel. another comparative study of two versions of a popular llm in generating ophthalmic scientific abstracts revealed very high hallucination rates (33% and 29%) of generating fake references. this tendency of llms to hallucinate, ergo present incorrect or unverifiable knowledge as accurate, even at 3% can have serious consequences in critical drug discovery applications. there are several reasons for llm hallucinations. at the core of this behavior is the fact that generative ai models have no actual intelligence, relying instead on a probability-based approach to predict data that is most likely to occur based on patterns and contexts ‘learned’ from their training data. apart from this inherent lack of contextual understanding, other potential causes include exposure to noise, errors, biases, and inconsistencies in training data, training and generation methods, or even prompting techniques. for some, hallucination is all llms do and others see it as inevitable for any prompt-based large language model. in the context of life sciences research, however, mitigating llm hallucinations remains one of the biggest obstacles to the large-scale and strategic integration of this potentially transformative technology. how to mitigate llm hallucinations? there are three broad and complementary approaches to mitigating hallucinations in large language models: prompt engineering, fine-tuning, and grounding + prompt augmentation. prompt engineering prompt engineering is the process of strategically designing user inputs, or prompts, in order to guide model behavior and obtain optimal responses. there are three major approaches to prompt engineering: zero-shot, few-shot, and chain-of-thought prompts. in zero-shot prompting, language models are provided with inputs that are not part of their training data but are still capable of generating reliable results. few-shot prompting involves providing examples to llms before presenting the actual query. chain-of-thought (cot) is based on the finding that a series of intermediate reasoning steps provided as examples during prompting can significantly improve the reasoning capabilities of large language models. the chain-of-thought concept has been expanded to include new techniques such as chain-of-verification (cove), a self-verification process that enables llms to check the accuracy and reliability of their output, and chain of density (cod), a process that focuses on summarization rather than reasoning to control the density of information in the generated text. prompt engineering, however, has its own set of limitations including prompt constraints that may cramp the ability to query complex domains and the lack of objective metrics to quantify prompt effectiveness. fine-tuning where the focus of prompt engineering is on the skill required to elicit better llm output, fine-tuning emphasizes task-specific training in order to enhance the performance of pre-trained models in specific topics or domain areas. a conventional approach to llm finetuning is full fine-tuning, which involves the additional training of pre-trained models on labeled, domain or task-specific data in order to generate more contextually relevant responses. this is a time, resource and expertise-intensive process. an alternative approach is parameter-efficient fine-tuning (peft), conducted on a small set of extra parameters without adjusting the entire model. the modular nature of peft means that the training can prioritize select portions or components of the original parameters so that the pre-trained model can be adapted for multiple tasks. lora (low-rank adaptation of large language models), a popular peft technique, can significantly reduce the resource intensity of fine-tuning while matching the performance of full fine-tuning. there are, however, challenges to fine-tuning including domain shift issues, the potential for bias amplification and catastrophic forgetting, and the complexities involved in choosing the right hyperparameters for fine-tuning in order to ensure optimal performance. grounding & augmentation llm hallucinations are often the result of language models attempting to generate knowledge based on information that they have not explicitly memorized or seen. the logical solution, therefore, would be to provide llms with access to a curated knowledge base of high-quality contextual information that enables them to generate more accurate responses. advanced grounding and prompt augmentation techniques can help address many of the accuracy and reliability challenges associated with llm performance. both techniques rely on external knowledge sources to dynamically generate context. grounding ensures that llms have access to up-to-date and use-case-specific information sources to provide the relevant context that may not be available solely from the training data. similarly, prompt augmentation enhances a prompt with contextually relevant information that enables llms to generate a more accurate and pertinent output. factual grounding is a technique typically used in the pre-training phase to ensure that llm output across a variety of tasks is consistent with a knowledge base of factual statements. post-training grounding relies on a range of external knowledge bases, including documents, code repositories, and public and proprietary databases, to improve the accuracy and relevance of llms on specific tasks. retrieval-augmented generation (rag), is a distinct framework for the post-training grounding of llms based on the most accurate, up-to-date information retrieved from external knowledge bases. the rag framework enables the optimization of biomedical llms output along three key dimensions. one, access to targeted external knowledge sources ensures llms' internal representation of information is dynamically refreshed with the most current and contextually relevant data. two, access to an llm’s information sources ensures that responses can be validated for relevance and accuracy. and three, there is the emerging potential to extend the rag framework beyond just text to multimodal knowledge retrieval, spanning images, audio, tables, etc., that can further boost the factuality, interpretability, and sophistication of llms. also read: how retrieval-augmented generation (rag) can transform drug discovery some of the key challenges of retrieval-augmented generation include the high initial cost of implementation as compared to standalone generative ai. however, in the long run, the rag-llm combination will be less expensive than frequently fine-tuning llms and provides the most comprehensive approach to mitigating llm hallucinations. but even with better grounding and retrieval, scientific applications demand another layer of rigor — validation and reproducibility. here’s how teams can build confidence in llm outputs before trusting them in high-stakes discovery workflows. how to validate llm outputs in drug discovery pipelines in scientific settings like drug discovery, ensuring the validity of large language model (llm) outputs is critical — especially when such outputs may inform downstream experimental decisions. here are key validation strategies used to assess llm-generated content in biomedical pipelines: validation checklist: compare outputs to curated benchmarks use structured, peer-reviewed datasets such as drugbank, chembl, or internal gold standards to benchmark llm predictions. cross-reference with experimental data validate ai-generated hypotheses against published experimental results, or integrate with in-house wet lab data for verification. establish feedback loops from in vitro validations create iterative pipelines where lab-tested results refine future model prompts, improving accuracy over time. advancing reproducibility in ai-augmented science for llm-assisted workflows to be trustworthy and audit-ready, they must be reproducible — particularly when used in regulated environments. reproducibility practices: dataset versioning track changes in source datasets, ensuring that each model run references a consistent data snapshot. prompt logging store full prompts (including context and input structure) to reproduce specific generations and analyze outputs over time. controlled inference environments standardize model versions, hyperparameters, and apis to eliminate variation in inference across different systems. integrated intelligence with lensai™ holistic life sciences research requires the sophisticated orchestration of several innovative technologies and frameworks. lensai integrated intelligence, our next-generation data-centric ai platform, fluently blends some of the most advanced proprietary technologies into one seamless solution that empowers end-to-end drug discovery and development. lensai integrates rag-enhanced biollms with an ontology-driven nlp framework, combining neuro-symbolic logic techniques to connect and correlate syntax (multi-modal sequential and structural data) and semantics (biological functions). a comprehensive and continuously expanding knowledge graph, mapping a remarkable 25 billion relationships across 660 million data objects, links sequence, structure, function, and literature information from the entire biosphere to provide a comprehensive overview of the relationships between genes, proteins, structures, and biological pathways. our next-generation, unified, knowledge-driven approach to the integration, exploration, and analysis of heterogeneous biomedical data empowers life sciences researchers with the high-tech capabilities needed to explore novel opportunities in drug discovery and development.

How retrieval-augmented generation (RAG) can transform drug discovery

in a recent article on knowledge graphs and large language models (llms) in drug discovery, we noted that despite the transformative potential of llms in drug discovery, there were several critical challenges that have to be addressed in order to ensure that these technologies conform to the rigorous standards demanded by life sciences research. synergizing knowledge graphs with llms into one bidirectional data- and knowledge-based reasoning framework addresses several concerns related to hallucinations and lack of interpretability. however, that still leaves the challenge of enabling llms access to external data sources that address their limitation with respect to factual accuracy and up-to-date knowledge recall. retrieval-augmented generation (rag), together with knowledge graphs and llms, is the third critical node on the trifecta of techniques required for the robust and reliable integration of the transformative potential of language models into drug discovery pipelines. why retrieval-augmented generation? one of the key limitations of general-purpose llms is their training data cutoff, which essentially means that their responses to queries are typically out of step with the rapidly evolving nature of information. this is a serious drawback, especially in fast-paced domains like life sciences research. retrieval-augmented generation enables biomedical research pipelines to optimize llm output by: grounding the language model on external sources of targeted and up-to-date knowledge to constantly refresh llms' internal representation of information without having to completely retrain the model. this ensures that responses are based on the most current data and are more contextually relevant. providing access to the model's information so that responses can be validated for accuracy and sources, ensuring that its claims can be checked for relevance and accuracy. in short, retrieval-augmented generation provides the framework necessary to augment the recency, accuracy, and interpretability of llm-generated information. how does retrieval-augmented generation work? retrieval augmented generation is a natural language processing (nlp) approach that combines elements of both information retrieval and text generation models to enhance the performance of knowledge-intensive tasks. the retrieval component aggregates information relevant to specific queries from a predefined set of documents or knowledge sources which then serves as the context for the generation model. once the information has been retrieved, it is combined with the input context to create an integrated context containing both the original query and the relevant retrieved information. this integrated context is then fed into a generation model to generate an accurate, coherent, and contextually appropriate response based on both pre-trained knowledge and retrieved query-specific information. the rag approach gives life sciences research teams more control over grounding data used by a biomedical llm by honing it on enterprise- and domain-specific knowledge sources. it also enables the integration of a range of external data sources, such as document repositories, databases, or apis, that are most relevant to enhancing model response to a query. the value of rag in biomedical research conceptually, the retrieve+generate model’s capabilities in terms of dealing with dynamic external information sources, minimizing hallucinations, and enhancing interpretability make it a natural and complementary fit to augment the performance of biollms. in order to quantify this augmentation in performance, a recent research effort evaluated the ability of a retrieval-augmented generative agent in biomedical question-answering vis-a-vis llms (gpt-3.5/4), state-of-the-art commercial tools (elicit, scite, and perplexity) and humans (biomedical researchers). the rag agent, paperqa, was first evaluated against a standard multiple-choice llm-evaluation dataset, pubmedqa, with the provided context removed to test the agents’ ability to retrieve information. in this case, the rag agent beats gpt-4 by 30 points (57.9% to 86.3%). next, the researchers constructed a more complex and more contemporary dataset (litqa), based on more recent full-text research papers outside the bounds of llm’s pre-training data, to compare the integrated abilities of paperqa, llms and human researchers to retrieve the right information and to generate an accurate answer based on that information. again, the rag agent outperformed both pre-trained llms and commercial tools with overall accuracy (69.5%) and precision (87.9%) scores that were on par with biomedical researchers. more importantly, the rag model produced zero hallucinated citations compared to llms (40-60%). despite being just a narrow evaluation of the performance of the retrieval+generation approach in biomedical qa, the above research does demonstrate the significantly enhanced value that rag+biollm can deliver compared to purely generative ai. the combined sophistication of retrieval and generation models can be harnessed to enhance the accuracy and efficiency of a range of processes across the drug discovery and development pipeline. retrieval-augmented generation in drug discovery in the context of drug discovery, rag can be applied to a range of tasks, from literature reviews to biomolecule design. currently, generative models have demonstrated potential for de novo molecular design but are still hampered by their inability to integrate multimodal information or provide interpretability. the rag framework can facilitate the retrieval of multimodal information, from a range of sources, such as chemical databases, biological data, clinical trials, images, etc., that can significantly augment generative molecular design. the same expanded retrieval + augmented generation template applies to a whole range of applications in drug discovery like, for example, compound design (retrieve compounds/ properties and generate improvements/ new properties), drug-target interaction prediction (retrieve known drug-target interactions and generate potential interactions between new compounds and specific targets. adverse effects prediction (retrieve known adverse and generate modifications to eliminate effects). etc. the template even applies to several sub-processes/-tasks within drug discovery to leverage a broader swathe of existing knowledge to generate novel, reliable, and actionable insights. in target validation, for example, retrieval-augmented generation can enable the comprehensive generative analysis of a target of interest based on an extensive review of all existing knowledge about the target, expression patterns and functional roles of the target, known binding sites, pertinent biological pathways and networks, potential biomarkers, etc. in short, the more efficient and scalable retrieval of timely information ensures that generative models are grounded in factual, sourceable knowledge, a combination with limitless potential to transform drug discovery. an integrated approach to retrieval-augmented generation retrieval-augmented generation addresses several of the critical limitations and augments the generative capabilities of biollms. however, there are additional design rules and multiple technological profiles that have to come together to successfully address the specific requirements and challenges of life sciences research. our lensai™ integrated intelligence platform seamlessly unifies the semantic proficiency of knowledge graphs, the versatile information retrieval capabilities of retrieval-augmented generation, and the reasoning capabilities of large language models to reinvent the understanding-retrieve-generate cycle in biomedical research. our unified approach empowers researchers to query a harmonized life science knowledge layer that integrates unstructured information & ontologies into a knowledge graph. a semantic-first approach enables a more accurate understanding of research queries, which in turn results in the retrieval of content that is most pertinent to the query. the platform also integrates retrieval-augmented generation with structured biomedical data from our hyft technology to enhance the accuracy of generated responses. and finally, lensai combines deep learning llms with neuro-symbolic logic techniques to deliver comprehensive and interpretable outcomes for inquiries. to experience this unified solution in action, please contact us here.

From words to meaning: Exploring semantic analysis in NLP

there’s more biomedical data than ever, but making sense of it is still tough. in this blog, we look at how semantic analysis—an essential part of natural language processing (nlp)—helps researchers turn free text into structured insights. from identifying key biomedical terms to mapping relationships between them, we explore how these techniques support everything from literature mining to optimizing clinical trials. what is semantic analysis in linguistics? semantic analysis is an important subfield of linguistics, the systematic scientific investigation of the properties and characteristics of natural human language. as the study of the meaning of words and sentences, semantics analysis complements other linguistic subbranches that study phonetics (the study of sounds), morphology (the study of word units), syntax (the study of how words form sentences), and pragmatics (the study of how context impacts meaning), to name just a few. there are three broad subcategories of semantics: formal semantics: the study of the meaning of linguistic expressions using mathematical-logical formalizations, such as first-order predicate logic or lambda calculus, to natural languages. conceptual semantics: this is the study of words, phrases, and sentences based not just on a set of strict semantic criteria but on schematic and prototypical structures in the minds of language users. lexical semantics: the study of word meanings not just in terms of the basic meaning of a lexical unit but in terms of the semantic relations that integrate these units into a broader linguistic system. semantic analysis in natural language processing (nlp) in nlp, semantic analysis is the process of automatically extracting meaning from natural languages in order to enable human-like comprehension in machines. there are two broad methods for using semantic analysis to comprehend meaning in natural languages: one, training machine learning models on vast volumes of text to uncover connections, relationships, and patterns that can be used to predict meaning (e.g. chatgpt). and two, using structured ontologies and databases that pre-define linguistic concepts and relationships that enable semantic analysis algorithms to quickly locate useful information from natural language text. though generalized large language model (llm) based applications are capable of handling broad and common tasks, specialized models based on a domain-specific taxonomy, ontology, and knowledge base design will be essential to power intelligent applications. how does semantic analysis work? there are two key components to semantic analysis in nlp. the first is lexical semantics, the study of the meaning of individual words and their relationships. this stage entails obtaining the dictionary definition of the words in the text, parsing each word/element to determine individual functions and properties, and designating a grammatical role for each. key aspects of lexical semantics include identifying word senses, synonyms, antonyms, hyponyms, hypernyms, and morphology. in the next step, individual words can be combined into a sentence and parsed to establish relationships, understand syntactic structure, and provide meaning. there are several different approaches within semantic analysis to decode the meaning of a text. popular approaches include: semantic feature analysis (sfa): this approach involves the extraction and representation of shared features across different words in order to highlight word relationships and help determine the importance of individual factors within a text. key subtasks include feature selection, to highlight attributes associated with each word, feature weighting, to distinguish the importance of different attributes, and feature vectors and similarity measurement, for insights into relationships and similarities between words, phrases, and concepts. latent semantic analysis (lsa): this technique extracts meaning by capturing the underlying semantic relationships and context of words in a large corpus. by recognizing the latent associations between words and concepts, lsa enhances machines’ capability to interpret natural languages like humans. the lsa process includes creating a term-document matrix, applying singular value decomposition (svd) to the matrix, dimension reduction, concept representation, indexing, and retrieval. probabilistic latent semantic analysis (plsa) is a variation on lsa with a statistical and probabilistic approach to finding latent relationships. semantic content analysis (sca): this methodology goes beyond simple feature extraction and distribution analysis to consider word usage context and text structure to identify relationships and impute meaning to natural language text. the process broadly involves dependency parsing, to determine grammatical relationships, identifying thematic and case roles to reveal relationships between actions, participants, and objects, and semantic frame identification, for a more refined understanding of contextual associations. semantic analysis techniques here’s a quick overview of some of the key semantic analysis techniques used in nlp: word embeddings these refer to techniques that represent words as vectors in a continuous vector space and capture semantic relationships based on co-occurrence patterns. word-to-vector representation techniques are categorized as conventional, or count-based/frequency-based models, distributional, static word embedding models that include latent semantic analysis (lsa), word-to-vector (word2vec), global vector (glove) and fasttext, and contextual models, which include embeddings from large language, generative pre-training, and bidirectional encoder representations from transformers (bert) models. semantic role labeling this a technique that seeks to answer a central question — who did what to whom, how, when, and where — in many nlp tasks. semantic role labeling identifies the roles that different words play by recognizing the predicate-argument structure of a sentence. it is traditionally broken down into four subtasks: predicate identification, predicate sense disambiguation, argument identification, and argument role labeling. given its ability to generate more realistic linguistic representations, semantic role labeling today plays a crucial role in several nlp tasks including question answering, information extraction, and machine translation. named entity recognition (ner) ner is a key information extraction task in nlp for detecting and categorizing named entities, such as names, organizations, locations, events, etc. ner uses machine learning algorithms trained on data sets with predefined entities to automatically analyze and extract entity-related information from new unstructured text. ner methods are classified as rule-based, statistical, machine learning, deep learning, and hybrid models. biomedical named entity recognition (bioner) is a foundational step in biomedical nlp systems with a direct impact on critical downstream applications involving biomedical relation extraction, drug-drug interactions, and knowledge base construction. however, the linguistic complexity of biomedical vocabulary makes the detection and prediction of biomedical entities such as diseases, genes, species, chemical, etc. even more challenging than general domain ner. the challenge is often compounded by insufficient sequence labeling, large-scale labeled training data and domain knowledge. deep learning bioner methods, such as bidirectional long short-term memory with a crf layer (bilstm-crf), embeddings from language models (elmo), and bidirectional encoder representations from transformers (bert), have been successful in addressing several challenges. currently, there are several variations of the bert pre-trained language model, including bluebert, biobert, and pubmedbert, that have applied to bioner tasks. an associated and equally critical task in bionlp is that of biomedical relation extraction (biore), the process of automatically extracting and classifying relationships between complex biomedical entities. in recent years, the integration of attention mechanisms and the availability of pre-trained biomedical language models have helped augment the accuracy and efficiency of biore tasks in biomedical applications. other semantic analysis techniques involved in extracting meaning and intent from unstructured text include coreference resolution, semantic similarity, semantic parsing, and frame semantics. the importance of semantic analysis in nlp semantic analysis is key to the foundational task of extracting context, intent, and meaning from natural human language and making them machine-readable. this fundamental capability is critical to various nlp applications, from sentiment analysis and information retrieval to machine translation and question-answering systems. the continual refinement of semantic analysis techniques will therefore play a pivotal role in the evolution and advancement of nlp technologies. how llms improve semantic search in biomedical nlp semantic search in biomedical literature has evolved far beyond simple keyword matching. today, large language models (llms) enable researchers to retrieve contextually relevant insights from complex, unstructured datasets—such as pubmed—by understanding meaning, not just matching words. unlike traditional search, which depends heavily on exact term overlap, llm-based systems leverage embeddings—dense vector representations of words and phrases—to capture nuanced relationships between biomedical entities. this is especially valuable when mining literature for drug-disease associations, extracting drug-gene relations using nlp, mode-of-action predictions, or identifying multi-sentence relationships between proteins and genes. by embedding both queries and biomedical documents in the same high-dimensional space, llms support more relevant and context-aware retrieval. for instance, a query such as "inhibitors of pd-1 signaling" can retrieve relevant articles even if they don’t explicitly use the phrase "pd-1 inhibitors." this approach has transformed pubmed mining with nlp by enabling deeper and more intuitive exploration of biomedical text. llm-powered semantic search is already being used in pubmed mining tools, clinical trial data extraction, and knowledge graph construction. looking ahead: nlp trends in drug discovery as semantic search continues to evolve, it’s becoming central to biomedical research workflows, enabling faster, deeper insights from unstructured text. the shift from keyword matching to meaning-based retrieval marks a key turning point in nlp-driven drug discovery. these llm-powered approaches are especially effective for use cases like: extracting drug-gene interactions identifying biomarkers from literature linking unstructured data across sources they also help address key challenges in biomedical nlp, such as ambiguity, synonymy, and entity disambiguation across documents.

NLP, NLU & NLG : What is the difference?

in 2022, eliza, an early natural language processing (nlp) system developed in 1966, won a peabody award for demonstrating that software could be used to create empathy. over 50 years later, human language technologies have evolved significantly beyond the basic pattern-matching and substitution methodologies that powered eliza. as we enter the new age of chatgp, generative ai, and large language models (llms), here’s a quick primer on the key components — nlp, nlu (natural language understanding), and nlg (natural language generation), of nlp systems. what is nlp? nlp is an interdisciplinary field that combines multiple techniques from linguistics, computer science, ai, and statistics to enable machines to understand, interpret, and generate human language. the earliest language models were rule-based systems that were extremely limited in scalability and adaptability. the field soon shifted towards data-driven statistical models that used probability estimates to predict the sequences of words. though this approach was more powerful than its predecessor, it still had limitations in terms of scaling across large sequences and capturing long-range dependencies. the advent of recurrent neural networks (rnns) helped address several of these limitations but it would take the emergence of transformer models in 2017 to bring nlp into the age of llms. the transformer model introduced a new architecture based on attention mechanisms. unlike sequential models like rnns, transformers are capable of processing all words in an input sentence in parallel. more importantly, the concept of attention allows them to model long-term dependencies even over long sequences. transformer-based llms trained on huge volumes of data can autonomously predict the next contextually relevant token in a sentence with an exceptionally high degree of accuracy. in recent years, domain-specific biomedical language models have helped augment and expand the capabilities and scope of ontology-driven bionlp applications in biomedical research. these domain-specific models have evolved from non-contextual models, such as biowordvec, biosentvec, etc., to masked language models, such as biobert, bioelectra, etc., and to generative language models, such as biogpt and biomedlm. knowledge-enhanced biomedical language models have proven to be more effective at knowledge-intensive bionlp tasks than generic llms. in 2020, researchers created the biomedical language understanding and reasoning benchmark (blurb), a comprehensive benchmark and leaderboard to accelerate the development of biomedical nlp. nlp = nlu + nlg + nlq nlp is a field of artificial intelligence (ai) that focuses on the interaction between human language and machines. it employs a constantly expanding range of techniques, such as tokenization, lemmatization, syntactic parsing, semantic analysis, and machine translation, to extract meaning from unstructured natural languages and to facilitate more natural, bidirectional communication between humans and machines. source: techtarget modern nlp systems are powered by three distinct natural language technologies (nlt), nlp, nlu, and nlg. it takes a combination of all these technologies to convert unstructured data into actionable information that can drive insights, decisions, and actions. according to gartner ’s hype cycle for nlts, there has been increasing adoption of a fourth category called natural language query (nlq). so, here’s a quick dive into nlu, nlg, and nlq. nlu while nlp converts unstructured language into structured machine-readable data, nlu helps bridge the gap between human language and machine comprehension by enabling machines to understand the meaning, context, sentiment, and intent behind the human language. nlu systems process human language across three broad linguistic levels: a syntactical level to understand language based on grammar and syntax, a semantic level to extract meaning, and a pragmatic level to decipher context and intent. these systems leverage several advanced techniques, including semantic analysis, named entity recognition, relation extraction and coreference resolution, to assign structure, rules, and logic to language to enable machines to get a human-level comprehension of natural languages. the challenge is to evolve from pipeline models, where each task is performed separately, to blended models that can combine critical bionlp tasks, such as biomedical named entity recognition (bioner) and biomedical relation extraction (biore), into one unified framework. nlg where nlu focuses on transforming complex human languages into machine-understandable information, nlg, another subset of nlp, involves interpreting complex machine-readable data in natural human-like language. this typically involves a six-stage process flow that includes content analysis, data interpretation, information structuring, sentence aggregation, grammatical structuring, and language presentation. nlg systems generate understandable and relevant narratives from large volumes of structured and unstructured machine data and present them as natural language outputs, thereby simplifying and accelerating the transfer of knowledge between machines and humans. to explain the nlp-nlu-nlg synergies in extremely simple terms, nlp converts language into structured data, nlu provides the syntactic, semantic, grammatical, and contextual comprehension of that data and nlg generates natural language responses based on data. nlq the increasing sophistication of modern language technologies has renewed research interest in natural language interfaces like nlq that allow even non-technical users to search, interact, and extract insights from data using everyday language. most nlq systems feature both nlu and nlg modules. the nlu module extracts and classifies the utterances, keywords, and phrases in the input query, in order to understand the intent behind the database search. nlg becomes part of the solution when the results pertaining to the query are generated as written or spoken natural language. nlq tools are broadly categorized as either search-based or guided nlq. the search-based approach uses a free text search bar for typing queries which are then matched to information in different databases. a key limitation of this approach is that it requires users to have enough information about the data to frame the right questions. the guided approach to nlq addresses this limitation by adding capabilities that proactively guide users to structure their data questions using modeled questions, autocomplete suggestions, and other relevant filters and options. augmenting life sciences research with nlp at mindwalk, our mission is to enable an authentic systems biology approach to life sciences research, and natural language technologies play a central role in achieving that mission. our lensai integrated intelligence platform leverages the power of our hyft® framework to organize the entire biosphere as a multidimensional network of 660 million data objects. our proprietary bionlp framework then integrates unstructured data from text-based information sources to enrich the structured sequence data and metadata in the biosphere. the platform also leverages the latest development in llms to bridge the gap between syntax (sequences) and semantics (functions). for instance, the use of retrieval-augmented generation (rag) models enables the platform to scale beyond the typical limitations of llm, such as knowledge cutoff and hallucinations, and provide the up-to-date contextual reference required for biomedical nlp applications. with the lensai, researchers can now choose to launch their research by searching for a specific biological sequence. or they may search in the scientific literature with a general exploratory hypothesis related to a particular biological domain, phenomenon, or function. in either case, our unique technological framework returns all connected sequence-structure-text information that is ready for further in-depth exploration and ai analysis. by combining the power of hyft®, nlp, and llms, we have created a unique platform that facilitates the integrated analysis of all life sciences data. thanks to our unique retrieval-augmented multimodal approach, now we can overcome the limitations of llms such as hallucinations and limited knowledge. stay tuned for hearing more in our next blog.

Natural Language Understanding (NLU) - Basics and Applications in Bioinformatics

natural language understanding (nlu) is an ai-powered technology that allows machines to understand the structure and meaning of human languages. nlu, like natural language generation (nlg), is a subset of natural language processing (nlp) that focuses on assigning structure, rules, and logic to human language so machines can understand the intended meaning of words, phrases, and sentences in text. nlg, on the other hand, deals with generating realistic written/spoken human-understandable information from structured and unstructured data. since the development of nlu is based on theoretical linguistics, the process can be explained in terms of the following linguistic levels of language comprehension. linguistic levels in nlu phonology is the study of sound patterns in different languages/dialects, and in nlu it refers to the analysis of how sounds are organized, and their purpose and behavior. lexical or morphological analysis is the study of morphemes, indivisible basic units of language with their own meaning, one at a time. indivisible words with their own meaning, or lexical morphemes (e.g.: work) can be combined with plural morphemes (e.g.: works) or grammatical morphemes (e.g.: worked/working) to create word forms. lexical analysis identifies relationships between morphemes and converts words into their root form. syntactic analysis, or syntax analysis, is the process of applying grammatical rules to word clusters and organizing them on the basis of their syntactic relationships in order to determine meaning. this also involves detecting grammatical errors in sentences. while syntactic analysis involves extracting meaning from the grammatical syntax of a sentence, semantic analysis looks at the context and purpose of the text. it helps capture the true meaning of a piece of text by identifying text elements as well as their grammatical role. discourse analysis expands the focus from sentence-length units to look at the relationships between sentences and their impact on overall meaning. discourse refers to coherent groups of sentences that contribute to the topic under discussion. pragmatic analysis deals with aspects of meaning not reflected in syntactic or semantic relationships. here the focus is on identifying intended meaning readers by analyzing literal and non-literal components against the context of background knowledge. common tasks/techniques in nlu there are several techniques that are used in the processing and understanding of human language. here’s a quick run-through of some of the key techniques used in nlu and nlp. tokenization is the process of breaking down a string of text into smaller units called tokens. for instance, a text document could be tokenized into sentences, phrases, words, subwords, and characters. this is a critical preprocessing task that converts unstructured text into numerical data for further analysis. stemming and lemmatization are two different approaches with the same objective: to reduce a particular word to its root word. in stemming, characters are removed from the end of a word to arrive at the “stem” of that word. algorithms determine the number of characters to be eliminated for different words even though they do not explicitly know the meaning of those words. lemmatization is a more sophisticated approach that uses complex morphological analysis to arrive at the root word, or lemma. parsing is the process of extracting the syntactic information of a sentence based on the rules of formal grammar. based on the type of grammar applied, the process can be classified broadly into constituency and dependency parsing. constituency parsing, based on context-free grammar, involves dividing a sentence into sub-phrases, or constituents, that belong to a specific grammar category, such as noun phrases or verb phrases. dependency parsing defines the syntax of a sentence not in terms of constituents but in terms of the dependencies between the words in a sentence. the relationship between words is depicted as a dependency tree where words are represented as nodes and the dependencies between them as edges. part-of-speech (pos) tagging, or grammatical tagging, is the process of assigning a grammatical classification, like noun, verb, adjective, etc., to words in a sentence. automatic tagging can be broadly classified as rule-based, transformation-based, and stochastic pos tagging. rule-based tagging uses a dictionary, as well as a small set of rules derived from the formal syntax of the language, to assign pos. transformation-based tagging, or brill tagging, leverages transformation-based learning for automatic tagging. stochastic refers to any model that uses frequency or probability, e.g. word frequency or tag sequence probability, for automatic pos tagging. name entity recognition (ner) is an nlp subtask that is used to detect, extract and categorize named entities, including names, organizations, locations, themes, topics, monetary, etc., from large volumes of unstructured data. there are several approaches to ner, including rule-based systems, statistical models, dictionary-based systems, ml-based systems, and hybrid models. these are just a few examples of some of the most common techniques used in nlu. there are several other techniques like, for instance, word sense disambiguation, semantic role labeling, and semantic parsing that focus on different levels of semantic abstraction, nlp/nlu in biomedical research nlp/nlu technologies represent a strategic fit for biomedical research with its vast volumes of unstructured data — 3,000-5,000 papers published each day, clinical text data from ehrs, diagnostic reports, medical notes, lab data, etc., and non-standardized digital real-world data. nlp-enabled text mining has emerged as an effective and scalable solution for extracting biomedical entity relations from vast volumes of scientific literature. techniques, like named entity recognition (ner), are widely used in relation extraction tasks in biomedical research with conventionally named entities, such as names, organizations, locations, etc., substituted with gene sequences, proteins, biological processes, and pathways, drug targets, etc. the unique vocabulary of biomedical research has necessitated the development of specialized, domain-specific bionlp frameworks. at the same time, the capabilities of nlu algorithms have been extended to the language of proteins and that of chemistry and biology itself. a 2021 article detailed the conceptual similarities between proteins and language that make them ideal for nlp analysis. more recently, an nlp model was trained to correlate amino acid sequences from the uniprot database with english language words, phrases, and sentences used to describe protein function to annotate over 40 million proteins. researchers have also developed an interpretable and generalizable drug-target interaction model inspired by sentence classification techniques to extract relational information from drug-target biochemical sentences. large neural language models and transformer-based language models are opening up transformative opportunities for biomedical nlp applications across a range of bioinformatics fields including sequence analysis, genome analysis, multi-omics, spatial transcriptomics, and drug discovery. most importantly, nlp technologies have helped unlock the latent value in huge volumes of unstructured data to enable more integrative, systems-level biomedical research. read more about nlp’s critical role in facilitating systems biology and ai-powered data-driven drug discovery. if you want more information on seamlessly integrating advanced bionlp frameworks into your research pipeline, please drop us a line here.

Improving drug safety with adverse event detection using NLP

it is estimated that adverse events (aes) are likely one of the 10 leading causes of death and disability in the world. in high-income countries, one in every 10 patients is exposed to the harm that can be caused by a range of adverse events, at least 50% of which are preventable. in low- and middle-income countries, 134 million such events occur each year, resulting in 2.6 million deaths. across populations, the incidence of aes also varies based on age, gender, ethnic and racial disparities. and according to a recent study, external disruptions, like the current pandemic, can significantly alter the incidence, dispersion and risk trajectory of these events. apart from their direct patient health-related consequences, aes also have significantly detrimental implications for healthcare costs and productivity. it is estimated that 15% of total hospital activity and expenditure in oecd countries is directly attributable to adverse events. there is therefore a dire need for a systematic approach to detecting and preventing adverse events in the global healthcare system. and that’s exactly where ai technologies are taking the lead. ai applications in adverse drug events (ades) a 2021 scoping review to identify potential ai applications to predict, prevent or mitigate the effects of ades homed in on four interrelated use cases. first use case: prediction of patients with the likelihood to have a future ade in order to prevent or effectively manage these events. second use case: predicting the therapeutic response of patients to medications in order to prevent ades, including in patients not expected to benefit from treatment. third use case: predicting optimal dosing for specific medications in order to balance therapeutic benefits with ade-related risks. fourth use case: predicting the most appropriate treatment options to guide the selection of safe and effective pharmacological therapies. the review concluded that ai technologies could play an important role in the prediction, detection and mitigation of ades. however, it also noted that even though the studies included in the review applied a range of ai techniques, model development was overwhelmingly based on structured data from health records and administrative health databases. therefore, the reviewers noted, integrating more advanced approaches like nlp and transformer neural networks would be essential in order to access and integrate unstructured data, like clinical notes, and improve the performance of predictive models. nlp in pharmacovigilance spontaneous reporting systems (srss) have traditionally been the cornerstone of pharmacovigilance with reports being pooled from a wide range of sources. for instance, vigibase, the global database at the heart of the world health organization’s international global pharmacovigilance system, currently holds over 30 million reports of suspected drug-related adverse effects in patients from 170 member countries. the problem, however, is that spontaneous reporting is, by definition, a passive approach and currently fewer than 5% of ades are reported even in jurisdictions with mandatory reporting. the vast majority of ade-related information resides in free-text channels: emails and phone calls to patient support centres, social media posts, news stories, doctor-pharma rep call transcripts, online patient forums, scientific literature etc. mining these free text channels and clinical narratives in ehrs can supplement spontaneous reporting and enable significant improvements in ade identification. nlp & ehrs ehrs provide a longitudinal electronic record of patient health information captured across different systems within the healthcare setting. one of the main benefits of integrating ehrs as a pharmacovigilance data source is that they provide real-time real-world data. these systems also contain multiple fields of unstructured data, like discharge summaries, lab test findings, nurse notifications, etc., that can be explored with nlp technologies to detect safety signals. and compared to srss, ehr data is not affected by duplication or under- or over-reporting and enables a more complete assessment of drug exposure and comorbidity status. in recent years, deep nlp models have been successfully used across a variety of text classification and prediction tasks in ehrs including medical text classification, segmentation, word sense disambiguation, medical coding, outcome prediction, and de-identification. hybrid clinical nlp systems, combining a knowledge-based general clinical nlp system for medical concepts extraction with a task-specific deep learning system for relations identification, have been able to automatically extract ade and medication-related information from clinical narratives. but some challenges still remain, such as the limited availability and complexity of domain-specific text, lack of annotated data, and the extremely sensitive nature of ehr information. nlp & biomedical literature biomedical literature is one of the most valuable sources of drug-related information, stemming both from development cycles as well as the post-marketing phase. in post-marketing surveillance(pms), for instance, scientific literature is becoming essential to the detection of emerging safety signals. but with as many as 800,000 new articles in medicine and pharmacology published every year, the value of nlp in automating the extraction of events and safety information cannot be overstated. over the years, a variety of nlp techniques have been applied to a range of literature mining tasks to demonstrate the accuracy and versatility of the technology. take pms, for example, a time-consuming and manual intellectual review process to actively screen biomedical databases and literature for new ades. researchers were able to train an ml algorithm on historic screening knowledge data to automatically sort relevant articles for intellectual review. another deep learning pipeline implemented with three nlp modules not only monitors biomedical literature for adr signals but also filters and ranks publications across three output levels. nlp & social media there has been a lot of interest in the potential of nlp-based pipelines that can automate information extraction from social media and other online health forums. but these data sources, specifically social media networks, present a unique set of challenges. for instance, adr mentions on social media typically include long, varied and informal descriptions that are completely different from the formal terminology found in pubmed. one proposed way around this challenge has been to use an adversarial transfer framework to transfer auxiliary features from pubmed to social media datasets in order to improve generalization, mitigate noise and enhance adr identification performance. pharmacovigilance on social media data has predominantly focused on mining ades using annotated datasets. achieving the larger objective of detecting ade signals and informing public policy will require the development of end-to-end solutions that enable the large-scale analysis of social media for a variety of drugs. one project to evaluate the performance of automated ae recognition systems for twitter warned of a potentially large discrepancy between published performance results and actual performance based on independent data. the transferability of ae recognition systems, the study concluded, would be key to their more widespread use in pharmacovigilance. all that notwithstanding, there is little doubt that user-generated textual content on the internet will have a substantive influence on conventional pharmacovigilance processes. integrated pharmacovigilance pharmacovigilance is still a very fragmented and uncoordinated process, both in terms of data collection and analysis. the value of nlp technologies lies in their ability to unlock real-time real-world insights at scale from data sources that will enable a more proactive approach to predicting and preventing adverse events. but for this to happen, the focus has to be on the development of outcome-based hybrid nlp models that can unify all textual data across clinical trials, clinical narratives, ehrs, biomedical literature, user-generated content etc. at the same time, the approach to the collection and analysis of structured data in pharmacovigilance also needs to be modernised to augment efficiency, productivity and accuracy. combining structured and unstructured data will open up a new era in data-driven pharmacovigilance.

Attention mechanisms, transformers and NLP

natural language processing is a multidisciplinary field and over the years several models and algorithms have been successfully used to parse text. ml approaches have been central to nlp development with many of them particularly focussing on a technique called sequence-to-sequence learning (seq2seq). deep neural networks first introduced by google in 2014, seq2seq models revolutionized translation and were quickly being used for a variety of nlp tasks including text summarization, speech recognition, image captioning, question-answering etc. prior to this, deep neural networks (dnns) had been used to tackle difficult problems such as speech recognition. however, they suffered from a significant limitation in that they required the dimensionality of inputs and outputs to be known and fixed. hence, they were not suitable for sequential problems, such as speech recognition, machine translation and question answering, where dimensionality can not be pre-defined. as a result, recurrent neural networks (rnns), a type of artificial neural network, soon became the state of the art for sequential data. recurrent neural networks in a traditional dnn, the assumption is that inputs and outputs are independent of each other. rnns, however, operate on the principle that the output depends on both the current input as well as the “memory” of previous inputs from a sequence. the use of feedback loops to process sequential data allows information to persist thereby giving rnns their “memory.” as a result, this approach is perfectly suitable for language applications where context is vital to the accuracy of the final output. however, there was the issue of vanishing gradients — information loss when dealing with long sequences because of their ability to only focus on the most recent information — that impaired meaningful learning in the context of large data sequences. rnns soon evolved into several specialized versions, like lstm (long short-term memory), gru (gated recurrent unit), time distributed layer, and convlstm2d layer, with the capability to process long sequences. each of these versions was designed to address specific situations, for instance, grus outperformed lstms on low complexity sequences, consumed less memory and delivered faster results whereas lstms performed better with high complexity sequences and enabled higher accuracy. rnns and their variants soon became state-of-the-art for sequence translation. however, there were still several limitations related to long-term dependencies, parallelization, resource intensity and their inability to take full advantage of emerging computing paradigms devices such as tpus and gpus. however, a new model would soon emerge and go on to become the dominant architecture for complex nlp tasks. transformers by 2017, complex rnns and variants became the standard for sequence modelling and transduction with the best models incorporating an encoder and decoder connected through an attention mechanism. that year, however, a paper from google called attention is all you need proposed a new model architecture called the transformer based entirely on attention mechanisms. having dropped recurrence in favour of attention mechanisms, these models performed remarkably better at translation tasks, while enabling significantly more parallelization and requiring less time to train. what is the attention mechanism? the concept of attention mechanism was first introduced in a 2014 paper on neural machine translation. prior to this, rnn encoder-decoder frameworks encoded variable-length source sentences into fixed-length vectors that would then be decoded into variable-length target sentences. this approach not only restricts the network’s ability to cope with large sentences but also results in performance deterioration for long input sentences. rather than trying to force-fit all the information from an input sentence into a fixed-length vector, the paper proposed the implementation of a mechanism of attention in the decoder. in this approach, the information from an input sentence is encoded across a sequence of vectors, instead of a fixed-length vector, with the attention mechanism allowing the decoder to adaptively choose a subset of these vectors to decode the translation. types of attention mechanisms the transformer was the first transduction model to implement self-attention as an alternative to recurrence and convolutions. a self-attention, or intra-attention, mechanism relates to different positions in order to compute a representation of the sequence. and depending on the implementation there can be several types of attention mechanisms. for instance, in terms of source states that contribute to deriving the attention vector, there is global attention, where attention is placed on all source states, hard attention, just one source state and soft attention, a limited set of source states. there is also luong attention from 2015, a variation on the original bahdanau or additive attention, which combined two classes of mechanisms, one global for all source words and the other local and focused on a selected subset of words, to predict the target sentence. the 2017 google paper introduced scaled dot-product attention, which itself was like dot-product, or multiplicative, attention, but with a scaling factor. the same paper also defined multi-head attention, where instead of performing a single attention function it is performed in parallel. this approach enables the model to concurrently attend to information from different representation subspaces at different positions. multi-head attention has played a central role in the success of transformer models, demonstrating consistent performance improvements over other attention mechanisms. in fact, rnns that would typically underperform transformers have been shown to outperform them when using multi-head attention. apart from rnns, they have also been incorporated into other models like graph attention networks and convolutional neural networks. transformers in nlp transformer architecture has become a dominant choice in nlp. in fact, some of the leading language models for nlp, such as bidirectional encoder representations from transformers (bert), generative pre-training models (gpt-3), and xlnet are transformer-based. in fact, transformer-based pretrained language models (t-ptlms) have been successfully used in a variety of nlp tasks. built on transformers, self-supervised learning and transfer learning, t-ptlms are able to use self-supervised learning on large volumes of text data to understand universal language representations and then transfer this knowledge to downstream tasks. today, there is a long list of t-ptlms including general, social media, monolingual, multilingual and domain-specific t-ptlms. specialized biomedical language models, like biobert, bioelectra, bioalbert and bioelmo, have been able to produce meaningful concept representations that augment the power and accuracy of a range of bionlp applications such as named entity recognition, relationship extraction and question answering. transformer-based language models trained with large-scale drug-target interaction (dti) data sets have been able to outperform conventional methods in the prediction of novel drug-target interactions. it’s hard to tell if transformers will eventually replace rnns but they are currently the model of choice for nlp.

Data related challenges in NLP

nlp challenges can be classified into two broad categories. the first category is linguistic and refers to the challenges of decoding the inherent complexity of human language and communication. we covered this category in a recent "why is nlp challenging?" article. the second is data-related and refers to some of the data acquisition, accuracy, and analysis issues that are specific to nlp use cases. in this article, we will look at four of the most common data-related challenges in nlp. low resource languages there is currently a digital divide in nlp between high resource languages, such as english, mandarin, french, german, arabic, etc., and low resource languages, which include most of the remaining 7,000+ languages of the world. though there is a range of ml techniques that can reduce the need for labelled data, there still needs to be enough data, both labelled and unlabelled, to feed data-hungry ml techniques and to evaluate system performance. in recent times, multilingual language models (mllms) have emerged as a viable option to handle multiple languages in a single model. pretrained mllms have been successfully used to transfer nlp capabilities to low-resource languages. as a result, there is increasing focus on zero-shot transfer learning approaches to building bigger mllms that cover more languages, and on creating benchmarks to understand and evaluate the performance of these models on a wider variety of tasks. apart from transfer learning, there are a range of techniques, like data augmentation, distant & weak supervision, cross-lingual annotation projections, learning with noisy labels, and non-expert support, that have been developed to generate alternative forms of labelled data for low-resource languages and low-resource domains. today, there is even a no-code platform that allows users to build nlp models in low-resource languages. training data building accurate nlp models requires huge volumes of training data. though there has been a sharp increase in recent times of nlp datasets, these are often collected through automation or crowdsourcing. there is, therefore, the potential for incorrectly labelled data which, when used for training, can lead to memorisation and poor generalisation. apart from finding enough raw data for training, the key challenge is to ensure accurate and extensive data annotation to make training data more reliable. data annotation broadly refers to the process of organising and annotating training data for specific nlp use cases. in-text annotation, a subset of data annotation, text data is transcribed and annotated so that ml algorithms are able to make associations between actual and intended meanings. there are five main techniques for text annotation: sentiment annotation, intent annotation, semantic annotation, entity annotation, and linguistic annotation. however, there are several challenges that each of these has to address. for instance, data labelling for entity annotations typically has to contend with issues related to nesting annotations, introducing new entity types in the middle of a project, managing extensive lists of tags, and categorising trailing and preceding whitespaces and punctuation. currently, there are several annotation and classification tools for managing nlp training data at scale. however, manually-labelled gold standard annotations remain a prerequisite and though ml models are increasingly capable of automated labelling, human annotation becomes essential in cases where data cannot be auto-labelled with high confidence. large or multiple documents dealing with large or multiple documents is another significant challenge facing nlp models. most nlp research is about benchmarking models on small text tasks and even state-of-the-art models have a limit on the number of words allowed in the input text. the second problem is that supervision is scarce and expensive to obtain. as a result, scaling up nlp to extract context from huge volumes of medium to long unstructured documents remains a technical challenge. current nlp models are mostly based on recurrent neural networks (rnns) that cannot represent longer contexts. however, there is a lot of focus on graph-inspired rnns as it emerges that a graph structure may serve as the best representation of nlp data. research at the intersection of dl, graphs and nlp is driving the development of graph neural networks (gnns). today, gnns have been applied successfully to a variety of nlp tasks, from classification tasks such as sentence classification, semantic role labelling and relation extraction, to generation tasks like machine translation, question generation, and summarisation. development time and resources as we mentioned in our previous article regarding the linguistic challenges of nlp, ai programs like alphago have evolved quickly to master a broader variety of games with less predefined knowledge. but nlp development cycles are yet to see that pace and degree of evolution. that’s because human language is inherently complex as it makes "infinite use of finite means" by enabling the generation of an infinite number of possibilities from a finite set of building blocks. the prevalent shape of syntax of every language is the result of communicative needs and evolutionary processes that have developed over thousands of years. as a result, nlp development is a complex and time-consuming process that requires evaluating billions of data points in order to adequately train ai from scratch. meanwhile, the complexity of large language models is doubling every two months. a powerful language model like the gpt-3 packs 175 billion parameters and requires 314 zettaflops, 1021 floating-point operations, to train. it has been estimated that it would cost nearly $100 million in deep learning (dl) infrastructure to train the world’s largest and most powerful generative language model with 530 billion parameters. in 2021, google open-sourced a 1.6 trillion parameter model and the projected parameter count for gpt-4 is about 100 trillion. as a result, language modelling is quickly becoming as economically challenging as it is conceptually complex. scaling nlp nlp continues to be one of the fastest-growing sectors within ai. as the race to build larger transformer models continues, the focus will turn to cost-effective and efficient means to continuously pre-train gigantic generic language models with proprietary domain-specific data. even though large language models and computational graphs can help address some of the data-related challenges of nlp, they will also require infrastructure on a whole new scale. today, vendors like nvidia are offering fully packaged products that enable organisations with extensive nlp expertise but limited systems, hpc, or large-scale nlp workload expertise to scale-out faster. so, despite the challenges, nlp continues to expand and grow to include more and more new use cases.

The importance of NLP in biomedical research

there will be more than twice as much digital data created over the next five years as has been generated since the advent of digital storage. and a vast majority of that data, more than 80 per cent, will be unstructured and estimated to be growing at 55-65% per year. textual data, in the form of documents, journal articles, blogs, emails, electronic health records and social media posts, is one of the most common types of unstructured data. this is where ai-based technologies like nlp, can help extract meaning and context from large volumes of unstructured textual data. nlp unlocks access to valuable new data sources that were hitherto beyond the purview of conventional data integration and analysis frameworks. biomedical-domain-specific nlp techniques open up a gamut of possibilities in automating the extraction of statistical and biological information from large volumes of text including scientific literature and medical/clinical data. more importantly, they bring several new benefits in terms of productivity, efficiency, performance and innovation. key benefits of nlp enabling scale, across multiple dimensions scientific journals and other specialized online publications are critical to the dissemination of experiments and studies in biomedical and life sciences research. every biomedical research project can benefit significantly from extracting relevant scientific knowledge, like protein-protein interactions, for example, embedded in this distributed information trove. and with an estimated 3000 biomedical articles being published every day, nlp becomes an indispensable tool for the collation and propagation of knowledge. it is a similar situation in the clinical context, where nlp can quickly extract meaning and context from a sprawl of unstructured text records such as ehrs, diagnostic reports, medical notes, lab data etc. nlp methods have also been successfully reimagined to scale across structured biological information like sequence data. today, high-throughput sequencing technologies are generating more biological sequence data that still lack interpretation or biological information. this creates a major integration and analysis bottleneck for conventional downstream frameworks. for instance, at mindwalk we have applied nlp methods to transcribe the universal language of all omics data and develop a unified framework that can instantly scale across all omics data. uncovering new actionable insights using nlp to expand the scope of biomedical research to textual data can lead to the discovery of insights that lie outside the realm of clinical and biological data. in the clinical context, for example, effective patient-physician communication is vital for enhancing patient understanding of treatment and adherence in order to improve clinical outcomes and patient quality of life. and patient-reported outcome measures (proms) are often used to assess and improve communication. however, one study set out to complement conventional approaches by extracting a patient-centred view of diseases and treatments through social media analytics. the strategy was to use a text-mining methodology to analyse health-related forums to understand the therapeutic experience of patients affected by hypothyroidism and to detect possible adverse drug reactions (adrs) that may not necessarily be communicated in the formal clinical setting. the analysis of reported adrs revealed that a pattern of well-known side effects uncertainties about proper administration was causing anxiety and fear. the other key finding was that some reported symptoms quite frequently posted online, like dizziness, memory impairment, and sexual dysfunction were usually not discussed at in-person consultations. empowering researchers, accelerating research nlp technologies significantly expand the scope and potential of biological research by putting into play vast volumes of information that were hitherto underutilised. by automating the analysis of unstructured textual data, it empowers researchers with more data points to explore more correlations and possibilities. in addition, it relieves them from tedious, repetitive tasks thereby allowing them to focus on activities that add real value and accelerate time-to-insight. take rare disease drug development, for example, a field characterised by small patient populations and a shortage of data. to account for the inherent data scarcity, researchers had to manually scour through large volumes of information to identify any links between rare diseases and specific genes and gene variants. the advent of nlp relieves researchers from the tedium of manual search, instantly expands their data universe and helps accelerate the drug development process for rare diseases. enabling innovation nlp can help disrupt and reinvent tried and tested processes that have become part of the established convention in many industries. take biological research, for example, where sequence search and comparison is the launch point for a lot of projects. in this standard process, users typically input a research-relevant biological sequence, in a predefined and acceptable data format, and use relevant search results to chart their research pathway. even though the underlying frameworks, models and algorithms have evolved considerably over the years, the standard process still remains the same; users input a sequence to obtain a list of all pertinent sequences. however, nlp-based innovations, like the mindwalk platform, for example, can completely disrupt this process to yield significant improvements in efficiency, productivity and performance. in the nlp-based model, users can start with a simple text input, say covid, to launch their search. more importantly, the model surfaces all relevant results, both at the sequence and text levels, thereby facilitating a more data-inclusive and integrative approach to genomics research. integrative research with biostrand mindwalk platform is our latest technology innovation in our continuing quest to make omics research more efficient, productive and integrative. by adding literature analysis to our existing omics and metadata integration framework, we now offer a unified solution that scales across sequence data and unstructured textual data to facilitate a truly integrative and data-driven approach to biological research. our platform's semantics-driven analysis framework is fully domain-agnostic and uses a bottom-up approach which means that even proprietary literature with custom words can be easily parsed. our integrated framework traverses omics data, metadata and textual data to capture all correlated information across structured and unstructured data at one shot. this provides researchers with a ‘single pane of glass’ view of all entities, associations and relationships that are relevant to their research. and we believe that enabling this singular focus on all the most relevant data points and correlations that exist between a specific research purpose and all prior knowledge can help researchers significantly accelerate time to insight and value.