The Blog
MindWalk is a biointelligence company uniting AI, multi-omics data, and advanced lab research into a customizable ecosystem for biologics discovery and development.
×
natural language understanding (nlu) is an ai-powered technology that allows machines to understand the structure and meaning of human languages. nlu, like natural language generation (nlg), is a subset of natural language processing (nlp) that focuses on assigning structure, rules, and logic to human language so machines can understand the intended meaning of words, phrases, and sentences in text. nlg, on the other hand, deals with generating realistic written/spoken human-understandable information from structured and unstructured data. since the development of nlu is based on theoretical linguistics, the process can be explained in terms of the following linguistic levels of language comprehension. linguistic levels in nlu phonology is the study of sound patterns in different languages/dialects, and in nlu it refers to the analysis of how sounds are organized, and their purpose and behavior. lexical or morphological analysis is the study of morphemes, indivisible basic units of language with their own meaning, one at a time. indivisible words with their own meaning, or lexical morphemes (e.g.: work) can be combined with plural morphemes (e.g.: works) or grammatical morphemes (e.g.: worked/working) to create word forms. lexical analysis identifies relationships between morphemes and converts words into their root form. syntactic analysis, or syntax analysis, is the process of applying grammatical rules to word clusters and organizing them on the basis of their syntactic relationships in order to determine meaning. this also involves detecting grammatical errors in sentences. while syntactic analysis involves extracting meaning from the grammatical syntax of a sentence, semantic analysis looks at the context and purpose of the text. it helps capture the true meaning of a piece of text by identifying text elements as well as their grammatical role. discourse analysis expands the focus from sentence-length units to look at the relationships between sentences and their impact on overall meaning. discourse refers to coherent groups of sentences that contribute to the topic under discussion. pragmatic analysis deals with aspects of meaning not reflected in syntactic or semantic relationships. here the focus is on identifying intended meaning readers by analyzing literal and non-literal components against the context of background knowledge. common tasks/techniques in nlu there are several techniques that are used in the processing and understanding of human language. here’s a quick run-through of some of the key techniques used in nlu and nlp. tokenization is the process of breaking down a string of text into smaller units called tokens. for instance, a text document could be tokenized into sentences, phrases, words, subwords, and characters. this is a critical preprocessing task that converts unstructured text into numerical data for further analysis. stemming and lemmatization are two different approaches with the same objective: to reduce a particular word to its root word. in stemming, characters are removed from the end of a word to arrive at the “stem” of that word. algorithms determine the number of characters to be eliminated for different words even though they do not explicitly know the meaning of those words. lemmatization is a more sophisticated approach that uses complex morphological analysis to arrive at the root word, or lemma. parsing is the process of extracting the syntactic information of a sentence based on the rules of formal grammar. based on the type of grammar applied, the process can be classified broadly into constituency and dependency parsing. constituency parsing, based on context-free grammar, involves dividing a sentence into sub-phrases, or constituents, that belong to a specific grammar category, such as noun phrases or verb phrases. dependency parsing defines the syntax of a sentence not in terms of constituents but in terms of the dependencies between the words in a sentence. the relationship between words is depicted as a dependency tree where words are represented as nodes and the dependencies between them as edges. part-of-speech (pos) tagging, or grammatical tagging, is the process of assigning a grammatical classification, like noun, verb, adjective, etc., to words in a sentence. automatic tagging can be broadly classified as rule-based, transformation-based, and stochastic pos tagging. rule-based tagging uses a dictionary, as well as a small set of rules derived from the formal syntax of the language, to assign pos. transformation-based tagging, or brill tagging, leverages transformation-based learning for automatic tagging. stochastic refers to any model that uses frequency or probability, e.g. word frequency or tag sequence probability, for automatic pos tagging. name entity recognition (ner) is an nlp subtask that is used to detect, extract and categorize named entities, including names, organizations, locations, themes, topics, monetary, etc., from large volumes of unstructured data. there are several approaches to ner, including rule-based systems, statistical models, dictionary-based systems, ml-based systems, and hybrid models. these are just a few examples of some of the most common techniques used in nlu. there are several other techniques like, for instance, word sense disambiguation, semantic role labeling, and semantic parsing that focus on different levels of semantic abstraction, nlp/nlu in biomedical research nlp/nlu technologies represent a strategic fit for biomedical research with its vast volumes of unstructured data — 3,000-5,000 papers published each day, clinical text data from ehrs, diagnostic reports, medical notes, lab data, etc., and non-standardized digital real-world data. nlp-enabled text mining has emerged as an effective and scalable solution for extracting biomedical entity relations from vast volumes of scientific literature. techniques, like named entity recognition (ner), are widely used in relation extraction tasks in biomedical research with conventionally named entities, such as names, organizations, locations, etc., substituted with gene sequences, proteins, biological processes, and pathways, drug targets, etc. the unique vocabulary of biomedical research has necessitated the development of specialized, domain-specific bionlp frameworks. at the same time, the capabilities of nlu algorithms have been extended to the language of proteins and that of chemistry and biology itself. a 2021 article detailed the conceptual similarities between proteins and language that make them ideal for nlp analysis. more recently, an nlp model was trained to correlate amino acid sequences from the uniprot database with english language words, phrases, and sentences used to describe protein function to annotate over 40 million proteins. researchers have also developed an interpretable and generalizable drug-target interaction model inspired by sentence classification techniques to extract relational information from drug-target biochemical sentences. large neural language models and transformer-based language models are opening up transformative opportunities for biomedical nlp applications across a range of bioinformatics fields including sequence analysis, genome analysis, multi-omics, spatial transcriptomics, and drug discovery. most importantly, nlp technologies have helped unlock the latent value in huge volumes of unstructured data to enable more integrative, systems-level biomedical research. read more about nlp’s critical role in facilitating systems biology and ai-powered data-driven drug discovery. if you want more information on seamlessly integrating advanced bionlp frameworks into your research pipeline, please drop us a line here.
reproducibility, getting the same results using the original data and analysis strategy, and replicability, is fundamental to valid, credible, and actionable scientific research. without reproducibility, replicability, the ability to confirm research results within different data contexts, becomes moot. a 2016 survey of researchers revealed a consensus that there was a crisis of reproducibility, with most researchers reporting that they failed to reproduce not only the experiments of other scientists (70%) but even their own (>50%). in biomedical research, reproducibility testing is still extremely limited, with some attempts to do so failing to comprehensively or conclusively validate reproducibility and replicability. over the years, there have been several efforts to assess and improve reproducibility in biomedical research. however, there is a new front opening in the reproducibility crisis, this time in ml-based science. according to this study, the increasing adoption of complex ml models is creating widespread data leakage resulting in “severe reproducibility failures,” “wildly overoptimistic conclusions,” and the inability to validate the superior performance of ml models over conventional statistical models. pharmaceutical companies have generally been cautious about accepting published results for a number of reasons, including the lack of scientifically reproducible data. an inability to reproduce and replicate preclinical studies can adversely impact drug development and has also been linked to drug and clinical trial failures. as drug development enters its latest innovation cycle, powered by computational in silico approaches and advanced ai-cadd integrations, reproducibility represents a significant obstacle to converting biomedical research into real-world results. reproducibility in in silico drug discovery the increasing computation of modern scientific research has already resulted in a significant shift with some journals incentivizing authors and providing badges for reproducible research papers. many scientific publications also mandate the publication of all relevant research resources, including code and data. in 2020, elife launched executable research articles (eras) that allowed authors to add live code blocks and computed outputs to create computationally reproducible publications. however, creating a robust reproducibility framework to sustain in silico drug discovery would require more transformative developments across three key dimensions: infrastructure/incentives for reproducibility in computational biology, reproducible ecosystems in research, and reproducible data management. reproducible computational biology this approach to industry-wide transformation envisions a fundamental cultural shift with reproducibility as the fulcrum for all decision-making in biomedical research. the focus is on four key domains. first, creating courses and workshops to expose biomedical students to specific computational skills and real-world biological data analysis problems and impart the skills required to produce reproducible research. second, promoting truly open data sharing, along with all relevant metadata, to encourage larger-scale data reuse. three, leveraging platforms, workflows, and tools that support the open data/code model of reproducible research. and four, promoting, incentivizing, and enforcing reproducibility by adopting fair principles and mandating source code availability. computational reproducibility ecosystem a reproducible ecosystem should enable data and code to be seamlessly archived, shared, and used across multiple projects. computational biologists today have access to a broad range of open-source and commercial resources to ensure their ecosystem generates reproducible research. for instance, data can now be shared across several recognized, domain and discipline-specific public data depositories such as pubchem, cdd vault, etc. public and private code repositories, such as github and gitlab, allow researchers to submit and share code with researchers around the world. and then there are computational reproducibility platforms like code ocean that enable researchers to share, discover, and run code. reproducible data management as per a recent data management and sharing (dms) policy issued by the nih, all applications for funding will have to be accompanied by a dms plan detailing the strategy and budget to manage and share research data. sharing scientific data, the nih points out, accelerates biomedical research discovery through validating research, increasing data access, and promoting data reuse. effective data management is critical to reproducibility and creating a formal data management plan prior to the commencement of a research project helps clarify two key facets of the research: one, key information about experiments, workflows, types, and volumes of data generated, and two, research output format, metadata, storage, and access and sharing policies. the next critical step towards reproducibility is having the right systems to document the process, including data/metadata, methods and code, and version control. for instance, reproducibility in in silico analyses relies extensively on metadata to define scientific concepts as well as the computing environment. in addition, metadata also plays a major role in making data fair. it is therefore important to document experimental and data analysis metadata in an established standard and store it alongside research data. similarly, the ability to track and document datasets as they adapt, reorganize, extend, and evolve across the research lifecycle will be crucial to reproducibility. it is therefore important to version control data so that results can be traced back to the precise subset and version of data. of course, the end game for all of that has to be the sharing of data and code, which is increasingly becoming a prerequisite as well as a voluntarily accepted practice in computational biology. one survey of 188 researchers in computational biology found that those who authored papers were largely satisfied with their ability to carry out key code-sharing tasks such as ensuring good documentation and that the code was running in the correct environment. the average researcher, however, would not commit any more time, effort, or expenditure to share code. plus, there still are certain perceived barriers that need to be addressed before the public archival of biomedical research data and code becomes prevalent. the future of reproducibility in drug discovery a 2014 report from the american association for the advancement of science (aaas) estimated that the u.s. alone spent approximately $28 billion yearly on irreproducible preclinical research. in the future, a set of blockchain-based frameworks may well enable the automated verification of the entire research process. meanwhile, in silico drug discovery has emerged as one of the maturing innovation areas in the pharmaceutical industry. the alliance between pharmaceutical companies and research-intensive universities has been a key component in de-risking drug discovery and enhancing its clinical and commercial success. reproducibility-related improvements and innovations will help move this alliance to a data-driven, ai/ml-based, in silico model of drug discovery.
Sorry. There were no results for your query.