The Blog

MindWalk is a biointelligence company uniting AI, multi-omics data, and advanced lab research into a customizable ecosystem for biologics discovery and development.

Advancing rational drug design: Vector search in protein analytics

the importance of an integrated end-to-end antibody discovery process drug discovery processes are typically organized in a step-by-step manner – going from target identification to lead optimization processes. this implies that data is being siloed at every process, leading to an exponential loss of quantitative and qualitative insights across the different processes. to realize the full potential of drug discovery, data integration within a data-driven automation platform is essential. the lensai™ foundation ai model powered by hyft technology is designed to solve the challenges behind ai-driven rational drug design, harnessing advanced ai and ml capabilities to navigate the complexities of drug discovery with high precision. by integrating predictive modelling, data analysis and lead optimization functionalities, lensai accelerates the end-to-end discovery and development of promising drug candidates. the lensai system uniquely integrates both structured and unstructured data, serving as a centralized graph for storing, querying, and analyzing diverse datasets, including different omics layers, chemical, and pharmacological information. with lensai, data from every phase of the drug discovery process is no longer siloed but represented as subgraphs within an interconnected graph that summarizes data across all processes. this interconnected approach enables bidirectional and cyclical information flow, allowing for flexibility and iterative refinement. for example, during in-silico lead optimization, challenges may arise regarding pharmacokinetic properties or off-target effects of lead compounds. by leveraging the integrated knowledge graph, we can navigate back to earlier phases to reassess decisions and explore alternative strategies. this holistic view ensures that insights and adjustments can be continuously incorporated throughout the drug discovery process. navigation through integrated knowledge graphs of complex biological data is made possible by the patented hyft technology. hyfts, which are amino acid patterns mined across the biosphere, serve as critical connectors within the knowledge graph by capturing diverse layers of information at both the subsequence and sequence levels. the hyfts encapsulate information about ‘syntax’ (the arrangement of amino acids), as well as ‘structure’ and ‘function,’ and connect this data to textual information at the sentence and concept levels. this hyft-based multi-modal integration ensures that we move beyond mere ‘syntax’ to incorporate ‘biological semantics,’ representing the connection between structure and function. within this single framework, detailed structural information is aligned with relevant textual metadata, providing a comprehensive understanding of biological sequences. exploring textual metadata could be very useful in the target identification stage. for example, to gather detailed information on the target epitopes: “in which species are these epitopes represented?” “can we extract from literature additional information and insights on the epitopes?”. this information can be yielded by querying the knowledge graph and harnessing the benefits of the fine-grained hyft-based approach, capturing information at the ‘subsequence’ level. indeed, at the hyft level, relevant textual concepts (sub-sentence level) are captured, which allows us to identify whether a specific hyft, represented in the target, might reveal relevant epitopes. apart from textual meta-data there is ‘flat’ metadata such as immunogenicity information, germline information, pharmacological data, developability data, and sequence liability presence. at each of the previously mentioned information layers, additional 'vector' data is obtained from various protein large language models (pllms). this means that an embedding is associated with each (sub)-sequence or concept. this allows for 'vector' searches, which, based on the embeddings, can be used to identify similar sequences, enhancing tasks like protein structure prediction and functional annotation. for a deep dive into vector search, see our vector search in text analysis blog here. this capability allows for the extraction of a wider range of features and the uncovering of hidden patterns across all these dimensions. lensai: the importance of embeddings at the sub-sequence level mindwalk lensai’s comprehensive approach in protein analytics is similar to text-based analytics. in text analysis, we refine semantic boundaries by intelligently grouping words to capture textual meaning. similarly, in protein analytics, we strategically group residue tokens (amino acids) to form sequential hyfts. just as words are clustered into synonyms in text analytics, “protein words” are identified and clustered based on their biological function in protein analytics. these “protein words,” when present in different sequences, reveal a conserved function. by leveraging this method, we gain a deeper understanding of the functional conservation across various protein sequences. thus, the lensai platform based on hyft technology analyses proteins at the sub-sequence level focusing on the hyft patterns as well as on the full-sequence level. comparable to natural language, residues might be less relevant and do not contribute to meaning, which, in case of proteins, can be translated into function. therefore, by focusing on hyfts, we obtain a more condensed information representation and noise reduction by excluding the information captured in non-critical regions. in text analysis, we can almost immediately recognize semantic similarity. we recognize sentences similar in meaning, although compiled of different words, because of our natural understanding of synonyms. in protein language to identify ‘functional similarity’, in other words, to distinguish whether two different amino acid patterns (hyfts) might yield the same function, we use a mathematical method i.e. pllms. pllms are transformer-based models that generate an embedding starting from single amino acid residues. depending on the data the pllm is trained on (typically millions of protein sequences), it tries to discover hidden properties by diving into the residue-residue connections (neighboring residues on both short or longer distances). figure 1: mindwalk's method of chunking tokens the dataset and task a pllm were trained on, determine the represented properties, which can vary from one pllm to another pllm. by stacking the embeddings from different large language model (llm) a more complete view on the protein data is generated. furthermore, we can use clustering and vector search algorithms to group sequences that are similar in a broad range of dimensions. protein embeddings are typically generated at the single amino acid level. in contrast, the hyft-based model obtains embeddings from llms at a pattern level, by concatenating residue-level embeddings. these ‘protein words or hyft’ level embeddings can be obtained from several pre-trained llms – varying from antibody-specific llms to more generic pllms. this hyft-based embedding model offers several benefits. first, this approach captures richer and more informative embeddings compared to the single residue level embeddings second, the concatenation of residue-level embeddings allows for preserving sequence-specific patterns, enhancing the ability to identify functional and structural motifs within proteins. lastly, integrating different llms ensures that these embeddings leverage vast amounts of learned biological knowledge, improving the accuracy and robustness of downstream tasks such as protein function prediction and annotation. so, if we want to identify which hyfts are ‘synonyms’, we deploy the hyft-based level embeddings. returning to the language analogy, where ‘apple’ will take a similar place in the embedding space as ‘orange’ or ‘banana’ – because they are all fruits – in protein analytics we are interested in the hyfts that take similar places in the embedding space – because they all perform the same function in a certain context. figure 2: embeddings at the sub-sequence level: concepts versus hyfts as the figure above (fig. 2) illustrates, the word "apple" can have different meanings depending on the context (referring to a phone or a fruit), the sequence hyft ‘vkkpgas’ can also appear in various contexts, representing different protein annotations and classifications. for instance, a specific hyft is found in organisms ranging from bacteria and fungi to human immunoglobulins. consequently, the embeddings for hyft vkkpgas might occupy different positions in the embedding space, reflecting these distinct functional contexts. use case: transforming antibody discovery with integrated vector search in hit expansion analysis in the lensai hit expansion analysis pipeline, outputs from phage-display, b-cell, or hybridoma technologies are combined with a large-scale enriched antibody sequence dataset sequenced by ngs. the primary goal is to expand the number and diversity of potential binders—functional antibodies from the ngs dataset that are closely related to a set of known binders. the data from the ngs repertoire set and the known binders are represented in a multi-modal knowledge graph, incorporating various modalities such as sequence, structure, function, text, and embeddings. this comprehensive representation allows the ngs repertoire set to be queried to identify a diverse set of additional hits by simultaneously exploiting different information levels, such as structural, physiochemical, and pharmacological properties like immunogenicity. a vital component of this multi-modal knowledge graph is the use of vector embeddings, where antibody sequences are represented in multi-dimensional space, enabling sophisticated analysis. these vector embeddings can be derived from different llms. for instance, in the example below, clinical antibodies obtain sequence-level embeddings from an antibody-specific llm, represented in 2d space and colored by their immunogenicity score. this immunogenicity score can be used to filter some of the antibodies, demonstrating how metadata can be utilized to select embedding-based clusters. furthermore, using vector embeddings allows for continuous data enrichment and the integration of latest information into the knowledge graph at every step of the antibody discovery and development cycle, enhancing the overall process. in protein engineering, this continuous data enrichment proves advantageous in various aspects, such as introducing specific mutations aimed at enhancing binding affinity, humanizing proteins, and reducing immunogenicity. this new data is dynamically added to the knowledge graph, ensuring a fully integrated view of all the data throughout the antibody design cycle. these modifications are pivotal in tailoring proteins for therapeutics, ensuring they interact more effectively with their targets while minimizing unwanted immune responses. figure 3. the clinical antibodies obtain sequence-level embeddings from an antibody-specific llm, represented in 2d space and colored by their immunogenicity score (1 indicating high immunogenic) conclusion the lensai platform provides a robust multi-modal approach to optimize antibody discovery and development processes. by solving the integration of sequence, structure, function, textual insights, and vector embeddings, lensai bridges gaps between disparate data sources. the platform enhances feature extraction by leveraging embedding data from various llms, capturing a wide array of biologically relevant 'hidden properties' at the sub-sequence level. this capability ensures a comprehensive exploration of nuanced biological insights, facilitating an integrated data view. by utilizing “vector search”, the platform can efficiently query and analyze these embeddings, enabling the identification of similar sequences and functional motifs across large and complex datasets. this approach not only captures the 'syntax' and 'structure' of amino acid patterns but also integrates 'biological semantics,' thereby providing a holistic understanding of protein functions and interactions. consequently, lensai improves the efficiency of antibody discovery and development from identifying novel targets to optimizations in therapeutic development, such as hit expansion analysis, affinity maturation, humanization, and immunogenicity screening processes. furthermore, lensai enables cyclical enrichment of antibody discovery and development processes by adding and integrating information into a knowledge graph at every step of the development cycle. this continuous enrichment sets a new benchmark for integrated, data-driven approaches in biotechnology, ensuring ongoing improvements and innovations. references: frontiers | many routes to an antibody heavy-chain cdr3: necessary, yet insufficient, for specific binding benchmarking antibody clustering methods using sequence, structural, and machine learning similarity measures for antibody discovery applications rosario vitale, leandro a bugnon, emilio luis fenoy, diego h milone, georgina stegmayer, evaluating large language models for annotating proteins, briefings in bioinformatics, volume 25, issue 3, may 2024, bbae177

Transforming drug design: Vector search in text analysis

ai-driven rational drug design ai-driven rational drug design is central to mindwalk's mission to power the intersection of biotech discovery, biotherapeutics and ai. ‘ai-driven’ signifies the application of artificial intelligence (ai), including machine learning (ml) and natural language processing (nlp). ‘rational’ alludes to the process of designing drugs based on the understanding of biological targets. this approach leverages computational models and algorithms to predict how drug molecules interact with their target biological molecules, such as proteins or enzymes, involved in disease processes. the goal is to create more effective and safer drugs by precisely targeting specific mechanisms within the body. integration of complex biological data the lensai™ integrated intelligence platform powered by patented hyft technology is unique in its data integration of structured and unstructured data, including genomic sequences, protein structures, scientific literature and clinical notes, facilitating a comprehensive understanding of biological systems. advanced computational techniques mindwalk's approach to drug discovery combines ai for rapid compound screening and predictive modeling with text analysis to retrieve information from research articles. this helps to identify promising drug candidates and optimize their properties for better efficacy and safety, which significantly reduces r&d timelines. the use of ai, ml, and nlp technologies within the lensai platform and the use of different protein large language model (llm) embeddings, facilitate the discovery of novel drug targets. these technologies allow for the identification of patterns, relationships, and insights within large datasets. the combination of mindwalk's technologies with the intersystems iris data platform introduces a powerful vector search mechanism that facilitates semantic analysis. this approach transforms the search for relevant biological and chemical information by enabling searches based on conceptual similarity rather than just keywords. as a result, researchers can uncover deeper insights into disease mechanisms and potential therapeutic targets. we wrote about vector search in an earlier blog. here we illustrate vector search for text. in a next post, we will dive into the application of vector search for protein analytics. utilizing vector search in text analysis the primary challenge for text search is locating specific and accurate information within a vast amount of unstructured data. it is like finding a needle in a haystack. conducting a simple keyword search in pubmed can yield thousands of results. while generative models can provide concise answers to questions within seconds, their accuracy is not always guaranteed. we implemented retrieval-augmented generation (rag) to combat the hallucinations that generative chat systems may experience. moreover, rag systems deliver up-to-date results and are able to refer to their sources, which makes their responses traceable. however, like all generative systems they struggle to handle large input prompts at once. this is where vector search becomes essential. vector search is a valuable tool to guide you to the precise area within your data haystack. representing meaning in vector space search terms often have various meanings in different contexts. for instance, the abbreviation 'ada' could refer to anti-drug antibodies, the american dental association, the american diabetes association, the americans with disabilities act, adenosine deaminase, adalimumab, and other entities. by encoding text data with embeddings, one can narrow down the focus to the meaning that aligns with the search query. the figure below illustrates a two-dimensional umap visualization of the embeddings for pubmed abstracts containing 'ada'. while the visual representation emphasizes similarity and does not provide a scalable measure for actual distance in the multidimensional vector space, it does demonstrate the presence of semantic ambiguity in the vector-based embeddings. thus, encoding the input allows for data clustering and focusing on the most relevant clusters. open ada abstracts | umap the embeddings used here are dense vectors. dense vectors are compact numerical representations of data points, typically generated by large language models, in this case pubmedbert. dense vectors capture the semantic meaning of text, allowing the system to retrieve relevant information even if the exact keywords are not present. this nuanced and context-aware retrieval offers advantages over traditional keyword-based methods. on the other hand, sparse vectors are typically high-dimensional but contain many zero values. as an example, a bag-of-words vector for a short english sentence would contain a one for every word that is present in the sentence and a zero for every english word that is not present in the sentence. the result is a very sparse vector with many zero values and a couple of ones. sparse vectors are often generated using traditional methods like tf-idf or bm25, which focus on the presence or frequency of specific terms in the text. these vectors require fewer resources and offer faster retrieval speeds. searching in vector space when generating embeddings, there are multiple levels to consider. chunking is the process of breaking down large pieces of text into smaller segments that will be considered as units of relevant context. from tokens to documents, each level offers a different way to understand and analyze text data. starting at the most granular level, tokens represent individual words or parts of individual words within a text. large language models often calculate embeddings based on single-word tokens. this may dilute the semantic richness of the text. the lensai integrated intelligence platform uses concepts. these are words or word groups that form a unit of meaning. concepts are more specific than tokens for keyword search. moreover, dense embeddings of concepts within a sentence are particularly well-suited for detecting synonyms. token embeddings - 'concept' embeddings the following umap visualization of concept embeddings shows similar embeddings for the semantically related instances of ‘ada treatment’ and ‘ada therapy’ and also for instances of ‘ada inhibition’ and ‘ada treatment’, whereas the embeddings for ‘ada professional practice committee’, ‘ada activity’, ‘ada scid’ and ‘ada formation’ build separate non-overlapping clusters. open ada concepts umap2 crc constructs (concept-relation-concept patterns) effectively capture the intricate boundaries of semantic meaning. focusing on crcs enhances the semantic similarity search while filtering out non-relevant sentence parts, yielding a more condensed representation of meaning. moving up to the levels of sentence and document embeddings can be useful for obtaining a more general idea of the context rather than focusing on a particular search term in a query. the level of embeddings that is most relevant will depend on the specific use case at hand. in conclusion, vector search presents numerous opportunities to optimize search results by guiding users to their most relevant data. leveraging dense and sparse vectors, as well as embeddings on various levels, can be combined to create a hybrid system tailored to specific use cases. in the field of ai-driven rational drug design, vector search is an additional computational technique that fits in a multidisciplinary approach, supporting more than only text data, as will become clear in our future blog post about vector search for protein analysis. combining the lensai integrated intelligence platform, with the intersystems iris data platform creates a robust vector search mechanism, enhancing rational drug discovery and personalized medicine. additionally,lensai is designed to support hallucination-free, traceable, and up-to-date retrieval-augmented generation, helping researchers access accurate and reliable data.

Topic: Vector search

Advancing rational drug design: Vector search in protein analytics

Transforming drug design: Vector search in text analysis

CaseXCase Series

The Blog

Advancing rational drug design: Vector search in protein analytics

Transforming drug design: Vector search in text analysis

Topic: Vector search

Advancing rational drug design: Vector search in protein analytics

Transforming drug design: Vector search in text analysis

Keep up to date