MindWalk Blog: The power of integrated AI, Data and Lab precision in Biologics Discovery

Advancing rational drug design: Vector search in protein analytics

the importance of an integrated end-to-end antibody discovery process drug discovery processes are typically organized in a step-by-step manner – going from target identification to lead optimization processes. this implies that data is being siloed at every process, leading to an exponential loss of quantitative and qualitative insights across the different processes. to realize the full potential of drug discovery, data integration within a data-driven automation platform is essential. the lensai™ foundation ai model powered by hyft technology is designed to solve the challenges behind ai-driven rational drug design, harnessing advanced ai and ml capabilities to navigate the complexities of drug discovery with high precision. by integrating predictive modelling, data analysis and lead optimization functionalities, lensai accelerates the end-to-end discovery and development of promising drug candidates. the lensai system uniquely integrates both structured and unstructured data, serving as a centralized graph for storing, querying, and analyzing diverse datasets, including different omics layers, chemical, and pharmacological information. with lensai, data from every phase of the drug discovery process is no longer siloed but represented as subgraphs within an interconnected graph that summarizes data across all processes. this interconnected approach enables bidirectional and cyclical information flow, allowing for flexibility and iterative refinement. for example, during in-silico lead optimization, challenges may arise regarding pharmacokinetic properties or off-target effects of lead compounds. by leveraging the integrated knowledge graph, we can navigate back to earlier phases to reassess decisions and explore alternative strategies. this holistic view ensures that insights and adjustments can be continuously incorporated throughout the drug discovery process. navigation through integrated knowledge graphs of complex biological data is made possible by the patented hyft technology. hyfts, which are amino acid patterns mined across the biosphere, serve as critical connectors within the knowledge graph by capturing diverse layers of information at both the subsequence and sequence levels. the hyfts encapsulate information about ‘syntax’ (the arrangement of amino acids), as well as ‘structure’ and ‘function,’ and connect this data to textual information at the sentence and concept levels. this hyft-based multi-modal integration ensures that we move beyond mere ‘syntax’ to incorporate ‘biological semantics,’ representing the connection between structure and function. within this single framework, detailed structural information is aligned with relevant textual metadata, providing a comprehensive understanding of biological sequences. exploring textual metadata could be very useful in the target identification stage. for example, to gather detailed information on the target epitopes: “in which species are these epitopes represented?” “can we extract from literature additional information and insights on the epitopes?”. this information can be yielded by querying the knowledge graph and harnessing the benefits of the fine-grained hyft-based approach, capturing information at the ‘subsequence’ level. indeed, at the hyft level, relevant textual concepts (sub-sentence level) are captured, which allows us to identify whether a specific hyft, represented in the target, might reveal relevant epitopes. apart from textual meta-data there is ‘flat’ metadata such as immunogenicity information, germline information, pharmacological data, developability data, and sequence liability presence. at each of the previously mentioned information layers, additional 'vector' data is obtained from various protein large language models (pllms). this means that an embedding is associated with each (sub)-sequence or concept. this allows for 'vector' searches, which, based on the embeddings, can be used to identify similar sequences, enhancing tasks like protein structure prediction and functional annotation. for a deep dive into vector search, see our vector search in text analysis blog here. this capability allows for the extraction of a wider range of features and the uncovering of hidden patterns across all these dimensions. lensai: the importance of embeddings at the sub-sequence level mindwalk lensai’s comprehensive approach in protein analytics is similar to text-based analytics. in text analysis, we refine semantic boundaries by intelligently grouping words to capture textual meaning. similarly, in protein analytics, we strategically group residue tokens (amino acids) to form sequential hyfts. just as words are clustered into synonyms in text analytics, “protein words” are identified and clustered based on their biological function in protein analytics. these “protein words,” when present in different sequences, reveal a conserved function. by leveraging this method, we gain a deeper understanding of the functional conservation across various protein sequences. thus, the lensai platform based on hyft technology analyses proteins at the sub-sequence level focusing on the hyft patterns as well as on the full-sequence level. comparable to natural language, residues might be less relevant and do not contribute to meaning, which, in case of proteins, can be translated into function. therefore, by focusing on hyfts, we obtain a more condensed information representation and noise reduction by excluding the information captured in non-critical regions. in text analysis, we can almost immediately recognize semantic similarity. we recognize sentences similar in meaning, although compiled of different words, because of our natural understanding of synonyms. in protein language to identify ‘functional similarity’, in other words, to distinguish whether two different amino acid patterns (hyfts) might yield the same function, we use a mathematical method i.e. pllms. pllms are transformer-based models that generate an embedding starting from single amino acid residues. depending on the data the pllm is trained on (typically millions of protein sequences), it tries to discover hidden properties by diving into the residue-residue connections (neighboring residues on both short or longer distances). figure 1: mindwalk's method of chunking tokens the dataset and task a pllm were trained on, determine the represented properties, which can vary from one pllm to another pllm. by stacking the embeddings from different large language model (llm) a more complete view on the protein data is generated. furthermore, we can use clustering and vector search algorithms to group sequences that are similar in a broad range of dimensions. protein embeddings are typically generated at the single amino acid level. in contrast, the hyft-based model obtains embeddings from llms at a pattern level, by concatenating residue-level embeddings. these ‘protein words or hyft’ level embeddings can be obtained from several pre-trained llms – varying from antibody-specific llms to more generic pllms. this hyft-based embedding model offers several benefits. first, this approach captures richer and more informative embeddings compared to the single residue level embeddings second, the concatenation of residue-level embeddings allows for preserving sequence-specific patterns, enhancing the ability to identify functional and structural motifs within proteins. lastly, integrating different llms ensures that these embeddings leverage vast amounts of learned biological knowledge, improving the accuracy and robustness of downstream tasks such as protein function prediction and annotation. so, if we want to identify which hyfts are ‘synonyms’, we deploy the hyft-based level embeddings. returning to the language analogy, where ‘apple’ will take a similar place in the embedding space as ‘orange’ or ‘banana’ – because they are all fruits – in protein analytics we are interested in the hyfts that take similar places in the embedding space – because they all perform the same function in a certain context. figure 2: embeddings at the sub-sequence level: concepts versus hyfts as the figure above (fig. 2) illustrates, the word "apple" can have different meanings depending on the context (referring to a phone or a fruit), the sequence hyft ‘vkkpgas’ can also appear in various contexts, representing different protein annotations and classifications. for instance, a specific hyft is found in organisms ranging from bacteria and fungi to human immunoglobulins. consequently, the embeddings for hyft vkkpgas might occupy different positions in the embedding space, reflecting these distinct functional contexts. use case: transforming antibody discovery with integrated vector search in hit expansion analysis in the lensai hit expansion analysis pipeline, outputs from phage-display, b-cell, or hybridoma technologies are combined with a large-scale enriched antibody sequence dataset sequenced by ngs. the primary goal is to expand the number and diversity of potential binders—functional antibodies from the ngs dataset that are closely related to a set of known binders. the data from the ngs repertoire set and the known binders are represented in a multi-modal knowledge graph, incorporating various modalities such as sequence, structure, function, text, and embeddings. this comprehensive representation allows the ngs repertoire set to be queried to identify a diverse set of additional hits by simultaneously exploiting different information levels, such as structural, physiochemical, and pharmacological properties like immunogenicity. a vital component of this multi-modal knowledge graph is the use of vector embeddings, where antibody sequences are represented in multi-dimensional space, enabling sophisticated analysis. these vector embeddings can be derived from different llms. for instance, in the example below, clinical antibodies obtain sequence-level embeddings from an antibody-specific llm, represented in 2d space and colored by their immunogenicity score. this immunogenicity score can be used to filter some of the antibodies, demonstrating how metadata can be utilized to select embedding-based clusters. furthermore, using vector embeddings allows for continuous data enrichment and the integration of latest information into the knowledge graph at every step of the antibody discovery and development cycle, enhancing the overall process. in protein engineering, this continuous data enrichment proves advantageous in various aspects, such as introducing specific mutations aimed at enhancing binding affinity, humanizing proteins, and reducing immunogenicity. this new data is dynamically added to the knowledge graph, ensuring a fully integrated view of all the data throughout the antibody design cycle. these modifications are pivotal in tailoring proteins for therapeutics, ensuring they interact more effectively with their targets while minimizing unwanted immune responses. figure 3. the clinical antibodies obtain sequence-level embeddings from an antibody-specific llm, represented in 2d space and colored by their immunogenicity score (1 indicating high immunogenic) conclusion the lensai platform provides a robust multi-modal approach to optimize antibody discovery and development processes. by solving the integration of sequence, structure, function, textual insights, and vector embeddings, lensai bridges gaps between disparate data sources. the platform enhances feature extraction by leveraging embedding data from various llms, capturing a wide array of biologically relevant 'hidden properties' at the sub-sequence level. this capability ensures a comprehensive exploration of nuanced biological insights, facilitating an integrated data view. by utilizing “vector search”, the platform can efficiently query and analyze these embeddings, enabling the identification of similar sequences and functional motifs across large and complex datasets. this approach not only captures the 'syntax' and 'structure' of amino acid patterns but also integrates 'biological semantics,' thereby providing a holistic understanding of protein functions and interactions. consequently, lensai improves the efficiency of antibody discovery and development from identifying novel targets to optimizations in therapeutic development, such as hit expansion analysis, affinity maturation, humanization, and immunogenicity screening processes. furthermore, lensai enables cyclical enrichment of antibody discovery and development processes by adding and integrating information into a knowledge graph at every step of the development cycle. this continuous enrichment sets a new benchmark for integrated, data-driven approaches in biotechnology, ensuring ongoing improvements and innovations. references: frontiers | many routes to an antibody heavy-chain cdr3: necessary, yet insufficient, for specific binding benchmarking antibody clustering methods using sequence, structural, and machine learning similarity measures for antibody discovery applications rosario vitale, leandro a bugnon, emilio luis fenoy, diego h milone, georgina stegmayer, evaluating large language models for annotating proteins, briefings in bioinformatics, volume 25, issue 3, may 2024, bbae177

Multimodal language models in protein engineering: Functional clonotyping & beyond

in the beginning of 2023, chatgpt achieved a significant milestone of 100 million users. the utilization of generative ai defined the year, with prominent large language models such as gpt-4 captivating the world due to their remarkable mastery of natural language. interestingly, openai’s last upgrade to chatgpt introduces powerful multimodal capabilities, enabling the model to handle various types of input going beyond text and processing images, audio and video. this showcases the future potential of generative ai for hyper-personalization and diverse application. what if these models progress to the point of mastering the language of life? imagine protein-level llms learning the “semantics” and “grammar” of proteins, not just as static structures but as dynamic multimodal entities, enabling us to unravel the intricacies of their functions and behaviors at a level of detail previously unimaginable. the need for multi-modality in protein engineering workflows also in protein engineering workflows, multi-modal models should be introduced, integrating multiple sources of data. going beyond exclusively sequence data might help to solve a vast array of known problems such as protein classification, mutational effect prediction and structure prediction. in the view of antibody discovery, an interesting problem is shaped by functional clonotyping, i.e. the grouping of antibody clonal groups that target the same antigen and epitope. typically, heavy chain cdr3 is used as unique identifier and thus, clustering is frequently performed by requiring a high percentage of hcdr3 sequence similarity and identical v-j assignments. however, it has been shown that many different hcdr3s can be identified within a target-specific antibody population [1]. moreover, the same hcdr3 can be generated by many different rearrangements, and specific target binding is an outcome of unique rearrangements and vl pairing: “the hcdr3 is necessary, albeit insufficient for specific antibody binding.”[1] in addition, it has been demonstrated that antibodies within a same cluster, targeting a same epitope, encompass highly divergent hcdr sequences [2]. this underscores the necessity of incorporating additional “layers” of information in pursuit of the clustering objective. for instance, space2 excels in clustering antibodies that bind to shared epitopes, highlighting that these clusters, characterized by functional coherence and structural similarity, embrace diversity in terms of sequence, genetic lineage and species origin [3]. nevertheless, the potential for significant advancements may reside in the transformative capacities of llms, not only due to their substantial scaling advantages but also the extensive array of possibilities they present. llms grasping the language of life while natural language – large language models (llms) excel in grasping contexts, protein based llms (plms) are advancing their understanding of meanings, contexts and the intricate relationships between the fundamental building blocks, amino acids. much like the word “apple” assuming different meanings based on context, different amino acid (patterns) might have different nuances within protein sequences. the process begins with the tokenization of protein sequence data, transforming them into linear strings of amino acids. some amino acids might “impact” other, more distant amino acids in such a way that it reveals a different function [semantics]. again, compare this to two word phrases: “apple, pear and banana” versus “i bought an apple phone” – the semantics change by context. to unravel the workings of the models behind llms – the so called transformer models - attention layers yield valuable information. which context-information is important to classify apple as being a “fruit” or “tech company”? ask now a similar question for classifying proteins: which context residues/ residue patterns are influencing another residue/ pattern to take part in a different function? does the model learn residue-residue interactions (reflected in attention weights) that overlap with structural interactions? by overlapping protein-domain knowledge on the model’s learnt embedding representations, we can learn underlying protein intricacies. moreover, we believe that utilizing these lower-layer embeddings as predictive features instead of/ on top of the final-layer embeddings might help to make the model more understandable and transparent. this clearly fits in the idea of strategically combining multi-modal data. the potential for improving predictive performance, e.g. improving functional clonotyping of antibodies, lies in the strategic concatenation of embeddings from different layers across various protein language models. indeed, plms are trained for different purposes. for e.g. ablang [4] is trained on predicting missing amino acids in antibody sequences, while antiberty [5] is trained on predicting paratope-binding residues. the model’s embeddings could encompass distinct, perhaps non-overlapping and unique angles of protein-relevant information – being it (a combination of) structural, functional, physiochemical, immunogenicity-related … information. delving deeper into the realm of functional clonotyping, where epitope-binning gains importance, relying solely on antigen-agnostic models may prove insufficient. our curiosity lies in understanding how residues on the paratope interact with those on the epitope – a two-fold perspective that has been addressed through cross-modal attention. this method, akin to graph-attention-network model applied to a bipartite antibody-antigen graph emerges as a compelling approach for modelling multimodality in antibody-antigen interaction and more broadly in protein-protein interactions. [6] in general, we should build comprehensive representations that go beyond individual layers to open up new avenues for understanding protein language. protein words to capture semantics language models for natural language learn how words are used in context, i.e. words with similar context have similar meanings. this allows the model to understand meanings based on distributional patterns alone. in natural language, symbols like spaces and punctuation help identify meaningful words, making explicit linguistic knowledge less necessary. however, applying this idea to proteins is uncertain because there's no clear definition of meaningful protein units, like ‘protein words.' we need a more analytical, expertise-driven approach to identify meaningful parts in protein sequences. this is where mindwalk’s hyft technology comes into play. amino acid patterns offer a more refined approach to embeddings compared to full sequence embeddings, analogous to the way semantic embeddings capture “logical” word groups or phrases to improve understanding in textual language. while full sequence embeddings encapsulate the entire protein sequence in a holistic manner, amino acid patterns focus on specific meaningful blocks within the sequence. mindwalk’s proprietary hyfts, which serve as protein building blocks with well-defined boundaries, enhance robustness to sequence variability by emphasizing critical regions and downplaying non-critical or less relevant areas in the full protein sequence. moreover, the hyfts serve as a central and unifying connector element laying the foundation for a holistic data management system. this integration extends beyond the incorporation of protein sequential , structural and functional data encompassing both flat metadata and vector embedding data, as well as textual enrichment data extracted from literature. these connector elements can traverse omics databases or external datasets such as iedb serving as starting points for nlp searches. in this way, a bridge is established between genetic information and relevant literature. lensai as a holistic integrator taking all this information together, an integrated data management system becomes necessary to build generalized foundation models for biology, rather than siloing each step independently. this integration extends beyond the incorporation of protein sequential , structural and functional data encompassing both flat metadata and vector embedding data, as well as textual enrichment data extracted from literature. the antibody discovery process undergoes a transformative shift, becoming a more informed journey where the flow of information is rooted in genetic building blocks. at each step, a comprehensive understanding is cultivated, by synthesizing insights from the amalgamation of genetic, textual, and structural dimensions, including diverse embeddings from different layers of llms capturing varying information sources. this is where lensai comes into play. by leveraging a vast knowledge graph interconnecting syntax (multi-modal sequential and structural data) and semantics (biological function), combined with the insights captured at the residue, region or hyft level – harnessed by the power of llm embeddings – this paves the way to improve drug-discovery relevant tasks such as functional clustering, developability prediction or prediction of immunogenicity risk. lensai’s advanced capabilities empower researchers to explore innovative protein structures and functionalities, unlocking new opportunities in antibody design and engineering. sources [1] https://www.frontiersin.org/articles/10.3389/fimmu.2018.00395/full [2] https://www.nature.com/articles/s41598-023-45538-w#sec10 [3] https://www.frontiersin.org/articles/10.3389/fmolb.2023.1237621/full [4] https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac046/6609807 [5] https://arxiv.org/abs/2112.07782 [6] https://arxiv.org/abs/1806.04398

Advancing rational drug design: Vector search in protein analytics

Multimodal language models in protein engineering: Functional clonotyping & beyond

CaseXCase Series

Julie Delanote

Advancing rational drug design: Vector search in protein analytics

Multimodal language models in protein engineering: Functional clonotyping & beyond

Advancing rational drug design: Vector search in protein analytics

Multimodal language models in protein engineering: Functional clonotyping & beyond

Keep up to date