The Blog

MindWalk is a biointelligence company uniting AI, multi-omics data, and advanced lab research into a customizable ecosystem for biologics discovery and development.

Beyond AlphaFold 2: The next frontier in macromolecular structure prediction

introduction as a testament to the recent breakthrough of deep-learning technologies in the field of (structural) bioinformatics, half of the nobel prize in chemistry 2024 [1] has been awarded to john jumper and demis hassabis, the main contributors to alphafold 2, the other half to prof. david baker (university of washington, seattle). speaking about breakthroughs is not an understatement: as of this time of writing, the original alphafold2 publication [2] has been cited more than 27,800 times (according to google scholar [3]). for comparison, on feb 21, 2023 (roughly 1.5 years ago), the number of citations was just 8,783. alphafold 2 is a solution to the protein folding problem and can predict with near experimental accuracy the structure of proteins as long as their primary structures (the sequence of amino acids along the protein chain) are known. this technology has been integrated with the lensaiⁱ™ in silico discovery platform, and we have discussed it in length in different blog posts since the public release of alphafold2 [4, 5, 6]. in this new blog post, we will review how these latest developments are impacting drug discovery, what can be technically achieved with current technology, and assess limitations that hinder discovery processes and future outcomes. finally, we will briefly present how these breakthrough technologies are integrated within the mindwalk lensai platform. alphafold 3: expanding the horizons of structural biology in may 2024, deepmind and isomorphic labs (a subsidiary of alphabet founded by demis hassabis) released alphafold 3, with a closed-source web server accessible to academic researchers. at the protein structure prediction, alphafold 3 is an improvement over alphafold 2: it is better at predicting monomeric and multimeric structures [7], specifically in the field of antibody-antigen complex modeling where alphafold 2 was notoriously lacking [8]. in addition to proteins, alphafold 3 introduces capabilities for predicting the structures of nucleic acids (such as rna) and small molecules. this expanded versatility makes it a powerful tool for drug discovery, as it can model the interactions between proteins and ligands. these substantial improvements are critical advancements for biotherapeutic development, where understanding these interactions is essential for developing targeted therapies like monoclonal antibodies and, in the broader sense, developing in silico screening strategies. while the authors' study in the original publication shows beyond state-of-the-art performance for many tasks, third-party benchmarks are still missing for alphafold 3, partly due to the limited capacity of the web server and its initial closed-source nature. as announced earlier in the year [13], the source code was released in november 2024, although with a restricted license; thus alphafold 3 is less susceptible to tweaking, in-depth analysis, and integration to protein design pipelines compared to alphafold 2 [6]. in addition to the base alphafold 3 code, several third-party initiatives have taken the initiative to reproduce the architecture of the model, as it was done for alphafold 2 before its release [3], and many alphafold 3-like prediction pipelines have been released, such as boltz-1 or chai-1. alphafold 3 success rate on different benchmark sets, for (from the left to the right) ligand docking, nucleic acids, covalent modifications and protein predictions; and compared to state-of-the-art methods. adapted from ref. 7. so far, the substantial improvements of alphafold 3 outbalance its known limitations: for instance, the algorithm struggles with molecule chirality. atomic clashes also occur, specifically for large proteins, so molecules can partially overlap, which is physically impossible. as success rates for some tasks remain low, “hallucinations” may happen. finally, predictions remain static in nature and completely ignore any dynamical aspect of molecular interaction. these limitations are, of course, not specific to alphafold 3, and there are many ways to mitigate these shortcomings by integrating structure prediction within a broader framework for molecular modeling. for instance, models generated by alphafold can be used in molecular dynamics simulations to assess conformational dynamics, interaction energies between molecular partners, and much more. structure prediction in practice despite their fame, the practical use of structure prediction tools such as alphafold is not often well understood. these tools works within the paradigm that for a given input of sequential molecular data (sequence of amino acids for protein, nucleic acids, …), there is a “static” 3d structure (atomic position) which can be predicted solely from this data, representative on the interaction between all involved atoms. while this picture is simplistic and ignores the dynamical nature of macromolecular interactions, which is only partially captured by static representations. within this paradigm, ideally, one would expect that a given input yields a single prediction. yet, this is not the case. for alphafold 2 monomers, there are 5 trained model weights which outputs 5 predictions for a single inputs. these predictions are scored and ranked by the model, using a so-called confidence metric. the most accurate model is expected to be ranked at the top. for alphafold 2 multimer, it has been found that more than 5 predictions are necessary to obtain accurate models; thus the standard pipeline outputs 25 models which can be later inspected. however, it is not always guaranteed that the most accurate prediction (compared to a ground structure structure) is always ranked at the top. typically, a criterion is defined, and the top ranking model match that criterion, then the prediction is considered correct. in benchmarks, the top-n success rate is the number of correct predictions up to rank n. for instance, the top-1 success rate is the number of case with a successful top-1 prediction over the full dataset, the top-5 success rate consider all ranks up to 5, and so on. for a given set of prediction, the probability of finding a correct prediction increases. in the case of protein complexes, it is notoriously hard to predict bound conformations using traditional docking techniques. the top-1 success rate for traditional methods (docking) is typically low (a few percent), and for these methods, it is often necessary to consider a wider pool of predictions along with complementary methods such as molecular dynamics to assess what is the likely correct method. alphafold 2 multimer became the gold standard for protein complex predictions, and alphafold 3 extends to a much larger landscape of interactions, involving nearly all kinds of molecules in life science. once a satisfying prediction is obtained, downstream tasks may be performed with other tools than alphafold. long molecular dynamics simulations can be used to sample the conformational landscape, identifying key functional domains, assessing the stability, performing mutagenesis analysis, and so on. structure prediction is thus one of the early step in the drug discovery phases, and must be complemented with additional analyses. is alphafold 2 obsolete? with the release of alphafold 3, one might wonder if alphafold 2 is now outdated. the answer is rather nuanced. while alphafold 3 offers improvements in specific areas like nucleic acid/protein predictions and ligand docking, alphafold 2 remains highly relevant. the reality is that alphafold 2 has been integrated within more intricate workflows, which, in some cases, extends its use beyond simple structure prediction and, in other cases, significantly improves its performance in specific tasks such as multimeric predictions as witnessed from the results of casp15 [14]. for example, alphafold 2 and proteinmpnn have been integrated into a pipeline for a complete de novo complex protein fold design with targeted properties [23, 24]. another example is protein complex prediction, which is highly improved through techniques like massive sampling and dropout layer activation during inference [15]. this improvement beyond base performance is done through slight tweaking, without re-training or fine-tuning the neural networks. antibody-antigen modeling: a persistent challenge one particular shortcoming of the first release of the alphafold 2 pipeline is its lack of accuracy for predicting antibody-antigen or nanobody-antigen bound complexes [8]. the problem itself is notoriously difficult, and it comes as no surprise that the observed accuracy of alphafold 2 on many other tasks motivated further inquiry with respect to their performance on this specific use case. an initial benchmark showed very low success rate (~10%) in this area [8], compared to other tasks. it has been argued that while the integration of coevolution data was as the source of alphafold 2’s overall performance, such data do not exist for antibody-antigen binding, which partially explains this lack of accurate results. nevertheless, a much more recent study [17] highlighted increased performance for newer versions of alphafold multimer (2.2 and 2.3) compared to the initial release. moreover, novel strategies, such as the aforementioned augmented sampling approach, have shown larger leap in success rates. indeed, a key feature of alphafold 2 (and successors) is the ability to rank its own predictions using predicted accuracy metrics: in massive sampling approaches, such metrics can be used to identify conformational models of relevance [16]. using a benchmark dataset of 37 antibody-antigen complexes (not part of the training set of alphafold 2), it has been reported [17] that the top-1 success rate was ~60%, which is quite close to the ~64% top-1 success rate of alphafold 3 (albeit on a much larger dataset [7], sampling 1,000 seeds); similar metrics were reported by other groups as well on other benchmark datasets [9]. in less than two years, the top-1 success rate has been multiplied by a factor of 6! if we consider larger pools of predictions from the top ranked one, up to top-25, the success-rate come close to 75% percent, meaning there is at least one correct prediction amongst 25, in 3 out of 4 cases. combining physics-based approaches with deep-learning predictions typically increase complex structure prediction success rate. in massive sampling approaches, a large amount of predictions are analysed (a few thousands at least), and in practice correct predictions have a large probability of being retrieved in the set. antibody–antigen success rate by different alphafold versions/implementation. the success rate is calculated based on the percentage of cases that had at least one model among their top n predictions that met a specified level of capri accuracy. adapted from ref. 17 powering up drug discovery with lensai at mindwalk, we have integrated alphafold 2 into our lensai platform to enhance drug discovery workflows. the platform allows users to perform protein structure predictions within an optimized environment that balances speed and accuracy. most comparable services contain limitations such as limited sequence lengths or reduced database search (~600 gb of storage, compared to the 2.62 tb of storage for the full database), which are tradeoffs to accommodate heavy usage, with a potential drop in accuracy in some cases. alphafold workflows readily available in aws healthomics ready2run. improvements like gpu acceleration (at inference and structure relaxation levels) may be desired, especially if the input sequences are large. to further improve performance, parallelization (which is not a feature of the official deepmind release) may be highly desired in the case of augmented sampling. beyond standard structure prediction tasks, lensai incorporates advanced features like automated reporting and augmented sampling to improve prediction confidence. moreover, lensai integrates alphafold into specialized pipelines such as epitope mapping and affinity maturation (a case study has been documented and is accessible in the following link [22]). these pipelines exploit state-of-the-art methodologies (physics- and data-driven approaches) to accelerate discovery rates in biotherapeutic research. conclusion: the future of ai-driven structural biology the field of structural biology witnessed groundbreaking progress within the past few years. alphafold’s journey from version 1 to version 3 represents a transformative leap in our ability to predict biological macromolecule structures with unprecedented accuracy. while alphafold 3 expands into new territories like nucleic acids and small molecules, it does not render its predecessor obsolete. both versions offer unique strengths that can be leveraged depending on specific research needs and pave new ways toward more intricate in silico and de novo generation of biotherapeutics to be integrated within pre-clinical research workflows. as we continue to integrate these models into platforms like lensai, we are improving our ability to predict protein structures and accelerating the entire drug discovery process—from target identification to lead optimization. the future is bright for ai-driven structural biology, and mindwalk is at the forefront of this exciting revolution. references [1] https://www.nobelprize.org/prizes/chemistry/2024/press-release/ , consulted 2024/10/09 [2] jumper, john, et al. "highly accurate protein structure prediction with alphafold." nature 596.7873 (2021): 583-589. [3] https://scholar.google.com/scholar?cites=6286436358625670901, consulted 2024/10/21 [4] https://blog.biostrand.ai/explained-a-brief-look-into-alphafold-2 , consulted 2024/10/21 [5] https://blog.biostrand.ai/explained-how-to-plot-the-prediction-quality-metrics-with-alphafold2 , consulted 2024/10/21 [6] https://blog.biostrand.ai/scaling-up-structural-biology-with-alphafold2 , consulted 2024/10/21 [7] abramson, josh, et al. "accurate structure prediction of biomolecular interactions with alphafold 3." nature (2024): 1-3. [8] yin, r., feng, b. y., varshney, a., & pierce, b. g. (2022). benchmarking alphafold for protein complex modeling reveals accuracy determinants. protein science, 31(8), e4379. [9] bernard, c., postic, g., ghannay, s., & tahi, f. (2024). has alphafold 3 reached its success for rnas?. biorxiv, 2024-06. [10] callaway, e. (2024). who will make alphafold3 open source? scientists race to crack ai model. nature, 630(8015), 14-15. [11] callaway, e. (2022). after alphafold: protein-folding contest seeks next big breakthrough. nature, 613: 13-14 [12] editorial, nature 629, 728 (2024) [13] https://x.com/pushmeet/status/1790086453520691657 , consulted 2024/10/21 [14] proteins: structure, function, and bioinformatics: volume 91, issue 12 - special issue: casp15: critical assessment of methods for structure prediction, 15th round, c1-c4, 1535-1951 (2023) [15] wallner, b. (2023). improved multimer prediction using massive sampling with alphafold in casp15. proteins: structure, function, and bioinformatics, 91(12), 1734-1746. [16] raouraoua, n., lensink, m., & brysbaert, g. (2024). massive sampling strategy for antibody-antigen targets in capri round 55 with massivefold. authorea preprints. [17] yin, r., & pierce, b. g. (2024). evaluation of alphafold antibody–antigen modeling with implications for improving predictive accuracy. protein science, 33(1), e4865. [18] hitawala, f. n., & gray, j. j. (2024). what has alphafold3 learned about antibody and nanobody docking, and what remains unsolved?. biorxiv, 2024-09. [19] harmalkar, a., lyskov, s., & gray, j. j. (2023). reliable protein-protein docking with alphafold, rosetta, and replica-exchange. biorxiv. [20] gao, m., & skolnick, j. (2024). improved deep learning prediction of antigen–antibody interactions. proceedings of the national academy of sciences, 121(41), e2410529121. [21] zheng, w., wuyun, q., freddolino, p. l., & zhang, y. (2023). integrating deep learning, threading alignments, and a multi‐msa strategy for high‐quality protein monomer and complex structure prediction in casp15. proteins: structure, function, and bioinformatics, 91(12), 1684-1703. [22] https://www.biostrand.ai/insight-hub/use-cases , consulted 2024/10/21 [23] goverde, c. a., pacesa, m., goldbach, n., dornfeld, l. j., balbi, p. e., georgeon, s., ... & correia, b. e. (2024). computational design of soluble and functional membrane protein analogues. nature, 1-10. [24] dauparas, j., anishchenko, i., bennett, n., bai, h., ragotte, r. j., milles, l. f., ... & baker, d. (2022). robust deep learning–based protein sequence design using proteinmpnn. science, 378(6615), 49-56.

Scaling up structural biology with AlphaFold2

the unveiling of alphafold2 during the critical assessment of structure prediction (casp) 14th edition has been a turning point in the field of structural biology as a solution to the protein structure problem. this problem is a decades old issue which stem from the observation that a protein structure can be almost uniquely determined by its sequence of amino acid (or primary structure). this implies that there is a general design law, which could be used to model the 3-dimensional structure of proteins if their sequence is known. despite increasingly more important computer power, such design law has remained unknown. more worrisome, first-principles based methods have historically lacked performance compared to some other template-based methods such as homology modelling, which seemed to go against the empirically observed deterministic nature of protein folding. with alphafold2, deepmind provided a convincing argument that de novo protein structure prediction is possible, and that a deep learning model could somehow capture the aforementioned design law. the performance of alphafold2 in contrast its competitors at casp14 lead to a humongous amount of hype (fig. 1), with some high figures in the field such as john moult (co-founder of casp) claiming that the protein structure prediction problem was solved to some extent [1]. since then, the alphafold2 source code and model weights have been released, which prompted most research groups to incorporate it in their research (fig. 2), with several possible downstream applications, such as empirical model refinement [4], molecular dynamics, etc. in addition, several deep learning-based competitors’ algorithms with comparable performance have appeared, such as rosettafold [5] (baker lab) or esmfold [6] (meta). nowadays, there is sufficient hindsight to understand to which extent protein structure prediction has become transformative with respect to how research is performed. figure 1 - interest of alphafold over time, measured by google trends (21 feb 2023). the first peak corresponds to the casp14 conference week, whereas the second peak corresponds to the public release of the model. adapted from google trends. figure 2 - number of citations of the original alphafold2 manuscript [3] (to not be confounded with the original alphafold paper), measured by scopus (21 feb 2023). the number may differ across databases. for instance, google scholar repertories 8783 citations (21 feb 2023). adapted from scopus. structure prediction as a routine bio-informatic task until recently, the field of bioinformatics revolved mainly around sequence data and annotations. indeed, with the advent of new generation sequencing platforms, sequence data production has been growing exponentially, raising many challenges on data integration and analysis at massive scale. a similar trend has been observed for protein structure, albeit to a much lower scale (fig. 3). this is due to different reasons, the main one being that empirical studies of protein structure remain too costly to scale up with the availability of sequence data. before alphafold2, the main method to produce a model in silico was homology modelling, which produced satisfying models based on templates with large sequence similarity. unfortunately, it meant that all cases where sequence similarity is not high enough would not produce quality model. moreover, the choice of a good template is critical, which means that such methods cannot be easily automated. on the other hand, alphafold2 requires very little input, only the protein sequence(s) which need to be modelled, which facilitate automation. this leads to the release of alphafolddb [7], which now contains more than 200 millions protein structures predicted from sequences stored in the uniprot database. in comparison, at the time of writing (21 feb 2023), there are only 201,515 entries in pdb [8], a factor of ~1,000 in term of size! while alphafold2 cannot be run from any standard laptop, it can be run on high-performance computing (hpc) environment, and scaled up for routine predictions, as we do at mindwalk on our cloud-based setup. figure 3 - number of released pdb structures per year. adapted from pdb rcsb. can predicted structural models be substituted to empirical models? the prospective of predicting protein structure modeled with precision challenging experimental methods has strong implication for accelerating research. the distribution of structure resolution in pdb shows that most models have a resolution of about 2 angstroms (fig. 4). in benchmark such as casp14, alphafold2 demonstrated such capacities, with a mean root-mean square deviation of 1.6 angstrom on cα atoms. these early results supported the claim that structure de novo prediction precision reached the boundary of empirical model precision. figure 4 - distribution of structure resolution (in angstroms), data shown include structures solved by x-ray crystallography or electron microscopy. adapted from rcsb. a recent review [9] of the first database of structure predicted by alphafold2 (365,198 protein models) highlighted the strengths and limitations of alphafold2’s predictions and related output metrics, which give local and non-local confidence score atomic precision (see previous blog post). the authors argue that, for the 11 proteomes covered by the database, an average of 25% additional residues are confidently modelled compared to structures built through homology modelling. these high-confidence regions can be used for downstream modelling tasks (for instance protein-ligand docking). however, not all of alphafold2 predictions can be trusted and used for downstream tasks. roughly 50% of the residues in the database of 11 proteomes are of low confidence (low plddt). these residues have been argued to often correspond to intrinsically disordered proteins/regions (idps/idrs). the authors of the paper used alphafold2 to benchmark its prediction against other tools to predict idps/idrs and showed that alphafold2 outperformed state-of-the-art algorithms such as iupred2. the authors also compared the alphafold2 multimer to state-of-the-art protein/protein docking algorithm and argue that it also outperformed them for predicting complexes, also confirmed by other groups [10]. one particular domain where alphafold2 does not outperform traditional approaches is antibody-antigen docking [10]. this relates to the requirement of co-evolution data used at the beginning of alphafold2’s pipeline. indeed, the antibody-antigen binding strength does not result from co-evolution, but from somatic hypermutation and affinity maturation. hence, the key component of alphafold2’s strength, the multiple sequence alignment (msa) embedding cannot help for this particular use case. such shortcomings were also highlighted during the casp15 conference, which occurred last december in antalya (turkey) [11]. the legacy of alphafold2 alphafold2 is bound to leave a lasting legacy in the field of structural biology. despite the notable absence of deepmind at casp15, the top performing methods incorporated alphafold2 as part of the prediction pipeline, and for single domain prediction, it can be expected that improvements will only be substantial from now on. instead, an increase in protein complexes prediction performance was noted, as various group integrated/hacked through the alphafold2 pipeline to predict models which alphafold multimer, the default pipeline for multimeric complex prediction, failed to correctly predict [11]. around the same period, meta fundamental ai research protein team (fair) released esm-2, a protein language model, as well as esmfold, a protein structure prediction engine built on top of esm-2. while esmfold is not as performant as alphafold2, it has a notable feature (or lack of): the msa pre-processing step is missing from the esmfold pipeline. this msa rely on a 2 terabytes database scan, which account for most of the runtime of an alphafold2 prediction. instead, esmfold rely on the information stored in the esm-2 model weights to produce accurate models. meta produced over 600 million models, which have been released in the esm metagenomic atlas [12]. the sudden increase in structure prediction performance from alphafold2 highlights the remaining challenges for protein structure modelling. the current state-of-the-art methods output static models from sequence input. however, the reality is that protein structures are far from static: part of proteins are less rigid than other and protein motion can be of crucial importance for function. many proteins can adopt different conformations depending on the context. the protein structure problem formulated as done in the introduction of this blog post suggests that there is a one-to-one mapping between sequence and structure, whereas this is far from being the case, and it highlights our biased thinking toward this simplification. while the pdb database is biased toward single structure model, it still displays heterogeneity in its structural data. in a recent publication [13], thomas j. lane (cfel-pbio) argues in favor of continuous distribution of protein structure as model, instead of single snapshots. some particular attention is brought to alphafold2’s prediction with respect to its training data: by analyzing the distribution of root-mean squared deviations (rmsd) between pdb models of the sars-cov-2 main protease (mpro) and, the distribution of rmsd between alphafold2 predictions and these pdb models, it turns out that the distributions overlap, but have distinct peaks. this means on average, that two randomly selected pdb models are more likely to be more similar to each other than to an alphafold2 prediction. it is also argued that alphafold2 models can lie in between conformational states represented in pdb: an example is given in term of hemoglobin state – unbound, or bound to ligands (such as o2 or co). the alphafold2 structure is shown to lie in between, as some kind of averaged structure which does not correspond to a real stable physical state. finally, protein domains with low confidence metrics output by alphafold2 have also been shown correspond to regions with structural flexibility. this further support the need to move away from the single structure paradigm. in summary, alphafold2 marked a turning point in structural biology, and arguably resolved the single protein structure problem. this leads to a change of focus toward even more complicated challenges, such as the prediction of proteins complexes and interactions. single protein structure prediction can now be a routine preliminary task for downstream applications and research, for instance, modelling folding mechanisms in molecular dynamics simulation, protein-ligand docking, etc. these downstream applications will be a topic of discussion in this blog, so stay tuned! references [1] https://www.deepmind.com/blog/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology, accessed 2023/02/21 [2] https://github.com/deepmind/alphafold, accessed 2023/02/21 [3] jumper, john, richard evans, alexander pritzel, tim green, michael figurnov, olaf ronneberger, kathryn tunyasuvunakool et al. "highly accurate protein structure prediction with alphafold." nature 596, no. 7873 (2021): 583-589. [4] barbarin-bocahu, irène, and marc graille. "the x-ray crystallography phase problem solved thanks to alphafold and rosettafold models: a case-study report." acta crystallographica section d: structural biology 78, no. 4 (2022). [5] baek, minkyung, frank dimaio, ivan anishchenko, justas dauparas, sergey ovchinnikov, gyu rie lee, jue wang et al. "accurate prediction of protein structures and interactions using a three-track neural network." science 373, no. 6557 (2021): 871-876. [6] lin, zeming, halil akin, roshan rao, brian hie, zhongkai zhu, wenting lu, nikita smetanin et al. "evolutionary-scale prediction of atomic level protein structure with a language model." biorxiv (2022): 2022-07. [7] https://alphafold.ebi.ac.uk/, accessed 2023/02/21 [8] https://www.rcsb.org/stats/growth/growth-released-structures, accessed 2023/02/21 [9] akdel, mehmet, douglas ev pires, eduard porta pardo, jürgen jänes, arthur o. zalevsky, bálint mészáros, patrick bryant et al. "a structural biology community assessment of alphafold2 applications." nature structural & molecular biology (2022): 1-12. [10] yin, rui, brandon y. feng, amitabh varshney, and brian g. pierce. "benchmarking alphafold for protein complex modeling reveals accuracy determinants." protein science 31, no. 8 (2022): e4379. [11] ewen callaway, “after alphafold: protein-folding contest seeks next big breakthrough”, nature no. 613 (2023), 13-14. [12] https://esmatlas.com/, accessed 2023/02/21. [13] lane, thomas j. "protein structure prediction has reached the single-structure frontier." nature methods (2023): 1-4.

Explained: how to plot the prediction quality metrics with AlphaFold2

in a previous blog post, we discussed the importance of alphafold2 (as well as other, deep-learning based methods) to accelerate the process of drug discovery by predicting the 3-dimensional structure of proteins and protein complexes. mindwalk provides a pipeline integrating the standard alphafold2 workflow, as well as additional post-processing workflows, all in a simple-to-run and scalable package. we will now discuss the method used for extraction of prediction evaluation metrics. after a successful alphafold2 run, you should have several files in the output repository. note that depending on the type of predictions (default, monomer_ptm, multimer), the naming can change a bit. / features.pkl ranked_{0,1,2,3,4}.pdb ranking_debug.json relaxed_model_{1,2,3,4,5}_pred_0.pdb result_model_{1,2,3,4,5}_pred_0.pkl timings.json unrelaxed_model_{1,2,3,4,5}_pred_0.pdb msas/ bfd_uniclust_hits.a3m mgnify_hits.sto pdb70_hits.hhr uniref90_hits.sto the content of each file is as follows: 1. the unrelaxed_model_{*}_pred_0.pdb files contain the raw predictions of alphafold. several models are predicted, and alphafold outputs the 5 best ones by default. these pdb files can be opened as any normal pdb files, with some differences. for example, the b-factor calculated for structure determined experimentally has been replaced by another quality score from alphafold (lddt). 2. the relaxed_model_{*}_pred_0.pdb files contain the refined predictions of alphafold. the corresponding unrelaxed models are fed to a force field algorithm, amber, that computes the forces exerted on the sidechains of each residue, then computes displacements for each side chain atom. this process is done iteratively, until the energy of the whole molecule is minimised. the final configuration for each model is then stored in the relaxed_model_{*}_pred_0.pdb. this step is expected to generally improve the predictions and is set by default when alphafold runs. 3. the ranked_{*}.pdb files are redundant files which are sorted with respect to the quality of the models. they either correspond to the unrelaxed_model_{*}_pred_0.pdb files or relaxed_model_{*}_pred_0.pdb files, if the relaxation step is activated. 4. the ranking_debug.json files contain the quality scores of each model, as well as the ranking according to this score. the score varies from 100 (perfect confidence) to 0 (no confidence). 5. the msas/repository files contain the multiple sequence alignments computed in the pre-processing step of the alphafold run. 6. the timings.json files contain the run time for each step. as we can see in the following example, most of the run time is actually spent on feature extraction, which is done in the pre-processing step (mostly hhblits). the gpu actually only uses the predict_and_compile_model steps. alphafold2 does not just output models, it also computes several confidence metrics about its predictions. currently, these models are stored in .pkl files, and are not directly readable. they need to be “unserialized” in order to be readable and integrated in your workflow. this is how the features.pkl file can be opened with python (3.7): we can see that the object stored in features.pkl is a simple python dictionary, with many keys such as sequence, msa, template_sequence, etc. the values associated with these keys are typical numpy arrays of different dimensions. we will not discuss all of these features; they relate to the pre-processing step of alphafold2 (look-up + multiple sequences alignment). the confidence metrics are stored in the other .pkl files. we can open one to display its content. once again, the unpickled file contains a python dictionary, and the different keys refer to different metrics. the metrics of most interest are the predicted aligned error, as well as the predicted lddt. note that when using the default settings, some of the metrics files displayed here will not appear, such as the predicted alignment error, which is one of the metrics displayed in alphafolddb entries. if you set --model_preset=monomer_ptm then you should have all the items listed above. the most important ones for assessing model quality are ‘plddt’ and ‘predicted_aligned_error’. assessing quality of model prediction first, let’s have a look at a random entry in alphafolddb. fig. 1 - an entry in alphafolddb at the top, we see sequence(s) of the protein (complex), the 3d model with secondary structure, and the predicted alignment error plot (pae). the colours in the model reflect the local confidence of the model, as given by the scale in the upper left side: dark blue regions have large confidence (plddt > 90) and orange regions have low confidence (plddt < 50). plddt corresponds to the model’s prediction of its score on the local distance difference test (lddt-cα), and is a measure of local accuracy. this metric replaces the b-factor found in structural models built from empirical methods. low confidence score may be due to different reasons: 1. the corresponding sub-sequence may not have a significant amount of homologs in the training data. 2. the corresponding area may be represented in the training data, but the conformation is not fixed: it can be an intrinsically disordered region. below the protein model is the pae plot, which identifies, in a pairwise fashion, the absolute error (ångströms) of relative position between residues. in the colour scale of alphafolddb, dark green correspond to 0 å, whereas larger errors are coloured in white. along the diagonal of the heat map, most elements are expected to be close to 0 å. well-defined blocks represent a domain of high confidence. a very good model would display a complete dark green heat map. in the example above, we see that this is not the case for the majority of the model. if you try to predict the fold of two domains (by setting --model_preset=multimer), you can end up with the following heat-map: fig. 2 - a pae plot featuring two domains with low conformation confidence in this case, two domains have been well predicted individually, however their relative positions are not well defined. while these plots are available in alphafolddb, the current version of alphafold2 (2.2.2) does not provide them right away. rather, the user must plot them from the provided description above. while the lensai pipeline still outputs them automatically, we will now describe how it is done using python. plotting from your own alphafold output using python the first step is to load all of the dictionaries stored in .pkl files. this can be done using the following python script (adapted from this python script): import os import glob import pickle import json import numpy as np import matplotlib.pyplot as plt class arg: def __init__(self, repo): self.input_dir = repo self.output_dir = repo self.name = repo repo = ['rbd_sars-cov-2'] # this is a list of all output repositories for r in repo: args = arg(r) with open(os.path.join(r, "ranking_debug.json"), 'r') as f: ranking_dict = json.load(f) feature_dict = pickle.load(open(f'{args.input_dir}/features.pkl','rb')) is_multimer = ('result_model_1_multimer_v2_pred_0.pkl' in [os.path.basename(f) for f in os.listdir(path=args.input_dir)]) if is_multimer==false: model_dicts = [pickle.load(open(f'{args.input_dir}/result_model_{f}{"_multimer_v2" if is_multimer else ""}{"_ptm" if is_multimer==false else ""}_pred_0.pkl','rb')) for f in range(1,6)] else: model_dicts = [pickle.load(open(f'{args.input_dir}/result_model_{f}{"_multimer_v2" if is_multimer else ""}{"_ptm" if is_multimer==false else ""}_pred_{g}.pkl','rb')) for f in range(1,6) for g in range(5)] in lines 1-7, a few libraries are imported to deal with file paths, being able to manipulate json objects, numpy objects, and plot the figures using the matplotlib library. in lines 8-12 the arg class is defined, whose purpose is to store the input repository, output repository, and name of the output files. for the sake of clarity, we set all of these names to the same value, which will be the name of the output repositories. in line 14, the names of the output repositories are set and stored in a list. we expect these repositories to be in the same directory as the script/notebook. finally, in lines 16-30, the script iterates over the repositories of files to load them as a list of dictionaries, which can then be used for plotting. to plot the metrics, we the define two functions: def get_pae_plddt(model_dicts): out = {} for i,d in enumerate(model_dicts): out[f'model_{i+1}'] = {'plddt': d['plddt'], 'pae':d['predicted_aligned_error']} return out def generate_output_images(feature_dict, model_dicts, ranking_dict, out_dir, name, pae_plddt_per_model): msa = feature_dict['msa'] seqid = (np.array(msa[0] == msa).mean(-1)) seqid_sort = seqid.argsort() non_gaps = (msa != 21).astype(float) non_gaps[non_gaps == 0] = np.nan final = non_gaps[seqid_sort] * seqid[seqid_sort, none] ###################### plot msa with coverage #################### plt.figure(figsize=(14, 4), dpi=100) plt.subplot(1, 2, 1) plt.title(f"sequence coverage ({name})") plt.imshow(final, interpolation='nearest', aspect='auto', cmap="rainbow_r", vmin=0, vmax=1, origin='lower') plt.plot((msa != 21).sum(0), color='black') plt.xlim(-0.5, msa.shape[1] - 0.5) plt.ylim(-0.5, msa.shape[0] - 0.5) plt.colorbar(label="sequence identity to query", ) plt.xlabel("positions") plt.ylabel("sequences") ################################################################## ###################### plot lddt per position #################### plt.subplot(1, 2, 2) plt.title(f"predicted lddt per position ({name})") s = 0 for model_name, value in pae_plddt_per_model.items(): plt.plot(value["plddt"], label=f"{model_name}, plddts: {round(list(ranking_dict['plddts'].values())[s], 6)}") s += 1 plt.legend() plt.ylim(0, 100) plt.ylabel("predicted lddt") plt.xlabel("positions") plt.savefig(f"{out_dir}/{name+('_' if name else '')}coverage_lddt.pdf") ################################################################## ################# plot the predicted aligned error################ num_models = len(model_dicts) plt.figure(figsize=(3 * num_models, 2), dpi=100) for n, (model_name, value) in enumerate(pae_plddt_per_model.items()): plt.subplot(1, num_models, n + 1) plt.title(model_name) plt.imshow(value["pae"], label=model_name, cmap="bwr", vmin=0, vmax=30) plt.colorbar() plt.savefig(f"{out_dir}/{name+('_' if name else '')}pae.pdf") ################################################################## the first function (lines 1-6) will extract the relevant metrics, whereas the second (lines 8-65) will plot 3 different charts and save them in each of the output repositories. then, we can call these functions and feed them the dictionaries to get the different plots. pae_plddt_per_model = get_pae_plddt(model_dicts) generate_output_images(feature_dict, model_dicts, ranking_dict, args.output_dir if args.output_dir else args.input_dir, args.name, pae_plddt_per_model) let’s have a look at the plot extracted from an example (here, the receptor binding domain of sars-cov-2). multiple-sequence alignment (msa) fig. 3 - the multiple sequence alignment summarised as a heatmap this heat-map representation of the msa indicates all sequences mapped to the input sequences. the colour scale indicates the identity score, and sequences are ordered from top (largest identity) to bottom (lowest identity). white regions are not covered, which occurs with sub-sequence entries in the database. the black line qualifies the relative coverage of the sequence with respect to the total number of aligned sequences. plddt plot fig. 4 - the predicted lddt per residue for the 5 models obtained after an alphafold2 job this plot displays the predicted lddt per residue position. here is what the alphafold developers report about this metric: · regions with plddt > 90 are expected to be modelled to high accuracy. these should be suitable for any application that benefits from high accuracy (e.g. characterising binding sites). · regions with plddt between 70 and 90 are expected to be modelled well (a generally good backbone prediction). · regions with plddt between 50 and 70 are low confidence and should be treated with caution. · the 3d coordinates of regions with plddt < 50 often have a ribbon-like appearance and should not be interpreted. we show in our paper that plddt < 50 is a reasonably strong predictor of disorder, i.e., it suggests such a region is either unstructured in physiological conditions or only structured as part of a complex. · structured domains with many inter-residue contacts are likely to be more reliable than extended linkers or isolated long helices. · unphysical bond lengths and clashes do not usually appear in confident regions. any part of a structure with several of these should be disregarded. (from https://alphafold.ebi.ac.uk/faq) in the predicted model pdb files, the predicted lddt per residue replace the traditional b-factor reported for empirically derived models, even though in most softwares for visualization, it is still referred to as the b-factor. pae plot fig. 5 - the predicted pae plot for the 5 models obtained after an alphafold2 job these heat maps are provided for each final model and show the predicted alignment error between each residue in the model. the colour scale contains three colours to further accentuate the contrast between the high confidence regions and the low confidence regions. with a quick glance, we can see that model_4 is the one where the errors are the largest, followed by model 3. this is reflected in the ranking of the models in the ranking_debug.json file. these scores are also displayed in the lddt plot, in the legend. we can see there that model_2, with a plddts score of 90.073, is the model with the smallest errors (also, you can see in the pae plot that this one contains less red area than the other models). comparing the 3d models with molecular visualisation tools pdb models can be visualised with different tools: · chimerax: https://www.cgl.ucsf.edu/chimerax/ · pymol: https://pymol.org/2/ · yasara view: http://www.yasara.org/viewdl.htm · … in this example, we will use yasara view, which is quite popular amongst researchers in academia, and features integration with the foldx to perform force field calculations. let’s open all the ranked_{0,1,2,3,4}.pdb files from the example of the notebook: file > load > pdb file. we will have the following: fig. 6 - visualisation (with yasara view) of 5 unaligned models obtained after an alphafold2 job in this example, we see that all 5 models are superimposed and unaligned. this is because the basis for each model can be different. you can align all of the models by doing the following: analyze > align > multiple, based on structure, objects with mustang. you can also choose a single reference on which to align all the other models. fig. 7 - structural alignment algorithms provided with yasara view once this is done, all the models will be superimposed. you can choose to give them different color by clicking on the object name (model name) on the upper right side of the ui (ranked_0, ranked_1, …). then right click and choose color. if desired, you can colour all the model with the b-factor, which corresponds to the lddt score in this case. doing so, will result in something that looks like this: fig. 8 predicted alphafold2 models after structural alignment. the colour scheme is indexed on the b-factor value (or rather, the predicted lddt score per residue) here you see that all of the regions coloured in yellow (high confidence) tend to overlap quite well. on the other side, the regions coloured in red (low confidence) are less likely to overlap. coincidentally, these regions are also predicted as loops, and can potentially correspond to disordered regions, unstructured in physiological conditions or only structured as part of a complex. however, this cannot be determined solely from the predictions of alphafold, it is important to remain cautious when interpreting results. conclusions we just described how to plot the relevant prediction quality metrics of a standard alphafold2 job. these are all integrated within the lensai pipeline, putting all of the relevant charts within an easy-to-read report. this pipeline will also be integrated within a larger ecosystem of pipelines of structural bioinformatics for which mindwalk has secured a new vlaio research grant to accelerate the development of a knowledge base surrounding protein structures and metadata integration.

Explained: A brief look into AlphaFold 2

the year 2021 has been a wild ride for structural biologists, with the advent of alphafold2, deepmind’s solution to the 50-years old protein structure prediction problem, which won by a large margin the casp14 competition in late 2020. at the time, aside from its reported performance and the promise it held to solve protein structures challenging state-of-the-art experimental methods, nothing was said about how deepmind would deliver its technology to the world. in summer 2021, deepmind stated that it would release the source code of alphafold2 and actively collaborate with embl-ebi to produce what is now known as the alphafold protein structure database (or alphafolddb). this database contains the prediction of alphafold2 for the human proteome, as well as 20 additional proteomes for other organisms. since then, the license for the source code and the model parameters are available under the apache 2.0 and creative commons attribution 4.0 international (cc by 4.0) licenses respectively. why is it important? the protein structure prediction problem is a long-standing of structural biology. proteins are mostly characterized by their sequence of amino acids, conventionally written from the amino-terminal extremity to the carboxyl-terminal extremity. however, proteins adopt different 3-dimensional shapes, which in turn characterize their functions. it is also seen that protein of similar sequences tends to adopt the same overall shape, yet the opposite is not true: protein of similar shape can have vastly different sequences. it is expected that, as the sequence directly indicates the constituent of the proteins, then the overall protein structure is determined by this sequence. while protein sequences can be obtained experimentally on a massive scale, it is much harder to resolve their 3d shape using empirical methods, which are highly impractical for industrial purposes such as drug development. hence, several efforts have been made in order to predict protein structure in silico, using different methods such as homology modelling or ab initio predictions using physical models. these approaches were not efficient enough to reach the level of precision of experimental methods. recent approaches based on deep-learning, a machine-learning framework based on artificial neural network (ann), showed promising improvements over the more historical methods. it is then less of a surprise that deepmind, a company who built its reputation on ann algorithms surpassing human performance, came up with a solution of its own to this long-standing problem. alphafold2 and its competitors such as rosettafold bring the promise of accelerated drug discovery and development by providing protein structure prediction with resolution almost as good as empirical methods. this can be done by modelling the active molecules in drugs or pathogens. an often-cited example is the prediction of the structure of the protein orf8, which is a protein expressed by sars coronaviruses which is supposedly involved in immune invasion. in addition to structure prediction, alphafold2 has shown promising results for prediction protein complexes and can possibly predict protein-protein interactions in the broader sense. beyond the hype, there still remain improvements to be made for end-to-end in silico drug discovery: the overall current precision of alphafold on atomic positions prediction is not yet sufficient for accurate prediction of binding site. in addition to this, there are some other limitations: the model only output static models, and if many confirmations are outputed, it can be difficult to assess wether these actually exist in real biological systems. moreover, alphafold2 models are not reliable for intrinsically disordered proteins. nevertheless, it is only a matter of time until these issues are solved. alphafold2 in a nutshell alphafold2 is an end-to-end solution to the protein structure prediction problem, starting from a sequence as an input, and protein structure model (pdb file format) as the output. without going into the finer details, the sequence is first fed to a pre-processing pipeline whose only task is to obtain a multiple-sequence alignment (msa) of all homologous sequences retrieved in different databases (such as uniref90). this msa is then embedded in neural network architecture which output 3-dimensional models, which are then further relaxed using force fields calculation (amber). in addition to the models, alphafold2 outputs confidence scores to assess model quality. hurdles with running alphafold2 using alphafold2 is rather straightforward, as it requires a minimal input file (in .fasta format), and few parameters. however, it is not possible to run it on a standard computer: the deep-learning model is optimized for gpu/tpu, and the pre-processing step with the msa requires look-up in a collection of datasets up to 2.2 tb. hence, only laboratories with high performance computing resources (hpc) can use alphafold2. for research groups without such resources, it is also possible to use cloud-based services, such as microsoft azure, google cloud, or amazon web services, albeit this solution can be costly if no precaution with resource allocation is taken. there exists solutions based on notebooks hosted on google colab, such as the official alphafold colab notebook, or the colabfold initiative, but these solutions have their shortcomings, as the resources freely provided are limited. hence, it is not possible to run several jobs in batch. at mindwalk, we opted for a solution based on aws ec2 instances and ebs volumes for data storage. we have built an alphafold docker image, so that the image can be stored then pulled from our elastic container registry for direct use. we chose to store the 2.2 tb dataset on a volume of similar disk size, and we made snapshot of that disk. such a choice reduces the costs of storing such a large software in our aws environment while making it easy and fast to create resources to launch new alphafold2 jobs. moreover, it eases up the scalability of the process, so that larger prediction jobs can be launched on ec2 instances with more resources. once installed, alphafold2 is quite easy to run, requiring very few parameters to be set in the command line: nohup python3 /data/alphafold/docker/run_docker.py --db_preset=full_dbs --model_preset=monomer_ptm --fasta_paths=/data/input/protein_sequence.fasta --max_template_date=2020-05-14 in this example, we used the provided python script to spin-up a docker container running alphafold2, provided with a protein sequence input stored in the /data/input/protein_sequence.fasta file. the parameter --model_preset can be set to multimer in order to switch the model to protein complexes prediction. after a successful run, alphafold2 outputs several files as well as confidence metrics. the mindwalk pipeline takes care of putting all of these within an appropriate report so that the user does not worry about the technicalities of post-processing these files. visualizing the output of alphafold2 alphafold2 output several models at the same time, depending on the input parameters (5 for standard run, 25 for complexes prediction). in the case of a standard run, you should find in the output repository the following files: / features.pkl ranked_{0,1,2,3,4}.pdb ranking_debug.json relaxed_model_{1,2,3,4,5}_pred_0.pdb result_model_{1,2,3,4,5}_pred_0.pkl timings.json unrelaxed_model_{1,2,3,4,5}_pred_0.pdb msas/ bfd_uniclust_hits.a3m mgnify_hits.sto pdb70_hits.hhr uniref90_hits.st models are ranked internally by alphafold2, #0 being the one with the largest confidence metric. those metrics are stored in the ranking_debug.json file, along with the ranking. multiple sequence alignments are stored in the msas/ repository, the timings are stored in the timings.json file. you can find below an example (fig. 1) of the values stored for each process after running alphafold on a random sequence (in this case, taken from an immunoglobulin domain): fig. 1 - timing of the different steps during a single alphafold run using an input sequence from an immunoglobulin domain. we can see that the step taking the largest amount of time is feature extraction (msa), taking approximately 70% of the runtime. the models are stored in the *.pdb files. let’s open the 5 ranked_{*}.pdb simultaneously with a molecular viewer package, such as yasara view. the models contain, for each residue, a local prediction confidence score (predicted lddt score) located where the b-factor is usually stored in models obtained from empirical methods. after structural alignment, and coloring the models according to the predicted lddt score per residue, we can compare them (fig. 2): fig. 2 - structural alignment of predicted structures by alphafold2. the colour scheme corresponds to the predicted lddt score, yellow being the theoretical maximum value. we can see that the predictions are consistent along with the structure, except in a specific region (in red in the 3d model). in that region, the score drops to 50%, whereas it is above 90% in all other regions. regions with predicted lddt scores above 90% are predicted with high confidence, whereas regions with 50 % are predicted with low confidence. it is no surprise then to see different conformation predictions around that region, which must be interpreted with great care. alphafold2 also outputs non-local metrics to assess model quality, which are highly relevant in order to interpret if the predicted conformation between domains in a protein complex is trustworthy or not. these are stored in the output pickle files (*.pkl), and can be read using python code through the pickle module of the python standard library. conclusions since alphafold2’s reveal, many things have changed in the field of protein structure prediction. more tools and more databases are popping up left and right, making the dream of in silico drug discovery closer than ever before. while there remain improvements to be made, alphafold2 still provides a fast way to model proteins with capabilities beyond that of homology modeling, drastically accelerating the drug discovery process. in the following blog post, we will discuss a bit more about the post-processing of alphafold2’s output and how to evaluate the predicted structure using local and non-local metrics.

CaseXCase Series

The Blog

Beyond AlphaFold 2: The next frontier in macromolecular structure prediction

Scaling up structural biology with AlphaFold2

Explained: how to plot the prediction quality metrics with AlphaFold2

Explained: A brief look into AlphaFold 2

Topic: AlphaFold

Beyond AlphaFold 2: The next frontier in macromolecular structure prediction

Scaling up structural biology with AlphaFold2

Explained: how to plot the prediction quality metrics with AlphaFold2

Explained: A brief look into AlphaFold 2

Keep up to date