Hundreds of millions of protein structures can be clustered using the new AI algorithm “Foldseek Cluster,” which was developed by a team of researchers from EMBL-EBI, the Institute of Molecular Systems Biology at ETH Zurich, and Seoul National University. Researchers used Foldseek Cluster for clustering the entire AlphaFold database, identifying 2.3 million non-singleton structural clusters, 31% of which are unannotated and likely represent previously undiscovered structures. This study offers new insights into the origins of human immune proteins and advances our understanding of protein evolution.

The study explores AlphaFold’s vast database, containing more than 214 million predicted protein structures. The new clustering algorithm, Foldseek cluster, effectively organized this huge data set into 2.30 million non-singular clusters. More than 31% of these groups were missing annotations, suggesting potential undiscovered protein structures. These unannotated clusters were characterized by sparse representation, covering only 4% of all proteins in the AlphaFold database. Evolutionary analysis indicated that most clusters were ancient, yet 4% appeared to be species-specific, possibly representing lower-quality predictions or cases of de novo gene birth. This analytical approach identified several human immune-related proteins displaying putative remote homology in prokaryotic species. 

Structure-based clustering of AFDB

The AFDB covers over 214 million predicted protein structures and has grown in several stages. The initial release focused on 20 key model organisms, while subsequent updates provided predictions for the Swiss-Prot dataset of the Universal Protein Resource (UniProt) and proteomes relevant to global health, taken from priority lists compiled by the World Health Organisation. 

The AFDB parses and archives these data and makes them accessible through bulk download options, programmatic access endpoints, and interactive web pages. The programmatic access, in particular, facilitated the integration of AlphaFold models into other biological data repositories, such as Protein Data Bank Europe (PDBe), UniProt, Pfam, InterPro, and Ensembl.

The AlphaFold UniProt v.3 database houses a massive 214 million predicted protein structures. To make sense of this, a clustering approach was crafted. Initially, the database was clustered based on sequence similarity and alignment, resulting in 52 million clusters. Each cluster picked the most confident protein structure. Then, a structure-based algorithm was used to refine clustering, resulting in 18.7 million clusters. Filtering out certain sequences, 2.3 million distinct clusters were identified, along with 13 million singletons. This process aids in organizing and understanding the vast protein structural landscape.

Cluster purity analysis

The quality of AlphaFold Database clusters was assessed by examining their structural and Pfam consistency. The clusters demonstrated structural homogeneity, with most members exhibiting similar structures. Additionally, members within the same clusters typically shared the same Pfam domain, confirming their consistency. The relationship between the structural and functional similarity was evident, with higher structural similarity correlating with greater functional similarity based on Pfam and Enzyme Commission annotations. The study also delved into the evolutionary relationships within the clusters, revealing that the vast majority of compared cluster members likely share a common evolutionary origin, indicating a composition primarily dominated by homologous proteins.

Clusters with undisclosed structure and function

In the vast AlphaFold Protein Structure Database (AFDB), efforts were made to identify ‘dark clusters,’ representing structurally and functionally unknown proteins. Initially, clusters with some similarity to known structures in the Protein Data Bank (PDB) were filtered out. Subsequently, representative proteins from the remaining clusters were annotated using the Pfam database. This process identified 30.9% of AFDB clusters as ‘dark clusters,’ potentially comprising novel protein structures. Interestingly, these clusters represented only a small fraction (4.06%) of the entire AFDB, indicating that extensively studied protein structures tend to have known annotations, emphasizing the vastness of protein structural space that remains to be explored.

Novel enzymes and small-molecule binders

Examining ‘dark clusters’ (uncharacterized protein groups), 33,842 highly confident clusters were singled out. Each cluster’s most confidently predicted member was investigated to predict potential new enzymes. By searching for pockets in these structures and employing a structure-based function prediction method (DeepFRI), potential functions were inferred. A total of 1,770 pockets were identified across 1,707 structures, leading to 5,324 functional assignments. Few high-confidence structure predictions appeared to lack compactness and defined structural elements, potentially suggesting erroneous predictions. The ‘transporter activity’ emerged as a frequently predicted molecular function, hinting at potential membrane-bound proteins often challenging to determine experimentally. Additionally, diverse functions were predicted, showcasing the array of potential roles these proteins might play, from ribonucleotide-binding proteins to chromosome maintenance-associated proteins.

Taxonomic analysis of the clusters

Analyzing the structural clusters, their taxonomic composition was examined to understand the distribution of protein machinery across different super-kingdoms. The clusters were mapped to the tree of life, revealing conservation at various taxonomic levels. These clusters appeared to be ancient, with a significant portion being shared by all life forms, bacteria, eukaryotes, and archaea. Interestingly, a small fraction of clusters (3.91%) were specific to certain species. These species-specific clusters were characterized by fewer members, often lacking annotations, and having smaller protein sizes. However, their prediction confidence was comparable to the remaining clusters. 

Examining human protein-containing clusters in an evolutionary context, clusters specific to humans were searched. Out of 13 identified clusters, four contained human proteins along with viral or specific protein units. However, these were predicted with low confidence and did not form unique human-specific clusters. Then, human protein clusters were associated with their respective functions using Gene Ontology (GO) terms. Remarkably, human proteins share structural similarities with proteins across various organisms, reflecting evolutionary conservation. These include functions related to enzyme activities, cellular components like microtubule-organizing centers, and diverse activities such as immune response and hormone functions, underscoring the evolutionary connections of human proteins.

The analysis revealed intriguing connections between bacterial and human immune-related proteins. Despite certain biological processes being predominantly associated with specific groups, exceptions were found, such as histone-related clusters bridging eukaryotes and bacteria, supporting evolutionary links. Additionally, similarities were observed between human and bacterial immunity-related proteins. For instance, TNFRSF4, a human protein, shares structural features with bacterial proteins due to common cysteine-rich repeat regions. Another notable finding was the resemblance of the human BPI protein, a component of the innate immune system, with bacterial structures. This indicates the potential roles of bacterial homologs in regulating the bacterial outer membrane. Furthermore, structural and functional similarities pointed to the repurposing of ancient DNA-sensing proteins in the AIM2 inflammasome, emphasizing evolutionary adaptations in immune mechanisms across species.

Potential domain families within proteins were predicted using a method based on structural similarity matches. By comparing representative protein structures from Foldseek clusters, probable domain regions based on structural similarities were identified. These regions were then grouped into potential domain families using a network clustering approach. Many matching and novel families were observed when comparing the predictions with known domain families (Pfam). This technique offers a promising way to predict unexplored domain families within proteins, shedding light on their functional components.

Structural similarity in distant domains

The network clustering technique revealed approximately 500 connections between clusters, indicating structural similarity between predicted domain families. Notably, certain clusters rich in Pfam annotations shared this similarity with domains lacking clear annotations, hinting at potential functions. The Frag1-like domains have varying genetic sequences, but despite that, these domains share a strikingly similar structure. This similarity is observed not only in eukaryotic organisms but also in bacteria and archaea. Essentially, it implies that despite genetic differences, they possess a fundamental common structural framework.

Identification of gasdermin domains

Two structures similar to the gasdermin domain were found. Gasdermin is a crucial protein that triggers a specific kind of cell death in humans to fight off infections. The structures discovered shared a common shape, like a twisted sheet, and had similar parts. Some structures looked like the active form of gasdermin found in both humans and ancient bacteria, suggesting a shared function in defending against invaders. 

The approach used for the structure prediction

The clustering algorithm used Foldseek’s 3Di to represent protein structures. It combined Linclust and cascaded MMseqs2 clustering, allowing efficient clustering of millions of structures. 

First, structures were converted to 3Di sequences and processed using Linclust’s workflow. K-mers were extracted and grouped, and structures were assigned to the longest sequence in each group. An ungapped alignment algorithm and MMseqs2 clustering followed based on alignment criteria. The clusters were further refined using Foldseek’s structural alignment. The method effectively distinguished homologs from analogs. Cluster purity was assessed using structural similarity, Pfam consistency, and EC number consistency. Functional predictions and domain predictions were also carried out. A web server was developed for user-friendly cluster exploration.


Analyzing a vast number of protein structures presents challenges in data handling. A method was developed to group similar protein structures, which reveals millions of clusters. Over 30% of these clusters are unlike any known structures. This method also has some limitations, potentially missing similarities due to strict criteria. This tool could unlock insights into distant evolutionary links and shed light on how certain functions evolved across species, especially in the immune system. 

Article source: Reference Paper

Learn More:

Website | + posts

Prachi is an enthusiastic M.Tech Biotechnology student with a strong passion for merging technology and biology. This journey has propelled her into the captivating realm of Bioinformatics. She aspires to integrate her engineering prowess with a profound interest in biotechnology, aiming to connect academic and real-world knowledge in the field of Bioinformatics.


Please enter your comment!
Please enter your name here