Discrepancy in Foldseek clustering results compared to AFDB clusters #428

YFeriel · 2025-02-19T19:23:22Z

Hello,

I concatenated my database containing approximately 600k protein structures with the AlphaFold database I downloaded using foldseek databases. I then performed clustering on the concatenated database using default parameters, resulting in approximately 2.7 million clusters.

foldseek concatdbs /data/foldseek/openprot_proteins_db /data/foldseek/sp /data/foldseek/concat_db
foldseek concatdbs /data/foldseek/openprot_proteins_db_ca /data/foldseek/sp_ca /data/foldseek/concat_db_ca
foldseek concatdbs /data/foldseek/openprot_proteins_db_h /data/foldseek/sp_h /data/foldseek/concat_db_h
foldseek concatdbs /data/foldseek/openprot_proteins_db_ss /data/foldseek/sp_ss /data/foldseek/concat_db_ss

foldseek cluster /data/foldseek/concat_db /data/cluster_results $1/tmp_clusters -k 7 --threads 64

When comparing my clusters with yours, I noticed that only 270k clusters (about 10%) have the same representative protein as yours. The remaining clusters have different representative proteins. Do you know why there is this discrepancy? I am aware that in your analysis, the representative protein is chosen based on the highest pLDDT, and this is done by MMseqs2. However, I did not use MMseqs2 for clustering; I directly used Foldseek. Could this explain the difference in results? If so, on what criteria does Foldseek base its choice of representative protein?

Next, I took a closer look at the 270k clusters that are common with your results. I annotated these clusters as "annotated" or "non-annotated" (dark clusters) based on your file 2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz. I then wanted to directly add the Pfam and GO annotations you provided to associate them with my common clusters, using the files 4-domain-clustering.zip and 3-sapId_sapGO_repId_cluFlag_LCAtaxId.tsv.gz. However, I could not find all of your clusters marked as annotated. Could you please explain in more detail the content of these files? It seems that at the end of your analysis, only 700k clusters out of the total 2.3M clusters were considered dark clusters. However, I cannot find the remaining annotated clusters in your data at afdb-cluster.steineggerlab.workers.dev.

Thank you in advance for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in Foldseek clustering results compared to AFDB clusters #428

Discrepancy in Foldseek clustering results compared to AFDB clusters #428

YFeriel commented Feb 19, 2025

Discrepancy in Foldseek clustering results compared to AFDB clusters #428

Discrepancy in Foldseek clustering results compared to AFDB clusters #428

Comments

YFeriel commented Feb 19, 2025