Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in Foldseek clustering results compared to AFDB clusters #428

Open
YFeriel opened this issue Feb 19, 2025 · 0 comments
Open

Comments

@YFeriel
Copy link

YFeriel commented Feb 19, 2025

Hello,

I concatenated my database containing approximately 600k protein structures with the AlphaFold database I downloaded using foldseek databases. I then performed clustering on the concatenated database using default parameters, resulting in approximately 2.7 million clusters.

foldseek concatdbs /data/foldseek/openprot_proteins_db /data/foldseek/sp /data/foldseek/concat_db
foldseek concatdbs /data/foldseek/openprot_proteins_db_ca /data/foldseek/sp_ca /data/foldseek/concat_db_ca
foldseek concatdbs /data/foldseek/openprot_proteins_db_h /data/foldseek/sp_h /data/foldseek/concat_db_h
foldseek concatdbs /data/foldseek/openprot_proteins_db_ss /data/foldseek/sp_ss /data/foldseek/concat_db_ss

foldseek cluster /data/foldseek/concat_db /data/cluster_results $1/tmp_clusters -k 7 --threads 64

When comparing my clusters with yours, I noticed that only 270k clusters (about 10%) have the same representative protein as yours. The remaining clusters have different representative proteins. Do you know why there is this discrepancy? I am aware that in your analysis, the representative protein is chosen based on the highest pLDDT, and this is done by MMseqs2. However, I did not use MMseqs2 for clustering; I directly used Foldseek. Could this explain the difference in results? If so, on what criteria does Foldseek base its choice of representative protein?

Next, I took a closer look at the 270k clusters that are common with your results. I annotated these clusters as "annotated" or "non-annotated" (dark clusters) based on your file 2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz. I then wanted to directly add the Pfam and GO annotations you provided to associate them with my common clusters, using the files 4-domain-clustering.zip and 3-sapId_sapGO_repId_cluFlag_LCAtaxId.tsv.gz. However, I could not find all of your clusters marked as annotated. Could you please explain in more detail the content of these files? It seems that at the end of your analysis, only 700k clusters out of the total 2.3M clusters were considered dark clusters. However, I cannot find the remaining annotated clusters in your data at afdb-cluster.steineggerlab.workers.dev.

Thank you in advance for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant