You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I concatenated my database containing approximately 600k protein structures with the AlphaFold database I downloaded using foldseek databases. I then performed clustering on the concatenated database using default parameters, resulting in approximately 2.7 million clusters.
When comparing my clusters with yours, I noticed that only 270k clusters (about 10%) have the same representative protein as yours. The remaining clusters have different representative proteins. Do you know why there is this discrepancy? I am aware that in your analysis, the representative protein is chosen based on the highest pLDDT, and this is done by MMseqs2. However, I did not use MMseqs2 for clustering; I directly used Foldseek. Could this explain the difference in results? If so, on what criteria does Foldseek base its choice of representative protein?
Next, I took a closer look at the 270k clusters that are common with your results. I annotated these clusters as "annotated" or "non-annotated" (dark clusters) based on your file 2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz. I then wanted to directly add the Pfam and GO annotations you provided to associate them with my common clusters, using the files 4-domain-clustering.zip and 3-sapId_sapGO_repId_cluFlag_LCAtaxId.tsv.gz. However, I could not find all of your clusters marked as annotated. Could you please explain in more detail the content of these files? It seems that at the end of your analysis, only 700k clusters out of the total 2.3M clusters were considered dark clusters. However, I cannot find the remaining annotated clusters in your data at afdb-cluster.steineggerlab.workers.dev.
Thank you in advance for your help!
The text was updated successfully, but these errors were encountered:
Hello,
I concatenated my database containing approximately 600k protein structures with the AlphaFold database I downloaded using foldseek databases. I then performed clustering on the concatenated database using default parameters, resulting in approximately 2.7 million clusters.
foldseek concatdbs /data/foldseek/openprot_proteins_db /data/foldseek/sp /data/foldseek/concat_db
foldseek concatdbs /data/foldseek/openprot_proteins_db_ca /data/foldseek/sp_ca /data/foldseek/concat_db_ca
foldseek concatdbs /data/foldseek/openprot_proteins_db_h /data/foldseek/sp_h /data/foldseek/concat_db_h
foldseek concatdbs /data/foldseek/openprot_proteins_db_ss /data/foldseek/sp_ss /data/foldseek/concat_db_ss
foldseek cluster /data/foldseek/concat_db /data/cluster_results $1/tmp_clusters -k 7 --threads 64
When comparing my clusters with yours, I noticed that only 270k clusters (about 10%) have the same representative protein as yours. The remaining clusters have different representative proteins. Do you know why there is this discrepancy? I am aware that in your analysis, the representative protein is chosen based on the highest pLDDT, and this is done by MMseqs2. However, I did not use MMseqs2 for clustering; I directly used Foldseek. Could this explain the difference in results? If so, on what criteria does Foldseek base its choice of representative protein?
Next, I took a closer look at the 270k clusters that are common with your results. I annotated these clusters as "annotated" or "non-annotated" (dark clusters) based on your file 2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz. I then wanted to directly add the Pfam and GO annotations you provided to associate them with my common clusters, using the files 4-domain-clustering.zip and 3-sapId_sapGO_repId_cluFlag_LCAtaxId.tsv.gz. However, I could not find all of your clusters marked as annotated. Could you please explain in more detail the content of these files? It seems that at the end of your analysis, only 700k clusters out of the total 2.3M clusters were considered dark clusters. However, I cannot find the remaining annotated clusters in your data at afdb-cluster.steineggerlab.workers.dev.
Thank you in advance for your help!
The text was updated successfully, but these errors were encountered: