Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Foldseek hits for PDB 1hxz chain D searching pdb100 #376

Open
tomgoddard opened this issue Oct 29, 2024 · 2 comments
Open

Duplicate Foldseek hits for PDB 1hxz chain D searching pdb100 #376

tomgoddard opened this issue Oct 29, 2024 · 2 comments

Comments

@tomgoddard
Copy link

Expected Behavior

I expect the Foldseek server (https://search.foldseek.com/api) searching the pdb100 database for a monomer will not return duplicate hits for a deposited PDB chain with identical sequence alignments.

Current Behavior

Submitting PDB 1hxz chain D to the Foldseek server returns 4 PDB chains that are reported twice in the results. They are being reported for equivalent chains in the PDB assemblies with the exact same sequence alignments: 1hxz-assembly1.cif.gz_D and 1hxz-assembly1.cif.gz_D-2, 1hxz-assembly1.cif.gz_C and 1hxz-assembly1.cif.gz_C-2, 1hxl-assembly1.cif.gz_C and 1hxl-assembly1.cif.gz_C-2, 1hxl-assembly1.cif.gz_D and 1hxl-assembly1.cif.gz_D-2.

Steps to Reproduce (for bugs)

Go to web site https://search.foldseek.com/search and use Load Accession 1hxz choose only database PDB100, press Search and look at the results for job_D.

Foldssek Output (for bugs)

I've attached the .m8 output file with the duplicate entries and a screen-shot of the Foldseek server web page showing the duplicate results.

Context

I found the duplicate results using the ChimeraX Foldseek search capability but thought it would be easier for you to reproduce using the official foldseek web server.

Your Environment

Using the Foldseek web server https://search.foldseek.com/search. I don't see the Foldseek version anywhere on the server web site. It lists the PDB database as PDB100 20240101.

1hxz_duplicates.zip

@martin-steinegger
Copy link
Collaborator

We probably did not cluster this because of their extremely short length. Is this causing issues in ChimeraX?

@tomgoddard
Copy link
Author

It is not a serious problem. Duplicate results propagate through subsequent processing of the search results and wastes time when a human ends up looking at them and needs to figure out why there are duplicates or has to delete the duplicates. I like to fix even rare problems since heavily used software that works correctly 99.999% of the time has only about half the value of software that works correctly 100% of the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants