Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible change in uchime*_denovo results between v2.22 and v2.29 #591

Open
frederic-mahe opened this issue Feb 7, 2025 · 12 comments
Open
Assignees

Comments

@frederic-mahe
Copy link
Collaborator

Tests developed by @colinbrislawn (see qiime2/q2-vsearch#100) show different chimera detection results when using the --uchime*_denovo commands, with vsearch v2.22.1.

With vsearch 2.29.3, the results are different. This should be investigated.

@frederic-mahe frederic-mahe self-assigned this Feb 7, 2025
@torognes
Copy link
Owner

torognes commented Feb 7, 2025

Seems like the changes happened in commit c2ffd0e on 24 February 2023. The previous commit b90cdc1 is okay. This is between version 2.22.1 and 2.23.0.

@torognes torognes self-assigned this Feb 7, 2025
@torognes
Copy link
Owner

torognes commented Feb 7, 2025

By closer inspection, it seems like I introduced the bug already in commit c5f1645 on 13 Feb 2023. The code diverged into separate branches at this point, so the dates are not always informative. Commit cb43bc7 from 9 February and earlier seems okay.

It appears to be related to the experimental chimera detection in long sequences that was introduced at this point. It should have been independent of the other chimera detection algorithms, but unfortunately it seems to have disturbed them.

@torognes
Copy link
Owner

torognes commented Feb 7, 2025

Changes have been introduced already in commit 22ffa0e giving different results than the previous commit, eca02bc. Seems like the window size for smoothing has been changed from 32 to 64. The calculation of the smoothed score may also have been changed.

@torognes
Copy link
Owner

There seems to be two reasons why results from version 2.22.1 and earlier are different from later versions.

The first reason is that the size of the smoothing window was increased from 32 to 64 nucleotides, starting in commit 22ffa0e. This was an unintended change. It didn't actually affect the test example that @colinbrislawn provided, but have changed the results in other cases.

The second reason is that chimera parent candidates that had no "winning" positions at all in the smoothed window may have been evaluated in version 2.22.1 and earlier. This was an error that was actually corrected in version 2.23.1 and later.

Here is an alignment of the specific case that showed different behavior in the different versions (using the --uchimeout option):

Query   (  160 nt) derep_16;size=146
ParentA (  380 nt) derep_2;size=485
ParentB (  164 nt) derep_5;size=315

A     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
Q     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
B     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
Diffs                                                                                   
Votes                                                                                   
Model   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

A    81 ACCGCGTGCACTGGTCCGGCCGG----gcctTTcccTcTgTgGAAccCcAtaccCTtCactgGgcgTgGCggggaaacAG 156
Q    81 ACCGCGTGCACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGC------CAAG 154
B    81 ACCGCGTGtACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGC------CAAG 154
Diffs           A                  BBBB  BBB B B B   BB B BBBB  B BBBB BBB B        BB  
Votes           +                   +++  +++ + + +   ++ + ++++  + ++++ +++ +         +  
Model   AAAAAAAAAxxxxxxxxxxxxxxxxxxBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

A   157 GAcaTTtactttgaaaaaattagagtgctccaggcaggcctatgctcgaatacattagcatggaataataaaataggacg 236
Q   155 GATGTT-------------------------------------------------------------------------- 160
B   155 GATGTTttca---------------------------------------------------------------------- 164
Diffs     BB                                                                            
Votes     ++                                                                            
Model   BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

A   237 cgcggttctattttgttggtttataggaccgccgtaatgattaatagggacagtcgggggcatcagtattcaactgtcag 316
Q   161 -------------------------------------------------------------------------------- 160
B   165 -------------------------------------------------------------------------------- 164
Diffs                                                                                   
Votes                                                                                   
Model   BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

A   317 aggtgaaattcttggatcagttgaagactaactactgcgaaagcatttgccaaggatgttttca 380
Q   161 ---------------------------------------------------------------- 160
B   165 ---------------------------------------------------------------- 164
Diffs                                                                   
Votes                                                                   
Model   BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Ids.  QA 81.5%, QB 99.3%, AB 80.8%, QModel 100.0%, Div. +0.7%
Diffs Left 1: N 0, A 0, Y 1 (100.0%); Right 28: N 0, A 0, Y 28 (100.0%), Score 0.2232

Here, sequence Q is the query sequence, B is the best scoring candidate, and A is another candidate. There is only a single position where A is more similar to Q than B: position 89. After smoothing out matches within a window of 32 (or 64) nucleotides, A has no winning positions at all. A therefore shouldn't really have been considered. In most cases I think sequences like sequence A would have obtained a very low score and would have been eliminated anyway. Here the score is 0.2232 which is below the default limit of 0.28 in uchime_denovo. Also, B has just a single difference (less than the required 3) and is barely different from Q (0.7%) and not different enough (the default limit is 0.8%) according to the rules for uchime_denovo. Q is rejected as a chimera in uchime_denovo. In uchime2_denovo the score etc is ignored, and it is considered a chimera because the model matches the query perfectly while both A and B are different from Q. In uchime3_denovo it is rejected because the abundance ratios are too low compared to the abskew parameter.

I think the right thing to do is to adjust the window size back to 32 as it originally was, but otherwise keep it as it is now.

I'll do some testing to see the effect of the changes. Preliminary tests with uchime_denovo indicate that the window size increase had a minor effect and reduced the number of chimeras somewhat, while the other change had negligible effect.

@colinbrislawn
Copy link
Contributor

Thank you for your prompt investigation, Torbjørn!

Related, can you recommend a toy alignment we should us for testing in Qiime2? In that other thread I was looking for a child and two parents that produced different results for uchime uchime2 and uchime3 so we could distinguish the algorithms by their results.

@torognes
Copy link
Owner

Hi @colinbrislawn,

Here is a set of toy sequences that should give different results with uchime_denovo, uchime2_denovo and uchime3_denovo:

>A;size=5
AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC
ACCGCATGTACTGGTGCGGCCGGTGAACTTCTTGGATTAATTGAAGACTAGCTACTGTGAAAGCAATTGCCAACGATGTT
>B;size=5
AGCTGCAATAGCGTTTATTAACGTTGTTGAGGTTAAAAAGTTCGTAGTAGAACCTTGGGACTGGCAGGCCCGTCCGCGTC
ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT
>P;size=1
AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC
ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT
>Q;size=1
AGCTCCAATAGCGTATACTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC
ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAGAGCATTTGCCAAGGATGTT

I started with sequence P of 160 bp. It was then duplicated into Q, A, and B. In A I introduced a few substitutions in the second half of the sequence, while in B I introduced a few substitutions in the first half of the sequence. Q is identical to P except for 1 substitution in each half of the sequence, not in the same positions as the substitutions in A or B. P and Q have an abundance of 1, while A and B have an abundance of 5.

P and Q could be chimeras with A and B as parents. Using uchime_denovo both P and Q will be detected as chimeras due to the similarities in each end of the sequences. With uchime2_denovo only P will be recognized because it matches the model of the combined A and B perfectly, while Q has mismatches and will not be recognized. In both cases the abundance skew (ratio) is 5, which is more that the required 2. However, in uchime3_denovo the abundance skew must be at least 16 by default, which is not the case here, and no chimeras will be detected.

@torognes
Copy link
Owner

Here are the results of uchime_denovo.

Command:

% vsearch --uchime_denovo toy.fasta --uchimeout uchimeout.tsv --uchimealn uchimealn.txt ; cat uchimeout.tsv ; cat uchimealn.txt

Output:

vsearch v2.29.3_macos_aarch64, 32.0GB RAM, 12 cores
https://github.com/torognes/vsearch

Reading file toy.fasta 100%  
640 nt in 4 seqs, min 160, max 160, avg 160
Masking 100% 
Sorting by abundance 100%
Counting k-mers 100% 
Detecting chimeras 100%  
Found 2 (50.0%) chimeras, 2 (50.0%) non-chimeras,
and 0 (0.0%) borderline sequences in 4 unique sequences.
Taking abundance information into account, this corresponds to
2 (16.7%) chimeras, 10 (83.3%) non-chimeras,
and 0 (0.0%) borderline sequences in 12 total sequences.

Results in uchimeout.tsv:

0.0000	A;size=5	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.0000	B;size=5	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.6378	P;size=1	A;size=5	B;size=5	A;size=5	100.0	95.0	93.8	88.8	95.0	10	0	0	8	0	0	5.0	Y
0.5375	Q;size=1	A;size=5	B;size=5	A;size=5	98.8	93.8	92.5	88.8	93.8	10	0	1	8	0	1	5.0	Y

Results in uchimealn.txt:

------------------------------------------------------------------------
Query   (  160 nt) P;size=1
ParentA (  160 nt) A;size=5
ParentB (  160 nt) B;size=5

A     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
Q     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
B     1 AGCTgCAATAGCGTtTATTAAcGTTGTTGaGGTTAAAAAGtTCGTAGTaGAACCTTGGGaCTGGCaGGCCcGTCCGCgTC 80
Diffs       A         A      A       A          A       A          A     A    A      A  
Votes       +         +      +       +          +       +          +     +    +      +  
Model   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAxx

A    81 ACCGCaTGTACTGGTgCGGCCGGTGAAcTTCTTGGATTaATTGAAGACTAgCTACTGtGAAAGCAaTTGCCAAcGATGTT 160
Q    81 ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT 160
B    81 ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT 160
Diffs        B         B           B          B           B      B       B       B      
Votes        +         +           +          +           +      +       +       +      
Model   xxxxxBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Ids.  QA 95.0%, QB 93.8%, AB 88.8%, QModel 100.0%, Div. +5.3%
Diffs Left 10: N 0, A 0, Y 10 (100.0%); Right 8: N 0, A 0, Y 8 (100.0%), Score 0.6378

------------------------------------------------------------------------
Query   (  160 nt) Q;size=1
ParentA (  160 nt) A;size=5
ParentB (  160 nt) B;size=5

A     1 AGCTCCAATAGCGTATAtTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
Q     1 AGCTCCAATAGCGTATACTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
B     1 AGCTgCAATAGCGTtTAtTAAcGTTGTTGaGGTTAAAAAGtTCGTAGTaGAACCTTGGGaCTGGCaGGCCcGTCCGCgTC 80
Diffs       A         A  N   A       A          A       A          A     A    A      A  
Votes       +         +  0   +       +          +       +          +     +    +      +  
Model   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAxx

A    81 ACCGCaTGTACTGGTgCGGCCGGTGAAcTTCTTGGATTaATTGAAGACTAgCTACTGtGAaAGCAaTTGCCAAcGATGTT 160
Q    81 ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAGAGCATTTGCCAAGGATGTT 160
B    81 ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAaAGCATTTGCCAAGGATGTT 160
Diffs        B         B           B          B           B      B  N    B       B      
Votes        +         +           +          +           +      +  0    +       +      
Model   xxxxxBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Ids.  QA 93.8%, QB 92.5%, AB 88.8%, QModel 98.8%, Div. +5.3%
Diffs Left 11: N 0, A 1, Y 10 (90.9%); Right 9: N 0, A 1, Y 8 (88.9%), Score 0.5375

@torognes
Copy link
Owner

Here are the results of uchime2_denovo.

Command:

% vsearch --uchime2_denovo toy.fasta --uchimeout uchimeout.tsv --uchimealn uchimealn.txt ; cat uchimeout.tsv ; cat uchimealn.txt

Output:

vsearch v2.29.3_macos_aarch64, 32.0GB RAM, 12 cores
https://github.com/torognes/vsearch

Reading file toy.fasta 100%  
640 nt in 4 seqs, min 160, max 160, avg 160
Masking 100% 
Sorting by abundance 100%
Counting k-mers 100% 
Detecting chimeras 100%  
Found 1 (25.0%) chimeras, 3 (75.0%) non-chimeras,
and 0 (0.0%) borderline sequences in 4 unique sequences.
Taking abundance information into account, this corresponds to
1 (8.3%) chimeras, 11 (91.7%) non-chimeras,
and 0 (0.0%) borderline sequences in 12 total sequences.

Results in uchimeout.tsv:

0.0000	A;size=5	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.0000	B;size=5	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.6378	P;size=1	A;size=5	B;size=5	A;size=5	100.0	95.0	93.8	88.8	95.0	10	0	0	8	0	0	5.0	Y
0.5375	Q;size=1	A;size=5	B;size=5	A;size=5	98.8	93.8	92.5	88.8	93.8	10	0	1	8	0	1	5.0	N

Results in uchimealn.txt:

------------------------------------------------------------------------
Query   (  160 nt) P;size=1
ParentA (  160 nt) A;size=5
ParentB (  160 nt) B;size=5

A     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
Q     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
B     1 AGCTgCAATAGCGTtTATTAAcGTTGTTGaGGTTAAAAAGtTCGTAGTaGAACCTTGGGaCTGGCaGGCCcGTCCGCgTC 80
Diffs       A         A      A       A          A       A          A     A    A      A  
Votes       +         +      +       +          +       +          +     +    +      +  
Model   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAxx

A    81 ACCGCaTGTACTGGTgCGGCCGGTGAAcTTCTTGGATTaATTGAAGACTAgCTACTGtGAAAGCAaTTGCCAAcGATGTT 160
Q    81 ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT 160
B    81 ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT 160
Diffs        B         B           B          B           B      B       B       B      
Votes        +         +           +          +           +      +       +       +      
Model   xxxxxBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Ids.  QA 95.0%, QB 93.8%, AB 88.8%, QModel 100.0%, Div. +5.3%
Diffs Left 10: N 0, A 0, Y 10 (100.0%); Right 8: N 0, A 0, Y 8 (100.0%), Score 0.6378

@torognes
Copy link
Owner

Here are the results of uchime3_denovo.

Command:

% vsearch --uchime3_denovo toy.fasta --uchimeout uchimeout.tsv --uchimealn uchimealn.txt ; cat uchimeout.tsv ; cat uchimealn.txt

Output:

vsearch v2.29.3_macos_aarch64, 32.0GB RAM, 12 cores
https://github.com/torognes/vsearch

Reading file toy.fasta 100%  
640 nt in 4 seqs, min 160, max 160, avg 160
Masking 100% 
Sorting by abundance 100%
Counting k-mers 100% 
Detecting chimeras 100%  
Found 0 (0.0%) chimeras, 4 (100.0%) non-chimeras,
and 0 (0.0%) borderline sequences in 4 unique sequences.
Taking abundance information into account, this corresponds to
0 (0.0%) chimeras, 12 (100.0%) non-chimeras,
and 0 (0.0%) borderline sequences in 12 total sequences.

Results in uchimeout.tsv:

0.0000	A;size=5	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.0000	B;size=5	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.0000	P;size=1	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.0000	Q;size=1	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N

The file uchimealn.txt is empty.

@torognes
Copy link
Owner

I have tested the uchime_denovo command on the BioMarKs dataset (312503 sequences) with four different variants of vsearch version 2.29.3 with and without the bug and with window sizes of 32 and 64.

Here are the results:

Window Bug Chimeras Borderline Non-chimeras
32 N 12730 (4.1%) 8258 (2.6%) 291515 (93.3%)
32 Y 12731 (4.1%) 8261 (2.6%) 291511 (93.3%)
64 N 12104 (3.9%) 6546 (2.1%) 293853 (94.0%)
64 Y 12138 (3.9%) 6562 (2.1%) 293803 (94.0%)

As can be seen from the table, the bug seems to increase the number of chimera prediction by a very small amount (especially for window size 32). These may be false positives, but they have passed the score and divergence tests.

The long window size had a larger effect and has reduced the number of positive predictions somewhat. It might have increased the number of false negatives.

The real amount of chimeras in this dataset is unknown, as far as I know.

@torognes
Copy link
Owner

I also tested uchime2_denovo and uchime3_denovo directly on the same dataset, even though the algorithms are designed for sequences with fewer errors.

Here are the numbers for uchime2_denovo:

Window Bug Chimeras Non-chimeras
32 N 52034 (16.7%) 260469 (83.3%)
32 Y 52435 (16.8%) 260068 (83.2%)
64 N 49039 (15.7%) 263464 (84.3%)
64 Y 49838 (15.9%) 262665 (84.1%)

And here are the numbers for uchime3_denovo:

Window Bug Chimeras Non-chimeras
32 N 17229 (5.5%) 295274 (94.5%)
32 Y 17459 (5.6%) 295044 (94.4%)
64 N 15973 (5.1%) 296530 (94.9%)
64 Y 16448 (5.3%) 296055 (94.7%)

In both cases, it seemed like the bug had a larger effect on the number of predicted chimeras than for uchime_denovo. But the increased window size again had the largest effect.

@torognes
Copy link
Owner

Version 2.29.4 has been released with a fix for the window size.

https://github.com/torognes/vsearch/releases/tag/v2.29.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants