possible change in uchime*_denovo results between v2.22 and v2.29 #591

frederic-mahe · 2025-02-07T12:49:53Z

Tests developed by @colinbrislawn (see qiime2/q2-vsearch#100) show different chimera detection results when using the --uchime*_denovo commands, with vsearch v2.22.1.

With vsearch 2.29.3, the results are different. This should be investigated.

The text was updated successfully, but these errors were encountered:

torognes · 2025-02-07T13:31:53Z

Seems like the changes happened in commit c2ffd0e on 24 February 2023. The previous commit b90cdc1 is okay. This is between version 2.22.1 and 2.23.0.

torognes · 2025-02-07T14:38:31Z

By closer inspection, it seems like I introduced the bug already in commit c5f1645 on 13 Feb 2023. The code diverged into separate branches at this point, so the dates are not always informative. Commit cb43bc7 from 9 February and earlier seems okay.

It appears to be related to the experimental chimera detection in long sequences that was introduced at this point. It should have been independent of the other chimera detection algorithms, but unfortunately it seems to have disturbed them.

torognes · 2025-02-07T18:23:53Z

Changes have been introduced already in commit 22ffa0e giving different results than the previous commit, eca02bc. Seems like the window size for smoothing has been changed from 32 to 64. The calculation of the smoothed score may also have been changed.

torognes · 2025-02-10T16:56:56Z

There seems to be two reasons why results from version 2.22.1 and earlier are different from later versions.

The first reason is that the size of the smoothing window was increased from 32 to 64 nucleotides, starting in commit 22ffa0e. This was an unintended change. It didn't actually affect the test example that @colinbrislawn provided, but have changed the results in other cases.

The second reason is that chimera parent candidates that had no "winning" positions at all in the smoothed window may have been evaluated in version 2.22.1 and earlier. This was an error that was actually corrected in version 2.23.1 and later.

Here is an alignment of the specific case that showed different behavior in the different versions (using the --uchimeout option):

Query   (  160 nt) derep_16;size=146
ParentA (  380 nt) derep_2;size=485
ParentB (  164 nt) derep_5;size=315

A     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
Q     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
B     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
Diffs                                                                                   
Votes                                                                                   
Model   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

A    81 ACCGCGTGCACTGGTCCGGCCGG----gcctTTcccTcTgTgGAAccCcAtaccCTtCactgGgcgTgGCggggaaacAG 156
Q    81 ACCGCGTGCACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGC------CAAG 154
B    81 ACCGCGTGtACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGC------CAAG 154
Diffs           A                  BBBB  BBB B B B   BB B BBBB  B BBBB BBB B        BB  
Votes           +                   +++  +++ + + +   ++ + ++++  + ++++ +++ +         +  
Model   AAAAAAAAAxxxxxxxxxxxxxxxxxxBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

A   157 GAcaTTtactttgaaaaaattagagtgctccaggcaggcctatgctcgaatacattagcatggaataataaaataggacg 236
Q   155 GATGTT-------------------------------------------------------------------------- 160
B   155 GATGTTttca---------------------------------------------------------------------- 164
Diffs     BB                                                                            
Votes     ++                                                                            
Model   BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

A   237 cgcggttctattttgttggtttataggaccgccgtaatgattaatagggacagtcgggggcatcagtattcaactgtcag 316
Q   161 -------------------------------------------------------------------------------- 160
B   165 -------------------------------------------------------------------------------- 164
Diffs                                                                                   
Votes                                                                                   
Model   BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

A   317 aggtgaaattcttggatcagttgaagactaactactgcgaaagcatttgccaaggatgttttca 380
Q   161 ---------------------------------------------------------------- 160
B   165 ---------------------------------------------------------------- 164
Diffs                                                                   
Votes                                                                   
Model   BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Ids.  QA 81.5%, QB 99.3%, AB 80.8%, QModel 100.0%, Div. +0.7%
Diffs Left 1: N 0, A 0, Y 1 (100.0%); Right 28: N 0, A 0, Y 28 (100.0%), Score 0.2232

Here, sequence Q is the query sequence, B is the best scoring candidate, and A is another candidate. There is only a single position where A is more similar to Q than B: position 89. After smoothing out matches within a window of 32 (or 64) nucleotides, A has no winning positions at all. A therefore shouldn't really have been considered. In most cases I think sequences like sequence A would have obtained a very low score and would have been eliminated anyway. Here the score is 0.2232 which is below the default limit of 0.28 in uchime_denovo. Also, B has just a single difference (less than the required 3) and is barely different from Q (0.7%) and not different enough (the default limit is 0.8%) according to the rules for uchime_denovo. Q is rejected as a chimera in uchime_denovo. In uchime2_denovo the score etc is ignored, and it is considered a chimera because the model matches the query perfectly while both A and B are different from Q. In uchime3_denovo it is rejected because the abundance ratios are too low compared to the abskew parameter.

I think the right thing to do is to adjust the window size back to 32 as it originally was, but otherwise keep it as it is now.

I'll do some testing to see the effect of the changes. Preliminary tests with uchime_denovo indicate that the window size increase had a minor effect and reduced the number of chimeras somewhat, while the other change had negligible effect.

colinbrislawn · 2025-02-10T17:30:50Z

Thank you for your prompt investigation, Torbjørn!

Related, can you recommend a toy alignment we should us for testing in Qiime2? In that other thread I was looking for a child and two parents that produced different results for uchime uchime2 and uchime3 so we could distinguish the algorithms by their results.

torognes · 2025-02-13T08:20:38Z

Hi @colinbrislawn,

Here is a set of toy sequences that should give different results with uchime_denovo, uchime2_denovo and uchime3_denovo:

>A;size=5
AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC
ACCGCATGTACTGGTGCGGCCGGTGAACTTCTTGGATTAATTGAAGACTAGCTACTGTGAAAGCAATTGCCAACGATGTT
>B;size=5
AGCTGCAATAGCGTTTATTAACGTTGTTGAGGTTAAAAAGTTCGTAGTAGAACCTTGGGACTGGCAGGCCCGTCCGCGTC
ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT
>P;size=1
AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC
ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT
>Q;size=1
AGCTCCAATAGCGTATACTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC
ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAGAGCATTTGCCAAGGATGTT

I started with sequence P of 160 bp. It was then duplicated into Q, A, and B. In A I introduced a few substitutions in the second half of the sequence, while in B I introduced a few substitutions in the first half of the sequence. Q is identical to P except for 1 substitution in each half of the sequence, not in the same positions as the substitutions in A or B. P and Q have an abundance of 1, while A and B have an abundance of 5.

P and Q could be chimeras with A and B as parents. Using uchime_denovo both P and Q will be detected as chimeras due to the similarities in each end of the sequences. With uchime2_denovo only P will be recognized because it matches the model of the combined A and B perfectly, while Q has mismatches and will not be recognized. In both cases the abundance skew (ratio) is 5, which is more that the required 2. However, in uchime3_denovo the abundance skew must be at least 16 by default, which is not the case here, and no chimeras will be detected.

torognes · 2025-02-13T08:24:07Z

Here are the results of uchime_denovo.

Command:

% vsearch --uchime_denovo toy.fasta --uchimeout uchimeout.tsv --uchimealn uchimealn.txt ; cat uchimeout.tsv ; cat uchimealn.txt

Output:

vsearch v2.29.3_macos_aarch64, 32.0GB RAM, 12 cores
https://github.com/torognes/vsearch

Reading file toy.fasta 100%  
640 nt in 4 seqs, min 160, max 160, avg 160
Masking 100% 
Sorting by abundance 100%
Counting k-mers 100% 
Detecting chimeras 100%  
Found 2 (50.0%) chimeras, 2 (50.0%) non-chimeras,
and 0 (0.0%) borderline sequences in 4 unique sequences.
Taking abundance information into account, this corresponds to
2 (16.7%) chimeras, 10 (83.3%) non-chimeras,
and 0 (0.0%) borderline sequences in 12 total sequences.

Results in uchimeout.tsv:

0.0000	A;size=5	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.0000	B;size=5	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.6378	P;size=1	A;size=5	B;size=5	A;size=5	100.0	95.0	93.8	88.8	95.0	10	0	0	8	0	0	5.0	Y
0.5375	Q;size=1	A;size=5	B;size=5	A;size=5	98.8	93.8	92.5	88.8	93.8	10	0	1	8	0	1	5.0	Y

Results in uchimealn.txt:

------------------------------------------------------------------------
Query   (  160 nt) P;size=1
ParentA (  160 nt) A;size=5
ParentB (  160 nt) B;size=5

A     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
Q     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
B     1 AGCTgCAATAGCGTtTATTAAcGTTGTTGaGGTTAAAAAGtTCGTAGTaGAACCTTGGGaCTGGCaGGCCcGTCCGCgTC 80
Diffs       A         A      A       A          A       A          A     A    A      A  
Votes       +         +      +       +          +       +          +     +    +      +  
Model   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAxx

A    81 ACCGCaTGTACTGGTgCGGCCGGTGAAcTTCTTGGATTaATTGAAGACTAgCTACTGtGAAAGCAaTTGCCAAcGATGTT 160
Q    81 ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT 160
B    81 ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT 160
Diffs        B         B           B          B           B      B       B       B      
Votes        +         +           +          +           +      +       +       +      
Model   xxxxxBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Ids.  QA 95.0%, QB 93.8%, AB 88.8%, QModel 100.0%, Div. +5.3%
Diffs Left 10: N 0, A 0, Y 10 (100.0%); Right 8: N 0, A 0, Y 8 (100.0%), Score 0.6378

------------------------------------------------------------------------
Query   (  160 nt) Q;size=1
ParentA (  160 nt) A;size=5
ParentB (  160 nt) B;size=5

A     1 AGCTCCAATAGCGTATAtTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
Q     1 AGCTCCAATAGCGTATACTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
B     1 AGCTgCAATAGCGTtTAtTAAcGTTGTTGaGGTTAAAAAGtTCGTAGTaGAACCTTGGGaCTGGCaGGCCcGTCCGCgTC 80
Diffs       A         A  N   A       A          A       A          A     A    A      A  
Votes       +         +  0   +       +          +       +          +     +    +      +  
Model   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAxx

A    81 ACCGCaTGTACTGGTgCGGCCGGTGAAcTTCTTGGATTaATTGAAGACTAgCTACTGtGAaAGCAaTTGCCAAcGATGTT 160
Q    81 ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAGAGCATTTGCCAAGGATGTT 160
B    81 ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAaAGCATTTGCCAAGGATGTT 160
Diffs        B         B           B          B           B      B  N    B       B      
Votes        +         +           +          +           +      +  0    +       +      
Model   xxxxxBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Ids.  QA 93.8%, QB 92.5%, AB 88.8%, QModel 98.8%, Div. +5.3%
Diffs Left 11: N 0, A 1, Y 10 (90.9%); Right 9: N 0, A 1, Y 8 (88.9%), Score 0.5375

torognes · 2025-02-13T08:25:53Z

Here are the results of uchime2_denovo.

Command:

% vsearch --uchime2_denovo toy.fasta --uchimeout uchimeout.tsv --uchimealn uchimealn.txt ; cat uchimeout.tsv ; cat uchimealn.txt

Output:

vsearch v2.29.3_macos_aarch64, 32.0GB RAM, 12 cores
https://github.com/torognes/vsearch

Reading file toy.fasta 100%  
640 nt in 4 seqs, min 160, max 160, avg 160
Masking 100% 
Sorting by abundance 100%
Counting k-mers 100% 
Detecting chimeras 100%  
Found 1 (25.0%) chimeras, 3 (75.0%) non-chimeras,
and 0 (0.0%) borderline sequences in 4 unique sequences.
Taking abundance information into account, this corresponds to
1 (8.3%) chimeras, 11 (91.7%) non-chimeras,
and 0 (0.0%) borderline sequences in 12 total sequences.

Results in uchimeout.tsv:

0.0000	A;size=5	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.0000	B;size=5	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.6378	P;size=1	A;size=5	B;size=5	A;size=5	100.0	95.0	93.8	88.8	95.0	10	0	0	8	0	0	5.0	Y
0.5375	Q;size=1	A;size=5	B;size=5	A;size=5	98.8	93.8	92.5	88.8	93.8	10	0	1	8	0	1	5.0	N

Results in uchimealn.txt:

------------------------------------------------------------------------
Query   (  160 nt) P;size=1
ParentA (  160 nt) A;size=5
ParentB (  160 nt) B;size=5

A     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
Q     1 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC 80
B     1 AGCTgCAATAGCGTtTATTAAcGTTGTTGaGGTTAAAAAGtTCGTAGTaGAACCTTGGGaCTGGCaGGCCcGTCCGCgTC 80
Diffs       A         A      A       A          A       A          A     A    A      A  
Votes       +         +      +       +          +       +          +     +    +      +  
Model   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAxx

A    81 ACCGCaTGTACTGGTgCGGCCGGTGAAcTTCTTGGATTaATTGAAGACTAgCTACTGtGAAAGCAaTTGCCAAcGATGTT 160
Q    81 ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT 160
B    81 ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT 160
Diffs        B         B           B          B           B      B       B       B      
Votes        +         +           +          +           +      +       +       +      
Model   xxxxxBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Ids.  QA 95.0%, QB 93.8%, AB 88.8%, QModel 100.0%, Div. +5.3%
Diffs Left 10: N 0, A 0, Y 10 (100.0%); Right 8: N 0, A 0, Y 8 (100.0%), Score 0.6378

torognes · 2025-02-13T08:27:49Z

Here are the results of uchime3_denovo.

Command:

% vsearch --uchime3_denovo toy.fasta --uchimeout uchimeout.tsv --uchimealn uchimealn.txt ; cat uchimeout.tsv ; cat uchimealn.txt

Output:

vsearch v2.29.3_macos_aarch64, 32.0GB RAM, 12 cores
https://github.com/torognes/vsearch

Reading file toy.fasta 100%  
640 nt in 4 seqs, min 160, max 160, avg 160
Masking 100% 
Sorting by abundance 100%
Counting k-mers 100% 
Detecting chimeras 100%  
Found 0 (0.0%) chimeras, 4 (100.0%) non-chimeras,
and 0 (0.0%) borderline sequences in 4 unique sequences.
Taking abundance information into account, this corresponds to
0 (0.0%) chimeras, 12 (100.0%) non-chimeras,
and 0 (0.0%) borderline sequences in 12 total sequences.

Results in uchimeout.tsv:

0.0000	A;size=5	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.0000	B;size=5	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.0000	P;size=1	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N
0.0000	Q;size=1	*	*	*	*	*	*	*	*	0	0	0	0	0	0	*	N

The file uchimealn.txt is empty.

torognes · 2025-02-13T10:14:28Z

I have tested the uchime_denovo command on the BioMarKs dataset (312503 sequences) with four different variants of vsearch version 2.29.3 with and without the bug and with window sizes of 32 and 64.

Here are the results:

Window	Bug	Chimeras	Borderline	Non-chimeras
32	N	12730 (4.1%)	8258 (2.6%)	291515 (93.3%)
32	Y	12731 (4.1%)	8261 (2.6%)	291511 (93.3%)
64	N	12104 (3.9%)	6546 (2.1%)	293853 (94.0%)
64	Y	12138 (3.9%)	6562 (2.1%)	293803 (94.0%)

As can be seen from the table, the bug seems to increase the number of chimera prediction by a very small amount (especially for window size 32). These may be false positives, but they have passed the score and divergence tests.

The long window size had a larger effect and has reduced the number of positive predictions somewhat. It might have increased the number of false negatives.

The real amount of chimeras in this dataset is unknown, as far as I know.

torognes · 2025-02-14T13:40:23Z

I also tested uchime2_denovo and uchime3_denovo directly on the same dataset, even though the algorithms are designed for sequences with fewer errors.

Here are the numbers for uchime2_denovo:

Window	Bug	Chimeras	Non-chimeras
32	N	52034 (16.7%)	260469 (83.3%)
32	Y	52435 (16.8%)	260068 (83.2%)
64	N	49039 (15.7%)	263464 (84.3%)
64	Y	49838 (15.9%)	262665 (84.1%)

And here are the numbers for uchime3_denovo:

Window	Bug	Chimeras	Non-chimeras
32	N	17229 (5.5%)	295274 (94.5%)
32	Y	17459 (5.6%)	295044 (94.4%)
64	N	15973 (5.1%)	296530 (94.9%)
64	Y	16448 (5.3%)	296055 (94.7%)

In both cases, it seemed like the bug had a larger effect on the number of predicted chimeras than for uchime_denovo. But the increased window size again had the largest effect.

torognes · 2025-02-14T15:04:59Z

Version 2.29.4 has been released with a fix for the window size.

https://github.com/torognes/vsearch/releases/tag/v2.29.4

frederic-mahe self-assigned this Feb 7, 2025

torognes self-assigned this Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible change in uchime*_denovo results between v2.22 and v2.29 #591

possible change in uchime*_denovo results between v2.22 and v2.29 #591

frederic-mahe commented Feb 7, 2025

torognes commented Feb 7, 2025

torognes commented Feb 7, 2025

torognes commented Feb 7, 2025

torognes commented Feb 10, 2025

colinbrislawn commented Feb 10, 2025

torognes commented Feb 13, 2025

torognes commented Feb 13, 2025

torognes commented Feb 13, 2025

torognes commented Feb 13, 2025

torognes commented Feb 13, 2025

torognes commented Feb 14, 2025

torognes commented Feb 14, 2025

possible change in uchime*_denovo results between v2.22 and v2.29 #591

possible change in uchime*_denovo results between v2.22 and v2.29 #591

Comments

frederic-mahe commented Feb 7, 2025

torognes commented Feb 7, 2025

torognes commented Feb 7, 2025

torognes commented Feb 7, 2025

torognes commented Feb 10, 2025

colinbrislawn commented Feb 10, 2025

torognes commented Feb 13, 2025

torognes commented Feb 13, 2025

torognes commented Feb 13, 2025

torognes commented Feb 13, 2025

torognes commented Feb 13, 2025

torognes commented Feb 14, 2025

torognes commented Feb 14, 2025