TABLE 2.
Degree of overestimation due to intragenomic heterogeneity and underestimation caused by insufficient interspecific variation for different 16S rRNA gene regions under the ASV and 97%-OTU levels
Identity threshold | 16S gene region | HIQ-T-NCBIa |
HIQ-C-NCBIa |
Overestimation (%)b | Underestimation (%)b | ||
---|---|---|---|---|---|---|---|
No. of sequences | No. of OTUs | No. of sequences | No. of OTUs | ||||
100% | Full-length | 29,416 | 15,727 | 6,550 | 6,131 | 156.5 | 6.4 |
V1–V2 | 29,459 | 10,287 | 6,562 | 5,433 | 89.3 | 17.2 | |
V1–V3 | 28,883 | 11,623 | 6,339 | 5,523 | 110.5 | 12.9 | |
V3 | 29,246 | 5,467 | 6,554 | 4,060 | 34.6 | 38.0 | |
V3–V4 | 29,994 | 8,079 | 6,866 | 5,325 | 51.7 | 22.4 | |
V4 | 29,903 | 5,593 | 6,829 | 4,341 | 28.8 | 36.4 | |
V4–V5 | 30,020 | 5,636 | 6,890 | 4,392 | 28.3 | 36.2 | |
V5–V7 | 29,058 | 7,333 | 6,402 | 4,823 | 52.0 | 24.7 | |
V6 | 26,669 | 3,816 | 5,713 | 2,998 | 27.3 | 47.5 | |
V6–V8 | 30,027 | 8,310 | 6,890 | 5,407 | 53.7 | 21.5 | |
V7–V9 | 26,179 | 6,474 | 5,875 | 4,374 | 48.0 | 25.6 | |
97% | Full-length | 29,416 | 3,181 | 6,550 | 3,035 | 4.8 | 53.7 |
V1–V2 | 29,459 | 4,074 | 6,562 | 3,647 | 11.7 | 44.4 | |
V1–V3 | 28,883 | 3,788 | 6,339 | 3,478 | 8.9 | 45.1 | |
V3 | 29,246 | 2,715 | 6,554 | 2,556 | 6.2 | 61.0 | |
V3–V4 | 29,994 | 2,794 | 6,866 | 2,663 | 4.9 | 61.2 | |
V4 | 29,903 | 2,284 | 6,829 | 2,186 | 4.5 | 68.0 | |
V4–V5 | 30,020 | 2,314 | 6,890 | 2,217 | 4.4 | 67.8 | |
V5–V7 | 29,058 | 2,511 | 6,402 | 2,384 | 5.3 | 62.8 | |
V6 | 26,669 | 2,989 | 5,713 | 2,623 | 14.0 | 54.1 | |
V6–V8 | 30,027 | 2,692 | 6,890 | 2,566 | 4.9 | 62.8 | |
V7–V9 | 26,179 | 2,114 | 5,875 | 2,025 | 4.4 | 65.5 |
HIQ-T-NCBI, the data set constructed considering intragenomic heterogeneity; HIQ-C-NCBI, the data set constructed ruling out intragenomic heterogeneity.
The overestimation rate was calculated as (A – B)/B · 100%, where A represents the number of OTUs from HIQ-T-NCBI and B represents the number of OTUs from HIQ-C-NCBI. The underestimation rate was calculated as (C – B)/C · 100%, where C is the number of sequences in HIQ-C-NCBI.