Navigation
Training dataset
Superkingdom Phylum Class Sequencelength distributionML model training
Evaluation MNB Evaluation RF Confusion matrix MNB Confusion matrix RF Feature importance MNB Feature importance RF ROC curve MNB ROC curve RFTraining dataset
Sequence count for each superkingdom
Note: This plot is interactive.
When hovering over superkingdom in the donut,
the number of sequences for that superkingdom is
shown. The table next to it can be searched and
ordened as well.
About this chart
This donut chart shows the distribution of
sequences per superkingdom from the filtered
dataset. The chart distinguishes between two
superkingdoms: Bacteria (19,931 sequences) and
Archaea (929 sequences). The counts for each
superkingdom are also presented in the adjacent
searchable and sortable table.
Note: This plot is interactive. When hovering over superkingdom in the donut, the number of sequences for that superkingdom is shown. The table next to it can be searched and ordened as well.
About this chart
This donut chart shows the distribution of sequences per superkingdom from the filtered dataset. The chart distinguishes between two superkingdoms: Bacteria (19,931 sequences) and Archaea (929 sequences). The counts for each superkingdom are also presented in the adjacent searchable and sortable table.
Superkingdom | Number of sequences |
---|---|
Bacteria | 19931 |
Archaea | 929 |
Sequence count for each phylum
Note:
This chart is interactive. When hovering over
each phylum in the donut, the number of sequences
for the phylum is shown. "Others" is also
clickable; it will then show the remaining phyla
in a bar chart. You will then be able to hover
over each bar and see the number of sequences
per remaining phylum. The table next to it can
be searched and ordered as well.
About this chart
This donut chart shows the distribution of
sequences per phylum from the filtered dataset.
The chart distinguishes between the top 4 most
frequent phyla (Proteobacteria, Firmicutes,
Bacteroidetes , and Actinobacteria)
and a fifth
category called "Others", which includes all
remaining phyla. The counts for each phylum are
also presented in the adjacent searchable and
sortable table. No axes apply to the donut chart;
the bar chart shows the number of sequences per
phylum on the y-axis. On the x-axis the number
of sequences is shown.
Proteobacteria has the highest number of
sequences (7,561 sequences). The phylum with
the fewest sequences is Fibrobacteres
(10 sequences).
Note: This chart is interactive. When hovering over each phylum in the donut, the number of sequences for the phylum is shown. "Others" is also clickable; it will then show the remaining phyla in a bar chart. You will then be able to hover over each bar and see the number of sequences per remaining phylum. The table next to it can be searched and ordered as well.
About this chart
This donut chart shows the distribution of sequences per phylum from the filtered dataset. The chart distinguishes between the top 4 most frequent phyla (Proteobacteria, Firmicutes, Bacteroidetes , and Actinobacteria) and a fifth category called "Others", which includes all remaining phyla. The counts for each phylum are also presented in the adjacent searchable and sortable table. No axes apply to the donut chart; the bar chart shows the number of sequences per phylum on the y-axis. On the x-axis the number of sequences is shown.
Proteobacteria has the highest number of sequences (7,561 sequences). The phylum with the fewest sequences is Fibrobacteres (10 sequences).
*Click on "Others" to see the phyla that are included in this category
Phylum | Number of sequences |
---|---|
Proteobacteria | 7561 |
Actinobacteria | 4952 |
Firmicutes | 4008 |
Bacteroidetes | 2187 |
Others | 2152 |
Sequence count for each class
Note: This chart is interactive.
When hovering over each class in the donut,
the number of sequences for the class is shown.
"Others" is also clickable; it will then show
the remaining classes in a bar chart. Which
you will then be able to hover over each bar
and see the number of sequences per remaining
class. The table next to it can be searched
for and ordered as well.
About this chart
This donut chart presents the distribution of
sequences per class from the filtered dataset.
The top 4 most frequent classes are shown.
Actinobacteria has the largest number of
sequences (4,831 sequences), excluding the
"Others" section, which groups the remaining
classes. Within "others" it is shown that
Thermoanaerobaculia has the fewest
sequences (1 sequence). The adjacent table
also shows each class and the corresponding
sequence counts. No axes apply to the donut
chart; the bar chart shows the number of
sequences per class on the y-axis. On the
x-axis the number of sequences is shown.
Note: This chart is interactive. When hovering over each class in the donut, the number of sequences for the class is shown. "Others" is also clickable; it will then show the remaining classes in a bar chart. Which you will then be able to hover over each bar and see the number of sequences per remaining class. The table next to it can be searched for and ordered as well.
About this chart
This donut chart presents the distribution of sequences per class from the filtered dataset. The top 4 most frequent classes are shown.
Actinobacteria has the largest number of sequences (4,831 sequences), excluding the "Others" section, which groups the remaining classes. Within "others" it is shown that Thermoanaerobaculia has the fewest sequences (1 sequence). The adjacent table also shows each class and the corresponding sequence counts. No axes apply to the donut chart; the bar chart shows the number of sequences per class on the y-axis. On the x-axis the number of sequences is shown.
*Click on "Others" to see the classes that are included in this category
Class | Number of sequences |
---|---|
Actinobacteria | 4831 |
Gammaproteobacteria | 3244 |
Bacilli | 2639 |
Alphaproteobacteria | 2605 |
Others | 7541 |
Sequence length distribution if the filtered dataset
Note
This chart is interactive.
When hovering over each bar you will
see the number of sequences within that length range.
About this chart
This chart represents the sequence length distribution for
the filtered dataset. The y-axis represents the number
of sequences on a logarithmic scale, and the x-axis shows
the sequence lengths in nucleotides. The bars indicate
how many sequences fall within each length interval.
Interpretation of this chart (Discussion)
Majority of the sequences (10,602) are between 1,455 and 1,528
nucleotides long, while the fewest sequences (3) are between
1,751 and 1,827 nucleotides. This suggests that
most sequences in the dataset are of similar length, which
is typical for 16S rRNA gene sequences. The distribution
is normal distributed. this is expected, as the
16S rRNA gene is a conserved region in Bacteria and Archaea.
Note This chart is interactive. When hovering over each bar you will see the number of sequences within that length range.
About this chart
This chart represents the sequence length distribution for the filtered dataset. The y-axis represents the number of sequences on a logarithmic scale, and the x-axis shows the sequence lengths in nucleotides. The bars indicate how many sequences fall within each length interval.
Interpretation of this chart (Discussion)
Majority of the sequences (10,602) are between 1,455 and 1,528 nucleotides long, while the fewest sequences (3) are between 1,751 and 1,827 nucleotides. This suggests that most sequences in the dataset are of similar length, which is typical for 16S rRNA gene sequences. The distribution is normal distributed. this is expected, as the 16S rRNA gene is a conserved region in Bacteria and Archaea.
Evaluation results Multinomial Naive Bayes
About this table
This table presents the evaluation of a taxonomic classification
model trained on genomic sequence data using 8-mer features and
tested on the fully filtered dataset containing 16S rRNA sequences.
Based on the “support” values, the dataset shows considerable
variation in the number of samples per phylum. The table includes
precision, recall, F1-score, and support for each taxonomic group.
The "support" column indicates the number of true instances per class
in the test set (totaling 4172). For example, Acidobacteria
appeared 12 times in the test set.
Interpretation of this table (Discussion)
Most phyla, 20 out of 25, were classified with perfect performance
(F1 = 1.00). Only Balneolaeota, Deferribacteres,
Fibrobacteres, Nitrospirae and
Thermodesulfobacteria showed misclassifications, with
Nitrospirae even having a score of 0. They all also
have a lower recall. Additionally, Firmicutes has
slightly reduced precision (0.99) and Actinobacteria
has a slightly reduced recall (0.99).
All values are unitless scores between 0 and 1. The overall model
performance was high, with a macro-average F1-score of 0.910 and a
Matthews Correlation Coefficient (MCC) of 0.996.
About this table
This table presents the evaluation of a taxonomic classification model trained on genomic sequence data using 8-mer features and tested on the fully filtered dataset containing 16S rRNA sequences. Based on the “support” values, the dataset shows considerable variation in the number of samples per phylum. The table includes precision, recall, F1-score, and support for each taxonomic group. The "support" column indicates the number of true instances per class in the test set (totaling 4172). For example, Acidobacteria appeared 12 times in the test set.
Interpretation of this table (Discussion)
Most phyla, 20 out of 25, were classified with perfect performance
(F1 = 1.00). Only Balneolaeota, Deferribacteres,
Fibrobacteres, Nitrospirae and
Thermodesulfobacteria showed misclassifications, with
Nitrospirae even having a score of 0. They all also
have a lower recall. Additionally, Firmicutes has
slightly reduced precision (0.99) and Actinobacteria
has a slightly reduced recall (0.99).
All values are unitless scores between 0 and 1. The overall model
performance was high, with a macro-average F1-score of 0.910 and a
Matthews Correlation Coefficient (MCC) of 0.996.
F1 Score: 0.9102338105407413 MCC: 0.9959149483996512 precision recall f1-score support Acidobacteria 1.00 1.00 1.00 12 Actinobacteria 1.00 0.99 1.00 990 Aquificae 1.00 1.00 1.00 9 Bacteroidetes 1.00 1.00 1.00 437 Balneolaeota 1.00 0.33 0.50 3 Chlamydiae 1.00 1.00 1.00 6 Chlorobi 1.00 1.00 1.00 4 Chloroflexi 1.00 1.00 1.00 11 Crenarchaeota 1.00 1.00 1.00 20 Cyanobacteria 1.00 1.00 1.00 23 Deferribacteres 1.00 0.67 0.80 3 Deinococcus-Thermus 1.00 1.00 1.00 26 Euryarchaeota 1.00 1.00 1.00 166 Fibrobacteres 1.00 0.50 0.67 2 Firmicutes 0.99 1.00 1.00 801 Fusobacteria 1.00 1.00 1.00 15 Nitrospirae 0.00 0.00 0.00 2 Planctomycetes 1.00 1.00 1.00 10 Proteobacteria 1.00 1.00 1.00 1512 Spirochaetes 1.00 1.00 1.00 31 Synergistetes 1.00 1.00 1.00 6 Tenericutes 1.00 1.00 1.00 53 Thermodesulfobacteria 1.00 0.67 0.80 3 Thermotogae 1.00 1.00 1.00 14 Verrucomicrobia 1.00 1.00 1.00 13 accuracy 1.00 4172 macro avg 0.96 0.89 0.91 4172 weighted avg 1.00 1.00 1.00 4172
Evaluation results Random Forest
About this table
This table presents the evaluation of a taxonomic
classification model trained on genomic sequence
data using 8-mer features and tested on the fully
filtered dataset containing 16S rRNA sequences.
Based on the “support” values, the dataset shows
considerable variation in the number of samples per
phylum. The table includes precision, recall, F1-score,
and support for each taxonomic group. The "support"
column indicates the number of true instances per class
in the test set (totaling 4172). For example,
Acidobacteria appeared 12 times in the test set.
Interpretation of this table (Discussion)
Most phyla, 23 out of 25, were classified with perfect
performance (F1 = 1.00). Only Tenericutes (F1 = 0.97)
and Cyanobacteria (F1 = 0.98) showed misclassifications,
and both had notably lower precision. Additionally,
Firmicutes had slightly reduced recall (0.99).
All values are unitless scores between 0 and 1. The overall
model performance was high, with a macro-averaged F1-score
of 0.998 and a Matthews Correlation Coefficient (MCC) of
0.998. This indicates that the model
effectively classified the majority of sequences
in the test set, with only a few misclassifications.
About this table
This table presents the evaluation of a taxonomic classification model trained on genomic sequence data using 8-mer features and tested on the fully filtered dataset containing 16S rRNA sequences. Based on the “support” values, the dataset shows considerable variation in the number of samples per phylum. The table includes precision, recall, F1-score, and support for each taxonomic group. The "support" column indicates the number of true instances per class in the test set (totaling 4172). For example, Acidobacteria appeared 12 times in the test set.
Interpretation of this table (Discussion)
Most phyla, 23 out of 25, were classified with perfect performance (F1 = 1.00). Only Tenericutes (F1 = 0.97) and Cyanobacteria (F1 = 0.98) showed misclassifications, and both had notably lower precision. Additionally, Firmicutes had slightly reduced recall (0.99). All values are unitless scores between 0 and 1. The overall model performance was high, with a macro-averaged F1-score of 0.998 and a Matthews Correlation Coefficient (MCC) of 0.998. This indicates that the model effectively classified the majority of sequences in the test set, with only a few misclassifications.
F1 Score: 0.9979095607832456 MCC: 0.9984299928689293 precision recall f1-score support Acidobacteria 1.00 1.00 1.00 12 Actinobacteria 1.00 1.00 1.00 990 Aquificae 1.00 1.00 1.00 9 Bacteroidetes 1.00 1.00 1.00 437 Balneolaeota 1.00 1.00 1.00 3 Chlamydiae 1.00 1.00 1.00 6 Chlorobi 1.00 1.00 1.00 4 Chloroflexi 1.00 1.00 1.00 11 Crenarchaeota 1.00 1.00 1.00 20 Cyanobacteria 0.96 1.00 0.98 23 Deferribacteres 1.00 1.00 1.00 3 Deinococcus-Thermus 1.00 1.00 1.00 26 Euryarchaeota 1.00 1.00 1.00 166 Fibrobacteres 1.00 1.00 1.00 2 Firmicutes 1.00 0.99 1.00 801 Fusobacteria 1.00 1.00 1.00 15 Nitrospirae 1.00 1.00 1.00 2 Planctomycetes 1.00 1.00 1.00 10 Proteobacteria 1.00 1.00 1.00 1512 Spirochaetes 1.00 1.00 1.00 31 Synergistetes 1.00 1.00 1.00 6 Tenericutes 0.95 1.00 0.97 53 Thermodesulfobacteria 1.00 1.00 1.00 3 Thermotogae 1.00 1.00 1.00 14 Verrucomicrobia 1.00 1.00 1.00 13 accuracy 1.00 4172 macro avg 1.00 1.00 1.00 4172 weighted avg 1.00 1.00 1.00 4172
Confusion matrix Multinomial Naïve Bayes
Note: This interactive
visualization shows the proportion of predictions
per class. Hovering over each cell displays the actual
class, the predicted class, and the corresponding
normalized value, which represents the percentage of
predictions for that combination.
About this plot
This plot shows the normalized confusion matrix from a
machine learning experiment that aimed to classify
Bacteria and Archaea at the phylum level.
The classification was based on 8-mer frequency
ectors derived from 16S rRNA genes. The model was trained
and evaluated using the scikit-learn
(Pedregosa et al., 2011) Multinomial Naïve Bayes classifier.
The test set included 4,172 sequences
(20% of the filtered dataset).
Each row represents the actual phylum, and each column
represents the predicted phylum. Cell values are normalized
per row, ranging from 0 (dark blue) to 1 (yellow),
indicating the percentage of sequences from each actual
phylum that were predicted as each class.
Interpretation of this plot (Discussion)
Most sequences were correctly classified, as shown by the strong
diagonal. However, some misclassifications occurred, including:
- 33.3% of Deferribacteres sequences were misclassified
as Proteobacteria.
- 50% of Fibrobacteres sequences were misclassified as
Firmicutes.
- 33.3% of Thermodesulfobacteria sequences were
misclassified as Proteobacteria.
These confusions likely stem from compositional similarities in
8-mer patterns, rather than evolutionary closeness between phyla.
Multinomial Naive Bayes models are sensitive to such distributional
similarities, especially in high-dimensional 8-mer space.
A yellow diagonal line indicates that the model has a high recall
(sensitivity) for each phylum, meaning it correctly identifies a
large proportion of sequences for their true class. Thus, the model
is effective at retrieving the correct class labels per phylum.
Note: This interactive visualization shows the proportion of predictions per class. Hovering over each cell displays the actual class, the predicted class, and the corresponding normalized value, which represents the percentage of predictions for that combination.
About this plot
This plot shows the normalized confusion matrix from a
machine learning experiment that aimed to classify
Bacteria and Archaea at the phylum level.
The classification was based on 8-mer frequency
ectors derived from 16S rRNA genes. The model was trained
and evaluated using the scikit-learn
(Pedregosa et al., 2011) Multinomial Naïve Bayes classifier.
The test set included 4,172 sequences
(20% of the filtered dataset).
Each row represents the actual phylum, and each column
represents the predicted phylum. Cell values are normalized
per row, ranging from 0 (dark blue) to 1 (yellow),
indicating the percentage of sequences from each actual
phylum that were predicted as each class.
Interpretation of this plot (Discussion)
Most sequences were correctly classified, as shown by the strong diagonal. However, some misclassifications occurred, including:
- 33.3% of Deferribacteres sequences were misclassified as Proteobacteria.
- 50% of Fibrobacteres sequences were misclassified as Firmicutes.
- 33.3% of Thermodesulfobacteria sequences were misclassified as Proteobacteria.
A yellow diagonal line indicates that the model has a high recall (sensitivity) for each phylum, meaning it correctly identifies a large proportion of sequences for their true class. Thus, the model is effective at retrieving the correct class labels per phylum.
Confusion matrix Random Forest
Note: This interactive
visualization shows the proportion of predictions
per class. Hovering over each cell displays the actual
class, the predicted class, and the corresponding
normalized value, which represents the percentage of
predictions for that combination.
About this plot
This figure presents the normalized confusion matrix
resulting from a machine learning experiment that
aimed to classify bacteria and archaea at phylum level
based on 8-mer frequency vectors derived from 16S rRNA
sequences, and the model used was a Random Forest
classifier trained and evaluated using scikit-learn
(Pedregosa et al, 2011). The plot shows the
performance on the test set, which consisted of
4,172 sequences (20% of the full dataset).
The rows represent the actual phyla, and the
columns represent the predicted phyla.
Each cell in the matrix shows the proportion
of predictions (normalized per true class) made
for each class. This means each row sums to 1,
and values range from 0 (dark blue) to 1 (yellow).
A value of 1 means that 100% of sequences from a
given phylum were correctly classified into that phylum.
Interpretation of this plot (Discussion)
The majority of values lie along the diagonal, indicating
that the model achieved high classification accuracy overall.
Some misclassifications occurred, most notably with sequences
from Firmicutes, which were occasionally predicted as:
- Tenericutes (3.7% of Firmicutes misclassified),
- Proteobacteria (1.2%),
- Cyanobacteria (1.2%).
These off-diagonal values reflect false positives, and their magnitude indicates how often one phylum was mistaken for another during prediction.
These errors suggest subtle overlaps in sequence patterns between these groups.
Note: This interactive visualization shows the proportion of predictions per class. Hovering over each cell displays the actual class, the predicted class, and the corresponding normalized value, which represents the percentage of predictions for that combination.
About this plot
This figure presents the normalized confusion matrix resulting from a machine learning experiment that aimed to classify bacteria and archaea at phylum level based on 8-mer frequency vectors derived from 16S rRNA sequences, and the model used was a Random Forest classifier trained and evaluated using scikit-learn (Pedregosa et al, 2011). The plot shows the performance on the test set, which consisted of 4,172 sequences (20% of the full dataset). The rows represent the actual phyla, and the columns represent the predicted phyla. Each cell in the matrix shows the proportion of predictions (normalized per true class) made for each class. This means each row sums to 1, and values range from 0 (dark blue) to 1 (yellow). A value of 1 means that 100% of sequences from a given phylum were correctly classified into that phylum.
Interpretation of this plot (Discussion)
The majority of values lie along the diagonal, indicating that the model achieved high classification accuracy overall. Some misclassifications occurred, most notably with sequences from Firmicutes, which were occasionally predicted as:
- Tenericutes (3.7% of Firmicutes misclassified),
- Proteobacteria (1.2%),
- Cyanobacteria (1.2%).
Top 20 Feature importances per selected class Multinomial Naive Bayes
Note: This plot is interactive.
Double-clicking a class in the legend on the right will
display the top 20 features for the selected phylum or phyla.
The buttons in the top right corner allow you to zoom and adjust
the plot view. Hovering over a feature will display its
associated log probability.
About this plot
This figure presents the top 20 most important 8-mer features
for the classification of the phylum chosen, based on the log
probabilities assigned by the Multinomial Naive Bayes classifier.
The features were extracted from 8-mer frequency vectors derived
from 16S rRNA sequences used in the same classification experiment
as shown in the confusion matrix.
The bar chart displays the 8-mer sequences on the y-axis and their
corresponding log probabilities on the x-axis, where more negative
log probabilities indicate a stronger association with the selected
class. These 8-mers are the most predictive features used by the
classifier to distinguish this class from other bacterial and archaeal
phyla in the dataset.
For example: the most distinctive 8-mer for Cyanobacteria
in this plot is 'gaagaaca' with a log probability of -7.37.
The differences in log probabilities among the top 8-mers are small.
Interpretation of this plot (Discussion)
The most distinctive 8-mer for Cyanobacteria in this plot is
'gaagaaca' with a log probability of -7.37. The differences in
log probabilities among the top 8-mers are small.
The small differences in log probabilities between the top features
suggest that the classifier does not rely heavily on a single highly
predictive 8-mer but rather on a combination of multiple moderately
informative 8-mers to correctly identify Cyanobacteria sequences.
When looking at Proteobacteria, you will see that the log
probabilities are less negative. This indicates that the Multinomial
Naive Bayes (MNB) classifier found fewer predictive 8-mers for this phylum.
In contrast, the log probabilities for Thermodesulfobacteria
are more negative, suggesting stronger feature associations for that group.
Note: This plot is interactive. Double-clicking a class in the legend on the right will display the top 20 features for the selected phylum or phyla. The buttons in the top right corner allow you to zoom and adjust the plot view. Hovering over a feature will display its associated log probability.
About this plot
This figure presents the top 20 most important 8-mer features
for the classification of the phylum chosen, based on the log
probabilities assigned by the Multinomial Naive Bayes classifier.
The features were extracted from 8-mer frequency vectors derived
from 16S rRNA sequences used in the same classification experiment
as shown in the confusion matrix.
The bar chart displays the 8-mer sequences on the y-axis and their
corresponding log probabilities on the x-axis, where more negative
log probabilities indicate a stronger association with the selected
class. These 8-mers are the most predictive features used by the
classifier to distinguish this class from other bacterial and archaeal
phyla in the dataset.
For example: the most distinctive 8-mer for Cyanobacteria
in this plot is 'gaagaaca' with a log probability of -7.37.
The differences in log probabilities among the top 8-mers are small.
Interpretation of this plot (Discussion)
The most distinctive 8-mer for Cyanobacteria in this plot is
'gaagaaca' with a log probability of -7.37. The differences in
log probabilities among the top 8-mers are small.
The small differences in log probabilities between the top features
suggest that the classifier does not rely heavily on a single highly
predictive 8-mer but rather on a combination of multiple moderately
informative 8-mers to correctly identify Cyanobacteria sequences.
When looking at Proteobacteria, you will see that the log
probabilities are less negative. This indicates that the Multinomial
Naive Bayes (MNB) classifier found fewer predictive 8-mers for this phylum.
In contrast, the log probabilities for Thermodesulfobacteria
are more negative, suggesting stronger feature associations for that group.
Top 20 Feature importances Random Forest
Note: This plot is interactive.
The importance score is shown when hovering over the 8-mer
line. You can also control the view of the plot with the
buttons on the right side.
About this plot
This plot shows the 20 most important 8-mers according
to a Random Forest model trained on 8-mer frequency
vectors extracted from 16S rRNA sequences. The y-axis
lists the 8-mer features; the x-axis shows their normalized
feature importance.
Feature importance values are normalized so that the
total across all ~29 million 8-mers in the dataset sums
to 1.0. For comparison: if all features were equally
important, each would have an importance of approximately
3.45×10⁻⁸. A score of 0.0039 means that this 8-mer is
over 113,000× more important than the average feature —
indicating an exceptionally strong influence on the
model's decisions.
Interpretation of this plot (Discussion)
The 8-mer 'CGTTGCGC' is the most important, with an
importance score of 0.0039 (0.39% of the total),
followed by 'TAGTAACC' (0.37%) and 'TGGAATGT' (0.36%).
Note: This plot is interactive. The importance score is shown when hovering over the 8-mer line. You can also control the view of the plot with the buttons on the right side.
About this plot
This plot shows the 20 most important 8-mers according
to a Random Forest model trained on 8-mer frequency
vectors extracted from 16S rRNA sequences. The y-axis
lists the 8-mer features; the x-axis shows their normalized
feature importance.
Feature importance values are normalized so that the
total across all ~29 million 8-mers in the dataset sums
to 1.0. For comparison: if all features were equally
important, each would have an importance of approximately
3.45×10⁻⁸. A score of 0.0039 means that this 8-mer is
over 113,000× more important than the average feature —
indicating an exceptionally strong influence on the
model's decisions.
Interpretation of this plot (Discussion)
The 8-mer 'CGTTGCGC' is the most important, with an importance score of 0.0039 (0.39% of the total), followed by 'TAGTAACC' (0.37%) and 'TGGAATGT' (0.36%).
Multiclass ROC Curve Multinomial Naïve Bayes
Note: This plot is interactive. When
double clicking a phylum on the right side, that phylum is
shown in the plot. Hovering over the curve displays detailed
performance metrics (e.g., TPR, FPR) at that point.
The view of the plot is also controllable with the buttons
on the top right side.
About this plot
This ROC curve illustrates the performance of a multiclass
classification model that classifies microorganisms at the
phylum level using sequence-based features (8-mers).
The model was trained on a balanced training set and
evaluated on an independent test set.
The graph plots the True Positive Rate (y-axis) against
the False Positive Rate (x-axis).
- A True Positive Rate (TPR) of 0 means the model
correctly identifies none of the actual positive
cases, while a TPR of 1 means the model correctly
identifies all of them.
- A False Positive Rate (FPR) of 0 means no false positives
are made, while an FPR of 1 means all negative cases
are incorrectly classified as positive.
The black diagonal line (AUC = 0.5) represents random
classification. The model achieves micro- and macro-averaged
AUC scores of 1.00, as well as AUC values of 1.00 for all
individual classes.
The macro-average computes the AUC for each class
independently and then averages them, giving equal weight
to each class. In contrast, the micro-average aggregates
all true positives, false positives, and false negatives
across all classes before computing the AUC, thereby
weighing classes according to their frequency.
Interpretation of this plot (Discussion)
The scores being 1 indicate perfect classification performance on
the test set.
The exact score is shown when hovering over the plot.
Note: This plot is interactive. When double clicking a phylum on the right side, that phylum is shown in the plot. Hovering over the curve displays detailed performance metrics (e.g., TPR, FPR) at that point. The view of the plot is also controllable with the buttons on the top right side.
About this plot
This ROC curve illustrates the performance of a multiclass
classification model that classifies microorganisms at the
phylum level using sequence-based features (8-mers).
The model was trained on a balanced training set and
evaluated on an independent test set.
The graph plots the True Positive Rate (y-axis) against
the False Positive Rate (x-axis).
- A True Positive Rate (TPR) of 0 means the model correctly identifies none of the actual positive cases, while a TPR of 1 means the model correctly identifies all of them.
- A False Positive Rate (FPR) of 0 means no false positives are made, while an FPR of 1 means all negative cases are incorrectly classified as positive.
Interpretation of this plot (Discussion)
The scores being 1 indicate perfect classification performance on
the test set.
The exact score is shown when hovering over the plot.
Multiclass ROC Curve Random Forest
Note: This plot is interactive.
The exact score is shown when hovering over the plot.
You can also control the view of the plot with the buttons
on the right side. When double clicking an organism on the
right side, only one is shown.
About this plot
This ROC curve illustrates the performance of a multiclass
Random Forest classification model that classifies
bacteria and archaea at the phylum level. The model is
based on 8-mer frequency vectors derived from 16S rRNA
sequences. Sequence-based features, the 8-mers, were used
for classification.
The model was trained on a balanced training set comprising
80% of the fully filtered dataset and was evaluated on an
independent test set containing the remaining 20%. In the
ROC curve, the True Positive Rate (y-axis) is plotted
against the False Positive Rate (x-axis), both ranging
from 0 to 1. The black diagonal line (AUC = 0.5) serves
as a reference for random classification.
The macro-average computes the AUC for each class
independently and then averages them, giving equal weight
to each class. In contrast, the micro-average aggregates
all true positives, false positives, and false negatives
across all classes before computing the AUC, thereby
weighing classes according to their frequency.
Interpretation of this plot (Discussion)
All individual classes and both the macro- and micro-averaged
AUC scores achieved an AUC of 1.00, indicating perfect
classification
Note: This plot is interactive. The exact score is shown when hovering over the plot. You can also control the view of the plot with the buttons on the right side. When double clicking an organism on the right side, only one is shown.
About this plot
This ROC curve illustrates the performance of a multiclass
Random Forest classification model that classifies
bacteria and archaea at the phylum level. The model is
based on 8-mer frequency vectors derived from 16S rRNA
sequences. Sequence-based features, the 8-mers, were used
for classification.
The model was trained on a balanced training set comprising
80% of the fully filtered dataset and was evaluated on an
independent test set containing the remaining 20%. In the
ROC curve, the True Positive Rate (y-axis) is plotted
against the False Positive Rate (x-axis), both ranging
from 0 to 1. The black diagonal line (AUC = 0.5) serves
as a reference for random classification.
The macro-average computes the AUC for each class
independently and then averages them, giving equal weight
to each class. In contrast, the micro-average aggregates
all true positives, false positives, and false negatives
across all classes before computing the AUC, thereby
weighing classes according to their frequency.
Interpretation of this plot (Discussion)
All individual classes and both the macro- and micro-averaged AUC scores achieved an AUC of 1.00, indicating perfect classification