Training dataset

Sequence count for each superkingdom

Note: This plot is interactive. When hovering over superkingdom in the donut, the number of sequences for that superkingdom is shown. The table next to it can be searched and ordened as well.

About this chart

This donut chart shows the distribution of sequences per superkingdom from the filtered dataset. The chart distinguishes between two superkingdoms: Bacteria (19,931 sequences) and Archaea (929 sequences). The counts for each superkingdom are also presented in the adjacent searchable and sortable table.

Superkingdom Number of sequences
Bacteria 19931
Archaea 929

Sequence count for each phylum

Note: This chart is interactive. When hovering over each phylum in the donut, the number of sequences for the phylum is shown. "Others" is also clickable; it will then show the remaining phyla in a bar chart. You will then be able to hover over each bar and see the number of sequences per remaining phylum. The table next to it can be searched and ordered as well.

About this chart

This donut chart shows the distribution of sequences per phylum from the filtered dataset. The chart distinguishes between the top 4 most frequent phyla (Proteobacteria, Firmicutes, Bacteroidetes , and Actinobacteria) and a fifth category called "Others", which includes all remaining phyla. The counts for each phylum are also presented in the adjacent searchable and sortable table. No axes apply to the donut chart; the bar chart shows the number of sequences per phylum on the y-axis. On the x-axis the number of sequences is shown.

Proteobacteria has the highest number of sequences (7,561 sequences). The phylum with the fewest sequences is Fibrobacteres (10 sequences).

*Click on "Others" to see the phyla that are included in this category

Phylum Number of sequences
Proteobacteria 7561
Actinobacteria 4952
Firmicutes 4008
Bacteroidetes 2187
Others 2152

Sequence count for each class

Note: This chart is interactive. When hovering over each class in the donut, the number of sequences for the class is shown. "Others" is also clickable; it will then show the remaining classes in a bar chart. Which you will then be able to hover over each bar and see the number of sequences per remaining class. The table next to it can be searched for and ordered as well.

About this chart

This donut chart presents the distribution of sequences per class from the filtered dataset. The top 4 most frequent classes are shown.

Actinobacteria has the largest number of sequences (4,831 sequences), excluding the "Others" section, which groups the remaining classes. Within "others" it is shown that Thermoanaerobaculia has the fewest sequences (1 sequence). The adjacent table also shows each class and the corresponding sequence counts. No axes apply to the donut chart; the bar chart shows the number of sequences per class on the y-axis. On the x-axis the number of sequences is shown.

*Click on "Others" to see the classes that are included in this category

Class Number of sequences
Actinobacteria 4831
Gammaproteobacteria 3244
Bacilli 2639
Alphaproteobacteria 2605
Others 7541

Sequence length distribution if the filtered dataset

Note This chart is interactive. When hovering over each bar you will see the number of sequences within that length range.

About this chart

This chart represents the sequence length distribution for the filtered dataset. The y-axis represents the number of sequences on a logarithmic scale, and the x-axis shows the sequence lengths in nucleotides. The bars indicate how many sequences fall within each length interval.

Interpretation of this chart (Discussion)

Majority of the sequences (10,602) are between 1,455 and 1,528 nucleotides long, while the fewest sequences (3) are between 1,751 and 1,827 nucleotides. This suggests that most sequences in the dataset are of similar length, which is typical for 16S rRNA gene sequences. The distribution is normal distributed. this is expected, as the 16S rRNA gene is a conserved region in Bacteria and Archaea.

Evaluation results Multinomial Naive Bayes

About this table

This table presents the evaluation of a taxonomic classification model trained on genomic sequence data using 8-mer features and tested on the fully filtered dataset containing 16S rRNA sequences. Based on the “support” values, the dataset shows considerable variation in the number of samples per phylum. The table includes precision, recall, F1-score, and support for each taxonomic group. The "support" column indicates the number of true instances per class in the test set (totaling 4172). For example, Acidobacteria appeared 12 times in the test set.

Interpretation of this table (Discussion)

Most phyla, 20 out of 25, were classified with perfect performance (F1 = 1.00). Only Balneolaeota, Deferribacteres, Fibrobacteres, Nitrospirae and Thermodesulfobacteria showed misclassifications, with Nitrospirae even having a score of 0. They all also have a lower recall. Additionally, Firmicutes has slightly reduced precision (0.99) and Actinobacteria has a slightly reduced recall (0.99).
All values are unitless scores between 0 and 1. The overall model performance was high, with a macro-average F1-score of 0.910 and a Matthews Correlation Coefficient (MCC) of 0.996.

F1 Score: 0.9102338105407413
MCC: 0.9959149483996512

                       precision    recall  f1-score   support

        Acidobacteria       1.00      1.00      1.00        12
       Actinobacteria       1.00      0.99      1.00       990
            Aquificae       1.00      1.00      1.00         9
        Bacteroidetes       1.00      1.00      1.00       437
         Balneolaeota       1.00      0.33      0.50         3
           Chlamydiae       1.00      1.00      1.00         6
             Chlorobi       1.00      1.00      1.00         4
          Chloroflexi       1.00      1.00      1.00        11
        Crenarchaeota       1.00      1.00      1.00        20
        Cyanobacteria       1.00      1.00      1.00        23
      Deferribacteres       1.00      0.67      0.80         3
  Deinococcus-Thermus       1.00      1.00      1.00        26
        Euryarchaeota       1.00      1.00      1.00       166
        Fibrobacteres       1.00      0.50      0.67         2
           Firmicutes       0.99      1.00      1.00       801
         Fusobacteria       1.00      1.00      1.00        15
          Nitrospirae       0.00      0.00      0.00         2
       Planctomycetes       1.00      1.00      1.00        10
       Proteobacteria       1.00      1.00      1.00      1512
         Spirochaetes       1.00      1.00      1.00        31
        Synergistetes       1.00      1.00      1.00         6
          Tenericutes       1.00      1.00      1.00        53
Thermodesulfobacteria       1.00      0.67      0.80         3
          Thermotogae       1.00      1.00      1.00        14
      Verrucomicrobia       1.00      1.00      1.00        13

             accuracy                           1.00      4172
            macro avg       0.96      0.89      0.91      4172
         weighted avg       1.00      1.00      1.00      4172
  
        

Evaluation results Random Forest

About this table

This table presents the evaluation of a taxonomic classification model trained on genomic sequence data using 8-mer features and tested on the fully filtered dataset containing 16S rRNA sequences. Based on the “support” values, the dataset shows considerable variation in the number of samples per phylum. The table includes precision, recall, F1-score, and support for each taxonomic group. The "support" column indicates the number of true instances per class in the test set (totaling 4172). For example, Acidobacteria appeared 12 times in the test set.

Interpretation of this table (Discussion)

Most phyla, 23 out of 25, were classified with perfect performance (F1 = 1.00). Only Tenericutes (F1 = 0.97) and Cyanobacteria (F1 = 0.98) showed misclassifications, and both had notably lower precision. Additionally, Firmicutes had slightly reduced recall (0.99). All values are unitless scores between 0 and 1. The overall model performance was high, with a macro-averaged F1-score of 0.998 and a Matthews Correlation Coefficient (MCC) of 0.998. This indicates that the model effectively classified the majority of sequences in the test set, with only a few misclassifications.

F1 Score: 0.9979095607832456
MCC: 0.9984299928689293

                       precision    recall  f1-score   support

        Acidobacteria       1.00      1.00      1.00        12
       Actinobacteria       1.00      1.00      1.00       990
            Aquificae       1.00      1.00      1.00         9
        Bacteroidetes       1.00      1.00      1.00       437
         Balneolaeota       1.00      1.00      1.00         3
           Chlamydiae       1.00      1.00      1.00         6
             Chlorobi       1.00      1.00      1.00         4
          Chloroflexi       1.00      1.00      1.00        11
        Crenarchaeota       1.00      1.00      1.00        20
        Cyanobacteria       0.96      1.00      0.98        23
      Deferribacteres       1.00      1.00      1.00         3
  Deinococcus-Thermus       1.00      1.00      1.00        26
        Euryarchaeota       1.00      1.00      1.00       166
        Fibrobacteres       1.00      1.00      1.00         2
           Firmicutes       1.00      0.99      1.00       801
         Fusobacteria       1.00      1.00      1.00        15
          Nitrospirae       1.00      1.00      1.00         2
       Planctomycetes       1.00      1.00      1.00        10
       Proteobacteria       1.00      1.00      1.00      1512
         Spirochaetes       1.00      1.00      1.00        31
        Synergistetes       1.00      1.00      1.00         6
          Tenericutes       0.95      1.00      0.97        53
Thermodesulfobacteria       1.00      1.00      1.00         3
          Thermotogae       1.00      1.00      1.00        14
      Verrucomicrobia       1.00      1.00      1.00        13

             accuracy                           1.00      4172
            macro avg       1.00      1.00      1.00      4172
         weighted avg       1.00      1.00      1.00      4172
  
        

Confusion matrix Multinomial Naïve Bayes

Note: This interactive visualization shows the proportion of predictions per class. Hovering over each cell displays the actual class, the predicted class, and the corresponding normalized value, which represents the percentage of predictions for that combination.

About this plot

This plot shows the normalized confusion matrix from a machine learning experiment that aimed to classify Bacteria and Archaea at the phylum level. The classification was based on 8-mer frequency ectors derived from 16S rRNA genes. The model was trained and evaluated using the scikit-learn (Pedregosa et al., 2011) Multinomial Naïve Bayes classifier. The test set included 4,172 sequences (20% of the filtered dataset).
Each row represents the actual phylum, and each column represents the predicted phylum. Cell values are normalized per row, ranging from 0 (dark blue) to 1 (yellow), indicating the percentage of sequences from each actual phylum that were predicted as each class.

Interpretation of this plot (Discussion)

Most sequences were correctly classified, as shown by the strong diagonal. However, some misclassifications occurred, including:

  • 33.3% of Deferribacteres sequences were misclassified as Proteobacteria.
  • 50% of Fibrobacteres sequences were misclassified as Firmicutes.
  • 33.3% of Thermodesulfobacteria sequences were misclassified as Proteobacteria.
These confusions likely stem from compositional similarities in 8-mer patterns, rather than evolutionary closeness between phyla. Multinomial Naive Bayes models are sensitive to such distributional similarities, especially in high-dimensional 8-mer space.
A yellow diagonal line indicates that the model has a high recall (sensitivity) for each phylum, meaning it correctly identifies a large proportion of sequences for their true class. Thus, the model is effective at retrieving the correct class labels per phylum.

Confusion matrix Random Forest

Note: This interactive visualization shows the proportion of predictions per class. Hovering over each cell displays the actual class, the predicted class, and the corresponding normalized value, which represents the percentage of predictions for that combination.

About this plot

This figure presents the normalized confusion matrix resulting from a machine learning experiment that aimed to classify bacteria and archaea at phylum level based on 8-mer frequency vectors derived from 16S rRNA sequences, and the model used was a Random Forest classifier trained and evaluated using scikit-learn (Pedregosa et al, 2011). The plot shows the performance on the test set, which consisted of 4,172 sequences (20% of the full dataset). The rows represent the actual phyla, and the columns represent the predicted phyla. Each cell in the matrix shows the proportion of predictions (normalized per true class) made for each class. This means each row sums to 1, and values range from 0 (dark blue) to 1 (yellow). A value of 1 means that 100% of sequences from a given phylum were correctly classified into that phylum.

Interpretation of this plot (Discussion)

The majority of values lie along the diagonal, indicating that the model achieved high classification accuracy overall. Some misclassifications occurred, most notably with sequences from Firmicutes, which were occasionally predicted as:

  • Tenericutes (3.7% of Firmicutes misclassified),
  • Proteobacteria (1.2%),
  • Cyanobacteria (1.2%).
These off-diagonal values reflect false positives, and their magnitude indicates how often one phylum was mistaken for another during prediction. These errors suggest subtle overlaps in sequence patterns between these groups.

Top 20 Feature importances per selected class Multinomial Naive Bayes

Note: This plot is interactive. Double-clicking a class in the legend on the right will display the top 20 features for the selected phylum or phyla. The buttons in the top right corner allow you to zoom and adjust the plot view. Hovering over a feature will display its associated log probability.

About this plot

This figure presents the top 20 most important 8-mer features for the classification of the phylum chosen, based on the log probabilities assigned by the Multinomial Naive Bayes classifier. The features were extracted from 8-mer frequency vectors derived from 16S rRNA sequences used in the same classification experiment as shown in the confusion matrix.
The bar chart displays the 8-mer sequences on the y-axis and their corresponding log probabilities on the x-axis, where more negative log probabilities indicate a stronger association with the selected class. These 8-mers are the most predictive features used by the classifier to distinguish this class from other bacterial and archaeal phyla in the dataset.
For example: the most distinctive 8-mer for Cyanobacteria in this plot is 'gaagaaca' with a log probability of -7.37. The differences in log probabilities among the top 8-mers are small.

Interpretation of this plot (Discussion)

The most distinctive 8-mer for Cyanobacteria in this plot is 'gaagaaca' with a log probability of -7.37. The differences in log probabilities among the top 8-mers are small.
The small differences in log probabilities between the top features suggest that the classifier does not rely heavily on a single highly predictive 8-mer but rather on a combination of multiple moderately informative 8-mers to correctly identify Cyanobacteria sequences.
When looking at Proteobacteria, you will see that the log probabilities are less negative. This indicates that the Multinomial Naive Bayes (MNB) classifier found fewer predictive 8-mers for this phylum. In contrast, the log probabilities for Thermodesulfobacteria are more negative, suggesting stronger feature associations for that group.

Top 20 Feature importances Random Forest

Note: This plot is interactive. The importance score is shown when hovering over the 8-mer line. You can also control the view of the plot with the buttons on the right side.

About this plot

This plot shows the 20 most important 8-mers according to a Random Forest model trained on 8-mer frequency vectors extracted from 16S rRNA sequences. The y-axis lists the 8-mer features; the x-axis shows their normalized feature importance.
Feature importance values are normalized so that the total across all ~29 million 8-mers in the dataset sums to 1.0. For comparison: if all features were equally important, each would have an importance of approximately 3.45×10⁻⁸. A score of 0.0039 means that this 8-mer is over 113,000× more important than the average feature — indicating an exceptionally strong influence on the model's decisions.

Interpretation of this plot (Discussion)

The 8-mer 'CGTTGCGC' is the most important, with an importance score of 0.0039 (0.39% of the total), followed by 'TAGTAACC' (0.37%) and 'TGGAATGT' (0.36%).

Multiclass ROC Curve Multinomial Naïve Bayes

Note: This plot is interactive. When double clicking a phylum on the right side, that phylum is shown in the plot. Hovering over the curve displays detailed performance metrics (e.g., TPR, FPR) at that point. The view of the plot is also controllable with the buttons on the top right side.

About this plot

This ROC curve illustrates the performance of a multiclass classification model that classifies microorganisms at the phylum level using sequence-based features (8-mers). The model was trained on a balanced training set and evaluated on an independent test set.
The graph plots the True Positive Rate (y-axis) against the False Positive Rate (x-axis).

  • A True Positive Rate (TPR) of 0 means the model correctly identifies none of the actual positive cases, while a TPR of 1 means the model correctly identifies all of them.
  • A False Positive Rate (FPR) of 0 means no false positives are made, while an FPR of 1 means all negative cases are incorrectly classified as positive.
The black diagonal line (AUC = 0.5) represents random classification. The model achieves micro- and macro-averaged AUC scores of 1.00, as well as AUC values of 1.00 for all individual classes. The macro-average computes the AUC for each class independently and then averages them, giving equal weight to each class. In contrast, the micro-average aggregates all true positives, false positives, and false negatives across all classes before computing the AUC, thereby weighing classes according to their frequency.

Interpretation of this plot (Discussion)

The scores being 1 indicate perfect classification performance on the test set.
The exact score is shown when hovering over the plot.

Multiclass ROC Curve Random Forest

Note: This plot is interactive. The exact score is shown when hovering over the plot. You can also control the view of the plot with the buttons on the right side. When double clicking an organism on the right side, only one is shown.

About this plot

This ROC curve illustrates the performance of a multiclass Random Forest classification model that classifies bacteria and archaea at the phylum level. The model is based on 8-mer frequency vectors derived from 16S rRNA sequences. Sequence-based features, the 8-mers, were used for classification.
The model was trained on a balanced training set comprising 80% of the fully filtered dataset and was evaluated on an independent test set containing the remaining 20%. In the ROC curve, the True Positive Rate (y-axis) is plotted against the False Positive Rate (x-axis), both ranging from 0 to 1. The black diagonal line (AUC = 0.5) serves as a reference for random classification.
The macro-average computes the AUC for each class independently and then averages them, giving equal weight to each class. In contrast, the micro-average aggregates all true positives, false positives, and false negatives across all classes before computing the AUC, thereby weighing classes according to their frequency.

Interpretation of this plot (Discussion)

All individual classes and both the macro- and micro-averaged AUC scores achieved an AUC of 1.00, indicating perfect classification