YABAC: Yet Another Bacteria and Archaea Classifier

This website is created with the idea to provide a user-friendly interface for the classification of bacteria and archaea. The user can submit their own 16S rRNA sequences in .fasta format to classify the sequences at phylum level. The website provides a tool that allows the user to upload their data and classify it using machine learning models.

Our machine learning models predict the outcome of a multiple classification task by using Multinomial Naïve Bayes (MNB) and Random Forest (RF) models. These models are able to predict the taxonomy of the Bacteria and Archaea on phylum level. The model outcome shows with which probability the sequence belongs to the predicted phylum.

Currently, it is not possible to save your tool results on the website itself. After the classification is done, you can see the results on the website. The website does not provide a download or save function for these results.

To contact us, send a letter to the following address:

HAN University of Applied Sciences
For the attention of class BIN-3c HBO Bioinformatics
Laan van Scheut 2
6525 EM Nijmegen
The Netherlands

Make sure to include "YABAC" in the subject line of your letter.

Both the machine learning models (MNB and RF) were trained on a dataset containing 16s rRNA sequences from Archaea and Bacteria, extracted from the NCBI database (NCBI RefSeq). This dataset provides comprehensive coverage of 16S rRNA sequences across a wide range of bacterial and archaeal taxa. Its detailed taxonomic resolution make it well-suited for use in phylogenetic studies. The inclusion of RefSeq IDs and taxonomic identifiers facilitates interoperability with NCBI. The data is from 2019. For more details on the dataset, please read this section on our Materials & Methods page.

The tool only supports .fasta files (example).

Multinomial Naïve Bayes (MNB) is a probabilistic machine learning algorithm based on Bayes’ Theorem and is particularly well-suited for discrete frequency-based data—such as word counts in text classification (MultinomialNB, n.d.). Bayes’ Theorem, named after the 18th-century British mathematician Thomas Bayes, describes the conditional probability of an event based on prior knowledge of conditions that might be related to the event (Hayes, 2025). In MNB, this principle is used to estimate the probability that a sequence belongs to a particular class, based on its k-mer frequency profile. For more information about the RF algorithm, please read this section on Multinomial Naïve Bayes (MNB) on our Materials & Methods page.

Random Forest (RF) is an ensemble learning method that builds multiple decision trees using bootstrapped samples of the training data. At each decision point within a tree, a random subset of features is selected to determine the best split. This randomness helps reduce overfitting and enhances the model’s ability to generalize to unseen data. Once all trees are trained, RF makes predictions by combining their individual outputs, using majority voting for classification tasks or averaging for regression tasks. This ensemble approach improves prediction accuracy and robustness, especially when working with structured or high-dimensional data. RF also performs well even when some input features are noisy or irrelevant (Koehrsen, 2017). For more information about the RF algorithm, please read this section on Random Forest (RF) on our Materials & Methods page.

We started the project in February 2025 and expect to complete it by June 2025.

The plots on the website are generated using the Plotly JavaScript library. This library is known for its interactive and visually appealing plots, but it can be slower to load compared to static plots. The loading time may vary depending on your internet connection and the complexity of the plot. We are working on optimizing the loading time of the plots to improve the user experience.

Frequently Asked Questions (FAQ)