About the YABAC-project

Yet Another Bacteria and Archaea Classifier (YABAC) is a standalone tool and web application for classifying Bacteria and Archaea down to phylum rank based on 16s rRNA sequences using machine learning (Multinomial Naive Bayes, Random Forest).

Glossary

Term Definition
16S rRNA 16S ribosomal RNA (16S rRNA) is an RNA component of the 30S subunit of a prokaryotic ribosome (SSU rRNA). 16s rRNA molecule plays a crucial role in protein synthesis by ensuring proper ribosome alignment with the start codon during translation (Jha et al., 2020). Due to its evolutionary conservation, 16s rRNA is widely used as a molecular marker for microbial classification (Woese & Fox, 1977).
Accuracy (metric) Accuracy is the ratio between the number of correctly classified samples and the overall number of samples (Chicco & Jurman, 2023), showing how often a classification ML model is correct overall (Evidently AI, 2025). Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN). According to Hendricks (n.d.), industry standards for accuracy are between 70% and 90%.
Algorithm A series of steps in a particular order, i.e. "a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer" (Training Arkansas Computing Teachers (TACT), n.d.)
Archaea Unicellular prokaryotic microorganisms, that "were initially isolated from extreme environments (e.g., high temperatures, low pH, hypersalinity)" (Van Wolferen, Pulschen, Baum, Gribaldo & Albers, 2022)
Class A taxonomic rank, i.e. a level of classification. In the hierarchy of biological classification, superkingdom sits below phylum and above order.
Classification The process of organizing entities into categories based on shared characteristics or properties (Cambridge, 2025).
Confusion Matrix A table that helps visualize and evaluate the performance of a classification model. It shows the count of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), allowing for a detailed analysis of where a model's predictions may be accurate or inaccurate (Murel & Kavlakoglu, 2024).
Data preparation “Data preparation is the umbrella term for all the activities involved in getting your data ready for analysis or use in a machine learning model”(Chanaka, 2024)
Data preprocessing “Data preprocessing […] is a specific step within data preparation that focuses on cleaning and transforming the data itself”(Chanaka, 2024)
Data science The usage of scientific methods, algorithms and/or systems to extract knowledge from structured and unstructured data" (Van Bentum, 2025)
F1-score (metric) The F1-score combines “precision and recall into a single performance metric that gives equal weight to both measures” (Frank, 2023). Formula: F1 = (2 * TP) / (2 * TP + FP + FN), which can also be written as: F1 = (2 * precision * recall) / (precision + recall).
Identification The process of recognizing and naming an entity based on its characteristics or features (Cambridge, 2025a).
JSON-file JavaScript Object Notation, a text-based file format with extension .json.
K-mer A substring of length k of a biological sequence.
K-mer analysis Analysis of a (set of) biological sequence(s) based on the k-mers derived from those sequences.
Machine learning "Machine learning is the study of computer algorithms that can improve automatically through experience and by the use of data." (Van Bentum, 2025)
MCC-score (metric) The MCC-score is similar to the F1-score, but unlike the F1-score, MCC considers the ratio between positive and negative elements, making MCC suitable for measuring performance on imbalanced datasets (Chicco & Jurman, 2023). Formula: MCC = (TP * TN – FP * FN) / √(TP + FP) * (TP + FN) * (TP + FP) * (TN + FN). “An MCC score above 0.3 is considered moderate, and a score above 0.5 is considered strong” (Activeloop, n.d.).
Metagenome The collection of all genetic material present within an environment and/or sample. Therefore, a metagenome typically contains sequences from different species and/or multiple individuals.
Microbiome "A characteristic microbial community occupying a reasonable well-defined habitat which has distinct physio-chemical properties. The microbiome not only refers to the microorganisms involved but also encompass their theatre of activity, which results in the formation of specific ecological niches" (Berg et al., 2020)
Microbiota "The assembly of microorganisms belonging to different kingdoms (Prokaryotes [Bacteria, Archaea], Eukaryotes [e.g., Protozoa, Fungi, and Algae])" (Berg et al., 2020)
Multinomial Naive Bayes (MNB) A machine learing algorithm: Multinomial Naïve Bayes (MNB) is a probabilistic machine learning algorithm based on Bayes’ Theorem and is particularly well-suited for discrete frequency-based data, such as word counts in text classification (GeeksforGeeks, 2025).
Phylum A taxonomic rank, i.e. a level of classification. In the hierarchy of biological classification, superkingdom sits below kingdom and above class.
Precision (metric) Precision "measures how often a ML model correctly predicts the positive class” (Evidently AI, 2025). Formula: Precision = TP / (TP + FP). Precision can be measured on a scale of 0 to 1 or as a percentage, the desired precision scores being as close as possible to 1.0 or 100% as this would mean that there is a minimal (0) number of false positives.
Principal Component "2-dimensional representation of a dataset that contains as much as possible of the variation" (Van Bentum, 2025)
Principal Component Analysis (PCA) A linear dimensionality reduction technique for reducing large amounts of data to a few principal components.
Random Forest (RF) A machine learning algorithm: Random Forest (RF) is an ensemble learning method that builds multiple decision trees using bootstrapped samples of the training data (Baumann, Ehlers, Vogt & Rosenhahn, 2013)
Recall (metric) Recall measures which proportion of all predicted positives are in fact true positives. Formula: Recall = TP / (TP + FN). Recall can be measured on a scale of 0 to 1 or as a percentage. Recall is also known as the sensitivity or true positive rate (Evidently AI, 2025).
ROC-curve A Receiver Operating Characteristic (ROC) curve is a graphical representation used to evaluate the performance of a binary classification model. It plots the True Positive Rate (TPR, sensitivity) against the False Positive Rate (FPR). TPR=1.0 with FPR=0.0 indicates a perfect model, while TPR=FPR indicates that the model performs equal to a coint flip (i.e. a random classifier). (Van Bentum, 2025).
Superkingdom A taxonomic rank, i.e. a level of classification."Superkingdom" is also known as "Kingdom". In the hierarchy of biological classification, superkingdom sits below domain and above phylum. Superkingdom contains Archaea, Bacteria, and Eukarya.
Supervised learning "Supervised learning is a machine learning approach that is defined by its use of labeled data sets" (Delua, 2021), which means that "for each observation of the predictor measurement(s) xi, i = 1, . . . , n there is an associated response.
Sustainable Development Goal (SDG) A target defined by the United Nations in 2015, aimed to create a more sustainable and equitable future for all by 2030. In total, there 17 SDG's, their topics ranging from addressing gender equality and poverty to promoting climate action.
Taxonomy The science of classification (Cain, 2025), in the context of this project, 'taxonomy' refers to the biological classification of living and extinct organisms into taxonomic ranks.
Unsupervised learning Unsupervised learning is a machine learning approach for analyzing and clustering unlabeled data sets (Delua, 2021). "These algorithms discover hidden patterns in data without the need for human intervention (hence, they are “unsupervised”)" (Delua, 2021). One should choose unsupervised learning over supervied learning when one can observe "a vector of measurements xi but no associated response yi" (James et al., 2023).

Why the YABAC-project?

Since 2015, the number of people facing hunger and food insecurity has been rising, this number being ~9.2 percent of the world population (≈735 million people) in 2022 (United Nations, 2023). In response, the United Nations have formulated Sustainable Development Goal 2 to achieve a world without zero hunger by 2030. Causes of food insecurity can be distinguished in two categories: insufficient food availability and insufficient access to food (Smith, Obeid, & Jensen, 2000). food availability is determined by food production. Crops (plants) form the basis of every food chain (Morrissey, Dow, Mark, & O'Gara, 2004) and are grown in soil: soil is a complex ecosystem, containing microbes such as bacteria, fungi, viruses, archaea, and protists (Marzouk, Kwaslema, D. R., Omar, M. M., & Mohamed, S. H. 2024). The quantity and the quality of crop yields i.e. food production is subject to the soil microbiota, as some microbes “promote plant growth by cycling plant nutrients, suppressing plant pests and diseases, and contributing to soil aggregation, porosity, moisture retention, and organic matter accumulation” while other microbes can be harmful to crops if they cause disease or compete for nutrients (Kaminsky et al, 2021).

Therefore, in addition to metrices such as pH, nutrient levels, soil compaction, DNA-based assessment of soil microbes is becoming a common practice for measuring soil health (Fierer, Wood & Bueno de Mesquita, 2020). Examples of DNA-based classification methods are pairwise alignments and profile Hidden Markov Models; however, these methods are limited given that the majority of the soil microbiome has not been sequenced at present day (Iqbal, Begum, Ullah, Jalal, & Shaw, 2024). Today, machine learning (ML), a branch of artificial intelligence, is rising as an alternative method for microbial classification (Wu & Gadsden, 2023).

To explore the possibilities of machine learning in the assessment of soil microbes, specifically Archaea and Bacteria, this project aims to develop two ML classifiers using two different ML algorithms with accuracy ≥85% and MCC ≥0.3 to classify the taxonomic lineage of Archaea and Bacteria based on 16S ribosomal RNA (16S rRNA) sequences, as 16S rRNA is commonly used for classification of Bacteria and Archaea due to its evolutionarily conserved nature (Kim & Chun, 2014).

How we did it

We trained two machine learning models, Multinomial Naïve Bayes (MNB) and Random Forest (RF), on publicly available 16S rRNA reference sequences from the NCBI. This dataset, collected in 2019, contained 934 archaeal 16s rRNA sequences and 19.981 bacterial 16s rRNA sequences. These models, based on different algorithms, were validated and tested to compare their accuracy and effectiveness in classifying microorganisms. 8-mers were used as features for training.
The final models and intermediate results were integrated into an accessible dashboard, the tool page, that allows users to upload their own data, do predictions on their own data with the trained machine learning models, explore classifications, and visualize microbial distribution in soil samples. For the full materials and methods, please see the materials and methods page.

Contributors

YABAC was developed by a group of BsC Bioinformatics undergraduate students from the HAN University of Applied Sciences (Nijmegen, The Netherlands), under supervision of Douwe van der Leest and the HAN BioCentre Centre of Expertise. The official website of the HAN BioCentre can be found here.

The developers of YABAC:

Esmay Wissink
Laurie Straver
Tom Ummenthun
Rens Heerkens
Luuk Veeken

How to cite YABAC

If you find YABAC useful in your work, please cite our poster publication:

Wissink, E., Ummenthun, T., Straver, L., Heerkens, R. & Veeken, L (2025, June 24). Classifying Bacteria and Archaea phyla based on 16s rRNA 8-mers using Multinomial Naïve Bayes and Random Forest. BIN-2 and BIN-3 poster conference 2025, HAN University of Applied Sciences, Nijmegen, The Netherlands.

License

Yet Another Bacteria and Archaea Classifier © 2025 by Wissink, E., Ummenthun, T., Straver, L., Heerkens, R. & Veeken, L. is licensed under CC BY-NC-SA 4.0 and owned by the HAN University of Applied Sciences.

Contact

To contact us, send a letter to the following address:

HAN University of Applied Sciences
For the attention of class BIN-3c HBO Bioinformatics
Laan van Scheut 2
6525 EM Nijmegen
The Netherlands


Make sure to include "YABAC" in the subject line of your letter.

References

Image attribution

About the YABAC-website

Explanation of the web pages

  • Home

    The homepage of YABAC provides an overview of the project. It serves as a central hub, offering navigation to the web pages such as the Tool, the Data Dashboard and the FAQ.

  • Data Dashboard

    The Data Dashboard allows users to explore the data. It includes visualization and statistics that help users better understand the dataset used for the machine learning models.

  • Material and methods

    The material and methods provides an overview of the data sources, preprocessing steps, and machine learning techniques used in the YABAC project. It describes how the models were trained, validated, and tested, including information about the algorithms, feature extraction using 8-mers, and the evaluation metric (MCC). The goal is to give users insight into the methodology and methods behind this project.

  • Tool

    The tool page enables users to apply machine learning models to their own data. Users can choose between the Multinominal Naïve Bayes (MNB) model and the Random Forest (RF) model. After uploading a FASTA file, the models can be executed when pushed on run. When the models are done the user will be promted with a new venster with the results when the models are finished running and results will be visible on the new window.

  • FAQ

    The FAQ page contains answers to frequently asked questions about the YABAC project, the use of the tool, the interpretation of the results and more.