YABAC: Yet Another Bacteria and Archaea Classifier

About the YABAC-project

Yet Another Bacteria and Archaea Classifier (YABAC) is a standalone tool and web application for classifying Bacteria and Archaea down to phylum rank based on 16s rRNA sequences using machine learning (Multinomial Naive Bayes, Random Forest).

Glossary

Term	Definition
16S rRNA	16S ribosomal RNA (16S rRNA) is an RNA component of the 30S subunit of a prokaryotic ribosome (SSU rRNA). 16s rRNA molecule plays a crucial role in protein synthesis by ensuring proper ribosome alignment with the start codon during translation (Jha et al., 2020). Due to its evolutionary conservation, 16s rRNA is widely used as a molecular marker for microbial classification (Woese & Fox, 1977).
Accuracy (metric)	Accuracy is the ratio between the number of correctly classified samples and the overall number of samples (Chicco & Jurman, 2023), showing how often a classification ML model is correct overall (Evidently AI, 2025). Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN). According to Hendricks (n.d.), industry standards for accuracy are between 70% and 90%.
Algorithm	A series of steps in a particular order, i.e. "a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer" (Training Arkansas Computing Teachers (TACT), n.d.)
Archaea	Unicellular prokaryotic microorganisms, that "were initially isolated from extreme environments (e.g., high temperatures, low pH, hypersalinity)" (Van Wolferen, Pulschen, Baum, Gribaldo & Albers, 2022)
Class	A taxonomic rank, i.e. a level of classification. In the hierarchy of biological classification, superkingdom sits below phylum and above order.
Classification	The process of organizing entities into categories based on shared characteristics or properties (Cambridge, 2025).
Confusion Matrix	A table that helps visualize and evaluate the performance of a classification model. It shows the count of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), allowing for a detailed analysis of where a model's predictions may be accurate or inaccurate (Murel & Kavlakoglu, 2024).
Data preparation	“Data preparation is the umbrella term for all the activities involved in getting your data ready for analysis or use in a machine learning model”(Chanaka, 2024)
Data preprocessing	“Data preprocessing […] is a specific step within data preparation that focuses on cleaning and transforming the data itself”(Chanaka, 2024)
Data science	The usage of scientific methods, algorithms and/or systems to extract knowledge from structured and unstructured data" (Van Bentum, 2025)
F1-score (metric)	The F1-score combines “precision and recall into a single performance metric that gives equal weight to both measures” (Frank, 2023). Formula: F1 = (2 * TP) / (2 * TP + FP + FN), which can also be written as: F1 = (2 * precision * recall) / (precision + recall).
Identification	The process of recognizing and naming an entity based on its characteristics or features (Cambridge, 2025a).
JSON-file	JavaScript Object Notation, a text-based file format with extension .json.
K-mer	A substring of length k of a biological sequence.
K-mer analysis	Analysis of a (set of) biological sequence(s) based on the k-mers derived from those sequences.
Machine learning	"Machine learning is the study of computer algorithms that can improve automatically through experience and by the use of data." (Van Bentum, 2025)
MCC-score (metric)	The MCC-score is similar to the F1-score, but unlike the F1-score, MCC considers the ratio between positive and negative elements, making MCC suitable for measuring performance on imbalanced datasets (Chicco & Jurman, 2023). Formula: MCC = (TP * TN – FP * FN) / √(TP + FP) * (TP + FN) * (TP + FP) * (TN + FN). “An MCC score above 0.3 is considered moderate, and a score above 0.5 is considered strong” (Activeloop, n.d.).
Metagenome	The collection of all genetic material present within an environment and/or sample. Therefore, a metagenome typically contains sequences from different species and/or multiple individuals.
Microbiome	"A characteristic microbial community occupying a reasonable well-defined habitat which has distinct physio-chemical properties. The microbiome not only refers to the microorganisms involved but also encompass their theatre of activity, which results in the formation of specific ecological niches" (Berg et al., 2020)
Microbiota	"The assembly of microorganisms belonging to different kingdoms (Prokaryotes [Bacteria, Archaea], Eukaryotes [e.g., Protozoa, Fungi, and Algae])" (Berg et al., 2020)
Multinomial Naive Bayes (MNB)	A machine learing algorithm: Multinomial Naïve Bayes (MNB) is a probabilistic machine learning algorithm based on Bayes’ Theorem and is particularly well-suited for discrete frequency-based data, such as word counts in text classification (GeeksforGeeks, 2025).
Phylum	A taxonomic rank, i.e. a level of classification. In the hierarchy of biological classification, superkingdom sits below kingdom and above class.
Precision (metric)	Precision "measures how often a ML model correctly predicts the positive class” (Evidently AI, 2025). Formula: Precision = TP / (TP + FP). Precision can be measured on a scale of 0 to 1 or as a percentage, the desired precision scores being as close as possible to 1.0 or 100% as this would mean that there is a minimal (0) number of false positives.
Principal Component	"2-dimensional representation of a dataset that contains as much as possible of the variation" (Van Bentum, 2025)
Principal Component Analysis (PCA)	A linear dimensionality reduction technique for reducing large amounts of data to a few principal components.
Random Forest (RF)	A machine learning algorithm: Random Forest (RF) is an ensemble learning method that builds multiple decision trees using bootstrapped samples of the training data (Baumann, Ehlers, Vogt & Rosenhahn, 2013)
Recall (metric)	Recall measures which proportion of all predicted positives are in fact true positives. Formula: Recall = TP / (TP + FN). Recall can be measured on a scale of 0 to 1 or as a percentage. Recall is also known as the sensitivity or true positive rate (Evidently AI, 2025).
ROC-curve	A Receiver Operating Characteristic (ROC) curve is a graphical representation used to evaluate the performance of a binary classification model. It plots the True Positive Rate (TPR, sensitivity) against the False Positive Rate (FPR). TPR=1.0 with FPR=0.0 indicates a perfect model, while TPR=FPR indicates that the model performs equal to a coint flip (i.e. a random classifier). (Van Bentum, 2025).
Superkingdom	A taxonomic rank, i.e. a level of classification."Superkingdom" is also known as "Kingdom". In the hierarchy of biological classification, superkingdom sits below domain and above phylum. Superkingdom contains Archaea, Bacteria, and Eukarya.
Supervised learning	"Supervised learning is a machine learning approach that is defined by its use of labeled data sets" (Delua, 2021), which means that "for each observation of the predictor measurement(s) xi, i = 1, . . . , n there is an associated response.
Sustainable Development Goal (SDG)	A target defined by the United Nations in 2015, aimed to create a more sustainable and equitable future for all by 2030. In total, there 17 SDG's, their topics ranging from addressing gender equality and poverty to promoting climate action.
Taxonomy	The science of classification (Cain, 2025), in the context of this project, 'taxonomy' refers to the biological classification of living and extinct organisms into taxonomic ranks.
Unsupervised learning	Unsupervised learning is a machine learning approach for analyzing and clustering unlabeled data sets (Delua, 2021). "These algorithms discover hidden patterns in data without the need for human intervention (hence, they are “unsupervised”)" (Delua, 2021). One should choose unsupervised learning over supervied learning when one can observe "a vector of measurements xi but no associated response yi" (James et al., 2023).

Why the YABAC-project?

Since 2015, the number of people facing hunger and food insecurity has been rising, this number being ~9.2 percent of the world population (≈735 million people) in 2022 (United Nations, 2023). In response, the United Nations have formulated Sustainable Development Goal 2 to achieve a world without zero hunger by 2030. Causes of food insecurity can be distinguished in two categories: insufficient food availability and insufficient access to food (Smith, Obeid, & Jensen, 2000). food availability is determined by food production. Crops (plants) form the basis of every food chain (Morrissey, Dow, Mark, & O'Gara, 2004) and are grown in soil: soil is a complex ecosystem, containing microbes such as bacteria, fungi, viruses, archaea, and protists (Marzouk, Kwaslema, D. R., Omar, M. M., & Mohamed, S. H. 2024). The quantity and the quality of crop yields i.e. food production is subject to the soil microbiota, as some microbes “promote plant growth by cycling plant nutrients, suppressing plant pests and diseases, and contributing to soil aggregation, porosity, moisture retention, and organic matter accumulation” while other microbes can be harmful to crops if they cause disease or compete for nutrients (Kaminsky et al, 2021).

Therefore, in addition to metrices such as pH, nutrient levels, soil compaction, DNA-based assessment of soil microbes is becoming a common practice for measuring soil health (Fierer, Wood & Bueno de Mesquita, 2020). Examples of DNA-based classification methods are pairwise alignments and profile Hidden Markov Models; however, these methods are limited given that the majority of the soil microbiome has not been sequenced at present day (Iqbal, Begum, Ullah, Jalal, & Shaw, 2024). Today, machine learning (ML), a branch of artificial intelligence, is rising as an alternative method for microbial classification (Wu & Gadsden, 2023).

To explore the possibilities of machine learning in the assessment of soil microbes, specifically Archaea and Bacteria, this project aims to develop two ML classifiers using two different ML algorithms with accuracy ≥85% and MCC ≥0.3 to classify the taxonomic lineage of Archaea and Bacteria based on 16S ribosomal RNA (16S rRNA) sequences, as 16S rRNA is commonly used for classification of Bacteria and Archaea due to its evolutionarily conserved nature (Kim & Chun, 2014).

How we did it

We trained two machine learning models, Multinomial Naïve Bayes (MNB) and Random Forest (RF), on publicly available 16S rRNA reference sequences from the NCBI. This dataset, collected in 2019, contained 934 archaeal 16s rRNA sequences and 19.981 bacterial 16s rRNA sequences. These models, based on different algorithms, were validated and tested to compare their accuracy and effectiveness in classifying microorganisms. 8-mers were used as features for training.
The final models and intermediate results were integrated into an accessible dashboard, the tool page, that allows users to upload their own data, do predictions on their own data with the trained machine learning models, explore classifications, and visualize microbial distribution in soil samples. For the full materials and methods, please see the materials and methods page.

Contributors

YABAC was developed by a group of BsC Bioinformatics undergraduate students from the HAN University of Applied Sciences (Nijmegen, The Netherlands), under supervision of Douwe van der Leest and the HAN BioCentre Centre of Expertise. The official website of the HAN BioCentre can be found here.

The developers of YABAC:

Esmay Wissink
Laurie Straver
Tom Ummenthun
Rens Heerkens
Luuk Veeken

How to cite YABAC

If you find YABAC useful in your work, please cite our poster publication:

Wissink, E., Ummenthun, T., Straver, L., Heerkens, R. & Veeken, L (2025, June 24). Classifying Bacteria and Archaea phyla based on 16s rRNA 8-mers using Multinomial Naïve Bayes and Random Forest. BIN-2 and BIN-3 poster conference 2025, HAN University of Applied Sciences, Nijmegen, The Netherlands.

License

Yet Another Bacteria and Archaea Classifier © 2025 by Wissink, E., Ummenthun, T., Straver, L., Heerkens, R. & Veeken, L. is licensed under CC BY-NC-SA 4.0 and owned by the HAN University of Applied Sciences.

Contact

To contact us, send a letter to the following address:

HAN University of Applied Sciences
For the attention of class BIN-3c HBO Bioinformatics
Laan van Scheut 2
6525 EM Nijmegen
The Netherlands

Make sure to include "YABAC" in the subject line of your letter.

References

Activeloop (n.d.) What is Matthews Correlation Coefficient. https://www.activeloop.ai/resources/glossary/matthews-correlation-coefficient-mcc/
Anand, A. (2024, 21 september). Bagging and Boosting in AI: A Comprehensive Guide to Ensemble Learning. DEV Community. https://dev.to/abhinowww/bagging-and-boosting-in-ai-a-comprehensive-guide-to-ensemble-learning-2cf0
Baumann, F., Ehlers, A., Vogt, K., & Rosenhahn, B. (2013). Cascaded Random Forest for Fast Object Detection. In Lecture notes in computer science (pp. 131–142). https://doi.org/10.1007/978-3-642-38886-6_13
Berg, G., Rybakova, D., Fischer, D., Cernava, T., Vergès, M. C., Charles, T., Chen, X., Cocolin, L., Eversole, K., Corral, G. H., Kazou, M., Kinkel, L., Lange, L., Lima, N., Loy, A., Macklin, J. A., Maguin, E., Mauchline, T., McClure, R., . . . Schloter, M. (2020). Microbiome definition re-visited: old concepts and new challenges. Microbiome, 8(1). https://doi.org/10.1186/s40168-020-00875-0
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/a:1010933404324
Cain, A.J. (2025, April 25). Taxonomy | Definition, Examples, Levels, & Classification. Encyclopedia Britannica. https://www.britannica.com/science/taxonomy
Cambridge. (2025). classifying. In Cambridge. https://dictionary.cambridge.org/dictionary/english/classifying
Cambridge. (2025). identification. In Cambridge. https://dictionary.cambridge.org/dictionary/english/identification
Chanaka. (2024, 22 November). Data preparation vs. data preprocessing - Chanaka - medium. Medium. https://medium.com/@ChanakaDev/data-preparation-vs-data-preprocessing-17403b9a1e14
Chicco, D., & Jurman, G. (2023). The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Mining, 16(1). https://doi.org/10.1186/s13040-023-00322-4
Deban, Ningthoujam, A. S., Sanasam, S., Tamreihao, K., & Nimaich, S. (2009). Antagonistic activities of local actinomycete isolates against rice fungal pathogens. African Journal Of Microbiology Research, 3(11), 737–742. https://doi.org/10.5897/ajmr.9000038
Delua, J. (2021, 12 maart). Supervised vs Unsupervised Learning. Think. Geraadpleegd op 28 mei 2025, van https://www.ibm.com/think/topics/supervised-vs-unsupervised-learning
Donald, L., Pipite, A., Subramani, R., Owen, J., Keyzers, R. A., & Taufa, T. (2022). Streptomyces: Still the Biggest Producer of New Natural Secondary Metabolites, a Current Perspective. Microbiology Research, 13(3), 418–465. https://doi.org/10.3390/microbiolres13030031
Evidently AI (2025, January 9). Accuracy vs. precision vs. recall in machine learning: what’s the difference?. Evidently AI. https://www.evidentlyai.com/classification-metrics/accuracy-precision-recall
Fierer, N., Wood, S. A., & Bueno De Mesquita, C. P. (2020). How microbes can, and cannot, be used to assess soil health. Soil Biology and Biochemistry, 153, 108111. https://doi.org/10.1016/j.soilbio.2020.108111
Frank, E. (2023, October 13). Understanding the F1 Score. Medium. https://ellielfrank.medium.com/understanding-the-f1-score-55371416fbe1
Gao, B., & Gupta, R. S. (2012). Phylogenetic framework and molecular signatures for the main clades of the phylum Actinobacteria. Microbiology and molecular biology reviews : MMBR, 76(1), 66–112. https://doi.org/10.1128/MMBR.05011-11
GeeksforGeeks. (2025, 29 januari). Multinomial naive Bayes. GeeksforGeeks. https://www.geeksforgeeks.org/multinomial-naive-bayes/
Great Learning. (2024, 2 september). Multinomial naive Bayes explained. Great Learning Blog: Free Resources What Matters To Shape Your Career! https://www.mygreatlearning.com/blog/multinomial-naive-bayes-explained/#disadvantages
Hayes, A. (2025, February 24). Bayes' Theorem: What it is, formula, and examples. Investopedia. https://www.investopedia.com/terms/b/bayes-theorem.asp
Hendricks, R. (n.d.). What is a good accuracy score in Machine Learning?. deepchecks. https://www.deepchecks.com/question/what-is-a-good-accuracy-score-in-machine-learning/
Iqbal, S., Begum, F., Ullah, I., Jalal, N., & Shaw, P. (2024). Peeling off the layers from microbial dark matter (MDM): recent advances, future challenges, and opportunities. Critical Reviews in Microbiology, 1–21. https://doi.org/10.1080/1040841x.2024.2319669
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2023). An Introduction to Statistical Learning: with Applications in R (Second Edition) [E-book]. Springer. https://www.statlearning.com/
Jha, V., Roy, B., Jahagirdar, D., McNutt, Z. A., Shatoff, E. A., Boleratz, B. L., Watkins, D. E., Bundschuh, R., Basu, K., Ortega, J., & Fredrick, K. (2020). Structural basis of sequestration of the anti-Shine-Dalgarno sequence in the Bacteroidetes ribosome. Nucleic Acids Research, 49(1), 547–567. https://doi.org/10.1093/nar/gkaa1195
Kaminsky, L., Cloutier, M., Fleishman, S., Isbell, Borrelli, K. & Bell, T. (2021, 2 April). Soil Microbes in Organic Cropping Systems 101. eOrganic.org. Retrieved June 11, 2025, from https://eorganic.org/node/34601
Kim, M., & Chun, J. (2014). 16S rRNA Gene-Based Identification of Bacteria and Archaea using the EzTaxon Server. In Methods in microbiology (pp. 61–74). https://doi.org/10.1016/bs.mim.2014.08.001
Koehrsen, W. (2017, 27 December). Random Forest Simple Explanation. Medium. https://williamkoehrsen.medium.com/random-forest-simple-explanation-377895a60d2d
Koehrsen, W. (2017, December 27). Random Forest Simple explanation - Will Koehrsen - medium. Medium. https://williamkoehrsen.medium.com/random-forest-simple-explanation-377895a60d2d
Liu, K., & Wong, T. (2013). Naïve Bayesian Classifiers with Multinomial Models for rRNA Taxonomic Assignment. IEEE/ACM Transactions On Computational Biology And Bioinformatics, 10(5), 1. https://doi.org/10.1109/tcbb.2013.114
Magne, F., Gotteland, M., Gauthier, L., Zazueta, A., Pesoa, S., Navarrete, P., & Balamurugan, R. (2020). The Firmicutes/Bacteroidetes Ratio: A Relevant Marker of Gut Dysbiosis in Obese Patients?. Nutrients, 12(5), 1474. https://doi.org/10.3390/nu12051474
Marzouk, S. H., Kwaslema, D. R., Omar, M. M., & Mohamed, S. H. (2024). “Harnessing the power of soil microbes: Their dual impact in integrated nutrient management and mediating climate stress for sustainable rice crop production” A systematic review. Heliyon, 11(1), e41158. https://doi.org/10.1016/j.heliyon.2024.e41158
Million, M., Lagier, J. C., Yahav, D., & Paul, M. (2013). Gut bacterial microbiota and obesity. Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases, 19(4), 305–313. https://doi.org/10.1111/1469-0691.12172
Ministerie van Algemene Zaken. (2024, September 26). The Netherlands and the Sustainable Development Goals (SDGs). United Nations | Government.nl. https://www.government.nl/topics/united-nations/sustainable-development-goals
Morrissey, J. P., Dow, J. M., Mark, G. L., & O'Gara, F. (2004). Are microbes at the root of a solution to world food production? Rational exploitation of interactions between microbes and plants can help to transform agriculture. EMBO reports, 5(10), 922–926. https://doi.org/10.1038/sj.embor.7400263
Murel, J. & Kavlakoglu, E. (2024, January 19). Confusion matrix. Think. https://www.ibm.com/think/topics/confusion-matrix
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Müller, A., Nothman, J., Louppe, G., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal Of Machine Learning Research, 12(85):, 2825–2830. https://doi.org/10.48550/arXiv.1201.0490
Sayers, E. W., Beck, J., Bolton, E. E., Brister, J. R., Chan, J., Connor, R., Feldgarden, M., Fine, A. M., Funk, K., Hoffman, J., Kannan, S., Kelly, C., Klimke, W., Kim, S., Lathrop, S., Marchler-Bauer, A., Murphy, T. D., O'Sullivan, C., Schmieder, E., Skripchenko, Y., … Pruitt, K. D. (2025). Database resources of the National Center for Biotechnology Information in 2025. Nucleic acids research, 53(D1), D20–D29. https://doi.org/10.1093/nar/gkae979
Schoch, C. L., Ciufo, S., Domrachev, M., Hotton, C. L., Kannan, S., Khovanskaya, R., Leipe, D., Mcveigh, R., O'Neill, K., Robbertse, B., Sharma, S., Soussov, V., Sullivan, J. P., Sun, L., Turner, S., & Karsch-Mizrachi, I. (2020). NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database : the journal of biological databases and curation, 2020, baaa062. https://doi.org/10.1093/database/baaa062
Slonczewski, J., Foster, J. W., & Zinser, E. R. (2020). Microbiology: An Evolving Science.
Smith, L. C., Obeid, A. E. E., & Jensen, H. H. (2000). The geography and causes of food insecurity in developing countries. Agricultural Economics, 22(2), 199–215. https://doi.org/10.1111/j.1574-0862.2000.tb00018.x
The pandas development team (2020). pandas-dev/pandas: Pandas. https://doi.org/10.5281/zenodo.3509134
Training Arkansas Computing Teachers (TACT). (n.d.). Big idea 4: Algorithms. University of Arkansas. https://tact.uark.edu/big-idea-4-algorithms/Big Idea 4: Algorithms
United Nations. (2023). The Sustainable Development Goals Report (ISBN: 978-92-1-101460-0). United Nations Publications. Retrieved June 10, 2025, from https://unstats.un.org/sdgs/report/2023/The-Sustainable-Development-Goals-Report-2023.pdf
Van Benthum, G. (2025, February 5). Bi10T_DS_BIN_Data Science_1_kickoff [Course presentation]. HAN University of Applied Sciences. https://leren.han.nl/content/enforced/55023-DATSCT04_2024_P3N/Bi10T_DS_BIN_Data%20Science_1_kickoff.pptx
Van Benthum, G. (2025, March 26). Bi10T_DS_BIN_Data Science_7_Evaluation and ROC [Course presentation]. HAN University of Applied Sciences. https://leren.han.nl/content/enforced/55023-DATSCT04_2024_P3N/Bi10T_DS_BIN_Data%20Science_7_Evaluation%20and%20ROC.pptx
Verenigde Naties (2020, January 17). Verenigde Naties - Nederlands. https://unric.org/nl/duurzame-ontwikkelingsdoelstellingen/sdg-2/
Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and environmental microbiology, 73(16), 5261–5267. https://doi.org/10.1128/AEM.00062-07
Woese, C. R., & Fox, G. E. (1977). Phylogenetic structure of the prokaryotic domain: The primary kingdoms. Proceedings of the National Academy of Sciences, 74(11), 5088–5090. https://doi.org/10.1073/pnas.74.11.5088
Wu, Y., & Gadsden, S. A. (2023). Machine learning algorithms in microbial classification: a comparative analysis. Frontiers in Artificial Intelligence, 6. https://doi.org/10.3389/frai.2023.1200994
van Wolferen, M., Pulschen, A. A., Baum, B., Gribaldo, S., & Albers, S. V. (2022). The cell biology of archaea. Nature microbiology, 7(11), 1744–1755. https://doi.org/10.1038/s41564-022-01215-8

Image attribution

Bacteria free icon by Freepik is licensed under the Flaticon License.
Search free icon by Freepik is licensed under the Flaticon License.
Blue Athletic Field by Mateusz Dach is licensed under the Pexels License.
Low-Angle Shot of a Person's Hands Holding a Petri Dish by Edward Jenner is licensed under the Pexels License.
Close-Up Shot of Scrabble Tiles on a Blue Surface by Ann H is licensed under the Pexels License.
A Person Holding a Thought Bubble by Cup of Couple is licensed under the Pexels License.

Navigation

About

About the YABAC-project

Glossary

Why the YABAC-project?

How we did it

Contributors

How to cite YABAC

License

Contact

References

Image attribution

Links

Downloads

About the YABAC-website

Explanation of the web pages

Home

Data Dashboard

Material and methods

Tool

FAQ