YABAC: Yet Another Bacteria and Archaea Classifier

Navigation

Introduction Initial dataset Filtering dataset K-mer analysis Model exploration Multinomial Naive Bayes Random Forest Validation dataset

Introduction

Our project aims to accurately classify Bacteria and Archaea into their respective taxonomic groups using machine learning models based on 16S rRNA sequences. This classification supports improved microbial identification in soil samples, contributing to broader soil health assessments.

On this Materials & Methods page, you will find a detailed overview of the data sources, preprocessing steps, and machine learning techniques applied in this microbiome classification project.

This page covers a detailed explanation of the following steps:

Dataset selection
Data filtering and transformation
Feature extraction from sequences
Machine learning model selection and training (Multinomial Naive Bayes and Random Forest)
Creation of a validation FASTA file, including its rationale

The full list of references used in this project is available on the About page.

Initial dataset

This dataset provides comprehensive coverage of 16S rRNA sequences from 19,931 bacterial and 929 archaeal taxa. Because the dataset distinguishes organisms at a very fine taxonomic level, such as genus and species, it allows researchers to accurately compare and analyze evolutionary relationships between Bacteria and Archaea. The inclusion of RefSeq IDs and taxonomic identifiers facilitates interoperability with NCBI.

When clicking on the 'Download initial dataset' button, a .zip file (11.1MB) will be downloaded containing the initial dataset (.xlsx) and the metadata file (.json).

File specifications

File name	hanbc_metagenomics_refdbase_16s_april_2024_v10_datscp.xlsx
File format	Excel (.xlsx)
File size	11.2MB
Number of columns	30
Number of rows	21053
Author	Christoff Francke
Description	A curated reference dataset of 16S rRNA sequences with detailed taxonomic annotations for metagenomic analysis.

File name	hanbc_metagenomics_refdbase_16s_april_2024_v10_datscp_metadata.json
File format	JSON (.json)
File size	9 kB
Author	Esmay Wissink
Description	A metadata file containing detailed information about the initial dataset.

Column description

Term	Explanation
Species Name (B)	Name of the species, including subspecies or strains.
RefSeq ID (C)	Accession number for the RefSeq record.
GI/RefSeq GI (D + E)	Unique GenInfo Identifier assigned by NCBI to each sequence.
Molecule (F)	Type of molecule (e.g., 16S ribosomal RNA, genome).
Type of data (G)	Specifies the type of sequence data (e.g., genome, mRNA, protein, rRNA).
Partial/full (H)	Indicates whether the sequence is complete or partial.
Length (I)	The total length of the sequence in nucleotides.
Ordered phyla (J)	These are unique identifiers (Taxonomy IDs) assigned by NCBI, representing specific taxa or hierarchical relationships. The '>' symbol may indicate a categorization method used in this dataset.
Sequence (K)	The nucleotide sequence associated with the record.
Change (M)	Indicates if the record has undergone a name change. Y=change
Family (O, AD)	Taxonomic family of the organism.
Genus (P,AE)	Taxonomic genus of the record
Subspecies (Q)	Further classification of the species, if applicable.
BIOVAR/SEROVAR (R)	Bacterial groupings based on biochemical or antigenic properties.
Strain (S, AG)	Specific strain of the organism.
Tax ID (W)	Unique identifier in the NCBI Taxonomy database.
Taxonomy (X)	The general classification of the organism.
T:Superkingdom (Y)	Highest-level taxonomic category in the dataset.
T:Phylum (AA)	Broad classification within the superkingdom.
T:Class (AB)	Taxonomic class of the record organism
T:Order (AC)	Taxonomic order classification.
T:Species (AF)	Taxonomic species name.
Alternative Names (T:Strain.1, ALTERNATIVE1, ALTERNATIVE2)	Additional strain designations if applicable.

Filtering dataset

Before the filtersteps were applied a cutoff of 10 was chosen. This was done because of the rule of 10. This rule states that for each parameter you need 10 times as much data.

The dataset has been filtered based on the following criteria:

Filter Criterion	Description
Adjustment of RefSeq ID NR_121590.1	The size has been adjusted from >7883 to 'complete' due to this NCBI link: NR_121590.1. It does not contain a complete operon, but it does contain the complete 16S rRNA sequence.
Removed rows with 'linear DNA'	All rows where the type is 'linear DNA' have been removed, because the research is about RNA and not DNA. For most organisms, a RNA variant ('linear RNA') is available, except for: Aeropyrum camini SY1 = JCM 12091 culture JCM:12091 Pyrobaculum neutrophilum culture JCM:9278 Ruminiclostridium cellulolyticum H10
Removed rows with unclassified phylum	Rows where the phylum is 'unclassified' have been removed. Because we perform specific taxonomic classification at the phylum level, these rows cannot be used.
Removed redundant data	Rows where both the sequence and the species name match exactly have been filtered. For each organism, the first row has been retained and the was rest removed, because the sequences are identical and it is likely a clone or a measurement from another year.

After the filtering, the dataset was reduced to 20.199 entries, which is 854 fewer than the initial dataset.

Cut-off threshold

A cut-off threshold of 10 entries per phylum was applied due to limited data availability for some phyla. After applying this threshold, the dataset retained 34 distinct phyla, including 2 Archaea and 32 Bacteria.

K-mer analysis

K-mer analysis has been performed by counting the number of occurrences of each present k-mer in each sequence (total number of sequences: 20915). A length of 8 bases has been chosen, because Wang et al. stated that “sizes of 6 and 8 bases were less accurate” than sizes of 8 and 9 bases, and that a size of 8 can be chosen over a size of 9 to reduce memory requirements given that “sizes of 8 and 9 bases gave nearly identical results” (2007). The k-mer counts have been saved into a pandas DataFrame, converted to a .json file.

Machine learning model exploration

For this project, two machine learning models using two different algorithms were selected and evaluated independently: Random Forest (RF) and Multinomial Naïve Bayes (MNB). The goal was to classify 16S rRNA sequences at the phylum level based on their k-mer representations.

This is an exploratory classification task, as we are working with a biologically complex dataset. K-mer representations can capture both conserved and variable regions, but distinguishing between closely related phyla remains a challenge. We are also investigating which models perform best in identifying taxonomic groups from sequence-derived features. Our dataset is imbalanced and incomplete at deeper taxonomic ranks, making phylum-level classification a practical and informative target. The aim is not to develop a final predictive system, but to evaluate model performance under realistic bioinformatics constraints, such as the imbalanced dataset.

Boosting algorithms were intentionally avoided. While they often achieve high accuracy (Anand, 2024), they are more prone to overfitting, especially in datasets with class imbalance, like ours, where some phyla are much more represented than others. This would hinder the generalizability of the model to unseen data.

Initially, we considered using a Cascading Random Forest approach, which builds a hierarchical classifier that mirrors the taxonomic structure and can be particularly effective for resolving deeper taxonomic levels, such as genus or species (Zhang et al., 2022). This method is capable of uncovering complex relationships, even when some classes have relatively small sample sizes. However, cascading classifiers are especially sensitive to class imbalance, a limitation that becomes more pronounced at lower taxonomic ranks.

Given the characteristics of our dataset, we chose not to pursue this approach. Our dataset is limited in both quality and coverage below the phylum level. The quality is affected by sequencing errors, the presence of ambiguous bases (denoted as 'N'), and incomplete or uncertain taxonomic annotations. Coverage refers to the number of sequences per taxonomic group, which is often insufficient in our dataset. Many phyla are represented by fewer than the commonly recommended minimum of 100 sequences per class, reducing statistical power and increasing the risk of overfitting. These limitations, combined with the sensitivity of cascading classifiers to class imbalance, made a standard Random Forest approach a more suitable choice for our exploratory analysis.

Instead, we opted for a standard Random Forest model. Random Forest is an ensemble method that is well-suited for high-dimensional, structured data such as k-mer frequency tables. It is known for its robustness, even in the presence of noisy or imbalanced data (Qi, 2012). Moreover, Random Forest offers interpretable outputs, such as feature importance scores and out-of-bag error rates, allowing us to understand which k-mers contribute most to the classification. These characteristics make it a strong candidate for exploratory classification tasks. We used GridSearchCV (GridSearchCV, z.d.) to tune hyperparameters and optimize model performance.

Multinomial Naïve Bayes (MNB) is a probabilistic classifier based on Bayes' theorem, typically used in text classification. It is especially efficient with discrete, frequency-based features such as word counts—or in our case, k-mer counts. While MNB assumes feature independence and ignores the order of features—which conflicts with biological realities like motif co-occurrence and sequence context—it has still been shown to perform well in many practical applications (Soria et al., 2011). Moreover, MNB has been previously applied to taxonomic classification using k-mer analysis (e.g., Wang et al., 2007), making it a reasonable baseline model for comparison.

Multinomial Naive Bayes

Multinomial Naïve Bayes (MNB) is a probabilistic machine learning algorithm based on Bayes’ Theorem and is particularly well-suited for discrete frequency-based data—such as word counts in text classification(GeeksforGeeks, 2025). In this context, MNB is used to classify 16S rRNA sequences based on k-mer frequencies, where k-mers (subsequences of length k) represent the "words" of the sequence.

One key limitation of MNB is that it does not consider the order of k-mers(Liu & Wong, 2013); instead, it analyzes the frequency of each k-mer across the sequence. While this simplifies computation, it may reduce accuracy when distinguishing between closely related species with subtle variations in sequence structure. This also happens because of the interactions between amino acids in the same sequence when folding, which are now ignored.

However, when taxonomic differences are captured in variable regions of the 16S rRNA gene, the resulting differences in k-mer frequency can still provide sufficient signal for accurate classification—particularly at higher taxonomic levels such as genus or phylum.

MNB is highly efficient and scalable, making it well-suited for large datasets. It performs reliably even with limited training data, and its probabilistic framework helps to mitigate the effects of class imbalance. Additionally, MNB assumes that all features (i.e., k-mers) contribute dependently to the classification outcome, especially in biological data.

While MNB lacks specificity at the species level due to its disregard for k-mer order, this limitation can be addressed by:

Using longer k-mers (e.g., 8-9 nucleotides) to increase uniqueness
Applying TF-IDF (term frequency – inverse document frequency) weighting to highlight more informative k-mers
Preprocessing sequences to focus on variable regions (e.g., V3–V4) where taxonomic signal is the strongest

Bayes’ Theorem, named after the 18th-century British mathematician Thomas Bayes, describes the conditional probability of an event based on prior knowledge of conditions that might be related to the event (Hayes, 2025). In MNB, this principle is used to estimate the probability that a sequence belongs to a particular class, based on its k-mer frequency profile.

In summary, while MNB may not be optimal for high-resolution classification (e.g., at the species level), it offers a computationally efficient and robust approach for broader taxonomic assignments, particularly when using well-structured features like k-mer counts.

How the model works:

K-mer Extraction:
- Each 16S rRNA sequence is broken down into overlapping k-mers (subsequences of length k).
- The model uses a k-mer length of 8 and processes all possible sliding windows across the sequence, following Wang et al. (2007).
Feature Vector Creation:
- A bag-of-k-mers approach is used, where k-mers are treated as independent features.
- The frequency of each k-mer is counted and stored in a feature matrix.
- TF-IDF transformation (optional): Applies weighting to emphasize distinguishing k-mers.
Training the Model:
- The MNB classifier is trained on labeled microbial sequences (e.g., Archaea vs. Bacteria).
- The model computes the probability of a sequence belonging to a specific microbial class.
- Classification is based on Bayes’ theorem, assuming k-mers contribute independently to the classification decision.
Prediction:
- For a new sequence, its k-mer frequency vector is computed.
- The trained MNB model assigns probabilities to each microbial class.
- The sequence is classified into the most probable Archaea class (e.g., Archaea or Bacteria).

Random Forest

The Random Forest (RF) model was selected for its robustness, efficiency, and suitability for structured biological data such as 16S rRNA sequences. RF is particularly effective in handling noisy features, class imbalances, and small to medium sized datasets. In addition to its predictive performance, RF trains faster and offers valuable internal predictions such as error, strength, correlation, and variable importance (Baumann et al., 2013). To efficiently tune hyperparameters and address class imbalance during tuning, we used random under-sampling to create balanced subsets (25 samples per class) for the tuning phase. The final Random Forest model was then trained on the full filtered dataset, allowing it to learn from all available data while benefiting from the optimized parameters.

How this model works

Bootstrapping (Training data selection):
- Random Forest uses bootstrapping to generate multiple datasets by randomly selecting subsets from the original training data with replacement.
Building multiple decision trees:
- For each subset, a decision tree is trained. It splits the data based on features to classify or predict values.
Random feature selection:
- At each split, a random subset of features is selected to reduce correlation and improve model robustness.
Prediction through majority vote:
- All trees vote. The majority wins (classification), or the average is taken (regression).

Validation dataset

To construct the validation set, we obtained new sequences from the NCBI nucleotide database. We first searched for 16S rRNA sequences within the Bacteria domain, selecting 21 unique sequences that were not present in our original dataset. Using the same search criteria, we then selected 7 unique 16S rRNA sequences from the Archaea domain to further validate our model. To further challenge the model and assess its ability to generalize, we included a so-called “wildcard” sequence: an 18S rRNA sequence from the Fungi domain, representing a class the models were not trained on. This diverse set of sequences enables a comprehensive evaluation of the model’s classification performance and robustness across different taxonomic groups. This validation set is used to validate the machine learning models and to assess their performance on unseen data.

Organism	Accession	Name
Bacteria	AH002783.2	Beet leafhopper transmitted virescence phytoplasma 16S ribosomal RNA (16SrRNA) gene, complete sequence
Bacteria	AF049088.1	'Aporospora terricola' (nom. ined.) 18S ribosomal RNA gene, complete sequence
Bacteria	AB180383.1	Pseudoalteromonas sp. No. 47 gene for 16S ribosomal RNA, partial sequence
Bacteria	X79854.1	Streptomyces lincolnensis 16S rRNA gene, strain NRRL2936
Bacteria	X77468.1	A.methanolicus 16S rRNA gene
Bacteria	OR857363.1	Micrococcus yunnanensis strain Bact 16SrRNA B4 16S ribosomal RNA gene, partial sequence
Bacteria	ON406240.1	Rossellomorea aquimaris strain 2G9 16SrRNA 16S ribosomal RNA gene, partial sequence
Bacteria	OM078501.1	Microbacterium sp. strain QUMS 16SrRNA 16S ribosomal RNA gene, partial sequence
Bacteria	X57308.1	B.polymyxa 16SrRNA
Bacteria	E10216.1	16SrRNA gene of Lactobacillus brevis L63
Bacteria	E10214.1	16SrRNA gene of Lactobacillus sp. DA1
Bacteria	AJ601392.1	Flavobacterium xanthum 16SrRNA gene, strain R-9010
Bacteria	OQ976904.1	Pseudomonas sp. strain PE01-27-16SrRNA 16S ribosomal RNA gene, partial sequence
Bacteria	MW857479.1	Lactiplantibacillus plantarum strain AS.6 16S ribosomal RNA gene, partial sequence
Bacteria	MW582095.1	Bacillus subtilis strain PI 16S ribosomal RNA gene, partial sequence
Bacteria	MK918483.1	Escherichia coli strain ZSM93 16S ribosomal RNA gene, partial sequence
Bacteria	KP941769.1	Klebsiella pneumoniae strain JCM 1662 16S ribosomal RNA gene, partial sequence
Bacteria	KP941764.1	Shigella flexneri strain 29 str301 16S ribosomal RNA gene, partial sequence
Bacteria	KP941758.1	Klebsiella pneumoniae strain AT22 16S ribosomal RNA gene, partial sequence
Bacteria	AM911037.1	Uncultured Aeromonas sp. partial 16SrRNA gene, clone REC_91
Bacteria	E09456.1	DNA encoding Micobacterium tuberculosis 16SrRNA
Archaea	MK063890.1	Natrialba chahannaoensis strain M6 16S ribosomal RNA gene, partial sequence
Archaea	MW794195.1	Natrialba chahannaoensis strain GHMN55 16S ribosomal RNA gene, partial sequence
Archaea	OM302165.1	Natronolimnohabitans innermongolicus strain GHWN83 16S ribosomal RNA gene, partial sequence
Archaea	AY672462.1	Uncultured archaeon clone CH1_19_ARC_16SrRNA_9N_EPR 16S ribosomal RNA gene, partial sequence
Archaea	AY672465.1	Uncultured archaeon clone CH1_3_ARC_16SrRNA_9N_EPR 16S ribosomal RNA gene, partial sequence
Archaea	EU244166.1	Uncultured archaeon clone sl3113 16S ribosomal RNA gene, partial sequence
Archaea	AM911037.1	Uncultured Aeromonas sp.
Fungi	AF049088.1	'Aporospora terricola' (nom. ined.) 18S ribosomal RNA gene, complete sequence

When clicking on the 'Download validation dataset' button, a .zip file (10kB) will be downloaded containing the validation sequences (.FASTA) and the metadata file (.json).

File specifications

File name	validation_sequences_with_lineage.fasta
File format	Fasta (.fasta)
File size	34 kB
Number of entries	29
Author	Tom Ummenthun
Description	A curated reference dataset of 16S rRNA sequences with detailed taxonomic annotations in the header for metagenomic analysis.

File name	validation_sequences_with_lineage_metadata.json
File format	JSON (.json)
File size	2 kB
Author	Esmay Wissink
Description	A metadata file containing detailed information about the validation dataset.