16S rRNA |
16S ribosomal RNA (16S rRNA) is an RNA
component of the 30S subunit of a
prokaryotic ribosome (SSU rRNA). 16s rRNA
molecule plays a crucial role in protein
synthesis by ensuring proper ribosome
alignment with the start codon during
translation (Jha et al., 2020). Due to its
evolutionary conservation, 16s rRNA is
widely used as a molecular marker for
microbial classification (Woese & Fox,
1977).
|
Accuracy (metric) |
Accuracy is the ratio between the number of
correctly classified samples and the
overall number of samples (Chicco & Jurman,
2023), showing how often a classification
ML model is correct overall (Evidently AI,
2025). Formula: Accuracy = (TP + TN) / (TP
+ TN + FP + FN). According to Hendricks
(n.d.), industry standards for accuracy are
between 70% and 90%.
|
Algorithm |
A series of steps in a particular order,
i.e. "a process or set of rules to be
followed in calculations or other
problem-solving operations, especially by a
computer" (Training Arkansas Computing
Teachers (TACT), n.d.)
|
Archaea |
Unicellular prokaryotic microorganisms,
that "were initially isolated from extreme
environments (e.g., high temperatures, low
pH, hypersalinity)" (Van Wolferen,
Pulschen, Baum, Gribaldo & Albers, 2022)
|
Class |
A taxonomic rank, i.e. a level of
classification. In the hierarchy of
biological classification, superkingdom
sits below phylum and above order.
|
Classification |
The process of organizing entities into
categories based on shared characteristics
or properties (Cambridge, 2025).
|
Confusion Matrix |
A table that helps visualize and evaluate
the performance of a classification model.
It shows the count of true positives (TP),
true negatives (TN), false positives (FP),
and false negatives (FN), allowing for a
detailed analysis of where a model's
predictions may be accurate or inaccurate
(Murel & Kavlakoglu, 2024).
|
Data preparation |
“Data preparation is the umbrella term for
all the activities involved in getting your
data ready for analysis or use in a machine
learning model”(Chanaka, 2024)
|
Data preprocessing |
“Data preprocessing […] is a specific step
within data preparation that focuses on
cleaning and transforming the data
itself”(Chanaka, 2024)
|
Data science |
The usage of scientific methods, algorithms
and/or systems to extract knowledge from
structured and unstructured data" (Van
Bentum, 2025)
|
F1-score (metric) |
The F1-score combines “precision and recall
into a single performance metric that gives
equal weight to both measures” (Frank, 2023).
Formula: F1 = (2 * TP) / (2 * TP + FP + FN),
which can also be written as: F1 = (2 *
precision * recall) / (precision + recall).
|
Identification |
The process of recognizing and naming an
entity based on its characteristics or
features (Cambridge, 2025a).
|
JSON-file |
JavaScript Object Notation, a text-based
file format with extension .json.
|
K-mer |
A substring of length k of a biological
sequence.
|
K-mer analysis |
Analysis of a (set of) biological
sequence(s) based on the k-mers derived
from those sequences.
|
Machine learning |
"Machine learning is the study of
computer algorithms that can improve
automatically
through experience and by the use of data."
(Van
Bentum, 2025)
|
MCC-score (metric) |
The MCC-score is similar to the F1-score,
but unlike the F1-score, MCC considers the
ratio between positive and negative
elements, making MCC suitable for measuring
performance on imbalanced datasets (Chicco
& Jurman, 2023). Formula: MCC = (TP * TN –
FP * FN) / √(TP + FP) * (TP + FN) * (TP +
FP) * (TN + FN). “An MCC score above 0.3 is
considered moderate, and a score above 0.5
is considered strong” (Activeloop, n.d.).
|
Metagenome |
The collection of all genetic material
present within an environment and/or
sample. Therefore, a metagenome typically
contains sequences from different species
and/or multiple individuals.
|
Microbiome |
"A characteristic microbial
community occupying a reasonable
well-defined habitat which has distinct
physio-chemical properties. The microbiome
not only refers to the microorganisms
involved but also encompass their theatre
of activity, which results in the formation
of specific ecological niches" (Berg et
al., 2020)
|
Microbiota |
"The assembly of microorganisms belonging
to different kingdoms (Prokaryotes
[Bacteria, Archaea], Eukaryotes [e.g.,
Protozoa, Fungi, and Algae])" (Berg et al.,
2020)
|
Multinomial Naive Bayes (MNB) |
A machine learing algorithm: Multinomial
Naïve Bayes (MNB) is a probabilistic
machine learning algorithm based on Bayes’
Theorem and is particularly well-suited for
discrete frequency-based data, such as word
counts in text classification
(GeeksforGeeks, 2025).
|
Phylum |
A taxonomic rank, i.e. a level of
classification. In the hierarchy of
biological classification, superkingdom
sits below kingdom and above class.
|
Precision (metric) |
Precision "measures how often a ML model
correctly predicts the positive class”
(Evidently AI, 2025). Formula: Precision =
TP / (TP + FP). Precision can be measured
on a scale of 0 to 1 or as a percentage,
the desired precision scores being as close
as possible to 1.0 or 100% as this would
mean that there is a minimal (0) number of
false positives.
|
Principal Component |
"2-dimensional representation of a dataset
that contains as much as possible of the
variation" (Van Bentum, 2025)
|
Principal Component Analysis (PCA) |
A linear dimensionality reduction technique
for reducing large amounts of data to a few
principal components.
|
Random Forest (RF) |
A machine learning algorithm: Random Forest
(RF) is an ensemble learning method that
builds multiple decision trees using
bootstrapped samples of the training data
(Baumann, Ehlers, Vogt & Rosenhahn, 2013)
|
Recall (metric) |
Recall measures which proportion of all
predicted positives are in fact true
positives. Formula: Recall = TP / (TP +
FN). Recall can be measured on a scale of 0
to 1 or as a percentage. Recall is also
known as the sensitivity or true positive
rate (Evidently AI, 2025).
|
ROC-curve |
A Receiver Operating Characteristic (ROC)
curve is a graphical representation used to
evaluate the performance of a binary
classification model. It plots the True
Positive Rate (TPR, sensitivity) against
the False Positive Rate (FPR). TPR=1.0 with
FPR=0.0 indicates a perfect model, while
TPR=FPR indicates that the model performs
equal to a coint flip (i.e. a random
classifier). (Van Bentum, 2025).
|
Superkingdom |
A taxonomic rank, i.e. a level of
classification."Superkingdom" is also known
as "Kingdom". In the hierarchy of
biological classification, superkingdom
sits below domain and above phylum.
Superkingdom contains Archaea, Bacteria,
and Eukarya.
|
Supervised learning |
"Supervised learning is a machine learning
approach that is defined by its use of
labeled data sets" (Delua, 2021), which
means that "for each observation of the
predictor measurement(s) xi, i = 1, . . . ,
n there is an associated response.
|
Sustainable Development Goal (SDG) |
A target defined by the United Nations in
2015, aimed to create a more sustainable
and equitable future for all by 2030. In
total, there 17 SDG's, their topics ranging
from addressing gender equality and poverty
to promoting climate action.
|
Taxonomy |
The science of classification (Cain, 2025),
in the context of this project, 'taxonomy'
refers to the biological classification of
living and extinct organisms into taxonomic
ranks.
|
Unsupervised learning |
Unsupervised learning is a machine learning
approach for analyzing and clustering
unlabeled data sets (Delua, 2021). "These
algorithms discover hidden patterns in data
without the need for human intervention
(hence, they are “unsupervised”)" (Delua,
2021). One should choose unsupervised
learning over supervied learning when one
can observe "a vector of measurements xi
but no associated response yi" (James et
al., 2023).
|