Date of Award
5-2025
Document Type
Thesis
Degree Name
Master of Science
Degree Discipline
Computer Science
Abstract
Unsupervised Machine Learning (ML) techniques are powerful tools for identifying similarity patterns and can be utilized to categorize the data into related groups. This study showcased the applications of unsupervised methods, namely hierarchical clustering to carefully determine related groups of newly identified genomic islands (GIs). Genomic islands are mobile genetic elements integrated into bacterial chromosomes. Gis can impact the evolution of bacteria for example by carrying virulence or metabolic genes. Precisely identifying Gis calls for a sophisticated process which we recently implemented as an already-published tool called TIGER which stands for Target/Integrative Genetic Element Retriever (TIGER). TIGER identifies mobile DNAs in each genome and identifies genomic islands with high accuracy. We employed TIGER to identify approximately 131,000 Gis in E. coli bacteria. FastANI is a bioinformatics tool used to estimate the average nucleotide identity (ANI) between two bacterial genomes. ANI is defined as the mean nucleotide identity of orthologous gene pairs shared between two microbial genomes. A 131,000 by 131,000 Gis similarity matrix was obtained by analyzing 131k by 131k pairs of Gis sequences using FastANI, storing in a sparse matrix which is utilized to record large scaled dataset. To identify the similarities among the E. coli Gis, the hierarchical clustering algorithms and our novel heuristical approaches successfully categorized 131k Gis into relevant groups. The dendrogram, which is the output of the hierarchical clustering, was created to display the closeness among the Gis and identify related groups of Gis as well as singleton Gis. To obtain the ideal number of clusters, two initial grouping methods were implemented: quantity customized clustering and cutline-based clustering. Three cluster optimization approaches were used to improve the clustering algorithm, including ANI score-based optimization, Gis site type-based optimization, and dendrogram height-based optimization. Height-based optimization identified better performance and well separated the meaningful clusters. Purity, Dunn’s Index, and alignment ratios were used to further verify and narrow down the clusters. Our results provide a promising method for categorizing large DNA segments that can be compared using a similarity measure and be categorized into more precise clusters for further analysis.
Index Terms – Bioinformatics, genomic islands, hierarchical clustering, high-performance computing, phylogenetic trees, unsupervised learning.
Committee Chair/Advisor
Noushin Ghaffari
Committee Member
Lin Li
Committee Member
Yonghui Wang
Committee Member
Md. Shuvo
Publisher
Prairie View A&M University
Rights
© 2021 Prairie View A & M University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Date of Digitization
6-20-2025
Contributing Institution
John B Coleman Library
City of Publication
Prairie View
MIME Type
Application/PDF
Recommended Citation
Zhou, L. (2025). Unsupervised Machine Learning Techniques To Categorize Genomic Islands. Retrieved from https://digitalcommons.pvamu.edu/pvamu-theses/1548