Date of Award

5-2025

Document Type

Thesis

Degree Name

Master of Science

Degree Discipline

Computer Science

Abstract

Unsupervised Machine Learning (ML) techniques are powerful tools for identifying similarity patterns and can be utilized to categorize the data into related groups. This study showcased the applications of unsupervised methods, namely hierarchical clustering to carefully determine related groups of newly identified genomic islands (GIs). Genomic islands are mobile genetic elements integrated into bacterial chromosomes. Gis can impact the evolution of bacteria for example by carrying virulence or metabolic genes. Precisely identifying Gis calls for a sophisticated process which we recently implemented as an already-published tool called TIGER which stands for Target/Integrative Genetic Element Retriever (TIGER). TIGER identifies mobile DNAs in each genome and identifies genomic islands with high accuracy. We employed TIGER to identify approximately 131,000 Gis in E. coli bacteria. FastANI is a bioinformatics tool used to estimate the average nucleotide identity (ANI) between two bacterial genomes. ANI is defined as the mean nucleotide identity of orthologous gene pairs shared between two microbial genomes. A 131,000 by 131,000 Gis similarity matrix was obtained by analyzing 131k by 131k pairs of Gis sequences using FastANI, storing in a sparse matrix which is utilized to record large scaled dataset. To identify the similarities among the E. coli Gis, the hierarchical clustering algorithms and our novel heuristical approaches successfully categorized 131k Gis into relevant groups. The dendrogram, which is the output of the hierarchical clustering, was created to display the closeness among the Gis and identify related groups of Gis as well as singleton Gis. To obtain the ideal number of clusters, two initial grouping methods were implemented: quantity customized clustering and cutline-based clustering. Three cluster optimization approaches were used to improve the clustering algorithm, including ANI score-based optimization, Gis site type-based optimization, and dendrogram height-based optimization. Height-based optimization identified better performance and well separated the meaningful clusters. Purity, Dunn’s Index, and alignment ratios were used to further verify and narrow down the clusters. Our results provide a promising method for categorizing large DNA segments that can be compared using a similarity measure and be categorized into more precise clusters for further analysis.

Index Terms – Bioinformatics, genomic islands, hierarchical clustering, high-performance computing, phylogenetic trees, unsupervised learning.

Committee Chair/Advisor

Noushin Ghaffari

Committee Member

Lin Li

Committee Member

Yonghui Wang

Committee Member

Md. Shuvo

Publisher

Prairie View A&M University

Rights

© 2021 Prairie View A & M University

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Date of Digitization

6-20-2025

Contributing Institution

John B Coleman Library

City of Publication

Prairie View

MIME Type

Application/PDF


Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.