Date of Award
8-2025
Document Type
Thesis
Degree Name
Master of Science
Degree Discipline
Computer Science
Abstract
Transposons play a pivotal role in genome evolution and contribute to genome expansion, with novel transposon discovery offering valuable insights into genetic function and its implications for health and diseases. This study focused on identifying novel transposons within a large-scale genomic dataset using unsupervised machine learning approaches to uncover hidden patterns and detect elements that deviated from known transposon groups. To address the challenge of data scale and to deal with the computational complexity, two complementary approaches were adopted: first, analyzing the entire dataset to picture the broad spectrum of diversity; second, a strategic reduction and filtering method to produce a manageable dataset that enabled efficient identification of novel elements.
The density-based clustering algorithm called Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) was applied to both the full and filtered datasets due to its robust capacity to detect outliers. In our study, transposons were considered as outliers that do not fit into any well-defined clusters and are highly suitable for pinpointing potentially different elements. That is how outliers represent promising candidates for novel transposes. In addition to detecting novel transposons, the clustering patterns revealed meaningful phylogenetic relationships among transposon groups, shedding light on their evolutionary trajectories and biological interconnections. This integrated method significantly enhanced the detection of novel transposons, deepening understanding of their impact on genomic architecture and their potential roles in human health. Ultimately, these findings offer a more nuanced view of genome dynamics and expand the landscape of functional genomics research.
Index Terms: Bioinformatics, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) Clustering, high performance computing, transposons, unsupervised learning.
Committee Chair/Advisor
Noushin Ghaffari
Committee Member
Lin Li
Committee Member
Sherri S. Frizell
Committee Member
Md. Shuvo
Publisher
Prairie View A&M University
Rights
© 2021 Prairie View A & M University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Date of Digitization
10/22/2025
Contributing Institution
J. B . Coleman Library
City of Publication
Prairie View
MIME Type
Application/PDF
Recommended Citation
Nazara, R. (2025). From Diversity To Novelty: Unsupervised Learning For Novel Transposon Identification And Evolutionary Mapping For Large-Scale Genomic Datasets. Retrieved from https://digitalcommons.pvamu.edu/pvamu-theses/1625