Date of Award


Document Type


Degree Name

Master of Science

Degree Discipline

Electrical Engineering


Mining Clinical Notes for relevant information has attracted a lot of interest in Natural Language Processing (NLP). Medical documents contain language whose distributions vary from that of the general domain and have a vocabulary that evolves with time. Recently, attention based deep learning language models have become the new state-of-the-art in language modeling capturing strong representations of language with respect to the context it is in, improving on classic clinical NLP task such as medication detection, and medication classification.

In this thesis research, the Harvard Medical School’s 2022 National Clinical NLP Challenges (n2c2) is considered where the Contextualized Medication Event Dataset (CMED) has been given for the challenge. CMED is a dataset of unstructured Electronic Health Records (EHRs) and annotated notes that contain task relevant information about the EHRs. The goal of the challenge is to develop effective solutions for extracting contextual information related to medications from EHRs using data driven methods. In this thesis, variations of Google’s attention-based Bert architecture have been applied for this challenge, namely, Bert Base, BioBert, and two variations of Bio+Clinical Bert, that are pre-trained on general domain, biomedical domain, and clinical domain corpora, respectively. They are used to perform named entity recognition (NER) for medication extraction and medical event detection. Pre-processing methods have been developed for breaking down EHRs for compatibility with the Bert model on NER task, and the variations of Bert are fine-tuned with CMED for the n2c2 task. Performance analysis has been carried out using a script based on constructing medical terms from the evaluation portion of CMED with metrics including recall, precision, and F1-Score. The results demonstrate that Bio+Clinical Bert outperforms Bert Base and BioBert, as well as three of the top ten performers in the challenge.

Index terms: Bi-directional encoder representations from transformers, electronic health records, natural language processing, transformer

Committee Chair/Advisor

Lijun Qian

Committee Member

Xishuang Dong

Committee Member

Xiangfang Li

Committee Member

Richard Wilkins


Prairie View A&M University


© 2021 Prairie View A & M University

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Date of Digitization


Contributing Institution

John B Coleman Library

City of Publication

Prairie View





To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.