CIND820 - Big Data Analytics Project

TMU CIND820 Capstone Project - Parliament of Canada Hansard Debate Records (2006 - 2023)

This project is aims to explore what information can be gained through applying text mining and topic modeling approaches to the historical Hansard debate records from the Parliament of Canada. The identified dataset covers from April 2006 to December 2023 (39-1 Parliment to part of 44-1), which is a period of time that covers several election cycles and two Prime Ministerss. The data set includes 1972 pdf files in total and collectively represent 155,385 pages and 128,933,818 words.

The initial results of the comparative analysis between LDA and HDP models show that while HDP has greater flexibility in processing a corpus of text including not requiring defining total topics upfront, the outputs of the LDA model continue to produce individual topics that can be easily interpreted as distinct. However, the HDP model was able to identify a greater diversity of keywords that covered a higher proportion of documents that potentially would allow for a more nuanced assessment of changes in topic keywords across different timescales of either parliament sessions or calendar years. In terms of coherance values:

For the sample size of 25, the HDP model was able to produce higher result (0.410 for LDA and 0.458 for HDP)
For the sample size of 385, the LDA model was able to produce a higher result (0.701 for LDA and 0.315 for HDP)
For the full dataset of 1972 documents, the LDA model was able to produce a higher result (0.628 for LDA and 0.319 for HDP)

With respect to the BERTopic model, initially processing limiations and capacity issues were encounterd due to compatability with the computer being leveraged. However, the use of a virtual machine that could leverage cloud computing to run the script proved to be a successful solution to the high computational demands of the BERTopic model. While BERTopic was able to identify 7 topics within the corpus, the ability for BERTopic to process documents without preprocessing worked out to be a disadvantage in the context of Hansard debate records that contain jargon and laguage related to protocol and parliamentary etiquette. The outcomes of the BERTopic are shared below.

Background: Literature Review

The literature review of past research has highlighted the importance of a using a systematic approach to text mining techniques, the benefits of leveraging a combined topic model from LDA and TF-IDF, as well as considerations for evaluation of topic models and the validation of the labels for the extract topic. The literature review also helped to identify comparable alternatives to LDA that include HDP and the more recent BERTopic. While the literature review has identified possible advantages for these alternative topic models, there is a potential for these alternatives to be more computationally intensive and the performance of these different modeal will be explored in the final report.

Copies of research articles that are refenced in the final paper can be found in /Reports/Literature_Review_Articles/

Original Data Source: https://www.ourcommons.ca/documentviewer/en/37-1/house/hansard-index

Sample of DataFrame created from fulldata set:

Relevant Source Code: Exploratory Analysis and Final Results

The the relevant files in this repository that have expored the LDA, HDP and BERTopic Models are referenced below. The files results are divided between "Initial Results" that assesed a representative sample of documents (s=385) as well as "Final Results" for analysis conducted on the full dataset:

Initial_Results_CreateSample: documenting how two sample sizes were identified that randomly selected files from the full dataset
Initial_Results_HDP_LDA_Models: Inital exploration of the algothrims for LDA and HDP topic models using a sample size of 25 files. The objectives were to test the general performance of the models in a time efficent manner.
Initial_Results_HDP_LDA_Models_Sample385: The same code was run a statistically significant sameple size of 385 files out of the full 1972 pdfs from the dataset.
BERTopic_Model: Initial assessment of the BERTopic model was limited due to the processing capacity and is a jupyter notebook that was run on a virtual machine. Some aspects of the script (e.g., pip install commands) have been included in the notebook due to the nature of running pyhton script on the virtual machine.

Additionally, the source data for initial results can be found in the following folders:

Dataset_Sample: 25 file sample that was leveraged by the LDA and HDP models
Dataset_Sample_385: the larger sample of 385 files that was leveraged by the LDA and HDP models
Datasource: the full set of documents collected from the Handsard archives.

Preprocessing - Determining Total Topics for LDA Model:

Leveraging the outcomes from the Literature Reivew and exploratory analysis of the dataset, it was determined that the ideal number of topics for the LDA model was 7. This value was carried forward into the analysis of representative same (s=385) and the full dataset (s=1972). The total number of topics to use in the LDA was determined through an assessement of coherance across a range of topic numbers (2-10 and 2-40).

Model Performance - Coherence Values

Coherance values of LDA

Coherance values of HDP model

The inital assessment of the model performance produced the following overal coherance values:

LDA had an overall coherance value of 0.410 (s=25) and 0.701 (s=385) for 7 topics
HDP had an overall coherance value of 0.458 for 19 topics (s=25) and 0.315 for 39 topics (s=385)

Assessment of Topic Keywords Reviewing the outputs for each identified topic highlighted the differences between LDA and HDP being able to identify clearly distinct topics. For example, at first review of the LDA topics, it is noted that there were a least three distinct topics out of 7 for s=385, which was further confirmed when isolating for representative text.

LDA Topic Keywords - Representative Text (s=385)

Comparing with the HDP model, while there were more total topics, there were many keywords that were common across the different topics that makes it hard for a person to interpret the differences between each individual topic. However, collectively the identified keywords were found across a higher percentage of total documents in the dataset.

HDP Topic Keywords - Representative Text (s=385)

Further supporting the representative text table above, plotting the document wordcount against the total documents for each topic highlighted the same topic numbers for each model. For example, the two representative topics #0 and #5 in the LDA model for s=25 and topics #0, #2 and #4 are shown in the graphs below to contain the highest number of total documents.

LDA Model Outputs (s=385):

HDP Model Outputs (s=385):

Further assessment of the LDA model which produced topics that were more clearly defined than that HDP model, as demonstrated by the Intertopic Distance Model Map When comparing the two maps (seen below). The results imply that the HDP model is characterized by high dimensonality, weak topic differenatation and similar topic content.

LDA Intertopic Distance Map & Top Relevant Keywords Per Topic (Sample 385)

HDP Intertopic Distance Map & Top Relevant Keywords Per Topic (Sameple 385)

BERTopic Model Outputs (s=385) The BERTopic model was trained on the dataset without preprocessing and leveraged many of the default parameters. However, due to the nature of the Hansard debate records, the identified topics were dominated by the protocol jargon and parliamentary etiquette terms that were removed during the preprocessing stage for the LDA and HDP models. Further evaluation of the BERTopic Model may require some degree of preprocessing prior to running the model.

Analysis Constraints & Limitations:

The computing and processing resources required to model the Hansard debate records proved challenging, especially when running the BERTopic model. In order to complete the training of the BERTopic, a virtual machine was leveraged inorder to access higher processing capacity (CPUs, RAM and GPU resources).

Even with access to the virtual machine, the long processing times required to preprocess text, train the models and conduct analysis of coherenace values limited the extent of analysis conducted. Considerations for future research to incorporate bigrams and trigrams into the corpus of text to train the LDA and HDP models is just one example of additional steps that can be taken to improve model performance.

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.gradient		.gradient
.ipynb_checkpoints		.ipynb_checkpoints
Dataset_Sample		Dataset_Sample
Dataset_Sample_385		Dataset_Sample_385
Datasource		Datasource
Reports		Reports
Script_PDFs		Script_PDFs
.DS_Store		.DS_Store
BERTopic_Model-7topics.ipynb		BERTopic_Model-7topics.ipynb
BERTopic_Model_old.ipynb		BERTopic_Model_old.ipynb
CIND820_Litreview.ipynb		CIND820_Litreview.ipynb
FinalPaper_CoherenceValueGraphs .ipynb		FinalPaper_CoherenceValueGraphs .ipynb
FinalPaper_HDP_LDA_Models_FullDataSet_PS.ipynb		FinalPaper_HDP_LDA_Models_FullDataSet_PS.ipynb
FinalPaper_HDP_LDA_Models_Time_FullDS.ipynb		FinalPaper_HDP_LDA_Models_Time_FullDS.ipynb
Initial_Results_CreateSample.ipynb		Initial_Results_CreateSample.ipynb
Initial_Results_HDP_LDA_Models.ipynb		Initial_Results_HDP_LDA_Models.ipynb
Initial_Results_HDP_LDA_Models_Sample385.ipynb		Initial_Results_HDP_LDA_Models_Sample385.ipynb
Initial_Results_HDP_LDA_Models_Sample385_CVgraph-40.ipynb		Initial_Results_HDP_LDA_Models_Sample385_CVgraph-40.ipynb
Initial_Results_HDP_LDA_Models_Sample385_CVgraph.ipynb		Initial_Results_HDP_LDA_Models_Sample385_CVgraph.ipynb
README.md		README.md
topicmodel_1.csv		topicmodel_1.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CIND820 - Big Data Analytics Project

Background: Literature Review

Relevant Source Code: Exploratory Analysis and Final Results

Preprocessing - Determining Total Topics for LDA Model:

Model Performance - Coherence Values

Analysis Constraints & Limitations:

About

Releases

Packages

Contributors 7

Languages

CDL-DataSci/CIND820

Folders and files

Latest commit

History

Repository files navigation

CIND820 - Big Data Analytics Project

Background: Literature Review

Relevant Source Code: Exploratory Analysis and Final Results

Preprocessing - Determining Total Topics for LDA Model:

Model Performance - Coherence Values

Analysis Constraints & Limitations:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages