The 117th Congress: Assessing Congressional Committee Referrals Using NLP Embeddings and Unsupervised Clustering of Legislative Texts
Note
K-Means Clustering UMAP: High-Dimensional NLP Embeddings of 11,154 House bills introduced in the 117th Congress, labeled by semantic theme per Cluster
Having served as a congressional staffer in the U.S. House of Representatives, I have firsthand experience with the complexities and procedural bottlenecks of the legislative process. For instance, during the 117th Congress (which was in session from January 2021 to January 2023), over 11,460 bills and resolutions were introduced across 330 legislative days—averaging about 34 new pieces of legislation per day.
Each bill requires referral to one or more of the 21 standing or select committees, a task that is both labor-intensive and time-sensitive. This volume of legislation can place a significant burden on the House Speaker, House Parliamentarian, and their respective staff to assign bills appropriately, relying not only on intricate knowledge of committee jurisdictions, but also factoring in legislative and political nuances.
Recognizing the potential for advances in Transformer-based Neural Networks to assist in the streamlining of such processes—especially for less critical legislation and procedures—I was motivated to explore whether natural language processing (NLP) techniques and unsupervised learning algorithms could inform how we might automate, assist, or assess the committee referral process.
Specifically, I aimed to investigate if clustering bill texts—based on semantic similarities identified within their embedded representations—could reveal thematic structures aligning with committee jurisdictions, thereby serving as a foundation for an automated referral system.
Tip
-
Scripts: 15 different scripts Python scripts were written for each part of the process, all of which are linked below or can be found in the rep here. Each Python script is annotated with my comments explaining certain aspects of the process in more detail.
-
Data: All data, including the LegiScan
.json
files, original.pdf
files, cleaned.txt files
, and.npy
embedding files for all 11,154 House bills and 21 Committees I analyzed are in the repo and available for reproduction or further analysis. -
Results: Including the Heat-map and UMAP plots, the
.npy
array for the K-Means centroids, individual distance and silhouette metrics for each bill, and a curated sample of bills from each cluster can all be found here.
Following the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology, I structured the project into the following phases:
The primary objective was to determine if NLP models and clustering algorithms could effectively predict House committee assignments, potentially easing the workload of the House Parliamentarian and expediting the legislative process. My hypothesis was that by clustering bills based on their semantic content, we might uncover patterns that mirror the existing committee referral system.
Scripts Used: bill_pull.py
, bill_titles.py
To have a complete dataset, I focused on the recent 117th Congress, rather than the current 118th. I sourced comprehensive legislative data from LegiScan, which provided detailed metadata for each bill, including bill numbers, titles, sponsors, initial committee assignments, and links to the full texts on Congress.gov. I focused only on House Legislation to avoid the complexities of dealing with the alternative Committee structure within the Senate. (I'm also partial to the House, having worked only for Members of that Chamber).
- Metadata Parsing: Using
bill_pull.py
andbill_titles.py
, I extracted essential metadata and ensured completeness. After parsing and cleaning, I had a dataset of 11,154 bills—about 97% of all bills introduced during that Congress.
Scripts Used: bill_download.py
, bill_convert_TXT.py
, data_check.py
, bill_cleaner_part1.py
, bill_cleaner_part2.py
, bill_counter.py
, committee_counter.py
, embedding_committee_rules.py
, embedding_bills_under_8100.py
, embedding_bills_over_8100.py
-
Data Extraction and Cleaning:
- PDF Downloading and Conversion: With
bill_download.py
andbill_convert_text.py
, I data-mined all bill PDFs from Congress.gov (over 2GB worth of PDF Data) and converted them into plain text files. - Integrity Check: I regularly employed
data_check.py
to ensure that each bill in my master CSV file had corresponding PDF and text files, flagging any discrepancies.
- PDF Downloading and Conversion: With
-
Text Cleaning:
- Removing Extraneous Information: I used
bill_cleaner_part1.py
andbill_cleaner_part2.py
to strip out unnecessary metadata, such as sponsor names and prior committee assignments, to prevent bias in the embeddings influencing the clustering and later prediction process. - Standardizing Formats: I ensured consistency across all bill texts to focus on the legislative content through a manual and programatic analysis of a sample of the cleaned text files, adjusting both
bill_cleaner_part1.py
andbill_cleaner_part2.py
until I was satisfied that all final text files had parity for the embedding process.
- Removing Extraneous Information: I used
-
Committee Representation:
- Compiling Descriptions: I extracted textual descriptions of each of the 21 House committees from the House Rules for the 117th Congress, which outlined their responsibilities, powers, and areas of jurisdiction. These 21 text documents were augmented with detailed biographical information authored by the committees themselves, published on each of the official committee websites.
- Creating Representative Texts: These descriptions served as representative documents for each committees, which could be compared in the embedding space shared with the legislative texts.
-
Model Selection and Pre-Embedding Preparation:
- NLP Model: OpenAI's
text-embedding-3-large
was selected to embed all texts, due to its large token window capacity, my familiarity with OpenAI API, and the preferable cost-to-quality of OpenAI models. Embedding all legislation and committee texts cost less than $7 total in API credit. Additionally, OpenAI embeddings pre-normalized to the unit norm, allowing me to skip a step to normalize vectors to account for magnitude. - Token Limitations: Due to
text-embedding-3-large
maximum token limit of 8,190 tokens, I usedbill_counter.py
andcommittee_counter.py
to identify texts exceeding this limit. No committee texts were above the limit, while around 600 legislative texts were above the limit.
- NLP Model: OpenAI's
-
Generating Embeddings:
- Bills: I employed
embedding_bills_under_8100.py
for bills within the token limit. - Chunking Long Texts: For bills over the limit,
embedding_bills_over_8100.py
split them into manageable chunks, embedded each, and averaged the embeddings to represent the entire bill. - Committees: I used
embedding_committee_rules.py
to create embeddings for each of the 21 committee's textual representation.
- Bills: I employed
Script Used: kmeans_predict_rules.py
- Clustering Approach: I implemented K-Means clustering with k=21, matching the number of House committees, hypothesizing that a cluster may form representing each Committee's area of jurisdiction.
- Hyper-parameters: I tested multiple random seeds and increased the number of initializations for each up to 5000, to ensure convergence and reproducibility. All seeds converged to the same approximate centroids within 1000 initializations.
- Visual Graphs: I chose to rely on the use of heat-map plots that illustrated the number of bills per committee that fell into each cluster, as well as UMAPs of the clustering in 2d, colored by either committee labels or cluster labels.
- Assignment Prediction: I mapped committee embeddings to clusters within the heat-map graphic to assess alignment of each cluster with with their respective bills.
Scripts Used: explore_clusters.py
, ClusterGraphic.py
- Cluster Analysis:
- After identifying widespread anomalies within the graphical results, I used
explore_clusters.py
to extract examine representative bills in each cluster (by proximity to the centroid), identifying common themes within the cluster. - Many clusters were identified not to correspond to specific jurisdictions, but themes that were sometimes more abstract or more specific. These labels were documented and incorporated in further scripting.
- After identifying widespread anomalies within the graphical results, I used
- Visualization:
- Finally I generated the final heat-maps and UMAP projections with
ClusterGraphic.py
to visualize relationships between clusters, bills, committees, and the cluster themes. - The final heat-maps illustrated how many bills from each committee fell into each cluster, revealing patterns and anomalies that made sense only in light of a deeper textual examination.
- Finally I generated the final heat-maps and UMAP projections with
The clustering yielded some insightful, albeit unexpected, results:
-
Partial Alignment:
- Certain clusters closely aligned with specific committees.
- Cluster 3 was dominated by bills related to veterans' services and benefits, aligning with the Veterans Affairs and Armed Services committees. Both the bills and the committee embeddings fell into this cluster.
- Cluster 12 centered on foreign policy and national security, matching the Foreign Affairs and Intelligence committees.
- Certain clusters closely aligned with specific committees.
-
Discrepancies:
- In many cases, committee embeddings did not fall into clusters containing the majority of their bills.
- For example, Energy and Commerce had its highest number of bills in Cluster 8, yet its committee embedding fell into Cluster 11.
- Some clusters were thematic but spanned multiple committees, indicating overlapping jurisdictions.
- In many cases, committee embeddings did not fall into clusters containing the majority of their bills.
Note
Heatmap: Committee Predictions are Noted by Cells Outlined in Blue
Below is a table identifying the major themes of each cluster. To see examples of specific bills that fell into each cluster, and helped shape these themes, go here.
Cluster Number | Identified Theme | Committees with Strongest Bill Showings | Committees with Embeddings in Cluster |
---|---|---|---|
0 | Prohibition of Federal Funding for COVID-19 Mandates and Limitation of Government Actions | Homeland Security, Judiciary, Oversight and Reform | None |
1 | Education and Workforce Development for Underserved Communities | Education and Labor | None |
2 | Congressional Oversight and Requests for Executive Documentation | Rules, Judiciary, Oversight and Reform | Rules, Budget, Ethics, Oversight and Reform, Judiciary, Admin |
3 | Enhancement of Veterans’ Services and Benefits | Veterans' Affairs, Armed Services | Veterans' Affairs, Armed Services |
4 | Environmental Conservation and Renewable Energy Initiatives | Natural Resources, Transportation, Energy & Commerce | None |
5 | Healthcare Access Improvement and Medicare Policy Reforms | Energy and Commerce, Ways and Means | None |
6 | Recognition and Commemoration of Military and Service Personnel | Armed Services, Financial Services | None |
7 | Government Ethics, Accountability, and Regulatory Reforms | Judiciary, Energy and Commerce, Financial Services | None |
8 | Healthcare Access and Public Health Reforms | Energy and Commerce | None |
9 | Designation of National Awareness Weeks and Observances | Education and Labor, Energy and Commerce, Oversight | None |
10 | Tax Policy Amendments and Economic Incentives | Ways and Means | Ways and Means |
11 | Infrastructure Development and Environmental Management | Natural Resources, Transportation, Energy & Commerce, Agriculture | Science, Space & Tech, Energy & Commerce, Transportation, Natural Resources |
12 | Foreign Policy, Sanctions, and National Security | Foreign Affairs, Intelligence | Foreign Affairs, Intelligence |
13 | Law Enforcement, Immigration, and Border Security Policies | Judiciary, Homeland Security | Homeland Security |
14 | Tax Policy Amendments, Economic Incentives, and Infrastructure Investment | Ways and Means | None |
15 | Comprehensive Infrastructure, Climate Action, Health, Housing, and Democracy Protection | Energy and Commerce, Financial Services, Natural Resources | None |
16 | Gun Control and Firearms Regulation | Judiciary | None |
17 | Education, Small Business Support, Social Programs, and Regulatory Reforms | Education and Labor, Financial Services, Small Business | Education and Labor, Financial Services, Small Business |
18 | Resolutions and Recognitions for Social Issues, Human Rights, and Commendations | Foreign Affairs, Oversight | None |
19 | Law Enforcement Recognition, National Security, and Social Resolutions | Foreign Affairs, Judiciary, Oversight | None |
20 | Designation of Federal and Postal Facilities and Memorials | Oversight and Reform | None |
Note
How Each Cluster Relates to Others Clusters
The project presented several challenges:
-
Data Acquisition and Processing:
- Text Availability: Obtaining plain text versions of all bills was more challenging than I anticipated, requiring the process of downloading and conversion of over 2gb of PDF files.
- Data Cleaning: The presence of hidden metadata within the PDF files that appeared in the TXT files -- and the somewhat inconsistent formatting between different bill types -- necessitated extensive cleaning to ensure accurate embeddings.
-
Token Limitations:
- Embedding Constraints: Handling bills exceeding the 8,190-token limit of the embedding model required strategic chunking, embedding, and averaging. This required a standalone script, and added some complexity to the process.
-
Clustering Complexities:
- Semantic Overlap: The high dimensionality and semantic overlap of legislative texts made it difficult for clusters to neatly correspond to committee jurisdictions, going against my hypothesis.
- Discrepancies: Committee embeddings sometimes did not align with clusters containing the majority of their bills, suggesting that some textual descriptions I compiled may not fully capture the full scope of certain committee's activities and areas of jurisdiction.
-
Unexpected Results:
- Thematic Broadness: Some clusters encompassed broad themes spanning multiple committees, reflecting the interconnected nature of legislative issues.
- Overlapping Jurisdictions: Highlighted the complexity of congressional committees where subject matters often intersect, and legislation often passes through multiple Committees beyond the Committee they are initially referred to.
- Deeper Textual Analysis: Visual results depicted in the heat-map and UMAP plots caused more confusion than clarity initially. Only after a deeper analysis of representative texts, and labeling of the plots by the identified themes, did the visual results begin to make more sense.
While the clustering approach revealed interesting and coherent thematic groupings / underlying structures within the legislative texts, it did not consistently predict or correlate to committee assignments of these texts. This suggests that although NLP and unsupervised learning hold potential for augmenting the legislative process, further refinement is necessary.
Future Directions:
- Supervised Learning Integration: Training classifier or LLM models on existing committee assignments to enhance prediction capabilities.
- Expanded Data Features: Expanded texts for Committee embeddings, as well as exploring ways to include metadata within legislative text representations, such as bill sponsors, political context, and historical referral patterns.
- Alternative Clustering Methods: Exploring algorithms like DBSCAN or hierarchical clustering to capture more complex relationships.
Note
UMAP of Bills labeled by Committee Assignments
Note
UMAP of Bills labeled by Cluster Theme