Parliament Scraper

Web scraper for scraping Hansard Reports from the parliament website. Still a work progress but I'll look at implementing a robust scraper for scraping parliament notes and neatly formatting the output in JSON. Already there is a helpful structure to the report, providing us with question and answer pairs, debates and general talk. This can then be used for NLP projects to obtain pre/semi-labelled data. As it's parliament reports we can use party alignment to visualise and validate different aspects of natual language in a very controlled talking space with clean data.

Usage

Install conda environment with conda env create -f environment.yml, and then activate it with conda activate parliament_scraper.

OUTDATED Call the scraper with: python scraper.py <DATE1> <optional: DATE2>

Date format is YYYYMMDD, a possible 2 dates are passed, check the URL of the report you are looking to scrape, they are formatted as: https://www.parliament.nz/en/pb/hansard-debates/rhr/combined/HansD_<DATE1>_<DATE2>. If DATE1 and DATE2 are the same, only one date needs to be passed to the scraper.

Output

The scraper will write to a file called debate_DATE1_DATE2.json in the current directory.

Check example_out.json for an example of the output format.

State

~~Needs work, can ony scrape 20210708 currently~~
~~Needs to be more robust to tag structure~~
Update: Scraper code working more robust to tag structure, see notebooks/parser.ipynb, scraper.py needs to be updated with changes
BillDebate -> Debate -> Subdebate structure doesn't nest when parent is BillDebate
Code needs better formatting and structure
- parliament module and submodules need better naming and documentation
All report links can be crawled, see notebooks/crawler.ipynb
Vote parsing needs to be properly re-implemented
Needs to be more robust to tag pairs being broken up (i.e. SubsQuestion and SubsAnswer)
- need to setdefault on expected keys + log the error

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
notebooks		notebooks
parliament		parliament
scripts		scripts
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
example_out.json		example_out.json
model_env.yml		model_env.yml
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parliament Scraper

Usage

Output

State

About

Releases

Packages

Languages

mjdall/parliament-scraper

Folders and files

Latest commit

History

Repository files navigation

Parliament Scraper

Usage

Output

State

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages