Merge pull request #76 from openstates/query-scraper-output-data

Add doc on querying scraper output using DuckDB
openstates · Sep 26, 2024 · 431b513 · 431b513
2 parents 88cbdae + e7ae8dc
commit 431b513
Show file tree

Hide file tree

Showing 7 changed files with 743 additions and 308 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -11,13 +11,15 @@ jobs:
   build:
     steps:
       # Python & dependency installation
-      - uses: actions/checkout@v2
+      - uses: actions/checkout@v4
         with:
           submodules: true
       - name: setup Python
         uses: actions/setup-python@v2
+        with:
+          python-version: '3.9'
       - name: install Poetry
-        uses: snok/install-poetry@v1.2.1
+        uses: snok/install-poetry@v1.4.1
       - name: set poetry config path
         run: poetry config virtualenvs.path ~/.virtualenvs
       - name: install dependencies

diff --git a/.python-version b/.python-version
@@ -0,0 +1 @@
+3.9
diff --git a/docs/contributing/scrapers.md b/docs/contributing/scrapers.md
@@ -149,6 +149,10 @@ directory and inspect the JSON files. If you're trying to make a small
 fix this is often sufficient, you can confirm that the scraped data
 looks correct and move on.
 
+Please see our document on [Querying Scraper Output Data](../data/query-scraper-output-data.md)
+for tools you can use to investigate data issues across a set of many
+scraped data output files.
+
 
 !!! note
     It is of course possible that the scrape fails. If so, there's a good

diff --git a/docs/data/query-scraper-output-data.md b/docs/data/query-scraper-output-data.md
@@ -0,0 +1,233 @@
+# How to query and examine data file output created by scrapers
+
+This document is aimed to help with scraper development and data debugging.
+
+Open States scrapers typically produce a set of output data files in the JSON format. Each file represents a primary
+entity that the scraper has produced, such as a Bill, an Event or a Vote Event. When working on scraper code, it can
+be really helpful to examine this output for data quality issues.
+
+## Simple examination
+
+The JSON format is relatively easy to view in any text editor or Integrated Development Environment (IDE). This makes
+it easy to examine a specific output file.
+
+And you can take it a step further by using a tool like [jq](https://jqlang.github.io/jq/) to query the JSON document
+so that you only see the attributes that are relevant to you. For example:
+
+```shell
+cat vote_event_ff4ae3ef-656f-11ef-8b4c-732c295f582c.json | jq .counts
+```
+
+## Querying a full set of scraper output files
+
+But what if you want to check data quality issues that might be distributed across the output, which may consist of
+thousands of data points? You can put together shell scripts/commands that iterate through and find certian files,
+then parse them with `jq`. But this can be tricky, especially if you are not familiar with advanced shell scripting.
+
+The good news is that, if you are familiar with SQL, there is a way to treat a set of scraper output files like a
+database and use SQL to query that database. A tool called [DuckDB](https://duckdb.org/) makes this possible.
+
+Let's assume I've run one or more scrapers for Delaware. I open my terminal and look at the scraper output directory,
+`_data/de`. There are over 3000 JSON files in that folder, so it would be painful to spot check lots of files:
+
+```
+jesse@work:~/repo/openstates/openstates-scrapers/scrapers/_data/de$ ls -alh | grep ".json" | wc -l
+3760
+```
+
+However, installing the DuckDB tool allows me to use it inside that folder to query the Bills data:
+
+```
+jesse@work:~/repo/openstates/openstates-scrapers/scrapers/_data/de$ duckdb
+v0.10.2 1601d94f94
+Enter ".help" for usage hints.
+Connected to a transient in-memory database.
+Use ".open FILENAME" to reopen on a persistent database.
+D SELECT * FROM read_json('bill_*.json');
+```
+
+Running `duckdb` in this way opens a "transient" database, meaning that anything you do will be gone once you close
+the `duckdb` terminal (by using the `ctrl+d` keystroke). If you want to save anything you create in your session,
+instead open it with a file that will save those Views/Tables: `duckdb scraper_output.db`
+
+You can see the data schema that DuckDB has identified from the JSON files by using `DESCRIBE`:
+
+```
+D DESCRIBE SELECT * FROM read_json('bill_*.json');
+┌─────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
+│     column_name     │                                         column_type                                          │  null   │   key   │ default │  extra  │
+│       varchar       │                                           varchar                                            │ varchar │ varchar │ varchar │ varchar │
+├─────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
+│ legislative_session │ VARCHAR                                                                                      │ YES     │         │         │         │
+│ identifier          │ VARCHAR                                                                                      │ YES     │         │         │         │
+│ title               │ VARCHAR                                                                                      │ YES     │         │         │         │
+│ from_organization   │ VARCHAR                                                                                      │ YES     │         │         │         │
+│ classification      │ VARCHAR[]                                                                                    │ YES     │         │         │         │
+│ subject             │ JSON[]                                                                                       │ YES     │         │         │         │
+│ abstracts           │ STRUCT(note VARCHAR, abstract VARCHAR)[]                                                     │ YES     │         │         │         │
+│ other_titles        │ STRUCT(note VARCHAR, title VARCHAR)[]                                                        │ YES     │         │         │         │
+│ other_identifiers   │ JSON[]                                                                                       │ YES     │         │         │         │
+│ actions             │ STRUCT(description VARCHAR, date DATE, organization_id VARCHAR, classification VARCHAR[], …  │ YES     │         │         │         │
+│ sponsorships        │ STRUCT("name" VARCHAR, classification VARCHAR, entity_type VARCHAR, "primary" BOOLEAN, per…  │ YES     │         │         │         │
+│ related_bills       │ JSON[]                                                                                       │ YES     │         │         │         │
+│ versions            │ STRUCT(note VARCHAR, links STRUCT(url VARCHAR, media_type VARCHAR)[], date VARCHAR, classi…  │ YES     │         │         │         │
+│ documents           │ STRUCT(note VARCHAR, links STRUCT(url VARCHAR, media_type VARCHAR)[], date VARCHAR, classi…  │ YES     │         │         │         │
+│ citations           │ STRUCT("publication" VARCHAR, citation VARCHAR, citation_type VARCHAR, effective JSON, exp…  │ YES     │         │         │         │
+│ sources             │ STRUCT(url VARCHAR, note VARCHAR)[]                                                          │ YES     │         │         │         │
+│ extras              │ JSON                                                                                         │ YES     │         │         │         │
+│ _id                 │ UUID                                                                                         │ YES     │         │         │         │
+├─────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┤
+│ 18 rows                                                                                                                                          6 columns │
+└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
+```
+
+You'll notice that related entities like Bill Actions, Bill Sponsorships, etc... are embedded in the data as `STRUCT`
+data: `actions`, `sponsorships`, etc.. This is typically structured as a list of `STRUCT` data, which is like a dict
+or object (`STRUCT(property_a VARCHAR, property_b VARCHAR)[]`). We can drill down into those.
+
+### Use views to drill into the data
+
+To make this data structure simpler to reason about, we can use Views. Here's a set of recommended views you can try.
+Please note that these views are used in many of the suggested queries below. In order to execute a query that uses
+a view, you need to create the view first!
+
+```sql
+-- Bills
+CREATE VIEW bills AS
+SELECT *
+FROM read_json('bill_*.json');
+CREATE VIEW bill_actions AS
+SELECT legislative_session, identifier AS bill_identifier, _id AS bill_id, unnest(actions, recursive := true)
+FROM bills;
+CREATE VIEW bill_action_classifications AS
+SELECT legislative_session, bill_identifier, bill_id, unnest(classification) AS classification
+FROM bill_actions;
+CREATE VIEW bill_sponsorships AS
+SELECT legislative_session, identifier AS bill_identifier, _id AS bill_id, unnest(sponsorships, recursive := true)
+FROM bills;
+CREATE VIEW bill_versions AS
+SELECT legislative_session, identifier AS bill_identifier, _id AS bill_id, unnest(versions, recursive := true)
+FROM bills;
+CREATE VIEW bill_version_links AS
+SELECT legislative_session, bill_identifier, bill_id, unnest(links, recursive := true)
+FROM bill_versions;
+
+-- Vote Events
+CREATE VIEW vote_events AS
+SELECT *
+FROM read_json('vote_event_*.json');
+CREATE VIEW vote_event_votes AS
+SELECT legislative_session,
+       bill_identifier,
+       identifier AS vote_identifier,
+       _id        AS vote_id,
+       unnest(votes, recursive := true)
+FROM vote_events;
+CREATE VIEW vote_event_counts AS
+SELECT legislative_session,
+       bill_identifier,
+       identifier AS vote_identifier,
+       _id        AS vote_id,
+       unnest(counts, recursive := true)
+FROM vote_events;
+```
+
+To view Views in your DuckDB database: `SHOW TABLES;`
+
+### Useful queries
+
+These queries will assume you've created the Views listed above! Execute the Views queries above before trying these!
+
+General utility: describe the shape of a given dataset:
+
+```sql
+DESCRIBE SELECT * FROM bills;
+DESCRIBE SELECT * FROM vote_events;
+```
+
+### Bills
+
+#### Bills that are missing a property that is a list (abstracts, actions, sponsorships, etc.)
+
+```sql
+SELECT (len(abstracts) > 0) AS has_abstract, COUNT(*)
+FROM bills
+GROUP BY 1;
+SELECT (len(sponsorships) > 0) AS has_sponsorship, COUNT(*)
+FROM bills
+GROUP BY 1;
+SELECT (len(related_bills) > 0) AS has_related_bills, COUNT(*)
+FROM bills
+GROUP BY 1;
+```
+
+#### Distribution of bill classifications
+
+```sql
+SELECT classification, COUNT(*)
+FROM bills
+GROUP BY classification;
+```
+
+#### Distribution of bill action classifications
+
+```sql
+SELECT classification, COUNT(*)
+FROM bill_action_classifications
+GROUP BY 1;
+```
+
+#### Bill actions that are unclassified
+
+```sql
+SELECT (classification = []) AS action_is_unclassified, COUNT(*)
+FROM bill_actions
+GROUP BY 1;
+```
+
+#### Bill version media type distribution
+
+```sql
+SELECT media_type, COUNT(*)
+FROM bill_version_links
+GROUP BY 1;
+```
+
+### Vote Events
+
+#### Vote Events over time (by month)
+
+```sql
+SELECT date_trunc('month', start_date), COUNT(*)
+FROM vote_events
+GROUP BY 1,
+ORDER BY 1;
+```
+
+#### Vote events with zero or unexpected number of votes
+
+```sql
+SELECT len(votes) AS num_votes, COUNT(*) AS events_with_this_number
+FROM vote_events
+GROUP BY 1
+ORDER BY 1 DESC;
+```
+
+or check via count values for vote events where all counts are zero
+
+```sql
+SELECT vote_id, SUM(value)
+FROM vote_event_counts
+GROUP BY 1
+HAVING SUM(value) = 0;
+```
+
+#### Vote event counts check distribution for anomalies
+
+```sql
+SELECT
+option, value, COUNT (*)
+FROM vote_event_counts
+GROUP BY 1, 2
+ORDER BY 1, 2 DESC;
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -76,6 +76,7 @@ nav:
     - 'contributing/documentation.md'
   - Data:
     - 'data/index.md'
+    - 'data/query-scraper-output-data.md'
     - 'data/categorization.md'
     - 'data/session-naming.md'
   - Enhancement Proposals: