notes.txt

CVR (Cast Vote Records) Notes:

Documentation: https://www.sfelections.org/results/20200303/data/DSuite_CVRExportFormat.pdf

BallotTypeContestManifest.json - mapping between BallotTypeId and ContestId. Seems to be many-to-many. Basically which contests were present on which ballots

BallotTypeManifest.json - lists types of ballots for this election

CandidateManifest.json - lists all candidates, and specify which contest they belong to

Configuration.json - config options selected used when generating these files. mostly not relevant

ContestManifest.json - lists all contests, notes how many votes are allowed for each contest, and number of ranks

CountingGroupManifest.json - lists counting groups (Election Day vs VBM)

DistrictManifest.json - not documented! Seems to have several different granularities of districts, ex, state-wide, congressional district, and more

DistrictPrecinctPortionManifest.json - mapping between precincts and districts (plural!) it belongs to

DistrictTypeManifest.json - lists granularities of districts, referenced by DistrictManifest.json

ElectionEventManifest.json - seems to describe the election itself

OutstackConditionManifest.json - describes various error conditions for ballots (overvote, undervote, etc)

PartyManifest.json - lists parties

PrecinctManifest.json - lists all precincts

PrecinctPortionManifest.json - seems very similar to PrecinctManifest.json, not sure the difference

TabulatorManifest.json - lists each voting machine

CvrExport_#.json - actual results from ballots! most data is in the Sessions list.

  TabulatorId - which voting machine counted this ballot
  BatchId, RecordId - just ids for this vote
  CountingGroupId - which counting group this ballot belongs to
  Original/Modified: actual votes, use IsCurrent to determine which is most up-to-date
    PrecinctPortionId - precinct ballot was cast at
    BallotTypeId
    IsCurrent
    Cards - list of cards scanned for this ballot
      OutstackConditionIds - errors during processing this ballot?
      PaperIndex - ballot card #?
      Contests - list of each contest on scanned page.
        Id - contest id
        Overvotes, Undervotes, OutstackConditionIds - errors during processing this contest?
        Marks - votes cast in this contest
          CandidateId - which candidate was voted for
          PartyId - party of candidate
          Rank - only relevant for ranked choice contests, rank of this mark
          WriteinIndex - only present if vote is for write-in candidate, not totally sure what this is
          MarkDensity, WriteinDensity - how filled in bubble was!
          IsAmbiguous - if mark is unclear
          IsVote - true if mark was counted as a vote

Question: if a ballot spans multiple cards, are they grouped together in a single CVR record, or can they be split across multiple records? If they are split, can you associate those cards with each other? Do adjacent record ids matter?


Data model:

Candidate table:
id, name, contest_id

Contest table:
id, name, num_votes, num_ranks

Ballot type table:
id, name

Ballot type contest assoc table:
ballot_type_id, contest_id

Counting group table:
id, name

Party table:
id, name

Precinct portion table:
id, name

District table:
id, name, district_type_id

District type table:
id, name

Precinct portion district assoc table:
precinct_portion_id, district_id

Vote
id, tabulator_id, batch_id, record_id, precinct_portion_id, ballot_type_id, contest_id, candidate_id, party_id, rank

Can we have one row per-ballot page instead of one row per-vote? Would go from 4.7m rows to 574k.
Aggregated vote:
id, tabulator_id, batch_id, record_id, precinct_portion_id, ballot_type_id, votes:
[ { candidate_id, rank } ]

Queries:

SELECT candidate.id as candidate_id, candidate.name as candidate_name, candidate.contest_id as contest_id, contest.name as contest_name, COUNT(vote.id) as votes
FROM candidate
JOIN contest ON contest.id = candidate.contest_id
JOIN vote ON vote.candidate_id = candidate.id
GROUP BY candidate.id, contest.id;

~4s

SELECT candidate.id as candidate_id, candidate.name as candidate_name, contest.id as contest_id, contest.name as contest_name, a.votes as votes
FROM (
  SELECT (jsonb_array_elements_text(ballot.votes)::jsonb)->'candidate_id' as candidate_id, COUNT(*) as votes
  FROM ballot
  GROUP BY (jsonb_array_elements_text(ballot.votes)::jsonb)->'candidate_id'
) AS a
JOIN candidate ON candidate.id = a.candidate_id::integer
JOIN contest ON candidate.contest_id = contest.id
GROUP BY candidate.id, contest.id, a.votes;

~5s

  WHERE ballot.votes @> '[{"candidate_id":18}'
+ gin index on votes

582ms

WHERE EXISTS (SELECT id FROM vote as vote_s WHERE vote.tabulator_id = vote_s.tabulator_id AND vote.batch_id = vote_s.batch_id AND vote.record_id = vote_s.record_id AND vote_s.candidate_id = 18)
+ no index

~4s

best approach: unagg but only query votes table (with index on candidate_id) and then do separate query for candidate + contest data