This repository contains Singapore legal data obtained various public sources and converted into a machine-readable format, including the following:
- Court hearings:
/data/hearings.json
- Senior Counsels:
/data/sc.json
- PDPC undertakings:
/data/pdpc-undertakings.json
- PDPC decisions:
/data/pdpc-decisions.json
- LSS DT reports:
/data/lss-dt-reports.json
- State Court judgments
/data/stc-judgments.json
- Family Court and Juvenile Court judgments
/data/fc-judgments.json
- Telecommunications FBO licences
/data/telco-fbo.json
You can view and query the data using this Datasette instance.
The code and configuration files in this repository are licensed under the EUPL-1.2 as set out in the LICENCE file.
This repository is not affiliated with the Singapore Academy of Law, Singapore Courts, Law Society, or any other organisation, and is provided for educational purposes only.
This is a big picture overview of the general architecture of this project:
flowchart LR
subgraph pipeline["Data Pipeline"]
subgraph /input/ scripts
Website-->data["/data/ (JSON files)"]
end
subgraph build_script["build_db.bb script"]
data-->sqlite["SQLite DB (/data/data.db)"]
end
end
subgraph backend["Backend"]
Datasette
end
build_script-->backend
subgraph frontend["Frontend"]
html["HTML templates"]
cljs["CLJS scripts"]
end
frontend-- served by -->backend
In the data pipeline, everything is just a script (aka a microservice™). Although most of the scripts are Babashka scripts written in Clojure, new scripts can be in any language.
The data is obtained periodically via scheduled GitHub action workflows and committed to this repository. Each Github Action runs one of the input scripts in the /input
folder. Each input script stores the data obtained in a JSON file in the /data
folder. Each JSON file is just a snapshot in time, i.e. it contains only the data obtained in the last run of the respective script as opposed to all data ever obtained using that script.
The /.github/workflows/deploy.yml
runs the /scripts/build_db.bb
script which uses the git-history
tool to create a SQLite database from the historical data across all the commits in this repository. The script then builds a Datasette Docker image and deploys that via Fly.io.
Some of the scripts in the /scripts
folder run Python tools. This project uses Poetry to manage its Python dependencies, so do install Poetry and the dependencies before running those scripts.
See /app/README.md
for frontend development.
This project uses devenv to quickly and conveniently set up a reproducible development environment. devenv
is particularly useful here because this project contains code written in various languages and has a variety of dependencies to be installed.
After installing devenv
, enter into a shell. This should automatically set up the environment and install all the dependencies:
devenv shell
The JSON data files across the various commits to the git
repository should then be aggregated into a SQLite database for ease of analysis. To create and populate the SQLite database:
devenv shell build-db
This may take some time (possibly >1h) as there have been many commits to this repository. The build_db.bb
script also does some processing on the data, e.g. it creates and populates certain columns for ease of use based on the raw data (see e.g. /scripts/computed_columns.bb
). Alternatively, you can download a copy of the database from lacunadb.huey.xyz.
You can run the same command above to update the SQLite database as necessary (e.g. after pulling subsequent commits).
Once you have the SQLite data, you can analyse it by running Datasette locally:
devenv shell dev-datasette
Make sure you have Babashka, Python, and Poetry installed.
Install the Poetry dependencies by running poetry install --no-root
.
This project uses various CLI utilities, which you will need to install to run the input scripts:
pdftotext
is used to extract text from PDFs. It is bundled within poppler
.
On Ubuntu/Debian:
sudo apt install poppler-utils
On macOS, you can install it using Homebrew:
brew install poppler
ocrmypdf
is used to run OCR on PDFs. It is a Poetry dependency already, but it does require tesseract
and ghostscript
to be installed.
On Ubuntu/Debian:
sudo apt install tesseract-ocr ghostscript
On macOS:
brew install tesseract ghostscript
After cloning this repository and following the setup steps above, you can generate the SQLite database on your machine by running the /scripts/build_db.bb
script:
bb --main scripts.build-db
If you do not have SQLite installed, you will need to install it.
On Ubuntu/Debian:
sudo apt install sqlite3
On macOS:
brew install sqlite3
You can use the /scripts/dev_docker.bb
script.
bb ./scripts/dev_docker.bb
It may be helpful to refer to the Docker images or the devenv.nix
configuration file for a better idea of how the project functions and how to run certain scripts.