The aim of this project is simple: create a basic Python library to explore and interact with open data catalogues.
This will improve and speed up how users:
- Navigate open data catalogues
- Find the data that they need
- Get that data into a format and/or location for further analysis
Simply...
pip install HerdingCats
or
poetry add HerdingCats
Note
Herding-CATs is currently under active development. Features may change as the project evolves.
Due to slight variations in how organisations set up and deploy their opendata catalogues, methods may not work 100% of the time for all catalogues.
We will do our best to ensure that most methods work across all catalogues and that a good variety of data catalogues is present.
Herding-CATs supports the following catalogues by default:
I'll help format these tables in clean markdown:
Catalogue Name | Website | Catalogue Backend |
---|---|---|
London Datastore | data.london.gov.uk | CKAN |
Subak Data Catalogue | data.subak.org | CKAN |
UK Gov Open Data | data.gov.uk | CKAN |
Humanitarian Data Exchange | data.humdata.org | CKAN |
UK Power Networks | ukpowernetworks.opendatasoft.com | Open Datasoft |
Infrabel | opendata.infrabel.be | Open Datasoft |
Paris | opendata.paris.fr | Open Datasoft |
Toulouse | data.toulouse-metropole.fr | Open Datasoft |
Elia Belgian Energy | opendata.elia.be | Open Datasoft |
EDF Energy | opendata.edf.fr | Open Datasoft |
Cadent Gas | cadentgas.opendatasoft.com | Open Datasoft |
French Gov Open Data | data.gouv.fr | CKAN |
Catalogue Name | Website | API Endpoint | Status |
---|---|---|---|
Bristol Open Data | opendata.bristol.gov.uk | TBC | Need to figure out catalogue backend |
Icebreaker One | ib1.org | TBC | Authentication with API key required |
Data Mill North | datamillnorth.org | TBC | Different implementation - may not work with all methods |
Canada Open Data | open.canada.ca | TBC | Different implementation needs investigation |
This Python library provides a way to explore and interact with CKAN and OpenDataSoft data catalogues. It includes four main classes:
CkanCatExplorer
: For exploring CKAN-based data cataloguesOpenDataSoftCatExplorer
: For exploring OpenDataSoft-based data cataloguesCkanCatResourceLoader
: For loading and transforming CKAN catalogue dataOpenDataSoftResourceLoader
: For loading and transforming OpenDataSoft catalogue data
All explorer classes work with a CatSession
object that handles the connection to the chosen data catalogue.
import HerdingCats as hc
def main():
with hc.CatSession(hc.CkanDataCatalogues.LONDON_DATA_STORE) as session:
explore = hc.CkanCatExplorer(session)
if __name__ == "__main__":
main()
check_site_health()
: Checks the health of the CKAN siteget_package_count()
: Returns the total number of packages in a cataloguepackage_list_dictionary()
: Returns a dictionary of all available packagespackage_list_dataframe(df_type: Literal["pandas", "polars"])
: Returns a dataframe of all available packagespackage_list_dictionary_extra()
: Returns a dictionary with extra package informationcatalogue_freshness()
: Provides a view of how many resources have been updated in the last 6 months (London Datastore only)package_show_info_json(package_name: Union[str, dict, Any])
: Returns package metadata including resource informationpackage_search_json(search_query: str, num_rows: int)
: Searches for packages and returns results as JSONpackage_search_condense_json_unpacked(search_query: str, num_rows: int)
: Returns a condensed view of package informationpackage_search_condense_dataframe_packed(search_query: str, num_rows: int, df_type: Literal["pandas", "polars"])
: Returns a condensed view with packed resourcespackage_search_condense_dataframe_unpacked(search_query: str, num_rows: int, df_type: Literal["pandas", "polars"])
: Returns a condensed view with unpacked resourcesextract_resource_url(package_info: List[Dict], resource_name: str)
: Extracts the URL and format of a specific resource
import HerdingCats as hc
def main():
with hc.CatSession(hc.CkanDataCatalogues.LONDON_DATA_STORE) as session:
explore = hc.CkanCatExplorer(session)
loader = hc.CkanCatResourceLoader()
if __name__ == "__main__":
main()
-
polars_data_loader(resource_data: Optional[List]) -> Optional[pl.DataFrame]
- Loads data into a Polars DataFrame
- Supports Excel (.xlsx) and CSV formats
-
pandas_data_loader(resource_data: Optional[List]) -> Optional[pd.DataFrame]
- Loads data into a Pandas DataFrame
- Supports Excel (.xlsx) and CSV formats
-
duckdb_data_loader(resource_data: Optional[List], duckdb_name: str, table_name: str)
- Loads data into a local DuckDB database
- Supports Excel (.xlsx) and CSV formats
-
motherduck_data_loader(resource_data: Optional[List[str]], token: str, duckdb_name: str, table_name: str)
- Loads data into MotherDuck
- Supports Excel (.xlsx), CSV, and JSON formats
aws_s3_data_loader(resource_data: Optional[List[str]], bucket_name: str, custom_name: str, mode: Literal["raw", "parquet"])
- Loads data into an AWS S3 bucket
- Supports raw file upload or Parquet conversion
- Supports Excel (.xlsx), CSV, and JSON formats
import HerdingCats as hc
def main():
with hc.CatSession(hc.OpenDataSoftDataCatalogues.UK_POWER_NETWORKS) as session:
explore = hc.OpenDataSoftCatExplorer(session)
if __name__ == "__main__":
main()
fetch_all_datasets()
: Retrieves all datasets from an OpenDataSoft catalogueshow_dataset_info_dict(dataset_id)
: Returns detailed metadata about a specific datasetshow_dataset_export_options_dict(dataset_id)
: Returns available export formats and download URLs
import HerdingCats as hc
def main():
with hc.CatSession(hc.OpenDataSoftDataCatalogues.UK_POWER_NETWORKS) as session:
explore = hc.OpenDataSoftCatExplorer(session)
loader = hc.OpenDataSoftResourceLoader()
if __name__ == "__main__":
main()
-
polars_data_loader(resource_data: Optional[List[Dict]], format_type: Literal["parquet"], api_key: Optional[str] = None) -> pl.DataFrame
- Loads Parquet data into a Polars DataFrame
- Optional API key for authenticated access
-
pandas_data_loader(resource_data: Optional[List[Dict]], format_type: Literal["parquet"], api_key: Optional[str] = None) -> pd.DataFrame
- Loads Parquet data into a Pandas DataFrame
- Optional API key for authenticated access
duckdb_data_loader(resource_data: Optional[List[Dict]], format_type: Literal["parquet"], api_key: Optional[str] = None) -> duckdb.DuckDBPyConnection
- Loads Parquet data into an in-memory DuckDB database
- Optional API key for authenticated access
aws_s3_data_loader(resource_data: Optional[List[Dict]], bucket_name: str, custom_name: str, api_key: Optional[str] = None)
- Loads Parquet data into an AWS S3 bucket
- Optional API key for authenticated access
- Requires configured AWS credentials
import HerdingCats as hc
def main():
with hc.CatSession(hc.CkanDataCatalogues.HUMANITARIAN_DATA_STORE) as session:
explore = hc.CkanCatExplorer(session)
loader = hc.CkanCatResourceLoader()
list = explore.package_list_dictionary()
data = explore.package_show_info_json("burkina-faso-violence-against-civilians-and-vital-civilian-facilities")
data_prep = explore.extract_resource_url(data, "2020-2024-BFA Aid Worker KIKA Incident Data.xlsx")
df = loader.polars_data_loader(data_prep)
df_2 = loader.pandas_data_loader(data_prep)
print(df.head(15))
print(df_2.head(15))
if __name__ == "__main__":
main()
For some data catalogues a free api key is required.
Simply sign up to the datastore to generate an api key.
import HerdingCats as hc
def main():
with hc.CatSession(hc.OpenDataSoftDataCatalogues.UK_POWER_NETWORKS) as session:
explore = hc.OpenDataSoftCatExplorer(session)
loader = hc.OpenDataSoftResourceLoader()
data = explore.show_dataset_export_options_dict("ukpn-smart-meter-installation-volumes")
pl_df = loader.polars_data_loader(data, "parquet", "your_api_key")
print(pl_df.head(15))
if __name__ == "__main__":
main()