Skip to content

Latest commit

 

History

History
120 lines (97 loc) · 5.88 KB

README.md

File metadata and controls

120 lines (97 loc) · 5.88 KB

GDELT with Nushell

The Global Database of Events, Language and Tone (GDELT) project claims to be largest, most comprehensive, and highest resolution open database of human society ever created. GDELT data goes back decades and continues to get updated every 15mins. The link provided is well worth exploring if you are not familiar with this fascinating dataset.

nu-gdelt grabs that new data each 15mins from the raw compressed CSV files, casts the data attributes to the correct types and saves the data into parquet files that are stored in monthly partitioned directories. The raw CSV files are stored in a Bronze partitioned directory, and the parquet files are stored in a Silver partitioned directory - adopting the data lake medallion architecture.

nu-gdelt uses nushell scripts while harnessing duckdb for the data transformations.

nu-gdelt is intended to be initiated by a cron job every 15mins, such as in the following:

*/15 * * * * /full/file/path/filename.nu

Once the data builds up in your own personal data lake, you can query it using duckdb rather simply. For example, to view the schema and details of the data, you could run:

duckdb
.mode line
DESCRIBE SELECT * FROM read_parquet('silver/2024/09/*.parquet');

Or, to see some aggregates of all of your data, you could run:

SUMMARIZE SELECT * FROM read_parquet('silver/2024/10/*.parquet');

As your data grows, it stays organised and in a form that is efficient and easy to query.

The final parquet files have the following schema. This schema closely matches that of the GDELT source, as per the documentation.

column_name column_type nullable
GlobalEventID INTEGER YES
Day INTEGER YES
MonthYear INTEGER YES
Year INTEGER YES
FractionDate FLOAT YES
Actor1Code VARCHAR YES
Actor1Name VARCHAR YES
Actor1CountryCode VARCHAR YES
Actor1KnownGroupCode VARCHAR YES
Actor1EthnicCode VARCHAR YES
Actor1Religion1Code VARCHAR YES
Actor1Religion2Code VARCHAR YES
Actor1Type1Code VARCHAR YES
Actor1Type2Code VARCHAR YES
Actor1Type3Code VARCHAR YES
Actor2Code VARCHAR YES
Actor2Name VARCHAR YES
Actor2CountryCode VARCHAR YES
Actor2KnownGroupCode VARCHAR YES
Actor2EthnicCode VARCHAR YES
Actor2Religion1Code VARCHAR YES
Actor2Religion2Code VARCHAR YES
Actor2Type1Code VARCHAR YES
Actor2Type2Code VARCHAR YES
Actor2Type3Code VARCHAR YES
IsRootEvent INTEGER YES
EventCode VARCHAR YES
EventBaseCode VARCHAR YES
EventRootCode VARCHAR YES
QuadClass INTEGER YES
GoldsteinScale FLOAT YES
NumMentions INTEGER YES
NumSources INTEGER YES
NumArticles INTEGER YES
AvgTone FLOAT YES
Actor1Geo_Type INTEGER YES
Actor1Geo_FullName VARCHAR YES
Actor1Geo_CountryCode VARCHAR YES
Actor1Geo_ADM1Code VARCHAR YES
Actor1Geo_ADM2Code VARCHAR YES
Actor1Geo_Lat FLOAT YES
Actor1Geo_Long FLOAT YES
Actor1Geo_FeatureID VARCHAR YES
Actor2Geo_Type INTEGER YES
Actor2Geo_FullName VARCHAR YES
Actor2Geo_CountryCode VARCHAR YES
Actor2Geo_ADM1Code VARCHAR YES
Actor2Geo_ADM2Code VARCHAR YES
Actor2Geo_Lat FLOAT YES
Actor2Geo_Long FLOAT YES
Actor2Geo_FeatureID VARCHAR YES
ActionGeo_Type INTEGER YES
ActionGeo_FullName VARCHAR YES
ActionGeo_CountryCode VARCHAR YES
ActionGeo_ADM1Code VARCHAR YES
ActionGeo_ADM2Code VARCHAR YES
ActionGeo_Lat FLOAT YES
ActionGeo_Long FLOAT YES
ActionGeo_FeatureID VARCHAR YES
DATEADDED BIGINT YES
SOURCEURL VARCHAR YES

Logging

nu-gdelt also produces and saves logs into a custom log file called gdelt.log. The format for logs is simply Datetime, Severtity and Message. The log file has been formatted in such a way to make it easy to read. Additionally, because we are using nushell, we can very easily navigate and filter our logs using the following command:

open gdelt.log | lines | split column " - " | rename "Datetime" "Severity" "Message" | into value

This command will provide a table of log data that can be filtered and sorted, including by time as the Datetime data is read in as an actual date date type.

If you want to observe the logs as the program is running, you can run the following:

tail -f gdelt.log

CTRL+C to exit tailing.

Future iterations of the logging functionality will include auto-rotating of logs, archiving and compression etc.

--

This repo is under active development.