schema driven processing language
SDPL introduces data schema to major data processing languages
such as Apache Pig, Spark and Hive. SDPL supports generic operations such as
LOAD
, STORE
, JOIN
, PROJECT
, while complex transformation and fine-tuning
are performed in the target language via quotation.
SDPL links 3 artifacts:
- DataRepository file describes data source and credentials needed to access it
- Schema file describes the data
- Source code describes what data to load and transformation to apply
Supported target languages are Apache Pig and Spark; DataRepository is a short YAML file; Schema could be read from SDPL YAML, AVRO and Protobuf formats
Main repository: https://bitbucket.org/mushkevych/sdpl
Mirror: https://github.com/mushkevych/sdpl
-
Python3.5+
-
antlr4 package
sudo apt-get install antlr4
-
antlr4-python3-runtime
$> pip install antlr4-python3-runtime
-
PyYAMP
$> pip install PyYAML
-
Avro
$> pip install avro-python3
-
Protobuf
$> pip install protobuf
`$> antlr4 -Dlanguage=Python3 sdpl.g4`
$> python3 sdpl.py pig -i tests/snippet_1.sdpl