Have you gotten stuck in data muds and needed to write pile of codes to handle just a tiny data type problems? Or have you bumped into some situations that you have to check out your spaghetti codes 🍝 for a long time to see what's wrong when cleaning your data? Don't worry! 😌 Here comes DataSika
🦌 for you! DataSika
is a simple python package that allows you to produce your own data pipeline locally by writing some basic standard yaml syntaxs. You can do webscrapping, api-requesting based on our useful functions. Also, we provide some filter availabilities for you if you want to filter out some content by xpath (for html responses), jsonpath (for json responses) and sql (for manipulating dataframes). Can't wait to try? Just install it as soon as possible and test it with examples we provided! 😆 ✨
- python version >
3.7
- Using command:
git clone git@github.com:rainyjonne/DataSika.git
- Manually download: clicking
Download ZIP file
from the green code button
- Install
pip
- macOS:
python -m ensurepip --upgrade
- WSL, Linux:
python -m ensurepip --upgrade
- macOS:
- Just execute this command:
pip install DataSika
, then you can happily use this command with your yaml files! 🎉 🎊 - Sika Command usage:
usage: sika [-h] [--input INPUT] [--output OUTPUT] [--rerun]
Build a simple pipeline by a yaml file
optional arguments:
-h, --help show this help message and exit
--input INPUT put in an input yaml file path
--output OUTPUT put a path for your output db
--rerun rerun the whole pipeline again, delete all data tables in your db file
- Making this package's command line tool works:
python setup.py install
- Running our four examples:
- Using command line tools:
- (ETL) Getting Ruby Gem Details Example:
sika --input examples/repominer.yaml
- (ETL) Airbnb UK Hostings + UK Crime Data Example:
sika --input examples/airbnb-uk-crime.yaml
- (EL) Getting Ruby Gem Lists Example:
sika --input examples/repominer-el.yaml
- (EL) Airbnb Japan Hostings Example:
sika --input examples/airbnb-tokyo.yaml
- (ETL) Getting Ruby Gem Details Example:
- Using python scripts:
- (ETL) Getting Ruby Gem Details Example:
python sika/main.py --input examples/repominer.yaml
- (ETL) Airbnb UK Hostings + UK Crime Data Example:
python sika/main.py --input examples/airbnb-uk-crime.yaml
- (EL) Getting Ruby Gem Lists Example:
python sika/main.py --input examples/repominer-el.yaml
- (EL) Airbnb Japan Hostings Example:
python sika/main.py --input examples/airbnb-tokyo.yaml
- (ETL) Getting Ruby Gem Details Example:
- Using command line tools:
- Tests should be run under the root folder of DataSika
- read type tasks:
pytest tests/test_read.py
- filter type tasks:
pytest tests/test_filter.py
- merge type tasks:
pytest tests/test_merge.py
- transform type tasks:
pytest tests/test_transform.py
pytest tests/test_tasks.py
pytest tests/test_stages.py