Releases: gumdropsteve/turbo-telegram
SQL Based + Full 2015 Data Download & Processing
Turbo-telegram v0.0.3-beta
This release brings 2 major updates for the price of 1!
BlazingSQL based NYC Dashboard
Rework of taxi_dashboard.ipynb
to utilize SQL queries when producing all DataFrames.
- BlazingSQL table of query results that's then focused accordingly
- Apply cuDF's
.to_pandas()
for HoloViews plots
This eliminates post-query filtering of results, freeing up GPU memory & enabling use of much larger datasets.
2015 Taxi Data Download & Processing
Users can now download & pre-process all 12 months of 2015 NYC yellow cab data. Total download size is ~20.07 GB before processing and ~18.94 GB (CSV0) after processing.
NOTE: taxi_dashboard.ipynb
does NOT yet point to this new data. This will be implemented soon, but issues such as optimizing big data integration for single-GPU users need to be addressed first.
New Files
-
- based off HoloViz taxi_preprocessing_example.py
- downloads & processes all 12 months of 2015 NYC taxi data
- uses BlazingSQL & Numpy1 to configure data for use with Datashader / HoloViews
- single node / processes 1 month at a time to ensure anyone w/ compatible GPU can run
- tested w/ 16GB Tesla T4 GPU on AWS, runs end-to-end in 7-8 min2
- GPU capacity test via final visualization under "Extra" (at end) calls thru August (8/12 months)34
-
- based off RAPIDS sql_check.py
- checks for installation of BlazingSQL & installs via Anaconda if not found
- called in
download_data.ipynb
imports section if BSQL not found & user wants to install
Footnotes
0 12 files, 18 columns (each) * 135,216,505 rows (total/combined)
1 elimination of NumPy expected w/ resolution of BlazingDB/blazingsql#334 (UPDATE 4 Feb: BSQL only merged to master branch 95c963c)
2 last run: 4m 27s download; 3m 25s processing (largely from writing .to_csv()
); 7m 52s total
3 sticking to consecutive months starting with January, this was the largest table query to process w/o kernel crashing, ~12.6GB CSV which is ~25GB on GPU, running off 1 16GB Tesla T4 GPU AWS EC2 instance
4 Here's how that plot looked;
More data, less bugs [Taxi Dashboard Update]
NYC Taxi Dashboard Update
improvements
- more data
- added filtered & converted data for February & March of 2015
- base table now created from Q1 2015 via wildcard (*) in file path
- simplified code & more clear notes & docstrings
issues addressed
- resolved #8, moved riders & fare input checks up from common_filtering to input of common_filtering, now common_filtering not engaged until input is checked
- resolved #9, simplified if/elif/else statements under map outputs
extra
- removed seasonal Christmas NYC map
NYC Taxi Dashboard
- Simple dashboard for exploring NYC Taxi dataset
- Relies on:
- BlazingSQL for data processing
- ipywidgets for engagement
- HoloViews for visualization
- Jupyter Notebook for environment