Repository holds test scripts for benchmark different file formats. CSV is relative uncompressed, sparse format but very common for data tasks, like import, export or storing. And when it comes performance of creating CSV file, reading and writing CSV files, how does it still stand against some other formats.
Benchmarking different file formats for cloud storage.
- CSV
- AVRO
- Parquet
- Pickle
- ORC
- TXT
create_df()
# results for write
print(timeit.Timer(WRITE_CSV_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_ORC_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_PARQUET_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_PICKLE_fun_timeIt).timeit(number=number_of_runs))
CLEAN_files()
- CSV
- Parquet
- Feather
benchmark_write <- data.frame(summary(microbenchmark(
"test_df.csv" = write.csv(test_df, file = file_csv),
"test_df_readr.csv" = readr::write_csv(test_df, file = file_csv_readr),
"test_df_datatable.csv" = data.table::fwrite(test_df, file = file_csv_datatable),
"test_df.feather" = write_feather(test_df, file_feather),
"test_df.parquet" = write_parquet(test_df, file_parquet),
"test_df.rds" = save(test_df, file = file_rdata),
"test_df.RData" = saveRDS(test_df, file_rds),
times = nof_repeat)))
Comparing read and write times for each file extension and see, which one performs better for given task.
Example in case of testing with R:
You can follow the steps below to clone the repository.
git clone https://github.com/tomaztk/Benchmarking-file-formats-for-cloud.git
Running SQL Server on-premise and uploading data to data lake, there is a python script (Jupyter notebook) with detailed steps and script.
- CSV or alternatives? Exporting data from SQL Server data to ORC, AVRO, Parquet, Feather files and store them into Azure data lake
- Comparing performances of CSV to RDS, Parquet, and Feather file formats in R
Thanks to these wonderful R community people for upgrading and improving these benchmarks. Your contributions are highly appreciated!
Ryan Duryea |