Benchmarking file formats for cloud Storage

Repository holds test scripts for benchmark different file formats. CSV is relative uncompressed, sparse format but very common for data tasks, like import, export or storing. And when it comes performance of creating CSV file, reading and writing CSV files, how does it still stand against some other formats.

Using Python

Covered formats with Python

Benchmarking different file formats for cloud storage.

CSV
AVRO
Parquet
Pickle
ORC
TXT

Python scripts for benchmark test

create_df()
 
# results for write
print(timeit.Timer(WRITE_CSV_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_ORC_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_PARQUET_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_PICKLE_fun_timeIt).timeit(number=number_of_runs))
 
CLEAN_files()

Using R

Covered formats with R

CSV
Parquet
Feather

R scripts for benchmarking

benchmark_write <- data.frame(summary(microbenchmark(
          "test_df.csv"     = write.csv(test_df, file = file_csv),
          "test_df_readr.csv"     = readr::write_csv(test_df, file = file_csv_readr),
          "test_df_datatable.csv"     = data.table::fwrite(test_df, file = file_csv_datatable),
          "test_df.feather" = write_feather(test_df, file_feather),
          "test_df.parquet" = write_parquet(test_df, file_parquet),
          "test_df.rds"     = save(test_df, file = file_rdata),
          "test_df.RData"   = saveRDS(test_df, file_rds), 
  times = nof_repeat)))

Comparing read and write times

Comparing read and write times for each file extension and see, which one performs better for given task.

Example in case of testing with R:

Cloning the repository

You can follow the steps below to clone the repository.

git clone https://github.com/tomaztk/Benchmarking-file-formats-for-cloud.git

Using Azure Blob storage for data lake

Running SQL Server on-premise and uploading data to data lake, there is a python script (Jupyter notebook) with detailed steps and script.

Link

Contributors and co-authors

Thanks to these wonderful R community people for upgrading and improving these benchmarks. Your contributions are highly appreciated!

_{Ryan Duryea}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
imgs		imgs
renv		renv
.Rprofile		.Rprofile
.gitignore		.gitignore
Blob_Storage.py		Blob_Storage.py
From_SQLServer_To_AzureBlobStore.ipynb		From_SQLServer_To_AzureBlobStore.ipynb
README.md		README.md
Upload_file_to_azure.ipynb		Upload_file_to_azure.ipynb
renv.lock		renv.lock
test.R		test.R
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking file formats for cloud Storage

Using Python

Covered formats with Python

Python scripts for benchmark test

Using R

Covered formats with R

R scripts for benchmarking

Comparing read and write times

Cloning the repository

Using Azure Blob storage for data lake

Related Blog posts

Contributors and co-authors

About

Releases

Packages

Contributors 2

Languages

tomaztk/Benchmarking-file-formats-for-cloud

Folders and files

Latest commit

History

Repository files navigation

Benchmarking file formats for cloud Storage

Using Python

Covered formats with Python

Python scripts for benchmark test

Using R

Covered formats with R

R scripts for benchmarking

Comparing read and write times

Cloning the repository

Using Azure Blob storage for data lake

Related Blog posts

Contributors and co-authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages