- Version: 0.1-0
- Status:
- License: BSD 2-Clause
- Author: Drew Schmidt and Amil Williamson
A set of high-level utilities for working with HDF5.
This package is not meant to expose anywhere near the full capabilities of HDF5. For that, see the rhdf5 and hdf5r packages (we actually use hdf5r internally). The goal of this package is to try to make HDF5 as simple to use as read.csv()
and write.csv()
, but with the added benefits of using binary file formats.
Our main focus is on storing and reading dataframes. We call these "h5df" as in "h5 dataframe". I understand that this is annoying and difficult to convince your fingers to type if you are familiar with HDF5, but it's too good of a name to pass up. Right now we support reading from two kinds of formats written by python's pandas (really pytables), with some restrictions (no strings when format=fixed
). We have full support for a format that is good for working with R, and should soon have a format that is useful if the goal is to regularly share data between python and R.
The current documentation is a train wreck, but we're working on it.
The development version is maintained on GitHub, and can easily be installed by any of the packages that offer installations from GitHub:
remotes::install_github("wrathematics/lineSampler")
remotes::install_github("RBigData/hdfio")
We'll take a look at some examples using the famous airlines dataset, and we'll be working from a directory that has all of the uncompressed csv's:
dir(pattern="*.csv")
## [1] "1987.csv" "1988.csv" "1989.csv" "1990.csv" "1991.csv" "1992.csv"
## [7] "1993.csv" "1994.csv" "1995.csv" "1996.csv" "1997.csv" "1998.csv"
## [13] "1999.csv" "2000.csv" "2001.csv" "2002.csv" "2003.csv" "2004.csv"
## [19] "2005.csv" "2006.csv" "2007.csv" "2008.csv"
The hdfio package has several ways of taking a csv and turning it into an HDF5 file. We'll start with the basic write_hdf5()
interface, which takes a dataframe that has already been read into memory and writes it out to HDF5:
library(hdfio)
airlines1987 = data.table::fread("1987.csv")
write_h5df(airlines1987, "/tmp/airlines.h5")
The default format supported by the package allows us to read subsets of rows and columns back in:
read_h5df("/tmp/airlines.h5", rows=1:5, cols=1:8)
## Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime
## 1 1987 10 14 3 741 730 912 849
## 2 1987 10 15 4 729 730 903 849
## 3 1987 10 17 6 741 730 918 849
## 4 1987 10 18 7 729 730 847 849
## 5 1987 10 19 1 749 730 922 849
Note that the read_h5df()
function supports only a few formats (including several employed by python's pandas). It does not handle an arbitrary HDF5 file.
We can also get a quick summary of the dataset:
summarize_h5df("/tmp/airlines.h5")
## Filename: /tmp/airlines.h5
## File size: 10.710 MiB
## Datasets:
## airlines1987
## Format: hdfio_column
## Dimensions: 1311826 x 29
Each HDF5 file can support multiple datasets:
airlines1988 = data.table::fread("1988.csv")
write_h5df(airlines1988, "/tmp/airlines.h5")
If there are multiple datasets, you have to specify which one you want with the reader (otherwise it will just automatically pick the one available):
read_h5df("/tmp/airlines.h5", rows=1:5, cols=1:8)
## Error in h5_get_dataset(h5_fp, dataset) : multiple datasets available
read_h5df("/tmp/airlines.h5", dataset="airlines1988", rows=1:5, cols=1:8)
## Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime
## 1 1988 1 9 6 1348 1331 1458 1435
## 2 1988 1 10 7 1334 1331 1443 1435
## 3 1988 1 11 1 1446 1331 1553 1435
## 4 1988 1 12 2 1334 1331 1438 1435
## 5 1988 1 13 3 1341 1331 1503 1435
Although summarize_h5df()
will show all datasets by default, but we can specify a single dataset, and also optionally show all the column names:
summarize_h5df("/tmp/airlines.h5")
## Filename: /tmp/airlines.h5
## File size: 52.090 MiB
## Datasets:
## airlines1987
## Format: hdfio_column
## Dimensions: 1311826 x 29
## airlines1988
## Format: hdfio_column
## Dimensions: 5202096 x 29
summarize_h5df("/tmp/airlines.h5", "airlines1988", colnames=TRUE)
## Filename: /tmp/airlines.h5
## File size: 52.090 MiB
## Datasets:
## airlines1988
## Format: hdfio_column
## Dimensions: 5202096 x 29
## Columns:
## 1. Year integer
## 2. Month integer
## 3. DayofMonth integer
## 4. DayOfWeek integer
## 5. DepTime integer
## 6. CRSDepTime integer
## 7. ArrTime integer
## 8. CRSArrTime integer
## 9. UniqueCarrier character
## 10. FlightNum integer
## 11. TailNum logical
## 12. ActualElapsedTime integer
## 13. CRSElapsedTime integer
## 14. AirTime logical
## 15. ArrDelay integer
## 16. DepDelay integer
## 17. Origin character
## 18. Dest character
## 19. Distance integer
## 20. TaxiIn logical
## 21. TaxiOut logical
## 22. Cancelled integer
## 23. CancellationCode logical
## 24. Diverted integer
## 25. CarrierDelay logical
## 26. WeatherDelay logical
## 27. NASDelay logical
## 28. SecurityDelay logical
## 29. LateAircraftDelay logical
The other way to transform a csv file into an HDF5 file is the csv2h5()
function:
csv2h5("1995.csv", h5out="/tmp/airlines1995.h5", verbose=TRUE)
## Scanning all input files for storage info...ok!
## Processing batch 1/1
## done!
As the output implies, the function will process the csv file in batches if the entire dataset can't fit into RAM. This dataset is small enough that it easily fits into RAM in one batch on this computer, so it processes the entire file in 1 batch. If a dataset name is not specified, one will be inferred from the name of the input file. This dataset is named 1995
since the input file was 1995.csv
and a dataset name (for the HDF5 file) wasn't specificed:
summarize_h5df("/tmp/airlines1995.h5")
## Filename: /tmp/airlines1995.h5
## File size: 66.188 MiB
## Datasets:
## 1995
## Format: hdfio_column
## Dimensions: 5327435 x 29
Finally, we can convert a directory of csv files into a single HDF5 file using dir2h5()
. We can do this in two ways. One will combine the files into a single dataset, conceptually equivalent to reading all of the datasets and rbind
-ing them all together. This is done one file at a time to accommodate large directories of files, and we assume that the csv files are "easily" stackable (no dropping/swapping columns across files). Doing this, we can create a single HDF5 file for the airlines dataset:
file.remove("/tmp/airlines.h5")
dir2h5(".", h5out="/tmp/airlines.h5", dataset="airlines", verbose=TRUE)
## Checking input files for common header lines...ok!
## Scanning all input files for storage info...ok!
## Processing 22 files:
## ./1987.csv: reading...ok! writing...ok!
## ./1988.csv: reading...ok! writing...ok!
## ./1989.csv: reading...ok! writing...ok!
## ./1990.csv: reading...ok! writing...ok!
## ./1991.csv: reading...ok! writing...ok!
## ./1992.csv: reading...ok! writing...ok!
## ./1993.csv: reading...ok! writing...ok!
## ./1994.csv: reading...ok! writing...ok!
## ./1995.csv: reading...ok! writing...ok!
## ./1996.csv: reading...ok! writing...ok!
## ./1997.csv: reading...ok! writing...ok!
## ./1998.csv: reading...ok! writing...ok!
## ./1999.csv: reading...ok! writing...ok!
## ./2000.csv: reading...ok! writing...ok!
## ./2001.csv: reading...ok! writing...ok!
## ./2002.csv: reading...ok! writing...ok!
## ./2003.csv: reading...ok! writing...ok!
## ./2004.csv: reading...ok! writing...ok!
## ./2005.csv: reading...ok! writing...ok!
## ./2006.csv: reading...ok! writing...ok!
## ./2007.csv: reading...ok! writing...ok!
## ./2008.csv: reading...ok! writing...ok!
## done!
summarize_h5df("/tmp/airlines.h5")
## Filename: /tmp/airlines.h5
## File size: 1.345 GiB
## Datasets:
## airlines
## Format: hdfio_column
## Dimensions: 123534969 x 29
The process can take a lot of time, as every single file has to be scanned first to ensure type safety for the writer. This can be disabled by passing yolo=TRUE
, but the process may fail in that case (it does with this example).
If you just want to drop a bunch of different (possibly unrelated) csv files into a single HDF5 file, you can use dir2h5()
with combined=FALSE
. In this case:
dir2h5(".", h5out="/tmp/airlines_split.h5", combined=FALSE)
summarize_h5df("/tmp/airlines_split.h5")
## Filename: /tmp/airlines_split.h5
## File size: 1.611 GiB
## Datasets:
## 1987
## Format: hdfio_column
## Dimensions: 1311826 x 29
## 1988
## Format: hdfio_column
## Dimensions: 5202096 x 29
## 1989
## Format: hdfio_column
## Dimensions: 5041200 x 29
## 1990
## Format: hdfio_column
## Dimensions: 5270893 x 29
## 1991
## Format: hdfio_column
## Dimensions: 5076925 x 29
## 1992
## Format: hdfio_column
## Dimensions: 5092157 x 29
## 1993
## Format: hdfio_column
## Dimensions: 5070501 x 29
## 1994
## Format: hdfio_column
## Dimensions: 5180048 x 29
## 1995
## Format: hdfio_column
## Dimensions: 5327435 x 29
## 1996
## Format: hdfio_column
## Dimensions: 5351983 x 29
## 1997
## Format: hdfio_column
## Dimensions: 5411843 x 29
## 1998
## Format: hdfio_column
## Dimensions: 5384721 x 29
## 1999
## Format: hdfio_column
## Dimensions: 5527884 x 29
## 2000
## Format: hdfio_column
## Dimensions: 5683047 x 29
## 2001
## Format: hdfio_column
## Dimensions: 5967780 x 29
## 2002
## Format: hdfio_column
## Dimensions: 5271359 x 29
## 2003
## Format: hdfio_column
## Dimensions: 6488540 x 29
## 2004
## Format: hdfio_column
## Dimensions: 7129270 x 29
## 2005
## Format: hdfio_column
## Dimensions: 7140596 x 29
## 2006
## Format: hdfio_column
## Dimensions: 7141922 x 29
## 2007
## Format: hdfio_column
## Dimensions: 7453215 x 29
## 2008
## Format: hdfio_column
## Dimensions: 7009728 x 29
Notice the effectiveness of the default compression level (compression=4
) here:
library(memuse)
Sys.dirsize(".")
## 11.203 GiB
Sys.filesize("/tmp/airlines.h5")
## 1.345 GiB
Sys.filesize("/tmp/airlines_split.h5")
## 1.611 GiB
So the uncompressed csv files are just over 11 GiB, while the compressed HDF5 files are roughly 1.5 GiB. This compression makes creating the file and reading from it somewhat slower, but the performance/size tradeoff is actually quite good. The highest compression level is 9, which is slower to read and significantly slower to write, while only cutting off a few hundred MiB of space from the compression=4
variant so we won't bother to show it here. We can compare the compressed version to an uncompressed version, but we have to rewrite the file:
dir2h5(".", dataset="airlines", h5out="/tmp/airlines_nocompression.h5", compression=0)
Sys.filesize("/tmp/airlines_nocompression.h5")
## 10.647 GiB
It's worth pointing out that this machine doesn't have nearly that much RAM (so it is indeed being processed in chunks):
Sys.meminfo()
## Totalram: 7.697 GiB
## Freeram: 5.519 GiB
TODO