Skip to content

File I O Abstraction Design Thoughts

glitch edited this page Jul 15, 2021 · 2 revisions

Thoughts on getting to a better model of file I/O abstraction in Arkouda, NOTE: this is in the very early stages of investigating and creating a design.

Issue

Current design is focused mainly on HDF5 since this is the only currently support format for data I/O from disk (technically you can read from the client side and push to the server to construct pdarray objects but this is not really a scalable approach).

As the Arkouda community looks to add new format such as Apache Parquet, etc. we would like to generalize the API for reading/writing files if possible from both an internal code perspective as well as the client interactive API.

Initial constraints

1D arrays

Currently pdarray only supports single dimensional arrays. While HDF5 supports multi-dimensional datasets we limit our support to groups of single dimensional arrays. This actually maps pretty nicely to the concept of columns which makes it a bit easier to add support for something like Parquet format columnar stores.

C api compatibility

With Chapel we really only have compatibility with C apis for various formats. I believe it is possible to link in C++ libraries but (anecdotally) this gets tricky so in effect we are currently limited to C wrappers and straight C implementations.

Separation of Concerns

Thinking through a possible api and abstraction one idea is to separate the concerns along the natural lines of the client-server relationship. Users may or may not care what the underlying format actually is which persists the data. A subtle point here is that they should be allowed to care, but not be required to care. i.e. They should be able to save/read(pdarray, filename) and not worry about the underlying format but at the same time we should allow folks to choose save/read(pdarray, filename, format="hdf5") A good example of this is the variety of image formats supported by modern day operating systems, jpeg/png/tiff/etc and a user being able to double-click an image and open it.

Client

The client will likely need to deal with object abstraction. We have a number of different pdarray and composites we support such as

  • pdarray
  • Strings
  • Categorical
  • GroupBy

Server

On the server everything is essentially treated as a pdarray or SymEntry with SegmentedArray support (which is basically two pdarray objects). Here we are mainly concerned with file type abstraction when it comes to I/O.

  • HDF5
  • Parquet?
  • Feather/Arrow?

What are similar communities doing?

I took a look at other projects and languages such as Pandas, Java, Go, etc. I believe programming languages have a slightly different perspective than applications/libraries in the sense of having to deal with the filesystem. Java's nio package is a good example of this and I don't believe Arkouda needs to worry about exposing functionality to interact with the file-system itself beyond telling the server a location to read from / save to.

Looking at the Pandas io module they don't really have much of a design abstraction either probably because they don't have a client/server type architecture. They effectively have ways to read/write to csv, sql, and support transformation into/out of other types of formats; but at the end of the day their functions are named something like read_csv, read_from_sql (please don't quote me on the actual function names ;) )

Design Patterns

One idea I've been thinking through is using an interface-type / api approach where we define method names with optional arguments for the file type. For read operations we already have the ability to detect format based on Magic number so reading already has a bit of abstraction present, but not wired in. As far as writing goes I believe we could use a simple factory pattern to generate a concrete implementation to handle writing.

i.e.

interface Writer(file_path, prefix, dict_info, optional_type=default_type)

factory (... type t){
  switch(t) {
    case("HDF5"):
      return HDF5Writer(args...);
    case("Parquet"):
      return ParquetWriter(args...);
    case(...)
etc.
}

A Reader could be a similar thing which we preface with file-magic detection to help instantiate the proper type.