Skip to content
Kirill Müller edited this page Jan 10, 2016 · 33 revisions

Problem

Getting data in and out of R is a minor but very important part of a statistician’s or data scientist’s work. Sometimes, the data are packaged as R package (nycflights13: Wickham (2014); Lahman: Friendly (2015)); however, in the majority of cases one has to deal with third-party data sources. Using a database for storage and retrieval is often the only feasible option with today’s ever-growing data. At present, R offers access to a variety of database engines (DBMS), both open-source like SQLite (RSQLite: Wickham, James, and Falcon (2014)), MySQL (RSQLite: Wickham, James, and Falcon (2014)), PostgreSQL (RPostgreSQL: Conway et al. (2013); RPostgres: Wickham (2015)), or MonetDB (MonetDB.R: Muehleisen and Damico (2015)), and proprietary like Oracle (ROracle: Mukhin, James, and Luciani (2015)) or Microsoft SQL Server (RSQLServer: Costigan and Project (2014)). All of the examples above are so-called DBI backends.

DBI is R’s native DataBase Interface: a set of virtual functions declared in the DBI package (DBI: Databases (2014)). A common interface is helpful for both users and backend implementers. Users need to memoize only one set of functions; the more people are familiar with an interface, the easier it is to receive support for it. Backend developers can focus on functionality instead of design decisions, and have a large base of potential users right from the start. Below is an illustration of the (hypothetic) nightmare of no common DBMS interface:

Figure 1: No common interface

ODBC and JDBC are two such interfaces for C and Java, respectively; the packages RODBC (Ripley and Lapsley 2015) and RJDBC (Urbanek 2016) wrap these interfaces again to offer access from R. In contrast, DBI backends communicate directly with the DBMS using its API, requiring one communication and transformation layer less (see figure below).

Figure 2: DBI interface

The contract for DBI’s virtual functions is defined only in their R documentation, this definition is often imprecise. The package DBItest (Müller 2015) is a work in progress, kindly supported by RStudio Inc., that attempts to consolidate the contract by defining a set of test cases that check DBI conformance, as suggested by Ż’ołtak (2015). Among other topics, the tests cover:

  • Behavior and return values of the DBI functions
  • Data type transformation and preservation
  • Identifier and string quoting rules
  • Parametrized queries
  • Error conditions

Using this package for currently existing backends to SQLite, MySQL and PostgreSQL helped identify about 40 different inconsistencies and software defects in all of the areas mentioned above. Consequently, it is more than likely that the behavior of most other DBI backends will also be inconsistent or even faulty, especially in corner cases.

Inconsistencies between implementations complicate the creation of database-agnostic code, which is of paramount importance both for generic data access packages such as dplyr (Wickham and Francois 2015), biglm (Lumley 2013), or TSdbi (Gilbert 2015), as well as for the casual user. Using DBI with different database backends should cause minimal if any friction and the least possible surprise, and above all produce correct results.

Perfect consistency will be difficult to achieve, e.g., the level of support for the boolean or timestamp data type varies between DBMS. DBI should be defined so that it allows implementing backends to as many DBMS as possible, with as few inconsistencies as possible, offering good support for R’s basic data types. Recently, the DBI package has seen a few modifications towards a more simplistic and robust design, which is helpful in that process.

A clean and testable specification of DBI, and its implementation in existing DBI backends, will improve the quality and reliability of database access from R, for both end users and package authors. The updated definition will be designed based on the experience with the current DBI implementation, sacrificing rarely used features for the sake of improved consistency. DBI compatibility can be tested in a fully automated fashion with minimal code; backend maintainers will be encouraged to do so in order to maintain a high standard across DBMS. Should the need arise to modify the definition in the future, enhancing and adapting the existing centralized tests will allow DBI backend maintainers to update their backends in a test-driven fashion. New DBI backends do not need to be implemented from scratch anymore: A fully DBI-compatible boilerplate package (RKazam) will be created, with placeholders to be filled in by the backend implementer. Ultimately, consistent and stable access to a variety of data sources will increase productivity whenever database access is required.

Plan

The work can be conceptually divided into two parts:

  1. Creation of the DBI specification, the corresponding centralized test, and the boilerplate for DBI backends.
  2. Improvement of existing DBI backends to adhere to this specification to the widest extent possible, clearly documenting the deviation from the specification where it is inevitable. The focus will be on backends for three of the most popular (solid IT 2016) open-source database engines: RMySQL, RPostgres, and RSQLite.

The two parts influence each other, therefore the timeline below can only be a rough outline. Each week in the timeline corresponds to 20-24 hours worked, which allows for some flexibility when implementing the project.

Note that, for the PostgreSQL DBMS, two DBI backends are currently available: RPostgreSQL (Conway et al. 2013), on CRAN, unchanged since early 2013, and RPostgres (Wickham 2015), a reimplementation that uses Rcpp (Eddelbuettel and François 2011; Eddelbuettel 2013), actively developed. This proposal focuses on improving RPostgres, because it seems to be easier to achieve full DBI compatibility with the more recent code.

Week 1-2: Setup

  • Publish first blog post
  • Contact maintainers of reverse dependencies
  • Discuss integration of RPostgres and RPostgreSQL
  • Discuss important design issues on R-sig-DB and GitHub
    • Date, time and timestamp variables
    • Parametrized queries
    • Schemas
    • Subdivision of DBI into smaller packages to separate SQL from non-SQL DBMS
    • Server-side execution of R code

Week 3-5: Baseline

  • Review and merge pending pull requests to DBI and the existing backends
  • Resolve known technical issues with the existing DBI backends
    • Connectivity
    • Character encoding
    • Build configuration
  • Implement automated tests on Linux, Windows, and OS X
    • Preferably, the automated tests should use the new r-hub infrastructure sponsored by the R consortium. Database support (MySQL and Postgres) does not seem to be part of the r-hub proposal, but can be contributed in the scope of this project. Gábor Csardi, who is implementing r-hub, has signaled interest in including database support.

Week 6-9: DBI, DBItest, and RKazam

  • Finalize the design of the DBI in underspecified areas
    • Fix open GitHub issues in DBI
  • Write draft of DBI specification document
  • Implement DBI specification in the DBItest package
    • Fix open GitHub issues in DBItest
    • Specification = documentation of tester functions
  • Check of reverse CRAN dependencies
  • Work closely with the maintainers of other existing backends towards full DBI consistency, planning CRAN release close to the end of the project
  • Implement the RKazam boilerplate package

Week 10-13: MySQL

  • Fix open GitHub issues in RMySQL
  • Ensure maximal consistency with DBI by satisfying as many tests as possible from DBItest
  • Check reverse CRAN dependencies
  • Communicate RMySQL as role model for other backends

Week 14-17: PostgreSQL

  • Fix open GitHub issues in RPostgres
  • Ensure maximal consistency with DBI by satisfying as many tests as possible from DBItest
  • Work on integrating RPostgres and RPostgreSQL if required

Week 18-21: SQLite

  • Fix open GitHub issues in RSQLite
  • Ensure maximal consistency with DBI by satisfying as many tests as possible from DBItest
  • Check reverse CRAN dependencies

Week 22-24: Dissemination

  • Stabilize
  • Release to CRAN, jointly with the maintainers of other existing backends
  • Finalize DBI specification document
  • Publish second blog post
  • Reserve time for unexpected problems with any of the DBI backends, DBMS, or continuous integration systems involved

Failure modes and recovery

The success of the project also depends on the cooperation of maintainers of dependent packages, and original reporters of GitHub issues. They will be involved at a very early stage of the project, in the hope that a consensus can be reached on critical design and scheduling issues (for dependent packages), and that all necessary information is available when needed (for GitHub issues). In the unfortunate case of weak cooperation, it might become necessary to break compatibility with dependent packages, or even postpone the release and resolve problems with dependent packages in a follow-up project. If a GitHub issue cannot be replicated without further input, and this input is not provided, the issue will be considered resolved after a grace period.

Should it turn out to be very difficult to add database support to r-hub when it is needed, there is always the option to fall back to Travis-CI and Appveyor, which already offer support for MySQL and Postgres out of the box.

The effects of potential hardware failure are alleviated by the reservation of funds and/or warranty, to allow for quick replacement of broken equipment. Data recovery is guaranteed by frequent pushes to GitHub and backup to external storage.

In the unlikely case of software defects in upstream dependencies (mostly DBMS client libraries), I intend to work with upstream maintainers on resolving the issues, and implement workarounds or document deficiencies for stable versions.

Help requested

I’d like to ask for financial support for the implementation of the project and for attending one conference to present the results. Only open-source operating systems and DBMS will be required locally, licensing fees are not an issue.

Based on the timeline and the feature description, I’d like to ask for USD 50’000 to cover the implementation costs. It would be great if 20% could be paid up front, 40% after week 13 (implementation of RMySQL backend), and the rest upon finalization and approval of the project. I will provide the equipment, and reserve a part of the up-front payment to cover potential costs for replacement/repair. It would be great to present the work at an R conference (e.g., useR! 2017), I’d also like to ask for reimbursement of the conference fee, accomodation, and transportation (approx. USD 1’500).

Proprietary DBMS such as Microsoft SQL Server, Oracle, or Google BigQuery, usually require licensing fees, in contrast to open-source DBMS. In these cases, the R Consortium could support DBI backend maintainers by covering the license and/or hosting fees. The implementation of a centralized service that provides on-demand DBMS instances for testing purposes may be the aim of a small follow-up project.

Beyond this project, I’d be happy to work with other database vendors who wish to implement a DBI backend, or establish full DBI consistency or otherwise improve an existing backend. I would be grateful for any references provided by members of the R Consortium.

Dissemination

The work will be continuously uploaded to GitHub, tested on r-hub (subject to availability, with fallback to Travis-CI and AppVeyor), and released to CRAN at the end of the project. GitHub seems to be very popular in the R community, its pull request mechanism allows contribution and code review in a very simple and convenient way. Maintainers of dependent packages (other DBI backends or packages that use DBI) will be contacted and involved in a very early stage of the project. Furthermore, important design discussions will be discussed on the R-sig-DB mailing list and on GitHub.

The existing licensing model will be maintained (LGPL for DBI and DBItest, GPL for the DBI backends).

The most important artifact will be the written DBI specification, derived from the centralized DBI testing code, in the form of an R package vignette and submitted to the “R journal” or a similar venue. If funding allows, I’ll be happy to present the work at the useR! (or a similar) conference. Two blog posts for the R consortium blog will complement this.

Acknowledgements

I would like to thank Henrik Bengtsson, Marcel Boldt, Jason Bryer, Imanuel Costigan, Gábor Csardi, Dirk Eddelbuettel, Paul Gilbert, Hannes Mühleisen, and Hadley Wickham, for their constructive feedback to an earlier draft of this proposal.

About the author

I have a background in computer science, with more than 20 years of programming experience (mostly C++). I am currently working as a scientific assistant at the IVT, ETH Zurich, and as freelance IT consultant. I am using R for data processing and statistical modeling since 2012. During this time I contributed numerous patches to various R packages, co-maintained the tikzDevice and ProjectTemplate packages, and implemented a few of my own, most notably the DBItest package for testing DBI backends. My r-appveyor script helps automated testing of R packages on Windows. From previous projects, I have gained experience with PostgreSQL and Microsoft SQL Server.

References

Conway, Joe, Dirk Eddelbuettel, Tomoaki Nishiyama, Sameer Kumar Prayaga, and Neil Tiffin. 2013. RPostgreSQL: R Interface to the PostgreSQL Database System. https://CRAN.R-project.org/package=RPostgreSQL.

Costigan, Imanuel, and The jTDS Project. 2014. RSQLServer: SQL Server R Database Interface (DBI). https://CRAN.R-project.org/package=RSQLServer.

Databases, R Special Interest Group on. 2014. DBI: R Database Interface. https://CRAN.R-project.org/package=DBI.

Eddelbuettel, Dirk. 2013. Seamless R and C++ Integration with Rcpp. New York: Springer.

Eddelbuettel, Dirk, and Romain François. 2011. “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software 40 (8): 1–18. http://www.jstatsoft.org/v40/i08/.

Friendly, Michael. 2015. Lahman: Sean Lahman’s Baseball Database. https://CRAN.R-project.org/package=Lahman.

Gilbert, Paul. 2015. TSdbi: Time Series Database Interface. https://CRAN.R-project.org/package=TSdbi.

Lumley, Thomas. 2013. Biglm: Bounded Memory Linear and Generalized Linear Models. https://CRAN.R-project.org/package=biglm.

Muehleisen, Hannes, and Anthony Damico. 2015. MonetDB.R: Connect MonetDB to R. https://CRAN.R-project.org/package=MonetDB.R.

Mukhin, Denis, David A. James, and Jake Luciani. 2015. ROracle: OCI Based Oracle Database Interface for R. https://CRAN.R-project.org/package=ROracle.

Müller, Kirill. 2015. DBItest: Testing “DBI” Back Ends.

Ripley, Brian, and Michael Lapsley. 2015. RODBC: ODBC Database Access. https://CRAN.R-project.org/package=RODBC.

solid IT. 2016. “DB-Engines Ranking.” http://db-engines.com/en/ranking.

Urbanek, Simon. 2016. RJDBC: Provides Access to Databases Through the JDBC Interface. http://www.rforge.net/RJDBC/.

Wickham, Hadley. 2014. Nycflights13: Data About Flights Departing NYC in 2013. https://CRAN.R-project.org/package=nycflights13.

———. 2015. RPostgres: Experimental Rcpp Interface to PostgreSQL.

Wickham, Hadley, and Romain Francois. 2015. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, David A. James, and Seth Falcon. 2014. RSQLite: SQLite Interface for R. https://CRAN.R-project.org/package=RSQLite.

Ż’ołtak, Mateusz. 2015. “R Database Connectivity: What Did We Leave Behind?” Aalborg, Denmark. http://user2015.math.aau.dk/contributed_talks#54.