README.Rmd

---
output: github_document
---

[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/cdata)](https://cran.r-project.org/package=cdata)
[![status](https://tinyverse.netlify.com/badge/cdata)](https://CRAN.R-project.org/package=cdata)

<!-- README.md is generated from README.Rmd. Please edit that file -->


```{r, echo = FALSE, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = " # ",
  fig.path = "tools/README-"
)
```


[`cdata`](https://CRAN.R-project.org/package=cdata) is a general data re-shaper that has the great virtue of adhering to Raymond's "Rule of Representation", and using Codd's "Guaranteed Access Rule".

> Fold knowledge into data, so program logic can be stupid and robust.
>
> [*The Art of Unix Programming*, Erick S. Raymond, Addison-Wesley, 2003](http://www.catb.org/esr/writings/taoup/html/ch01s06.html#id2878263)

> Rule 2: The guaranteed access rule.
>
> Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.
>
> [Edgar F. Codd](https://en.wikipedia.org/wiki/Codd%27s_12_rules)

The point being: it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.  `cdata` also has a `Python` implementation that it can inter-operate with in the [`data_algebra` package](https://github.com/WinVector/data_algebra) (example [here](https://github.com/WinVector/data_algebra/blob/master/Examples/cdata/cdata.ipynb)).


![](https://raw.githubusercontent.com/WinVector/cdata/master/tools/cdata.png)

Briefly: `cdata` supplies data transform operators that:

 * Work on local data or with any `DBI` data source.
 * Are powerful generalizations of the operations commonly called `pivot` and `un-pivot`.
 * Allow for example-driven graphical specification of data transforms or data layout control.
 * Work in-memory or with `SQL` databases.

A quick example: plot iris petal and sepal dimensions in a faceted graph.


```{r ex0}
iris <- data.frame(iris)
iris$iris_id <- seq_len(nrow(iris))

# show the data
head(iris)

library("ggplot2")
library("cdata")

#
# build a control table with a "key column" flower_part
# and "value columns" Length and Width
#
controlTable <- wrapr::qchar_frame(
  "flower_part", "Length"     , "Width"     |
    "Petal"    , Petal.Length , Petal.Width |
    "Sepal"    , Sepal.Length , Sepal.Width )

transform <- rowrecs_to_blocks_spec(
  controlTable,
  recordKeys = c("iris_id", "Species"))

# do the unpivot to convert the row records to block records
iris_aug <- iris %.>% transform

# show the tranformed data
head(iris_aug)

# plot the graph
ggplot(iris_aug, aes(x=Length, y=Width)) +
  geom_point(aes(color=Species, shape=Species)) + 
  facet_wrap(~flower_part, labeller = label_both, scale = "free") +
  ggtitle("Iris dimensions") +  scale_color_brewer(palette = "Dark2")

# show the transform
print(transform)

# show the representation of the transform
unclass(transform)
```

More details on the above example can be found [here](https://win-vector.com/2018/10/21/faceted-graphs-with-cdata-and-ggplot2/). A tutorial on how to design a `controlTable` can be found [here](https://winvector.github.io/cdata/articles/design.html).  And some discussion of the nature of records in `cdata` can be found [here](https://winvector.github.io/cdata/articles/blocksrecs.html).

----

A more detailed video tutorial is available [here](https://github.com/WinVector/cdata/blob/master/Examples/OrderedGrouping/OrderedGrouping.md).

----

We can also exhibit a larger example of using `cdata` to create a scatter-plot matrix, or pair plot:

```{r ex0_1}

iris <- data.frame(iris)
iris$iris_id <- seq_len(nrow(iris))

library("ggplot2")
library("cdata")

# declare our columns of interest
meas_vars <- qc(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
category_variable <- "Species"

# build a control with all pairs of variables as value columns
# and pair_key as the key column
controlTable <- data.frame(expand.grid(meas_vars, meas_vars, 
                                       stringsAsFactors = FALSE))
# one copy of columns is coordinate names second copy is values
controlTable <- cbind(controlTable, controlTable)
# name the value columns value1 and value2
colnames(controlTable) <- qc(v1, v2, value1, value2)
transform <- rowrecs_to_blocks_spec(
  controlTable,
  recordKeys = c("iris_id", "Species"),
  controlTableKeys = qc(v1, v2),
  checkKeys = FALSE)

# do the unpivot to convert the row records to multiple block records
iris_aug <- iris %.>% transform
# alternate notation: layout_by(transform, iris)


ggplot(iris_aug, aes(x=value1, y=value2)) +
  geom_point(aes_string(color=category_variable, shape=category_variable)) + 
  facet_grid(v2~v1, labeller = label_both, scale = "free") +
  ggtitle("Iris dimensions") +
  scale_color_brewer(palette = "Dark2") +
  ylab(NULL) + 
  xlab(NULL)

# show transform
print(transform)
```


The above is now wrapped into a [one-line command in `WVPlots`](https://winvector.github.io/WVPlots/reference/PairPlot.html).


----

The `cdata` package develops the idea of the ["coordinatized data" theory](https://winvector.github.io/FluidData/RowsAndColumns.html) and includes an implementation of the ["fluid data" methodology](https://winvector.github.io/FluidData/FluidData.html).   

The main `cdata` interfaces are given by the following set of methods:

  * [`rowrecs_to_blocks_spec()`](https://winvector.github.io/cdata/reference/rowrecs_to_blocks_spec.html), for specifying how single row records map to general multi-row (or block) records.
  * [`blocks_to_rowrecs_spec()`](https://winvector.github.io/cdata/reference/blocks_to_rowrecs_spec.html), for specifying how multi-row block records map to single-row records.
  * [`layout_specification()`](https://winvector.github.io/cdata/reference/layout_specification.html), for specifying transforms from multi-row records to other multi-row records.
  * [`layout_by()`](https://winvector.github.io/cdata/reference/layout_by.html) or the [wrapr dot arrow pipe](https://winvector.github.io/wrapr/reference/dot_arrow.html) for applying a layout to re-arrange data.
  * `t()` (transpose/adjoint) to invert or reverse layout specifications.

Some convenience functions include:

  * [`pivot_to_rowrecs()`](https://winvector.github.io/cdata/reference/pivot_to_rowrecs.html), for moving data from multi-row block records with one value per row (a single column of values) to single-row records `spread` or `dcast`.
  * [`pivot_to_blocks()`/`unpivot_to_blocks()`](https://winvector.github.io/cdata/reference/unpivot_to_blocks.html), for moving data from single-row records to possibly multi row block records with one row per value (a single column of values) `gather` or `melt`.
  * [`wrapr::qchar_frame()`](https://winvector.github.io/wrapr/reference/qchar_frame.html) a helper function for specifying record control table layout specifications.
 * [`wrapr::build_frame()`](https://winvector.github.io/wrapr/reference/build_frame.html) a helper function for specifying data frames.
  
The package vignettes can be found in the "Articles" tab of [the `cdata` documentation site](https://winvector.github.io/cdata/).

The (older) recommended tutorial is: [Fluid data reshaping with cdata](https://winvector.github.io/FluidData/FluidDataReshapingWithCdata.html). We also have a (older) [short free cdata screencast](https://youtu.be/4cYbP3kbc0k) (and another example can be found [here](https://winvector.github.io/FluidData/DataWranglingAtScale.html)).  These concepts were later adapted from `cdata` by the `tidyr` package.

----

Install via CRAN:

```{r, eval=FALSE}
install.packages("cdata")
```

----


Note: `cdata` is targeted at data with "tame column names" (column names that are valid both in databases, and as `R` unquoted variable names) and basic types (column values that are simple `R` types such as `character`, `numeric`, `logical`, and so on).