Skip to content

dataobservatory-eu/dataset

Repository files navigation

The dataset R Package

lifecycle Project Status: WIP CRAN_Status_Badge CRAN_time_from_release Status at rOpenSci Software Peer Review DOI devel-version dataobservatory Codecov test coverage pkgcheck AppVeyor build status R-CMD-check

The dataset package extension to the R statistical environment aims to ensure that the most important R object that contains a dataset, i.e. a data.frame or an inherited tibble, tsibble or data.table contains important metadata for the reuse and validation of the dataset contents. We aim to offer a novel solution to support individuals or small groups of data scientists working in various business, academic or policy research functions who cannot count on the support of librarians, knowledge engineers, and extensive documentation processes.

The dataset package extends the concept of tidy data and adds further, standardized semantic information to the user’s dataset to increase the (re-)use value of the data object.

  • More descriptive information about the dataset as a creation, its authors, contributors, reuse rights and other metadata to make it easier to find and use.
  • More standardized and linked metadata, such as standard variable definitions and code lists, enable the data owner to gather far more information from third parties or for third parties to understand and use the data correctly.
  • More information about the data provenance makes the quality assessment easier and reduces the need for time-consuming and unnecessary re-processing steps.
  • More structural information about the data makes it more accessible to reuse and join with new information, making it less error-prone for logical errors.

Further development plans for peer-review are added in till 5 November 2024 here: New Requirement setting.

The current version of the dataset package is in an early, experimental stage. You can follow the discussion of this package on rOpenSci.

library(dataset)
iris_ds <- dataset(
  x = iris,
  title = "Iris Dataset",
  author = person("Edgar", "Anderson", role = "aut"),
  publisher = "American Iris Society",
  source = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
  date = 1935,
  language = "en",
  description = "This famous (Fisher's or Anderson's) iris data set."
)

It is mandatory to add a title, author to a dataset, and if the date is not specified, the current date will be added.

As the dataset at this point is just created, if it is not published yet, the identifer receives the default :tba value, a version of 0.1.0 and the :unas (unassigned) publisher field.

The dataset behaves as expected, with all data.frame methods applicable. If the dataset was originally a tibble or data.table object, it retained all methods of these s3 classes because the dataset class only implements further methods in the attributes of the original object.

summary(iris_ds)
#> Anderson E (2024). "Iris Dataset."
#> Further metadata: describe(iris_ds)
#>   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
#>  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
#>  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
#>  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
#>  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
#>  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
#>  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
#>        Species  
#>  setosa    :50  
#>  versicolor:50  
#>  virginica :50  
#>                 
#>                 
#> 

A brief description of the extended metadata attributes:

describe(iris_ds)
#> Iris Dataset 
#> Dataset with 150 observations (rows) and 5 variables (columns).
#> Description: This famous (Fisher's or Anderson's) iris data set.
#> Creator: Edgar Anderson [aut]
#> Publisher: American Iris Society
paste0("Publisher:", publisher(iris_ds))
#> [1] "Publisher:American Iris Society"
paste0("Rights:", rights(iris_ds))
#> [1] "Rights::unas"

The descriptive metadata are added to a utils::bibentry object which has many printing options (see ?bibentry).

mybibentry <- dataset_bibentry(iris_ds)
print(mybibentry, "text")
#> Anderson E (2024). "Iris Dataset."
print(mybibentry, "Bibtex")
#> @Misc{,
#>   title = {Iris Dataset},
#>   author = {Edgar Anderson},
#>   publisher = {American Iris Society},
#>   year = {2024},
#>   resourcetype = {Dataset},
#>   identifier = {:tba},
#>   version = {0.1.0},
#>   description = {This famous (Fisher's or Anderson's) iris data set.},
#>   language = {en},
#>   format = {application/r-rds},
#>   rights = {:unas},
#> }
rights(iris_ds) <- "CC0"
rights(iris_ds)
#> [1] "CC0"
rights(iris_ds, overwrite = FALSE) <- "GNU-2"
#> The dataset has already a rights field: CC0

Some important metadata is protected from accidental overwriting (except for the default :unas unassigned and :tba to-be-announced values.)

rights(iris_ds, overwrite = TRUE)  <- "GNU-2"

Code of Conduct

Please note that the dataset package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Furthermore, rOpenSci Community Contributing Guide - A guide to help people find ways to contribute to rOpenSci is also applicable, because dataset is under software review for potential inclusion in rOpenSci.