diff --git a/docs/404.html b/docs/404.html index 3231886..9853c7d 100644 --- a/docs/404.html +++ b/docs/404.html @@ -32,7 +32,7 @@
diff --git a/docs/articles/index.html b/docs/articles/index.html index e85433d..09d16c5 100644 --- a/docs/articles/index.html +++ b/docs/articles/index.html @@ -17,7 +17,7 @@ diff --git a/docs/articles/madshapR-vignette.html b/docs/articles/madshapR-vignette.html index b80e0ea..8a1e80a 100644 --- a/docs/articles/madshapR-vignette.html +++ b/docs/articles/madshapR-vignette.html @@ -33,7 +33,7 @@ @@ -91,14 +91,14 @@The goal of madshapR is to provide functions to support rigorous -processes in data cleaning, evaluation, and documentation across -datasets from different studies based on Maelstrom Research guidelines. -The package includes the core functions to evaluate and format the main -inputs that define the process, diagnose errors, and summarize and -evaluate datasets and their associated data dictionaries. The main -outputs are clean datasets and associated metadata, and tabular and -visual summary reports.
+The madshapR package provides functions for efficient data cleaning, +evaluation, and documentation across different datasets. It was +developed to support work at Maelstrom Research and includes functions +to evaluate and summarize datasets and their associated data +dictionaries, identify potential issues in content and structure, and +prepare datasets and metadata for further processing. The key outputs +provided by the functions are formatted datasets, standardized metadata, +and tabular and visual summary reports.
-# To update the R package in your R environment you may first need to remove
-# it, and use the exit command quit() to terminate the current R session.
-
-# To install the R package:
+# To install madshapR:
install.packages('madshapR')
-library(madshapR)
-#if you need help with the package, please use:
-madshapR_help()
Maelstrom-research group. Copyright holder, funder.
-Alexandre Trottier. Contributor. -
-Tina Wey. Contributor. -
-Samuel El Bouzaïdi Tiali. Contributor. -
-Fabre G (2023). madshapR: Support Technical Processes Following 'Maelstrom Research' Standards. -R package version 1.0.3.0003, https://github.com/maelstrom-research/madshapR. +R package version 1.0.3, https://github.com/maelstrom-research/madshapR.
@Manual{, title = {madshapR: Support Technical Processes Following 'Maelstrom Research' Standards}, author = {Guillaume Fabre}, year = {2023}, - note = {R package version 1.0.3.0003}, + note = {R package version 1.0.3}, url = {https://github.com/maelstrom-research/madshapR}, }diff --git a/docs/index.html b/docs/index.html index 5400787..c7bd345 100644 --- a/docs/index.html +++ b/docs/index.html @@ -41,7 +41,7 @@
The goal of madshapR is to provide functions to support rigorous processes in data cleaning, evaluation, and documentation across datasets from different studies based on Maelstrom Research guidelines. The package includes the core functions to evaluate and format the main inputs that define the process, diagnose errors, and summarize and evaluate datasets and their associated data dictionaries. The main outputs are clean datasets and associated metadata, and tabular and visual summary reports.
+The madshapR package provides functions for efficient data cleaning, evaluation, and documentation across different datasets. It was developed to support work at Maelstrom Research and includes functions to evaluate and summarize datasets and their associated data dictionaries, identify potential issues in content and structure, and prepare datasets and metadata for further processing. The key outputs provided by the functions are formatted datasets, standardized metadata, and tabular and visual summary reports.
-# To update the R package in your R environment you may first need to remove it,
-# and use the exit command quit() to terminate the current R session.
-
-# To install the R package:
+# To install madshapR:
install.packages('madshapR')
-library(madshapR)
+library(madshapR)
# If you need help with the package, please use:
-madshapR_help()
Some of the tests were made with another package (Rmonize) which as ‘madshapR’ as a dependence.
+Some of the tests were made with another package (Rmonize) which as “madshapR” as a dependence.
in visual reports, void confusing changes in color scheme in visual reports.
Histograms for date variables display valid ranges.
in reports, change % NA as proportion in reports.
harmonized_dossier_visualize()
report shows variable labels in the same lang.
dossier_visualize()
report shows variable labels in the same lang.
in visual reports, the bar plot only appears when there are multiple missing value types, otherwise only the pie chart is shown.
in reports, all of the percentages are now included under “Other values (non categorical)”, which gives a single value.
col_id()
function which is a short cut for calling the attribute madshapR::col_id
of a dataset.col_id()
function which is a short cut for calling the attribute madshapR::col_id
of a dataset.
as_category()
,is_category()
,drop_category()
function which coerces a vector as a categorical object. Typically a column in a dataset that needs to be coerced into a categorical variable (The data dictionary is updated accordingly).
check_data_dict_categories()
, check_data_dict_missing_categories()
, check_data_dict_taxonomy()
, check_data_dict_variables()
, check_data_dict_valueType()
, check_dataset_categories()
, check_dataset_valueType()
, check_dataset_variables()
, check_name_standards()
These helper functions evaluate content of a dataset and/or data dictionary to extract from them summary statistics and elements such as missing values, NA, category names, etc. These informations are stored in a tibble that can be use to summary inputs.
dataset_preprocess()
, summary_variables()
, summary_variables_categorical()
,summary_variables_date()
, summary_variables_numeric()
,summary_variables_text()
R/experimental.R
+ as_category.Rd
+Converts a vector object to a categorical object, typically a column in a +data frame. The categories come from non-missing values present in the +object and are added to an associated data dictionary (when present).
+as_category(x)
A vector object to be coerced to categorical.
A vector with class haven_labelled.
+{
+
+library(dplyr)
+mtcars <- tibble(mtcars)
+as_category(mtcars[['cyl']])
+
+head(mtcars %>% mutate(cyl = as_category(cyl)))
+
+
+}
+#>
+#> Attaching package: 'dplyr'
+#> The following objects are masked from 'package:stats':
+#>
+#> filter, lag
+#> The following objects are masked from 'package:base':
+#>
+#> intersect, setdiff, setequal, union
+#> # A tibble: 6 × 11
+#> mpg cyl disp hp drat wt qsec vs am gear carb
+#> <dbl> <dbl+lbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
+#> 1 21 6 [6] 160 110 3.9 2.62 16.5 0 1 4 4
+#> 2 21 6 [6] 160 110 3.9 2.88 17.0 0 1 4 4
+#> 3 22.8 4 [4] 108 93 3.85 2.32 18.6 1 1 4 1
+#> 4 21.4 6 [6] 258 110 3.08 3.22 19.4 1 0 3 1
+#> 5 18.7 8 [8] 360 175 3.15 3.44 17.0 0 0 3 2
+#> 6 18.1 6 [6] 225 105 2.76 3.46 20.2 1 0 3 1
+
+
R/02-dictionaries_functions.R
as_data_dict.Rd
Validates the input object as a valid data dictionary and coerces it with
-the appropriate madshapR::class
attribute. This function mainly helps
-validate input within other functions of the package but could be used to
-check if an object is valid for use in a function.
Checks if an object is a valid data dictionary and returns it with the
+appropriate madshapR::class
attribute. This function mainly helps validate
+inputs within other functions of the package but could be used to check if
+an object is valid for use in a function.
A potential valid data dictionary to be coerced.
A potential data dictionary object to be coerced.
A list of data frame(s) with Rmonize::class
'data_dict'.
A list of data frame(s) with madshapR::class
'data_dict'.
R/02-dictionaries_functions.R
as_data_dict_mlstr.Rd
Whether the output data dictionary has a simple -data dictionary structure or not (meaning has a Maelstrom data dictionary -structure, compatible with Maelstrom Research ecosystem, including Opal). -FALSE by default.
Whether the input data dictionary should not be coerced +with specific format restrictions for compatibility with other +Maelstrom Research software. FALSE by default.
A list of data frame(s) with Rmonize::class
'data_dict_mlstr'.
A list of data frame(s) with madshapR::class
'data_dict_mlstr'.
variable
and name
.
The object may be specifically formatted to be compatible with additional Maelstrom Research software, -in particular Opal environments.
+in particular Opal environments.R/03-dataset_functions.R
as_dataset.Rd
Confirms that the input object is a valid dataset and returns it as a dataset
-with the appropriate madshapR::class
attribute. This function mainly helps
-validate inputs within other functions of the package but could be used to
-check if a dataset is valid.
Checks if an object is a valid dataset and returns it with the appropriate
+madshapR::class
attribute. This function mainly helps validate inputs
+within other functions of the package but could be used separately to check
+if a dataset is valid.
A potential dataset to be coerced.
A potential dataset object to be coerced.
A character string specifying the name(s) of the column(s) -which refer to key identifier of the dataset. The column(s) can be named -or indicated by position.
An optional character string specifying the name(s) or +position(s) of the column(s) used as identifiers.
A list of data frame(s) with Rmonize::class
'dataset'.
A data frame with madshapR::class
'dataset'.
A dataset is a data table containing variables. A dataset object is a data frame and can be associated with a data dictionary. If no data dictionary is provided with a dataset, a minimum workable -data dictionary will be generated as needed within relevant functions. An -identifier variable(s) for indexing can be specified by the user. +data dictionary will be generated as needed within relevant functions. +Identifier variable(s) for indexing can be specified by the user. The id values must be non-missing and will be used in functions that require it. If no identifier variable is specified, indexing is handled automatically by the function.
@@ -107,165 +106,31 @@{
# use madshapR_DEMO provided by the package
+library(dplyr)
-###### Example 1: a dataset can have an id column(s) which is specified as
-# an attribute.
+###### Example 1: A dataset can have an id column specified as an attribute.
dataset <- as_dataset(madshapR_DEMO$dataset_MELBOURNE, col_id = "id")
+glimpse(dataset)
+
+###### Example 2: Any data frame can be a dataset by definition.
+glimpse(as_dataset(iris, col_id = "Species"))
-###### Example 2: any data frame can be a dataset by definition.
-as_dataset(iris, col_id = "Species")
}
-#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
-#> 1 setosa 5.1 3.5 1.4 0.2
-#> 2 setosa 4.9 3.0 1.4 0.2
-#> 3 setosa 4.7 3.2 1.3 0.2
-#> 4 setosa 4.6 3.1 1.5 0.2
-#> 5 setosa 5.0 3.6 1.4 0.2
-#> 6 setosa 5.4 3.9 1.7 0.4
-#> 7 setosa 4.6 3.4 1.4 0.3
-#> 8 setosa 5.0 3.4 1.5 0.2
-#> 9 setosa 4.4 2.9 1.4 0.2
-#> 10 setosa 4.9 3.1 1.5 0.1
-#> 11 setosa 5.4 3.7 1.5 0.2
-#> 12 setosa 4.8 3.4 1.6 0.2
-#> 13 setosa 4.8 3.0 1.4 0.1
-#> 14 setosa 4.3 3.0 1.1 0.1
-#> 15 setosa 5.8 4.0 1.2 0.2
-#> 16 setosa 5.7 4.4 1.5 0.4
-#> 17 setosa 5.4 3.9 1.3 0.4
-#> 18 setosa 5.1 3.5 1.4 0.3
-#> 19 setosa 5.7 3.8 1.7 0.3
-#> 20 setosa 5.1 3.8 1.5 0.3
-#> 21 setosa 5.4 3.4 1.7 0.2
-#> 22 setosa 5.1 3.7 1.5 0.4
-#> 23 setosa 4.6 3.6 1.0 0.2
-#> 24 setosa 5.1 3.3 1.7 0.5
-#> 25 setosa 4.8 3.4 1.9 0.2
-#> 26 setosa 5.0 3.0 1.6 0.2
-#> 27 setosa 5.0 3.4 1.6 0.4
-#> 28 setosa 5.2 3.5 1.5 0.2
-#> 29 setosa 5.2 3.4 1.4 0.2
-#> 30 setosa 4.7 3.2 1.6 0.2
-#> 31 setosa 4.8 3.1 1.6 0.2
-#> 32 setosa 5.4 3.4 1.5 0.4
-#> 33 setosa 5.2 4.1 1.5 0.1
-#> 34 setosa 5.5 4.2 1.4 0.2
-#> 35 setosa 4.9 3.1 1.5 0.2
-#> 36 setosa 5.0 3.2 1.2 0.2
-#> 37 setosa 5.5 3.5 1.3 0.2
-#> 38 setosa 4.9 3.6 1.4 0.1
-#> 39 setosa 4.4 3.0 1.3 0.2
-#> 40 setosa 5.1 3.4 1.5 0.2
-#> 41 setosa 5.0 3.5 1.3 0.3
-#> 42 setosa 4.5 2.3 1.3 0.3
-#> 43 setosa 4.4 3.2 1.3 0.2
-#> 44 setosa 5.0 3.5 1.6 0.6
-#> 45 setosa 5.1 3.8 1.9 0.4
-#> 46 setosa 4.8 3.0 1.4 0.3
-#> 47 setosa 5.1 3.8 1.6 0.2
-#> 48 setosa 4.6 3.2 1.4 0.2
-#> 49 setosa 5.3 3.7 1.5 0.2
-#> 50 setosa 5.0 3.3 1.4 0.2
-#> 51 versicolor 7.0 3.2 4.7 1.4
-#> 52 versicolor 6.4 3.2 4.5 1.5
-#> 53 versicolor 6.9 3.1 4.9 1.5
-#> 54 versicolor 5.5 2.3 4.0 1.3
-#> 55 versicolor 6.5 2.8 4.6 1.5
-#> 56 versicolor 5.7 2.8 4.5 1.3
-#> 57 versicolor 6.3 3.3 4.7 1.6
-#> 58 versicolor 4.9 2.4 3.3 1.0
-#> 59 versicolor 6.6 2.9 4.6 1.3
-#> 60 versicolor 5.2 2.7 3.9 1.4
-#> 61 versicolor 5.0 2.0 3.5 1.0
-#> 62 versicolor 5.9 3.0 4.2 1.5
-#> 63 versicolor 6.0 2.2 4.0 1.0
-#> 64 versicolor 6.1 2.9 4.7 1.4
-#> 65 versicolor 5.6 2.9 3.6 1.3
-#> 66 versicolor 6.7 3.1 4.4 1.4
-#> 67 versicolor 5.6 3.0 4.5 1.5
-#> 68 versicolor 5.8 2.7 4.1 1.0
-#> 69 versicolor 6.2 2.2 4.5 1.5
-#> 70 versicolor 5.6 2.5 3.9 1.1
-#> 71 versicolor 5.9 3.2 4.8 1.8
-#> 72 versicolor 6.1 2.8 4.0 1.3
-#> 73 versicolor 6.3 2.5 4.9 1.5
-#> 74 versicolor 6.1 2.8 4.7 1.2
-#> 75 versicolor 6.4 2.9 4.3 1.3
-#> 76 versicolor 6.6 3.0 4.4 1.4
-#> 77 versicolor 6.8 2.8 4.8 1.4
-#> 78 versicolor 6.7 3.0 5.0 1.7
-#> 79 versicolor 6.0 2.9 4.5 1.5
-#> 80 versicolor 5.7 2.6 3.5 1.0
-#> 81 versicolor 5.5 2.4 3.8 1.1
-#> 82 versicolor 5.5 2.4 3.7 1.0
-#> 83 versicolor 5.8 2.7 3.9 1.2
-#> 84 versicolor 6.0 2.7 5.1 1.6
-#> 85 versicolor 5.4 3.0 4.5 1.5
-#> 86 versicolor 6.0 3.4 4.5 1.6
-#> 87 versicolor 6.7 3.1 4.7 1.5
-#> 88 versicolor 6.3 2.3 4.4 1.3
-#> 89 versicolor 5.6 3.0 4.1 1.3
-#> 90 versicolor 5.5 2.5 4.0 1.3
-#> 91 versicolor 5.5 2.6 4.4 1.2
-#> 92 versicolor 6.1 3.0 4.6 1.4
-#> 93 versicolor 5.8 2.6 4.0 1.2
-#> 94 versicolor 5.0 2.3 3.3 1.0
-#> 95 versicolor 5.6 2.7 4.2 1.3
-#> 96 versicolor 5.7 3.0 4.2 1.2
-#> 97 versicolor 5.7 2.9 4.2 1.3
-#> 98 versicolor 6.2 2.9 4.3 1.3
-#> 99 versicolor 5.1 2.5 3.0 1.1
-#> 100 versicolor 5.7 2.8 4.1 1.3
-#> 101 virginica 6.3 3.3 6.0 2.5
-#> 102 virginica 5.8 2.7 5.1 1.9
-#> 103 virginica 7.1 3.0 5.9 2.1
-#> 104 virginica 6.3 2.9 5.6 1.8
-#> 105 virginica 6.5 3.0 5.8 2.2
-#> 106 virginica 7.6 3.0 6.6 2.1
-#> 107 virginica 4.9 2.5 4.5 1.7
-#> 108 virginica 7.3 2.9 6.3 1.8
-#> 109 virginica 6.7 2.5 5.8 1.8
-#> 110 virginica 7.2 3.6 6.1 2.5
-#> 111 virginica 6.5 3.2 5.1 2.0
-#> 112 virginica 6.4 2.7 5.3 1.9
-#> 113 virginica 6.8 3.0 5.5 2.1
-#> 114 virginica 5.7 2.5 5.0 2.0
-#> 115 virginica 5.8 2.8 5.1 2.4
-#> 116 virginica 6.4 3.2 5.3 2.3
-#> 117 virginica 6.5 3.0 5.5 1.8
-#> 118 virginica 7.7 3.8 6.7 2.2
-#> 119 virginica 7.7 2.6 6.9 2.3
-#> 120 virginica 6.0 2.2 5.0 1.5
-#> 121 virginica 6.9 3.2 5.7 2.3
-#> 122 virginica 5.6 2.8 4.9 2.0
-#> 123 virginica 7.7 2.8 6.7 2.0
-#> 124 virginica 6.3 2.7 4.9 1.8
-#> 125 virginica 6.7 3.3 5.7 2.1
-#> 126 virginica 7.2 3.2 6.0 1.8
-#> 127 virginica 6.2 2.8 4.8 1.8
-#> 128 virginica 6.1 3.0 4.9 1.8
-#> 129 virginica 6.4 2.8 5.6 2.1
-#> 130 virginica 7.2 3.0 5.8 1.6
-#> 131 virginica 7.4 2.8 6.1 1.9
-#> 132 virginica 7.9 3.8 6.4 2.0
-#> 133 virginica 6.4 2.8 5.6 2.2
-#> 134 virginica 6.3 2.8 5.1 1.5
-#> 135 virginica 6.1 2.6 5.6 1.4
-#> 136 virginica 7.7 3.0 6.1 2.3
-#> 137 virginica 6.3 3.4 5.6 2.4
-#> 138 virginica 6.4 3.1 5.5 1.8
-#> 139 virginica 6.0 3.0 4.8 1.8
-#> 140 virginica 6.9 3.1 5.4 2.1
-#> 141 virginica 6.7 3.1 5.6 2.4
-#> 142 virginica 6.9 3.1 5.1 2.3
-#> 143 virginica 5.8 2.7 5.1 1.9
-#> 144 virginica 6.8 3.2 5.9 2.3
-#> 145 virginica 6.7 3.3 5.7 2.5
-#> 146 virginica 6.7 3.0 5.2 2.3
-#> 147 virginica 6.3 2.5 5.0 1.9
-#> 148 virginica 6.5 3.0 5.2 2.0
-#> 149 virginica 6.2 3.4 5.4 2.3
-#> 150 virginica 5.9 3.0 5.1 1.8
+#> Rows: 19
+#> Columns: 6
+#> $ id <dbl> 377943, 497013, 927676, 995667, 21829, 209432, 272983, 5806…
+#> $ Gender <dbl> 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2
+#> $ BMI <dbl> 221, 1850655594, 2457679588, 1571539833, 1855378065, 159317…
+#> $ age <dbl> 52, 49, 43, 59, 40, 47, -888, 53, 35, 40, 41, 34, 48, 43, -…
+#> $ smo_status <dbl> 1, 2, 3, -77, NA, 2, -77, 2, 1, 1, NA, 3, 2, 1, 2, 1, NA, 1…
+#> $ prg_curr <dbl> 0, -77, -77, 1, 0, -77, 8, 0, 0, -77, -77, 1, 1, 9, -77, -7…
+#> Rows: 150
+#> Columns: 5
+#> $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
+#> $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
+#> $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
+#> $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
+#> $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
Applies a data dictionary to a data structure, creating a labelled dataset. -All previous attributes will be preserved. For factors, the attribute -'levels' will be transformed into attribute 'labels' and values will be -recast into appropriate datatypes.
+Applies a data dictionary to a dataset, creating a labelled dataset with +variable attributes. Any previous attributes will be preserved. For +variables that are factors, variables will be transformed into +haven-labelled variables.
A data frame identifying the input dataset observations -associated to its data dictionary.
A dataset object.
A list of data frame(s) representing meta data of an -associated dataset. Automatically generated if not provided.
A list of data frame(s) representing metadata of the input +dataset. Automatically generated if not provided.
A data frame identifying the dataset with the data dictionary applied to each -variable as attributes.
+A labelled data frame with metadata as attributes, specified for each +variable from the input data dictionary.
A dataset is a data table containing variables. A dataset object is a data frame and can be associated with a data dictionary. If no data dictionary is provided with a dataset, a minimum workable -data dictionary will be generated as needed within relevant functions. An -identifier variable(s) for indexing can be specified by the user. +data dictionary will be generated as needed within relevant functions. +Identifier variable(s) for indexing can be specified by the user. The id values must be non-missing and will be used in functions that require it. If no identifier variable is specified, indexing is handled automatically by the function.
@@ -112,7 +111,7 @@A list of data frame(s) representing meta data to be +
A list of data frame(s) representing metadata to be transformed.
R/06-data_evaluate.R
data_dict_evaluate.Rd
Assesses the content and structure of a data dictionary and reports potential -issues to facilitate the assessment of input data. -The report can be used to help assess data structure, presence of fields, -coherence across elements, and taxonomy or data dictionary formats. This -report is compatible with Excel and can be exported as an Excel spreadsheet.
+Assesses the content and structure of a data dictionary and generates reports +of the results. The report can be used to help assess data dictionary +structure, presence of fields, coherence across elements, and taxonomy +or data dictionary formats.
A list of data frame(s) representing meta data to be evaluated.
A list of data frame(s) representing metadata to be evaluated.
An optional data frame identifying a variable -classification schema.
An optional data frame identifying a variable classification +schema.
Whether the output data dictionary has a simple -data dictionary structure or not (meaning has a Maelstrom data dictionary -structure, compatible with Maelstrom Research ecosystem, including Opal). -TRUE by default.
Whether the input data dictionary should be coerced +with specific format restrictions for compatibility with other +Maelstrom Research software. TRUE by default.
A list of data frames of report for one data dictionary.
+A list of data frames containing assessment reports.
The object may be specifically formatted to be compatible with additional Maelstrom Research software, -in particular Opal environments.
+in particular Opal environments.{
# use madshapR_DEMO provided by the package
+library(dplyr)
data_dict <- madshapR_DEMO$`data_dict_TOKYO - errors`
-data_dict_evaluate(data_dict)
+glimpse(data_dict_evaluate(data_dict))
}
#> - DATA DICTIONARY ASSESSMENT: data_dict --------------
@@ -145,51 +143,26 @@ Examples
#>
#> - WARNING MESSAGES (if any): --------------------------------------------
#>
-#> $`Data dictionary summary`
-#> # A tibble: 14 × 11
-#> index name `label:en` valueType `Categories::table` `Categories::label:en`
-#> <int> <chr> <chr> <chr> <chr> <chr>
-#> 1 1 part_id id of the… text NA NA
-#> 2 2 gndr gndr boolean "Male = DEMO ; \nF… "Male = Male ; \nFema…
-#> 3 3 height height decimal "SKIP PATTERN = DE… "SKIP PATTERN = SKIP …
-#> 4 4 weight… weight_ms NA "-88 = DEMO ; \n-9… "-88 = Don’t want to …
-#> 5 5 weight… weight_dc integer NA NA
-#> 6 6 weight… weight_dc integr NA NA
-#> 7 7 dob NA date NA NA
-#> 8 8 prg_evr prg_ever integer NA NA
-#> 9 9 empty empty integer NA NA
-#> 10 10 opente… opentext character NA NA
-#> 11 11 opente… NA NA NA NA
-#> 12 12 NA no name integer "6 = DEMO" "6 = Inaccurate"
-#> 13 NA prg_ev… NA NA "0 = DEMO ; \n1 = … "0 = never pregnant ;…
-#> 14 NA weight… NA NA "-99 = DEMO" "-99 = Don’t want to …
-#> # ℹ 5 more variables: `Categories::missing` <chr>, table <chr>,
-#> # `description:en` <chr>, unit <chr>, `copy units` <chr>
-#>
-#> $`Data dictionary assessment`
-#> # A tibble: 18 × 6
-#> sheet col_name name_var Quality assessment co…¹ value suggestion
-#> <chr> <chr> <chr> <chr> <chr> <chr>
-#> 1 Variables copy units NA [INFO] - Possible dupl… unit… NA
-#> 2 Variables label:en dob [ERR] - The column `la… NA NA
-#> 3 Variables label:en opentext [ERR] - The column `la… NA NA
-#> 4 Variables name opentext [ERR] - duplicated var… NA NA
-#> 5 Variables name row number: 12 [ERR] - missing variab… NA NA
-#> 6 Variables name weight dc [INFO] - Incompatible … NA NA
-#> 7 Variables unit NA [INFO] - Possible dupl… unit… NA
-#> 8 Variables valueType gndr [ERR] - valueType con… bool… text
-#> 9 Variables valueType height [ERR] - valueType con… deci… text
-#> 10 Variables valueType opentext [ERR] - Incompatible v… char… NA
-#> 11 Variables valueType weight_dc [ERR] - Incompatible v… inte… NA
-#> 12 Categories label:en 1 [ERR] - The column `la… NA NA
-#> 13 Categories missing prg_ever [ERR] - incompatible v… FLASE NA
-#> 14 Categories variable prg_ever [ERR] - Categories not… NA NA
-#> 15 Categories variable prg_ever [ERR] - In 'name', the… row … NA
-#> 16 Categories variable prg_ever [ERR] - Category names… -7 NA
-#> 17 Categories variable weight_sm [ERR] - Categories not… NA NA
-#> 18 Categories variable NA [ERR] - In 'variable',… row … NA
-#> # ℹ abbreviated name: ¹`Quality assessment comment`
-#>
+#> List of 2
+#> $ Data dictionary summary : tibble [14 × 11] (S3: tbl_df/tbl/data.frame)
+#> ..$ index : int [1:14] 1 2 3 4 5 6 7 8 9 10 ...
+#> ..$ name : chr [1:14] "part_id" "gndr" "height" "weight_ms" ...
+#> ..$ label:en : chr [1:14] "id of the participant" "gndr" "height" "weight_ms" ...
+#> ..$ valueType : chr [1:14] "text" "boolean" "decimal" NA ...
+#> ..$ Categories::table : chr [1:14] NA "Male = DEMO ; \nFemale = DEMO ; \n-77 = DEMO" "SKIP PATTERN = DEMO" "-88 = DEMO ; \n-99 = DEMO" ...
+#> ..$ Categories::label:en: chr [1:14] NA "Male = Male ; \nFemale = Female ; \n-77 = Don’t want to answer" "SKIP PATTERN = SKIP PATTERN" "-88 = Don’t want to answer ; \n-99 = Do not remember" ...
+#> ..$ Categories::missing : chr [1:14] NA "Male = FALSE ; \nFemale = FALSE ; \n-77 = TRUE" "SKIP PATTERN = TRUE" "-88 = TRUE ; \n-99 = TRUE" ...
+#> ..$ table : chr [1:14] "DEMO" "DEMO" "DEMO" "DEMO" ...
+#> ..$ description:en : chr [1:14] "id of the participant" "gender of the participant" "height of the participant" "weight of the participant - measured" ...
+#> ..$ unit : chr [1:14] NA NA "cm" "kg" ...
+#> ..$ copy units : chr [1:14] NA NA "cm" "kg" ...
+#> $ Data dictionary assessment: tibble [18 × 6] (S3: tbl_df/tbl/data.frame)
+#> ..$ sheet : chr [1:18] "Variables" "Variables" "Variables" "Variables" ...
+#> ..$ col_name : chr [1:18] "copy units" "label:en" "label:en" "name" ...
+#> ..$ name_var : chr [1:18] NA "dob" "opentext" "opentext" ...
+#> ..$ Quality assessment comment: chr [1:18] "[INFO] - Possible duplicated columns" "[ERR] - The column `label(:xx)` must exist contain no 'NA' values" "[ERR] - The column `label(:xx)` must exist contain no 'NA' values" "[ERR] - duplicated variable name" ...
+#> ..$ value : chr [1:18] "unit ; copy units" NA NA NA ...
+#> ..$ suggestion : chr [1:18] NA NA NA NA ...
A list of data frame(s) representing meta data to be -transformed. Automatically generated if not provided.
A list of data frame(s) representing metadata to be +transformed.
Symbol identifying the name of the element (data frame) to take +
A symbol identifying the name of the element (data frame) to take column(s) from. Default is 'Variables'.
Symbol identifying the name of the element (data frame) to create +
A symbol identifying the name of the element (data frame) to create column(s) to. Default is 'Categories'.
R/02-dictionaries_functions.R
data_dict_extract.Rd
Creates a data dictionary in a format compliant with formats used in -Maelstrom Research ecosystem, including Opal (with 'Variables' and -'Categories' in separate data frames and standard columns in each) from any -dataset in data frame format. If the input dataset has no associated -metadata, a data dictionary with minimal required information is created -from the column (variable) names to create the data dictionary structure -required for the package. All columns except variable names will be blank.
+Generates a data dictionary from a dataset. If the dataset variables have no +associated metadata, a minimum data dictionary is created by using variable +attributes.
A data frame identifying the input dataset observations which -contains meta data as attributes.
A dataset object.
Whether the output data dictionary has a simple -data dictionary structure or not (meaning has a Maelstrom data dictionary -structure, compatible with Maelstrom Research ecosystem, including Opal). -TRUE by default.
Whether the input data dictionary should be coerced +with specific format restrictions for compatibility with other +Maelstrom Research software. TRUE by default.
A list of data frame(s) identifying a data dictionary.
+A list of data frame(s) representing metadata of the dataset variables.
A dataset is a data table containing variables. A dataset object is a data frame and can be associated with a data dictionary. If no data dictionary is provided with a dataset, a minimum workable -data dictionary will be generated as needed within relevant functions. An -identifier variable(s) for indexing can be specified by the user. +data dictionary will be generated as needed within relevant functions. +Identifier variable(s) for indexing can be specified by the user. The id values must be non-missing and will be used in functions that require it. If no identifier variable is specified, indexing is handled automatically by the function.
@@ -118,7 +108,7 @@variable
and name
.
The object may be specifically formatted to be compatible with additional Maelstrom Research software, -in particular Opal environments.
+in particular Opal environments.A list of data frame(s) representing meta data to be -transformed.
A list of data frame(s) representing metadata to be +filtered.
A list of data frame(s) representing meta data to be +
A list of data frame(s) representing metadata to be transformed.
A list of data frame(s) representing meta data to be +
A list of data frame(s) representing metadata to be transformed.
A list of data frame(s) representing meta data to be +
A list of data frame(s) representing metadata to be transformed.
{
# use madshapR_DEMO provided by the package
+library(dplyr)
+
# Create a list of data dictionaries where the column 'table' is added to
# refer to the associated dataset. The object created is not a
# data dictionary per say, but can be used as a structure which can be
@@ -117,45 +119,24 @@ Examples
data_dict_2 <- madshapR_DEMO$data_dict_MELBOURNE)
names(data_dict_list) = c("dataset_TOKYO","dataset_MELBOURNE")
-data_dict_list_nest(data_dict_list, name_group = 'table')
+glimpse(data_dict_list_nest(data_dict_list, name_group = 'table'))
}
-#> $Variables
-#> # A tibble: 15 × 7
-#> table index name `label:en` `description:en` valueType unit
-#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
-#> 1 dataset_TOKYO 1 part_id id of the… id of the parti… text NA
-#> 2 dataset_TOKYO 2 gndr gndr gender of the p… text NA
-#> 3 dataset_TOKYO 3 height height height of the p… integer cm
-#> 4 dataset_TOKYO 4 weight_ms weight_ms weight of the p… integer kg
-#> 5 dataset_TOKYO 5 weight_dc weight_dc weight of the p… decimal kg
-#> 6 dataset_TOKYO 6 dob dob date of birth o… date years
-#> 7 dataset_TOKYO 7 prg_ever prg_ever whether the par… integer NA
-#> 8 dataset_TOKYO 8 empty empty empty column integer NA
-#> 9 dataset_TOKYO 9 opentext opentext open text text NA
-#> 10 dataset_MELBOURNE 1 id id id of the parti… text NA
-#> 11 dataset_MELBOURNE 2 Gender Gender Gender integer NA
-#> 12 dataset_MELBOURNE 3 BMI BMI Body Mass Index decimal kg/m…
-#> 13 dataset_MELBOURNE 4 age age Age of Particip… integer years
-#> 14 dataset_MELBOURNE 5 smo_stat… smo_curr Whether the par… integer NA
-#> 15 dataset_MELBOURNE 6 prg_curr prg_curr Are you current… integer NA
-#>
-#> $Categories
-#> # A tibble: 23 × 5
-#> table variable name `label:en` missing
-#> <chr> <chr> <chr> <chr> <chr>
-#> 1 dataset_TOKYO gndr Male Male FALSE
-#> 2 dataset_TOKYO gndr Female Female FALSE
-#> 3 dataset_TOKYO gndr -77 Don’t want to answer TRUE
-#> 4 dataset_TOKYO weight_ms -88 Don’t want to answer TRUE
-#> 5 dataset_TOKYO weight_ms -99 Don’t know TRUE
-#> 6 dataset_TOKYO prg_ever 0 never pregnant FALSE
-#> 7 dataset_TOKYO prg_ever 1 pregnant once or more FALSE
-#> 8 dataset_TOKYO prg_ever 2 currently pregnant FALSE
-#> 9 dataset_TOKYO prg_ever 8 Don’t want to answer TRUE
-#> 10 dataset_TOKYO prg_ever 9 Don’t know TRUE
-#> # ℹ 13 more rows
-#>
+#> List of 2
+#> $ Variables : tibble [15 × 7] (S3: tbl_df/tbl/data.frame)
+#> ..$ table : chr [1:15] "dataset_TOKYO" "dataset_TOKYO" "dataset_TOKYO" "dataset_TOKYO" ...
+#> ..$ index : chr [1:15] "1" "2" "3" "4" ...
+#> ..$ name : chr [1:15] "part_id" "gndr" "height" "weight_ms" ...
+#> ..$ label:en : chr [1:15] "id of the participant" "gndr" "height" "weight_ms" ...
+#> ..$ description:en: chr [1:15] "id of the participant" "gender of the participant" "height of the participant" "weight of the participant - measured" ...
+#> ..$ valueType : chr [1:15] "text" "text" "integer" "integer" ...
+#> ..$ unit : chr [1:15] NA NA "cm" "kg" ...
+#> $ Categories: tibble [23 × 5] (S3: tbl_df/tbl/data.frame)
+#> ..$ table : chr [1:23] "dataset_TOKYO" "dataset_TOKYO" "dataset_TOKYO" "dataset_TOKYO" ...
+#> ..$ variable: chr [1:23] "gndr" "gndr" "gndr" "weight_ms" ...
+#> ..$ name : chr [1:23] "Male" "Female" "-77" "-88" ...
+#> ..$ label:en: chr [1:23] "Male" "Female" "Don’t want to answer" "Don’t want to answer" ...
+#> ..$ missing : chr [1:23] "FALSE" "FALSE" "TRUE" "TRUE" ...
A data frame identifying the input dataset observations.
A dataset object.
A list of data frame(s) representing meta data -associated to a dataset.
A list of data frame(s) representing metadata of the input +dataset.
whether to apply the data dictionary to its dataset. -The resulting data frame will have for each column its associated meta data -as attributes. FALSE by default.
Whether data dictionary(ies) should be applied to +associated dataset(s), creating labelled dataset(s) with variable attributes. +Any previous attributes will be preserved. FALSE by default.
A dataset is a data table containing variables. A dataset object is a data frame and can be associated with a data dictionary. If no data dictionary is provided with a dataset, a minimum workable -data dictionary will be generated as needed within relevant functions. An -identifier variable(s) for indexing can be specified by the user. +data dictionary will be generated as needed within relevant functions. +Identifier variable(s) for indexing can be specified by the user. The id values must be non-missing and will be used in functions that require it. If no identifier variable is specified, indexing is handled automatically by the function.
@@ -131,64 +131,26 @@A list of data frame(s) representing meta data to be +
A list of data frame(s) representing metadata to be transformed.
An optional data frame identifying a variable -classification schema.
An optional data frame identifying a variable classification +schema.
A list of data frame(s) representing meta data to be +
A list of data frame(s) representing metadata to be transformed.
An optional data frame identifying a variable -classification schema.
An optional data frame identifying a variable classification +schema.
A list of data frame(s) representing meta data to be +
A list of data frame(s) representing metadata to be transformed.
A list of data frame(s) representing meta data of an -associated dataset (to be generated).
A list of data frame(s) representing metadata.
Whether to apply the data dictionary to its dataset. -The resulting data frame will have for each column its associated meta data -as attributes. FALSE by default.
Whether data dictionary(ies) should be applied to +associated dataset(s), creating labelled dataset(s) with variable attributes. +Any previous attributes will be preserved. FALSE by default.
A data dictionary contains the list of variables in a dataset and metadata +
A dataset is a data table containing variables. A dataset object is a +data frame and can be associated with a data dictionary. If no +data dictionary is provided with a dataset, a minimum workable +data dictionary will be generated as needed within relevant functions. +Identifier variable(s) for indexing can be specified by the user. +The id values must be non-missing and will be used in functions that +require it. If no identifier variable is specified, indexing is +handled automatically by the function.
+A data dictionary contains the list of variables in a dataset and metadata about the variables and can be associated with a dataset. A data dictionary object is a list of data frame(s) named 'Variables' (required) and 'Categories' (if any). To be usable in any function, the data frame diff --git a/docs/reference/dataset_cat_as_labels.html b/docs/reference/dataset_cat_as_labels.html index 5113d4b..75125d5 100644 --- a/docs/reference/dataset_cat_as_labels.html +++ b/docs/reference/dataset_cat_as_labels.html @@ -18,7 +18,7 @@
A data frame identifying the input dataset observations -associated to its data dictionary.
A dataset object.
A list of data frame(s) representing meta data of an -associated dataset (to be generated).
A list of data frame(s) representing metadata of the input +dataset. Automatically generated if not provided.
A dataset is a data table containing variables. A dataset object is a data frame and can be associated with a data dictionary. If no data dictionary is provided with a dataset, a minimum workable -data dictionary will be generated as needed within relevant functions. An -identifier variable(s) for indexing can be specified by the user. +data dictionary will be generated as needed within relevant functions. +Identifier variable(s) for indexing can be specified by the user. The id values must be non-missing and will be used in functions that require it. If no identifier variable is specified, indexing is handled automatically by the function.
@@ -116,160 +115,26 @@{
-dataset_cat_as_labels(iris['Sepal.Length'])
+dataset = madshapR_DEMO$dataset_PARIS
+data_dict = as_data_dict_mlstr(madshapR_DEMO$data_dict_PARIS)
+dataset_cat_as_labels(dataset, data_dict, col_names = 'SEX')
}
-#> Sepal.Length
-#> 1 5.1
-#> 2 4.9
-#> 3 4.7
-#> 4 4.6
-#> 5 5.0
-#> 6 5.4
-#> 7 4.6
-#> 8 5.0
-#> 9 4.4
-#> 10 4.9
-#> 11 5.4
-#> 12 4.8
-#> 13 4.8
-#> 14 4.3
-#> 15 5.8
-#> 16 5.7
-#> 17 5.4
-#> 18 5.1
-#> 19 5.7
-#> 20 5.1
-#> 21 5.4
-#> 22 5.1
-#> 23 4.6
-#> 24 5.1
-#> 25 4.8
-#> 26 5.0
-#> 27 5.0
-#> 28 5.2
-#> 29 5.2
-#> 30 4.7
-#> 31 4.8
-#> 32 5.4
-#> 33 5.2
-#> 34 5.5
-#> 35 4.9
-#> 36 5.0
-#> 37 5.5
-#> 38 4.9
-#> 39 4.4
-#> 40 5.1
-#> 41 5.0
-#> 42 4.5
-#> 43 4.4
-#> 44 5.0
-#> 45 5.1
-#> 46 4.8
-#> 47 5.1
-#> 48 4.6
-#> 49 5.3
-#> 50 5.0
-#> 51 7.0
-#> 52 6.4
-#> 53 6.9
-#> 54 5.5
-#> 55 6.5
-#> 56 5.7
-#> 57 6.3
-#> 58 4.9
-#> 59 6.6
-#> 60 5.2
-#> 61 5.0
-#> 62 5.9
-#> 63 6.0
-#> 64 6.1
-#> 65 5.6
-#> 66 6.7
-#> 67 5.6
-#> 68 5.8
-#> 69 6.2
-#> 70 5.6
-#> 71 5.9
-#> 72 6.1
-#> 73 6.3
-#> 74 6.1
-#> 75 6.4
-#> 76 6.6
-#> 77 6.8
-#> 78 6.7
-#> 79 6.0
-#> 80 5.7
-#> 81 5.5
-#> 82 5.5
-#> 83 5.8
-#> 84 6.0
-#> 85 5.4
-#> 86 6.0
-#> 87 6.7
-#> 88 6.3
-#> 89 5.6
-#> 90 5.5
-#> 91 5.5
-#> 92 6.1
-#> 93 5.8
-#> 94 5.0
-#> 95 5.6
-#> 96 5.7
-#> 97 5.7
-#> 98 6.2
-#> 99 5.1
-#> 100 5.7
-#> 101 6.3
-#> 102 5.8
-#> 103 7.1
-#> 104 6.3
-#> 105 6.5
-#> 106 7.6
-#> 107 4.9
-#> 108 7.3
-#> 109 6.7
-#> 110 7.2
-#> 111 6.5
-#> 112 6.4
-#> 113 6.8
-#> 114 5.7
-#> 115 5.8
-#> 116 6.4
-#> 117 6.5
-#> 118 7.7
-#> 119 7.7
-#> 120 6.0
-#> 121 6.9
-#> 122 5.6
-#> 123 7.7
-#> 124 6.3
-#> 125 6.7
-#> 126 7.2
-#> 127 6.2
-#> 128 6.1
-#> 129 6.4
-#> 130 7.2
-#> 131 7.4
-#> 132 7.9
-#> 133 6.4
-#> 134 6.3
-#> 135 6.1
-#> 136 7.7
-#> 137 6.3
-#> 138 6.4
-#> 139 6.0
-#> 140 6.9
-#> 141 6.7
-#> 142 6.9
-#> 143 5.8
-#> 144 6.8
-#> 145 6.7
-#> 146 6.7
-#> 147 6.3
-#> 148 6.5
-#> 149 6.2
-#> 150 5.9
+#> Processing of : SEX
+#> # A tibble: 24 × 7
+#> ID SEX BMI AGE SMO SMO_QTY PRG_EVER
+#> * <chr> <chr+lbl> <dbl> <dbl> <dbl> <dbl> <dbl>
+#> 1 Paris_687393 Femme [1] 2224298583 52 1 32 0
+#> 2 Paris_585666 Homme [0] 1523935376 49 1 8 -8
+#> 3 Paris_75802 Homme [0] 2266888359 43 1 48 -8
+#> 4 Paris_412072 Femme [1] NA 59 1 11 0
+#> 5 Paris_404333 Femme [1] 2618221463 40 1 18 1
+#> 6 Paris_554985 Homme [0] 1598702068 47 1 7 -8
+#> 7 Paris_714168 Femme [1] 1904634522 46 1 18 NA
+#> 8 Paris_145477 Femme [1] 168600545 53 0 -8 1
+#> 9 Paris_202076 Femme [1] 2992421287 35 1 36 1
+#> 10 Paris_847235 Homme [0] 2064553154 NA 0 -8 -8
+#> # ℹ 14 more rows
R/06-data_evaluate.R
dataset_evaluate.Rd
Assesses the content and structure of a dataset and reports possible issues -in the dataset and data dictionary to facilitate assessment of input data. -The report can be used to help assess data structure, presence of fields, -coherence across elements, and taxonomy or data dictionary formats. This -report is compatible with Excel and can be exported as an Excel spreadsheet.
+Assesses the content and structure of a dataset object and generates reports +of the results. This function can be used to evaluate data structure, +presence of specific fields, coherence across elements, and data dictionary +formats.
A data frame identifying the input dataset observations -associated to its data dictionary.
A dataset object.
A list of data frame(s) representing meta data of an -associated dataset. Automatically generated if not provided.
A list of data frame(s) representing metadata of the input +dataset. Automatically generated if not provided.
An optional data frame identifying a variable -classification schema.
An optional data frame identifying a variable classification +schema.
A character string specifying the name of the dataset
-(internally used in the function dossier_evaluate()
).
dossier_evaluate()
).
Whether the output data dictionary has a simple -data dictionary structure or not (meaning has a Maelstrom data dictionary -structure, compatible with Maelstrom Research ecosystem, including Opal). -TRUE by default.
Whether the input data dictionary should be coerced +with specific format restrictions for compatibility with other +Maelstrom Research software. TRUE by default.
A list of data frames of report for one data dictionary.
+A list of data frames containing assessment reports.
A dataset is a data table containing variables. A dataset object is a data frame and can be associated with a data dictionary. If no data dictionary is provided with a dataset, a minimum workable -data dictionary will be generated as needed within relevant functions. An -identifier variable(s) for indexing can be specified by the user. +data dictionary will be generated as needed within relevant functions. +Identifier variable(s) for indexing can be specified by the user. The id values must be non-missing and will be used in functions that require it. If no identifier variable is specified, indexing is handled automatically by the function.
@@ -147,7 +143,7 @@The object may be specifically formatted to be compatible with additional Maelstrom Research software, -in particular Opal environments.
+in particular Opal environments.A data frame identifying the input dataset observations -associated to its data dictionary.
A dataset object.
A list of data frame(s) representing meta data of an -associated dataset. Automatically generated if not provided.
A list of data frame(s) representing metadata of the input +dataset. Automatically generated if not provided.
A dataset is a data table containing variables. A dataset object is a data frame and can be associated with a data dictionary. If no data dictionary is provided with a dataset, a minimum workable -data dictionary will be generated as needed within relevant functions. An -identifier variable(s) for indexing can be specified by the user. +data dictionary will be generated as needed within relevant functions. +Identifier variable(s) for indexing can be specified by the user. The id values must be non-missing and will be used in functions that require it. If no identifier variable is specified, indexing is handled automatically by the function.
@@ -135,24 +134,19 @@{
-###### Example : any data frame can be a dataset by definition.
-dataset_preprocess(iris)
+###### Example : Any data frame can be a dataset by definition.
+head(dataset_preprocess(iris))
}
-#> # A tibble: 750 × 10
-#> index name `Categorical variable` valid_class value_var_occur value_var
-#> <int> <chr> <chr> <chr> <dbl> <chr>
-#> 1 1 Sepal.Len… no 3_Valid ot… 1 5.1
-#> 2 1 Sepal.Len… no 3_Valid ot… 1 4.9
-#> 3 1 Sepal.Len… no 3_Valid ot… 1 4.7
-#> 4 1 Sepal.Len… no 3_Valid ot… 1 4.6
-#> 5 1 Sepal.Len… no 3_Valid ot… 1 5
-#> 6 1 Sepal.Len… no 3_Valid ot… 1 5.4
-#> 7 1 Sepal.Len… no 3_Valid ot… 1 4.6
-#> 8 1 Sepal.Len… no 3_Valid ot… 1 5
-#> 9 1 Sepal.Len… no 3_Valid ot… 1 4.4
-#> 10 1 Sepal.Len… no 3_Valid ot… 1 4.9
-#> # ℹ 740 more rows
+#> # A tibble: 6 × 10
+#> index name `Categorical variable` valid_class value_var_occur value_var
+#> <int> <chr> <chr> <chr> <dbl> <chr>
+#> 1 1 Sepal.Leng… no 3_Valid ot… 1 5.1
+#> 2 1 Sepal.Leng… no 3_Valid ot… 1 4.9
+#> 3 1 Sepal.Leng… no 3_Valid ot… 1 4.7
+#> 4 1 Sepal.Leng… no 3_Valid ot… 1 4.6
+#> 5 1 Sepal.Leng… no 3_Valid ot… 1 5
+#> 6 1 Sepal.Leng… no 3_Valid ot… 1 5.4
#> # ℹ 4 more variables: index_value <int>, cat_index <int>, cat_label <chr>,
#> # index_in_dataset <int>
diff --git a/docs/reference/dataset_summarize.html b/docs/reference/dataset_summarize.html
index 1f9669e..e8248d7 100644
--- a/docs/reference/dataset_summarize.html
+++ b/docs/reference/dataset_summarize.html
@@ -1,11 +1,9 @@
-Generate a report and summary of a dataset — dataset_summarize • madshapR Generate an assessment report and summary of a dataset — dataset_summarize • madshapR
@@ -23,7 +21,7 @@
R/07-data_summarise.R
dataset_summarize.Rd
Assesses and summarizes the content and structure of a dataset and data -dictionary and reports potential issues to facilitate the assessment of -input. The report can be used to help assess data structure, presence of -fields, coherence across elements, and taxonomy or data dictionary formats. -The summary provides additional information about variable distributions and -descriptive statistics. This report is compatible with Excel and can be -exported as an Excel spreadsheet.
+Assesses and summarizes the content and structure of a dataset and generates +reports of the results. This function can be used to evaluate data structure, +presence of specific fields, coherence across elements, and data dictionary +formats, and to summarize additional information about variable distributions +and descriptive statistics.
A data frame identifying the input dataset observations -associated to its data dictionary.
A dataset object.
A list of data frame(s) representing meta data of an -associated dataset. Automatically generated if not provided.
A list of data frame(s) representing metadata of the input +dataset. Automatically generated if not provided.
A character string identifying the column in the dataset -to use as a grouping variable. Visual elements will be grouped by this +to use as a grouping variable. Elements will be grouped by this column.
An optional data frame identifying a variable -classification schema.
An optional data frame identifying a variable classification +schema.
Whether the output should be generated based on more -precise valueType inferred from the data. FALSE by default -(will use the valueType declared).
Whether the output should include a more accurate +valueType that could be applied to the dataset. FALSE by default.
A list of data frames of report for one data dictionary.
+A list of data frames containing assessment reports and summaries.
A dataset is a data table containing variables. A dataset object is a data frame and can be associated with a data dictionary. If no data dictionary is provided with a dataset, a minimum workable -data dictionary will be generated as needed within relevant functions. An -identifier variable(s) for indexing can be specified by the user. +data dictionary will be generated as needed within relevant functions. +Identifier variable(s) for indexing can be specified by the user. The id values must be non-missing and will be used in functions that require it. If no identifier variable is specified, indexing is handled automatically by the function.
@@ -158,12 +152,13 @@The valueType is a declared property of a variable that is required in certain functions to determine handling of the variables. Specifically, valueType refers to the -OBiBa data type of a variable. +OBiBa data type of a variable. The valueType is specified in a data dictionary in a column 'valueType' and can be associated with variables as attributes. Acceptable valueTypes include 'text', 'integer', 'decimal', 'boolean', datetime', 'date'. The full list of OBiBa valueType possibilities and their correspondence with R data -types are available using valueType_list.
+types are available using valueType_list. The valueType can be used to +coerce the variable to the corresponding data type.R/08-data_visualize.R
dataset_visualize.Rd
Generates a visual report for a dataset in an HTML bookdown document. The
-report provides figures and descriptive statistics for each variable to
-facilitate the assessment of input data. Statistics and figures are
-generated according to variable data type. The report can be used to help
-assess data structure, coherence across elements, and taxonomy or
-data dictionary formats. The summaries and figures provide additional
-information about variable distributions and descriptive statistics.
-The charts and tables are produced based on their data type. The variable
-can be grouped using group_by
parameter, which is a (categorical) column
-in the dataset. The user may need to use as.factor()
in this context. To
-fasten the process (and allow recycling object in a workflow) the user can
-feed the function with a dataset_summary
, which is the output of the function
-dataset_summarize()
of the column(s) col
and group_by
. The summary
-must have the same parameters to operate.
Generates a visual report of a dataset in an HTML bookdown +document, with summary figures and statistics for each variable. The report +outputs can be grouped by a categorical variable.
A data frame identifying the input dataset observations -associated to its data dictionary.
A dataset object.
A list of data frame(s) representing meta data of an -associated dataset. Automatically generated if not provided.
A list of data frame(s) representing metadata of the input +dataset. Automatically generated if not provided.
A character string identifying the column in each -dataset to use as a grouping variable. Visual elements will be grouped by -this column.
A character string identifying the column in the dataset +to use as a grouping variable. Elements will be grouped by this +column.
An optional data frame identifying a variable -classification schema.
An optional data frame identifying a variable classification +schema.
Whether the output should be generated based on more -precise valueType inferred from the data. FALSE by default -(will use the valueType declared).
Whether the output should include a more accurate +valueType that could be applied to the dataset. FALSE by default.
A character string specifying the name of the dataset -(used for internal processes and programming).
dossier_evaluate()
).
A bookdown folder containing files in the specified output folder. To
-open the file in browser, open 'docs/index.html'. Or use
-bookdown_open()
A folder containing files for the bookdown site. To open the bookdown site
+in a browser, open 'docs/index.html', or use bookdown_open()
with the
+folder path.
A dataset is a data table containing variables. A dataset object is a data frame and can be associated with a data dictionary. If no data dictionary is provided with a dataset, a minimum workable -data dictionary will be generated as needed within relevant functions. An -identifier variable(s) for indexing can be specified by the user. +data dictionary will be generated as needed within relevant functions. +Identifier variable(s) for indexing can be specified by the user. The id values must be non-missing and will be used in functions that require it. If no identifier variable is specified, indexing is handled automatically by the function.
@@ -187,12 +161,13 @@The valueType is a declared property of a variable that is required in certain functions to determine handling of the variables. Specifically, valueType refers to the -OBiBa data type of a variable. +OBiBa data type of a variable. The valueType is specified in a data dictionary in a column 'valueType' and can be associated with variables as attributes. Acceptable valueTypes include 'text', 'integer', 'decimal', 'boolean', datetime', 'date'. The full list of OBiBa valueType possibilities and their correspondence with R data -types are available using valueType_list.
+types are available using valueType_list. The valueType can be used to +coerce the variable to the corresponding data type.A taxonomy is a classification schema that can be defined for variable attributes. A taxonomy is usually extracted from an Opal environment, and a @@ -203,7 +178,8 @@
A data frame identifying the input dataset observations -associated to its data dictionary.
A dataset object.
A dataset is a data table containing variables. A dataset object is a data frame and can be associated with a data dictionary. If no data dictionary is provided with a dataset, a minimum workable -data dictionary will be generated as needed within relevant functions. An -identifier variable(s) for indexing can be specified by the user. +data dictionary will be generated as needed within relevant functions. +Identifier variable(s) for indexing can be specified by the user. The id values must be non-missing and will be used in functions that require it. If no identifier variable is specified, indexing is handled automatically by the function.
@@ -108,23 +107,18 @@R/03-dataset_functions.R
dossier_create.Rd
Assembles a dossier object from the listed datasets. A dossier is a list -containing at least one valid dataset and is the input used by key functions -of the package.
+Generates a dossier object (list of one or more datasets).
A list of data frame(s), identifying the input data -observations.
A list of data frame, each of them being dataset object.
whether to apply the data dictionary to its dataset. -The resulting data frame will have for each column its associated meta data -as attributes. FALSE by default.
Whether data dictionary(ies) should be applied to +associated dataset(s), creating labelled dataset(s) with variable attributes. +Any previous attributes will be preserved. FALSE by default.
A list of data frame(s), each of them identifying datasets in a dossier.
+A list of data frame(s), containing input dataset(s).
A dataset is a data table containing variables. A dataset object is a data frame and can be associated with a data dictionary. If no data dictionary is provided with a dataset, a minimum workable -data dictionary will be generated as needed within relevant functions. An -identifier variable(s) for indexing can be specified by the user. +data dictionary will be generated as needed within relevant functions. +Identifier variable(s) for indexing can be specified by the user. The id values must be non-missing and will be used in functions that require it. If no identifier variable is specified, indexing is handled automatically by the function.
@@ -106,207 +101,61 @@{
# use madshapR_DEMO provided by the package
+library(dplyr)
###### Example 1: datasets can be gathered into a dossier which is a list.
dossier <- dossier_create(
dataset_list = list(
dataset_MELBOURNE = madshapR_DEMO$dataset_MELBOURNE,
dataset_PARIS = madshapR_DEMO$dataset_PARIS ))
+
+glimpse(dossier)
-###### Example 2: any data frame can be gathered into a dossier
-dossier_create(list(iris, mtcars))
+###### Example 2: Any data frame can be gathered into a dossier
+glimpse(dossier_create(list(iris, mtcars)))
}
-#> $iris
-#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
-#> 1 5.1 3.5 1.4 0.2 setosa
-#> 2 4.9 3.0 1.4 0.2 setosa
-#> 3 4.7 3.2 1.3 0.2 setosa
-#> 4 4.6 3.1 1.5 0.2 setosa
-#> 5 5.0 3.6 1.4 0.2 setosa
-#> 6 5.4 3.9 1.7 0.4 setosa
-#> 7 4.6 3.4 1.4 0.3 setosa
-#> 8 5.0 3.4 1.5 0.2 setosa
-#> 9 4.4 2.9 1.4 0.2 setosa
-#> 10 4.9 3.1 1.5 0.1 setosa
-#> 11 5.4 3.7 1.5 0.2 setosa
-#> 12 4.8 3.4 1.6 0.2 setosa
-#> 13 4.8 3.0 1.4 0.1 setosa
-#> 14 4.3 3.0 1.1 0.1 setosa
-#> 15 5.8 4.0 1.2 0.2 setosa
-#> 16 5.7 4.4 1.5 0.4 setosa
-#> 17 5.4 3.9 1.3 0.4 setosa
-#> 18 5.1 3.5 1.4 0.3 setosa
-#> 19 5.7 3.8 1.7 0.3 setosa
-#> 20 5.1 3.8 1.5 0.3 setosa
-#> 21 5.4 3.4 1.7 0.2 setosa
-#> 22 5.1 3.7 1.5 0.4 setosa
-#> 23 4.6 3.6 1.0 0.2 setosa
-#> 24 5.1 3.3 1.7 0.5 setosa
-#> 25 4.8 3.4 1.9 0.2 setosa
-#> 26 5.0 3.0 1.6 0.2 setosa
-#> 27 5.0 3.4 1.6 0.4 setosa
-#> 28 5.2 3.5 1.5 0.2 setosa
-#> 29 5.2 3.4 1.4 0.2 setosa
-#> 30 4.7 3.2 1.6 0.2 setosa
-#> 31 4.8 3.1 1.6 0.2 setosa
-#> 32 5.4 3.4 1.5 0.4 setosa
-#> 33 5.2 4.1 1.5 0.1 setosa
-#> 34 5.5 4.2 1.4 0.2 setosa
-#> 35 4.9 3.1 1.5 0.2 setosa
-#> 36 5.0 3.2 1.2 0.2 setosa
-#> 37 5.5 3.5 1.3 0.2 setosa
-#> 38 4.9 3.6 1.4 0.1 setosa
-#> 39 4.4 3.0 1.3 0.2 setosa
-#> 40 5.1 3.4 1.5 0.2 setosa
-#> 41 5.0 3.5 1.3 0.3 setosa
-#> 42 4.5 2.3 1.3 0.3 setosa
-#> 43 4.4 3.2 1.3 0.2 setosa
-#> 44 5.0 3.5 1.6 0.6 setosa
-#> 45 5.1 3.8 1.9 0.4 setosa
-#> 46 4.8 3.0 1.4 0.3 setosa
-#> 47 5.1 3.8 1.6 0.2 setosa
-#> 48 4.6 3.2 1.4 0.2 setosa
-#> 49 5.3 3.7 1.5 0.2 setosa
-#> 50 5.0 3.3 1.4 0.2 setosa
-#> 51 7.0 3.2 4.7 1.4 versicolor
-#> 52 6.4 3.2 4.5 1.5 versicolor
-#> 53 6.9 3.1 4.9 1.5 versicolor
-#> 54 5.5 2.3 4.0 1.3 versicolor
-#> 55 6.5 2.8 4.6 1.5 versicolor
-#> 56 5.7 2.8 4.5 1.3 versicolor
-#> 57 6.3 3.3 4.7 1.6 versicolor
-#> 58 4.9 2.4 3.3 1.0 versicolor
-#> 59 6.6 2.9 4.6 1.3 versicolor
-#> 60 5.2 2.7 3.9 1.4 versicolor
-#> 61 5.0 2.0 3.5 1.0 versicolor
-#> 62 5.9 3.0 4.2 1.5 versicolor
-#> 63 6.0 2.2 4.0 1.0 versicolor
-#> 64 6.1 2.9 4.7 1.4 versicolor
-#> 65 5.6 2.9 3.6 1.3 versicolor
-#> 66 6.7 3.1 4.4 1.4 versicolor
-#> 67 5.6 3.0 4.5 1.5 versicolor
-#> 68 5.8 2.7 4.1 1.0 versicolor
-#> 69 6.2 2.2 4.5 1.5 versicolor
-#> 70 5.6 2.5 3.9 1.1 versicolor
-#> 71 5.9 3.2 4.8 1.8 versicolor
-#> 72 6.1 2.8 4.0 1.3 versicolor
-#> 73 6.3 2.5 4.9 1.5 versicolor
-#> 74 6.1 2.8 4.7 1.2 versicolor
-#> 75 6.4 2.9 4.3 1.3 versicolor
-#> 76 6.6 3.0 4.4 1.4 versicolor
-#> 77 6.8 2.8 4.8 1.4 versicolor
-#> 78 6.7 3.0 5.0 1.7 versicolor
-#> 79 6.0 2.9 4.5 1.5 versicolor
-#> 80 5.7 2.6 3.5 1.0 versicolor
-#> 81 5.5 2.4 3.8 1.1 versicolor
-#> 82 5.5 2.4 3.7 1.0 versicolor
-#> 83 5.8 2.7 3.9 1.2 versicolor
-#> 84 6.0 2.7 5.1 1.6 versicolor
-#> 85 5.4 3.0 4.5 1.5 versicolor
-#> 86 6.0 3.4 4.5 1.6 versicolor
-#> 87 6.7 3.1 4.7 1.5 versicolor
-#> 88 6.3 2.3 4.4 1.3 versicolor
-#> 89 5.6 3.0 4.1 1.3 versicolor
-#> 90 5.5 2.5 4.0 1.3 versicolor
-#> 91 5.5 2.6 4.4 1.2 versicolor
-#> 92 6.1 3.0 4.6 1.4 versicolor
-#> 93 5.8 2.6 4.0 1.2 versicolor
-#> 94 5.0 2.3 3.3 1.0 versicolor
-#> 95 5.6 2.7 4.2 1.3 versicolor
-#> 96 5.7 3.0 4.2 1.2 versicolor
-#> 97 5.7 2.9 4.2 1.3 versicolor
-#> 98 6.2 2.9 4.3 1.3 versicolor
-#> 99 5.1 2.5 3.0 1.1 versicolor
-#> 100 5.7 2.8 4.1 1.3 versicolor
-#> 101 6.3 3.3 6.0 2.5 virginica
-#> 102 5.8 2.7 5.1 1.9 virginica
-#> 103 7.1 3.0 5.9 2.1 virginica
-#> 104 6.3 2.9 5.6 1.8 virginica
-#> 105 6.5 3.0 5.8 2.2 virginica
-#> 106 7.6 3.0 6.6 2.1 virginica
-#> 107 4.9 2.5 4.5 1.7 virginica
-#> 108 7.3 2.9 6.3 1.8 virginica
-#> 109 6.7 2.5 5.8 1.8 virginica
-#> 110 7.2 3.6 6.1 2.5 virginica
-#> 111 6.5 3.2 5.1 2.0 virginica
-#> 112 6.4 2.7 5.3 1.9 virginica
-#> 113 6.8 3.0 5.5 2.1 virginica
-#> 114 5.7 2.5 5.0 2.0 virginica
-#> 115 5.8 2.8 5.1 2.4 virginica
-#> 116 6.4 3.2 5.3 2.3 virginica
-#> 117 6.5 3.0 5.5 1.8 virginica
-#> 118 7.7 3.8 6.7 2.2 virginica
-#> 119 7.7 2.6 6.9 2.3 virginica
-#> 120 6.0 2.2 5.0 1.5 virginica
-#> 121 6.9 3.2 5.7 2.3 virginica
-#> 122 5.6 2.8 4.9 2.0 virginica
-#> 123 7.7 2.8 6.7 2.0 virginica
-#> 124 6.3 2.7 4.9 1.8 virginica
-#> 125 6.7 3.3 5.7 2.1 virginica
-#> 126 7.2 3.2 6.0 1.8 virginica
-#> 127 6.2 2.8 4.8 1.8 virginica
-#> 128 6.1 3.0 4.9 1.8 virginica
-#> 129 6.4 2.8 5.6 2.1 virginica
-#> 130 7.2 3.0 5.8 1.6 virginica
-#> 131 7.4 2.8 6.1 1.9 virginica
-#> 132 7.9 3.8 6.4 2.0 virginica
-#> 133 6.4 2.8 5.6 2.2 virginica
-#> 134 6.3 2.8 5.1 1.5 virginica
-#> 135 6.1 2.6 5.6 1.4 virginica
-#> 136 7.7 3.0 6.1 2.3 virginica
-#> 137 6.3 3.4 5.6 2.4 virginica
-#> 138 6.4 3.1 5.5 1.8 virginica
-#> 139 6.0 3.0 4.8 1.8 virginica
-#> 140 6.9 3.1 5.4 2.1 virginica
-#> 141 6.7 3.1 5.6 2.4 virginica
-#> 142 6.9 3.1 5.1 2.3 virginica
-#> 143 5.8 2.7 5.1 1.9 virginica
-#> 144 6.8 3.2 5.9 2.3 virginica
-#> 145 6.7 3.3 5.7 2.5 virginica
-#> 146 6.7 3.0 5.2 2.3 virginica
-#> 147 6.3 2.5 5.0 1.9 virginica
-#> 148 6.5 3.0 5.2 2.0 virginica
-#> 149 6.2 3.4 5.4 2.3 virginica
-#> 150 5.9 3.0 5.1 1.8 virginica
-#>
-#> $mtcars
-#> mpg cyl disp hp drat wt qsec vs am gear carb
-#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
-#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
-#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
-#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
-#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
-#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
-#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
-#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
-#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
-#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
-#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
-#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
-#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
-#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
-#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
-#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
-#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
-#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
-#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
-#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
-#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
-#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
-#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
-#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
-#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
-#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
-#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
-#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
-#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
-#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
-#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
-#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
-#>
-#> attr(,"madshapR::class")
-#> [1] "dossier"
+#> List of 2
+#> $ dataset_MELBOURNE: tibble [19 × 6] (S3: tbl_df/tbl/data.frame)
+#> ..$ id : num [1:19] 377943 497013 927676 995667 21829 ...
+#> ..$ Gender : num [1:19] 2 1 1 2 2 1 2 2 2 1 ...
+#> ..$ BMI : num [1:19] 2.21e+02 1.85e+09 2.46e+09 1.57e+09 1.86e+09 ...
+#> ..$ age : num [1:19] 52 49 43 59 40 47 -888 53 35 40 ...
+#> ..$ smo_status: num [1:19] 1 2 3 -77 NA 2 -77 2 1 1 ...
+#> ..$ prg_curr : num [1:19] 0 -77 -77 1 0 -77 8 0 0 -77 ...
+#> ..- attr(*, "madshapR::class")= chr "dataset"
+#> $ dataset_PARIS : tibble [24 × 7] (S3: tbl_df/tbl/data.frame)
+#> ..$ ID : chr [1:24] "Paris_687393" "Paris_585666" "Paris_75802" "Paris_412072" ...
+#> ..$ SEX : num [1:24] 1 0 0 1 1 0 1 1 1 0 ...
+#> ..$ BMI : num [1:24] 2.22e+09 1.52e+09 2.27e+09 NA 2.62e+09 ...
+#> ..$ AGE : num [1:24] 52 49 43 59 40 47 46 53 35 NA ...
+#> ..$ SMO : num [1:24] 1 1 1 1 1 1 1 0 1 0 ...
+#> ..$ SMO_QTY : num [1:24] 32 8 48 11 18 7 18 -8 36 -8 ...
+#> ..$ PRG_EVER: num [1:24] 0 -8 -8 0 1 -8 NA 1 1 -8 ...
+#> ..- attr(*, "madshapR::class")= chr "dataset"
+#> - attr(*, "madshapR::class")= chr "dossier"
+#> List of 2
+#> $ iris :'data.frame': 150 obs. of 5 variables:
+#> ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
+#> ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
+#> ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
+#> ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
+#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
+#> ..- attr(*, "madshapR::class")= chr "dataset"
+#> $ mtcars:'data.frame': 32 obs. of 11 variables:
+#> ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
+#> ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
+#> ..$ disp: num [1:32] 160 160 108 258 360 ...
+#> ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
+#> ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
+#> ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
+#> ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
+#> ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
+#> ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
+#> ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
+#> ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
+#> ..- attr(*, "madshapR::class")= chr "dataset"
+#> - attr(*, "madshapR::class")= chr "dossier"
R/06-data_evaluate.R
dossier_evaluate.Rd
Assesses the content and structure of a dossier object (list of -datasets) and reports possible issues in the datasets and data dictionaries -to facilitate assessment of input data. -The report can be used to help assess data structure, presence of fields, -coherence across elements, and taxonomy or data dictionary formats.This -report is compatible with Excel and can be exported as an Excel spreadsheet.
+Assesses the content and structure of a dossier object (list of datasets) +and generates reports of the results. This function can be used to evaluate +data structure, presence of specific fields, coherence across elements, and +data dictionary formats.
An optional data frame identifying a variable -classification schema.
An optional data frame identifying a variable classification +schema.
Whether the output data dictionary has a simple -data dictionary structure or not (meaning has a Maelstrom data dictionary -structure, compatible with Maelstrom Research ecosystem, including Opal). -TRUE by default.
Whether the input data dictionary should be coerced +with specific format restrictions for compatibility with other +Maelstrom Research software. TRUE by default.
A list of data frames of report for each dataset.
+A list of data frames containing assessment reports.
A dossier is a named list containing at least one data frame or more, -each of them being datasets. The name of each tibble will be use as the +each of them being datasets. The name of each data frame will be use as the reference name of the dataset.
A taxonomy is a classification schema that can be defined for variable attributes. A taxonomy is usually extracted from an @@ -114,7 +109,7 @@
The object may be specifically formatted to be compatible with additional Maelstrom Research software, -in particular Opal environments.
+in particular Opal environments.R/07-data_summarise.R
dossier_summarize.Rd
Assesses and summarizes the content and structure of a dossier -(list of datasets) and reports potential issues to facilitate the -assessment of input data. The report can be used to help assess data -structure, presence of fields, coherence across elements, and taxonomy or -data dictionary formats. The summary provides additional information about -variable distributions and descriptive statistics. This report is compatible -with Excel and can be exported as an Excel spreadsheet.
+(list of datasets) and generates reports of the results. This function can +be used to evaluate data structure, presence of specific fields, coherence +across elements, and data dictionary formats, and to summarize additional +information about variable distributions and descriptive statistics.A character string identifying the column in the datasets -to use as a grouping variable. Visual elements will be grouped by this +
A character string identifying the column in the dataset +to use as a grouping variable. Elements will be grouped by this column.
An optional data frame identifying a variable -classification schema.
An optional data frame identifying a variable classification +schema.
Whether the output should be generated based on more -precise valueType inferred from the data. FALSE by default -(will use the valueType declared).
Whether the output should include a more accurate +valueType that could be applied to the dataset. FALSE by default.
A list of data frames of report for each listed dataset.
+A list of data frames containing overall assessment reports and summaries grouped by dataset.
A dossier is a named list containing at least one data frame or more, -each of them being datasets. The name of each tibble will be use as the +each of them being datasets. The name of each data frame will be use as the reference name of the dataset.
A taxonomy is a classification schema that can be defined for variable attributes. A taxonomy is usually extracted from an @@ -127,12 +122,13 @@
The valueType is a declared property of a variable that is required in certain functions to determine handling of the variables. Specifically, valueType refers to the -OBiBa data type of a variable. +OBiBa data type of a variable. The valueType is specified in a data dictionary in a column 'valueType' and can be associated with variables as attributes. Acceptable valueTypes include 'text', 'integer', 'decimal', 'boolean', datetime', 'date'. The full list of OBiBa valueType possibilities and their correspondence with R data -types are available using valueType_list.
+types are available using valueType_list. The valueType can be used to +coerce the variable to the corresponding data type.R/experimental.R
+ drop_category.Rd
+Converts a vector object to a non-categorical object, typically a column in a +data frame. The categories come from non-missing values present in the +object and are suppressed from an associated data dictionary (when present).
+drop_category(x)
object to be coerced.
A R object.
+Validate and coerce any object as a categorical variable.
Validate and coerce an object to dataset format
Validate and coerce any object as a dataset
Validate and coerce any object as data dictionary
Validate and coerce any object as a data dictionary
Validate and coerce an object to an Opal data dictionary format
Validate and coerce any object as an Opal data dictionary format
Validate and coerce an object to a workable data dictionary structure
Validate and coerce any object as a workable data dictionary structure
Validate and coerce an object to dossier format
Validate and coerce any object as a dossier (list of dataset(s))
Validate and coerce an object to taxonomy format
Validate and coerce any object as a taxonomy
Validate and coerce an object according to a given valueType
Validate and coerce any object according to a given valueType
Generate a quality assessment report of a dataset
Generate an assessment report for a dataset
Generate a report and summary of a dataset
Generate an assessment report and summary of a dataset
Generate a web-based bookdown visual report of a dataset
Generate a web-based visual report for a dataset
Generate a quality assessment report of a data dictionary
Generate an assessment report for a data dictionary
Create a data dictionary from a dataset
Generate a data dictionary from a dataset
Create a dossier object from a list of dataset(s)
Generate a dossier from a list of one or more datasets
Generate a quality assessment report of a dossier (list of datasets)
Generate an assessment report of a dossier
Generate a report and summary of a dossier (list of datasets)
Generate an assessment report and summary of a dossier
Validate and coerce any object as a non-categorical variable.
Test if an object is a valid dataset
Test if an object is a valid dossier
Test if an object is a valid dossier (list of dataset(s))
Tests if the input object is a valid dataset. This function mainly helps +validate input within other functions of the package but could be used +to check if a dataset is valid.
++Test if vector object is a categorical variable, typically a column in a +data frame. This function mainly helps validate input within other functions +of the package.
+is_category(x, threshold = NULL)
object to be coerced.
Optional. The function returns TRUE if the number of unique +values in the input vector is lower.
A logical.
+A dataset is a data table containing variables. A dataset object is a data frame and can be associated with a data dictionary. If no data dictionary is provided with a dataset, a minimum workable -data dictionary will be generated as needed within relevant functions. An -identifier variable(s) for indexing can be specified by the user. +data dictionary will be generated as needed within relevant functions. +Identifier variable(s) for indexing can be specified by the user. The id values must be non-missing and will be used in functions that require it. If no identifier variable is specified, indexing is handled automatically by the function.
diff --git a/docs/reference/is_dossier.html b/docs/reference/is_dossier.html index 5df5e52..7be9e1f 100644 --- a/docs/reference/is_dossier.html +++ b/docs/reference/is_dossier.html @@ -1,5 +1,5 @@ -