-
Notifications
You must be signed in to change notification settings - Fork 0
/
03_test_projects.Rmd
229 lines (159 loc) · 9.27 KB
/
03_test_projects.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
# Test projects
The three following case-studies are tested in detail within FAKIN (i.e. proposed
best-practices will be applied for this case studies and cross-checked whether
their application is useful.
## Geogenic Salination
Two types of datasets are handled in the Geosalz project, spread sheets of mostly
hydrochemical laboratory analysis and archived paper files. Accordingly, the `raw
data` and `processing` folders are divided in two parts `labor` and `archive`.
**Data processing**
- Archive data
* Import into [Endnote DB](#endnote)
* Getting data values out of pictures
- Lab data
* Manual pre-processing: due to heterogeneous formats -> data-cleaning
* Automated workflow: data-import, data aggregation and export with R
**Folder structure**
```
//server/rawdata
geosalz
BWB_archiv
BWB_labor
README.yml
//server/processing
geosalz
archiv
labor
README.yml
documents/
precleaned-data/
cleaned-data/
figures/
<rawdata.ink> (link to “//server/rawdata/geosalz/BWB_labor”)
<results.ink> (link to “//server/results/geosalz/report/”)
//server/results
geosalz
admin
reports
final_report.docx
<processing.ink> (link to “//server/processing/geosalz/labor/cleaned_data/v1.0”)
README.yml
```
Folder names indicate the owner of the data, here BWB. The `README.yml` gives
information on licensing of the data. In this case for restricted use only.
The `BWB_labor` folder contains:
The `METADATA.yml` comprises information on the origin of each file. In this case
data was received by email, thus each email is exported as a txt file (`select` -
`export as…`) and copied to the `METADATA.yml`. The `METADATA.yml` also contains
the email text itself, which may also provide meta information. The `METADATA.yml`
makes clear when and from whom the data was send and who received it.
A hyperlink can be inserted that directly links to the corresponding `processing`
folder. No further subfolders are required.
**Workflow for creating above folder structure**
1. Define project acronym `geosalz` (add in `PROJECTS.yml`)
2. Create initial folder structure on `//servername/rawdata/geosalz`
3. Create initial folder structure on `//servername/processing/geosalz` with one
subfolder for each task/work package, i.e.
4. Create a `README.yml` for each task describing the folder contents
5. Link relevant results to `//server/projekte/AUFTRAEGE/Rahmenvertrag GRW-WV/Data and documents/Versalzung`
**Lessons learnt:**
* Manual data preparation for heterogeneous data sets indispensable
* Naming conventions for large heterogeneous data sets not always in compliance
with [best-practices](#naming)
<!-- Adapt and test with new folder drive workflow as proposed -->
<!-- - `rawdata` -->
<!-- - `processing` -->
<!-- - `processing` -->
## LCA Modelling
**Challenge:**
The LCA modelling software [Umberto](https://www.ifu.com/umberto/) can produce
large raw data output files (> 300 MB csv files) that sometimes are even to big
for EXCEL 2010 (> 1 millions) but need to be aggregated (e.g. grouped
by specific criteria). This was usually performed manually within EXCEL in case
that model output data was below EXCEL`s 1 million row limit.
**Workflow improvement developed within FAKIN:**
An open source R package [kwb.umberto](https://kwb-r.github.io/kwb.umberto)
was programmed for automating:
- data import the Umberto model results,
- performing data aggregating to the user needs and finally
- exporting the aggregated results in an `results.xlsx` EXCEL spreadsheet.
This `results.xlsx` EXCEL spreadsheet is referenced by another EXCEL
spreatsheet `figues.xlsx` (which contains the figure templates and just links
to the `results.xlsx` in order to update the predefined figures).
This workflow now reduces the time consuming and error-prone formerly manually
performed data aggregation in EXCEL, whilst still enabling the users to adapt
the figures to their needs without coding knowledge.
## Pilot Plants {#aquanes}
Challenges:
The output of (on-line) monitoring technologies is often difficult to interpret
and also inconvenient to handle as the output formats of different devices
(in one water treatment scheme) can vary strongly. Furthermore, frequent
reporting and documentation of the treatment performance via (on-line)
monitoring can be time consuming for the personnel and requires advanced
software solutions. An alternative to commercial (and often expensive) software
solutions are tools which are based on the open software [R](https://www.r-project.org/) `r citep(citation())`. The free software approach allows any [R](https://www.r-project.org/) programmer to produce customized tools for each individual end-user.
Thus an automated reporting tool is developed within the [AQUANES](http://aquanes-h2020.eu/Default.aspx?t=1593) project for enabling an
integrative assessment of the different monitoring devices and integration with
water quality data obtained from analysis in the laboratory for four different
pilot plant sites in order to:
- Increase the reliability and reproducibility of handling large amounts of data
by reducing the likeliness in human error in complex systems and by increasing
the transparency of the data processing.
- Promote the use of customized R tools for different end-user such as utilities,
consultants and other research teams.
Therefore the open source R package [aquanes.report](https://kwb-r.github.io/aquanes.report)
`r citep(manual["Rustler_2018"])` was programmed, which is able to:
- import operational and lab data for each pilot site,
- performs temporal aggregation (e.g. 5 min, 1 h, 24h median values),
- visualises raw or aggregated data either interactively in a web browser or
by
- creating a standardised report (e.g. monthly) in html, pdf or docx
For the four different pilot plant sites the data (operational and lab data) for
being imported into the R tool came from various sources at different temporal
resolutions, which are detailed below:
- [Haridwar](http://aquanes-h2020.eu/Default.aspx?t=1668): operational data
stored by [Autarcon](http://www.autarcon.com/) in mySQL database (temporal
resolution: ~ 2-3 minutes, i.e. ~ 0.7 million data points per month), which is
accessible from the web and thus could be easily imported into R. Lab data was
provided by Autarcon initially in a unstructured format, which was impossible to
be automatically imported into R. However, after agreeing on a standardised EXCEL
spreadsheet format (e.g. one spreadsheet per site, one sheet per parameter and
additional sheets providing metadata for parameters and sites) it was possible
to integrate the lab data into the R tool.
- [Basel Lange-Erlen](http://aquanes-h2020.eu/Default.aspx?t=1663): operational
data is provided by the water supplier in EXCEL spreadsheets on a weekly basis
for each site (i.e. "Rein" and "Wiese") with a temporal resolution of 5 minutes
(i.e. ~ 0.5 million data points per month). Lab data are provided by the water
supplier in a single comma separated csv file, which is exported from a database.
Thus the structure of the lab data was standardised and could be easily imported
into the R tool.
- [Berlin-Schönerlinde](http://aquanes-h2020.eu/Default.aspx?t=1666): operational
data from the WEDECO pilot plant are collected using a SCADA system
(~ temporal resolution: seconds, i.e. ~ 10 million data points per month). Lab data
are provided by BWB in a single EXCEL spreadsheet. However, its structure often
changes in case it is updated by BWB, making an automated importing using R impossible
without adapting the import functions. Thus lab data were not integrated in the
R tool for this site.
- [Berlin-Tiefwerder](http://aquanes-h2020.eu/Default.aspx?t=1576): operational
data from the PENTAIR pilot plant are collected using a SCADA system
(~ temporal resolution: ~ seconds, i.e. ~ 10 million data points per month).
Lab data are provided by BWB in a single EXCEL spreadsheet. However, its structure
often changes in case it is updated by BWB, making an automated importing using
R impossible without adapting the import functions. Thus lab data were not
integrated in the R tool for this site.
```{block2 type="rmdwarning"}
The high temporal high resolution (~ seconds) of the operational data for both
Berlin pilot plants resulted in large data amounts (~ 10 million data points
per month), which required an large effort to optimise the performance of the
R reporting tool in oorder to enable the visualisation of the pilot plant`s raw
data for its test operation period (~ 18 months) on computers with limited RAM
ressources (~ 8-12 GB).
```
The R tool is used by KWB (for the sites [Berlin-Schönerlinde](http://aquanes-h2020.eu/Default.aspx?t=1666)
and [Berlin-Tiefwerder](http://aquanes-h2020.eu/Default.aspx?t=1576)) regulary
for assessing the pilot plants` operational performance interactively. In addition
for an advanced assessment only the data importing and aggregation routines and
combined with R scripts developed by KWB students.
For the other two pilot plant sites [Haridwar](http://aquanes-h2020.eu/Default.aspx?t=1668)
and [Basel Lange-Erlen](http://aquanes-h2020.eu/Default.aspx?t=1663) the AQUANES
project partners use the automated R reporting tool in a similar way.