-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_data_publication.Rmd
359 lines (271 loc) · 15.5 KB
/
02_data_publication.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
## Data Publishing and Sharing
```{r fig.align='center', echo=FALSE, include=grepl('html', knitr:::pandoc_to()), fig.cap= 'Modern life context for the ten simple rules [@Boland_2017]', fig.link='https://doi.org/10.1371/journal.pcbi.1005278/'}
knitr::include_graphics("images/tensimplerules_datasharing.png", dpi = NA)
```
```{r fig.align='center', echo=FALSE, fig.width = 5, fig.height = 5, include=grepl('latex', knitr:::pandoc_to()), fig.cap= 'Modern life context for the ten simple rules \\citep{Boland_2017}', fig.link='https://doi.org/10.1371/journal.pcbi.1005278/'}
knitr::include_graphics("images/tensimplerules_datasharing.png", dpi = NA)
```
```{r fig.align='center', echo=FALSE, fig.width = 5, fig.height = 5, include=grepl('docx|epub', knitr:::pandoc_to()), fig.cap= 'Modern life context for the ten simple rules [@Boland_2017]', fig.link='https://doi.org/10.1371/journal.pcbi.1005278/'}
knitr::include_graphics("images/tensimplerules_datasharing.png", dpi = NA)
```
>"This figure provides a framework for understanding how the “Ten Simple Rules to
Enable Multi-site Collaborations through Data Sharing” `r citep("10.1371/journal.pcbi.1005278")`
can be translated into easily understood modern life concepts.
>
**Rule 1** is Open-Source Software. The openness is signified by a window to a room
>filled with algorithms that are represented by gears.
>
**Rule 2** involves making the source data available whenever possible. Source data
>can be very useful for researchers. However, data are often housed in
>institutions and are not publicly accessible. These files are often stored
>externally; therefore, we depict this as a shed or storehouse of data, which,
>if possible, should be provided to research collaborators.
>
**Rule 3** is to “use multiple platforms to share research products.” This
>increases thechances that other researchers will find and be able to utilize your
>research product—this is represented by multiple locations (i.e., shed and house).
>
**Rule 4** involves the need to secure all necessary permissions a priori. Many
>datasets have data use agreements that restrict usage. These restrictions can
>sometimes prevent researchers from performing certain types of analyses or
>publishing in certain journals (e.g., journals that require all data to be
>openly accessible); therefore, we represent this rule as a key that can lock or
>unlock the door of your research.
>
**Rule 5** discusses the privacy issues that surround source data. Researchers
>need to understand what they can and cannot do (i.e., the privacy rules) with
>their data. Privacy often requires allowing certain users to have access to
>sections of data while restricting access to other sections of data. Researchers
>need to understand what can and cannot be revealed about their data (i.e., when
>to open and close the curtains).
>
**Rule 6** is to facilitate reproducibility whenever possible. Since
communication is the forte of reproducibility, we depicted it as two
researchers sharing a giant scroll, because data documentation is required and
is often substantial.
>
**Rule 7** is to “think global.” We conceptualize this as a cloud. This cloud
allows the research property (i.e., the house and shed) to be accessed across
large distances.
>
>**Rule 8** is to publicize your work. Think of it as “shouting from the rooftops.”
Publicizing is critical for enabling other researchers to access your research
product.
>
>**Rule 9** is to “stay realistic.” It is important for researchers to “stay
grounded” and resist the urge to overstate the claims made by their research.
>
**Rule 10** is to be engaged, and this is depicted as a person waving an “I heart
research” sign. It is vitally important to stay engaged and enthusiastic about
one’s research. This enables you to draw others to care about your research."
>
> ---- `r citep("10.1371/journal.pcbi.1005278")`
Recommended literature:
- [Ten simple rules to enable multi-site collaborations through data sharing](https://doi.org/10.1371/journal.pcbi.1005278)
`r citep("10.1371/journal.pcbi.1005278")`
- Guidelines for publishing (PhD) research data `r citep(manual["Kaden_2018"])`
### Repositories
Repositories for permanently deposing data are for example:
- General
+ [Figshare](https://figshare.com),
+ [Zenodo](https://zenodo.org/) (a joint project between
[OpenAIRE](https://www.openaire.eu/) and [CERN](https://home.cern/)),
+ [Mendeley data](https://data.mendeley.com/),
+ [Dataverse](https://dataverse.org/),
+ [Dryad](https://datadryad.org/)
- Focus on environmental and earth sciences
+ [Pangea](https://www.pangaea.de/)
+ [GFZ Potsdam data services](http://dataservices.gfz-potsdam.de/portal/)
Repositories for publishing program code are:
- [Github](https://github.com) or
- [Gitlab](https://gitlab.com).
However, both do not offer long term data preservation by default, but using
[Github](https://github.com) it is posible to make the code citable by linking
it with [Zenodo](https://zenodo.org/) (see: https://guides.github.com/activities/citable-code/).
We are currently using the following three repositories for publishing program
code (mainly R packages):
- [Github](https://github.com): for developing and publishing program code (mainly
R packages) we use [https://github.com/kwb-r](https://github.com/kwb-r). Currently
81 (i.e. 38 public and 43 private) repositories are published on this Github
account. For all 32 public R packages there is also a detailed status
report available available at https://kwb-r.github.io/status/ , e.g. with information on license,
documentation and the "health" of the R package (i.e. whether it can be
successfully installed on Linux or Windows platforms).
- [Zenodo](https://zenodo.org/): for automatically getting a [DOI](https://www.doi.org/)
for each software release made in one of our public Github repositories, e.g.
[aquanes.report](https://doi.org/10.5281/zenodo.825029) (for details see: https://guides.github.com/activities/citable-code/)
and
- [Gitlab](https://gitlab.com): as backup mirror ([https://gitlab.com/kwb-r](https://gitlab.com/kwb-r))
for all of our currently 81 (i.e. 34 public and 47 private) repositories currently
published on our Github account ([https://github.com/kwb-r](https://github.com/kwb-r))
**Proposal: define company-wide QMS policy ("top-down") for publishing program code**
The above workflow was established from "bottom-up" (i.e. Michael Rustler and
Hauke Sonnenberg) with the idea in mind to make the code as open as possible
(e.g. by chossing the permisse [MIT license](https://choosealicense.com/licenses/mit/) as default for all of our public
R packages).
However, up to now there is no company wide strategy ("top-down") defined yet
that would legitimate this "bottom-up" approach. This creates uncertainty (e.g.
what can be published?), so that much more code than necessary is labelled
as "private". To reduce this uncertainty the following QMS policy is proposed,
which should be discussed and agreed on in one of the next KWB management meetings:
- **Sponsor projects (e.g. funded by BMBF, EU)**: source code will be published
by default at https://github.com/kwb-r in **public** repositories (i.e.
it will be accessible for everyone) under the permissive [MIT license](https://choosealicense.com/licenses/mit/)
in case that the source code does not:
+ contain security critical paths (e.g. to our company server) or
+ confidential data.
Code should be developed in such a way that both of the criteria
(security critical paths, confidential data) defined above are considered.
Making the code openly available will decrease the burden to install them (e.g.
not each student needs to get an "access" token to install private repositories, as
required for "contract" projects, see below).
- **Contract projects (e.g. funded by BWB, Veolia)**: will be published in
**private** repositories by default at [https://github.com/kwb-r](https://github.com/kwb-r)
in case the funder does not pre-define a specific repository. Access to the source
code is thus resticted to KWB researchers and students working in the contract
project. Project partners and funders can access the source code only if they get
an "access token" from the KWB project team.
```{block2 type = "rmdtip"}
A [blog post](https://101innovations.wordpress.com/2016/10/09/github-and-more-sharing-data-code/) by @Bosman_2016 provide results of a large survey carried out in 2015 among more than 15000 researchers. Insights can be gained on:
- Which scholary communications tools are used and
- Are there disciplinary differences in usage?
They finally summarise:
"Another surprising finding is the overall low use of Zenodo – a CERN-hosted
repository that is the recommended archiving and sharing solution for data from
EU-projects and -institutions. The fact that Zenodo is a data-sharing platform
that is available to anyone (thus not just for EU project data) might not be
widely known yet."
```
### ORCID
Problem:
>"Two large challenges that researchers face today are discovery and evaluation.
We are overwhelmed by the volume of new research works, and traditional discovery
tools are no longer sufficient. We are spending considerable amounts of time
optimizing the impact—and discoverability—of our research work so as to support
grant applications and promotions, and the traditional measures for this are not
enough.
> --- `r citep(manual["Fenner2014"])`
Solution:
>"Open Researcher & Contributor ID ([ORCID](http://orcid.org/)) is an international, interdisciplinary, open and not-for-profit organization created to solve the
researcher name ambiguity problem for the benefit of all stakeholders.
[ORCID](http://orcid.org/)was built with the goal of becoming the universally
accepted unique identifier for researchers:
>
>1. ORCID is a community-driven organization
>
>2. ORCID is not limited by discipline, institution, or geography
>
>3. ORCID is an inclusive and transparently governed not-for profit organization
>
>4. ORCID data and source code are available under recognized open licenses
>
>5. the ORCID iD is part of institutional, publisher, and funding agency
infrastructures.
>
>Furthermore, [ORCID](http://orcid.org/) recognizes that existing researcher and
identifier schemes serve specific communities, and is working to link with,
rather than replace, existing infrastructures."
>
> --- `r citep(manual["Fenner2014"])`
### Licenses
>"In most countries in the world, creative work is protected by copyright laws.
International conventions, and primarily the Berne Convention of 1886, protect
the copyright of creators even across international borders for 50 years after
the death of the creator. This means that copying and using the creative work is
limited by conditions set by the creator, or another copyright holder. For
example, in many cases musical recordings may not be copied and further
distributed without the permission of the musician, or of the production company
that has acquired the copyright from the musician. Facts about the universe that
are discovered through research are not subject to copyright, but the
collection, aggregation, analysis and interpretation of research data may be
considered creative work, and could be protected by copyright laws. Thus, the
consumption of research publications is governed by copyright law. Furthermore,
even data sharing is often governed by copyright laws, because the compilation
of data to be shared often requires a creative effort. Another case of
resarch-relevant copyrighted products is software that is developed in the
course of research. In all of these cases, if license terms are not explicitly
specified, the work is considered to be protected as "all rights reserved". This
means that no one but the creator of the work can use the work unencumbered. For
software this means that copying and further distribution of the software is
prohibited. Even running the software may be restricted. The exact selection of
a license is beyond the scope of this section, but depends on your intentions
and goals with regard to the software"
>
> --- `r citep(manual["Rokem_2018"])`
Recommended literature:
- [Intellectual Property and Computational Science](https://link.springer.com/chapter/10.1007/978-3-319-00026-8_19) `r citep(manual["Stodden2014"])`
- [forschungslizenzen.de](http://www.forschungslizenzen.de) `r citep(manual["Forschungslizenzen"])`
- [Creative Commons Licences](https://link.springer.com/chapter/10.1007/978-3-319-00026-8_19) `r citep(manual["Friesike2014"])`
- [choosealicense.com/](https://choosealicense.com/)
### File Formats
>"Scientific data is saved in a myriad of file formats. A typical file format
might include a file header, describing the layout of the data on disk, metadata
associated with the data, and the data itself, often stored in binary format. In
some cases (e.g., CSV (or comma-separated value) files), data will be stored as
text. The danger of proliferation of file formats in scientific data lies in the
need to build and maintain separate software tools to read, write and process
all these data formats. This makes interoperability between different
practitioners more difficult, and limits the value of data sharing, because
access to the data in the files remains limited."
>
> --- `r citep(manual["Rokem_2018"])`
```{r echo = FALSE, warning=FALSE, message=FALSE}
if (!require("tibble")) {
install.packages("tibble")
}
tbl_longterm_file_formats <- tibble::tribble(
~., ~More.than.ten.years, ~Up.to.ten.years, ~Not.suitable,
"Text", "PDF/A, TXT, ASC, XML", "PDF, RTF, HTML, DOCX, PPTX, ODT, LATEX", "DOC, PPT",
"Data", "CSV", "XLSX, ODS", "XLS",
"Pictures", "TIFF, PNG, JPG 2000, SVG", "GIF, BMP, JPEG", "INDD, EPS",
"Audio", "WAV", "MP3, MP4", "",
"Video", "Motion JPG 2000, MOV", "MP4", "WMV"
)
names(tbl_longterm_file_formats) <- c(
"", "More than ten years", "Up to ten years",
"Not suitable"
)
if (!require("kableExtra")) {
install.packages("kableExtra")
}
library(kableExtra, quietly = TRUE)
```
```{r echo = FALSE, include=grepl('html', knitr:::pandoc_to())}
knitr::kable(tbl_longterm_file_formats,
format = "html",
align = "c",
caption = "Suitability of file formats for long-term preservation [@Kaden_2018]"
) %>%
kableExtra::column_spec(2:4, width = "5cm")
```
```{r echo = FALSE, include=grepl('latex', knitr:::pandoc_to())}
kableExtra::column_spec(
kable_input = knitr::kable(tbl_longterm_file_formats,
"latex",
align = "c",
booktabs = T,
caption = "Suitability of file formats for long-term preservation \\citep{Kaden_2018}"
),
column = 2:4, width = "4cm"
)
```
```{r echo = FALSE, include=grepl('docx|epub', knitr:::pandoc_to()), fig.width = 5, fig.height = 5 }
knitr::kable(tbl_longterm_file_formats,
format = "markdown",
align = "c",
caption = "Suitability of file formats for long-term preservation [@Kaden_2018]"
)
```
### Data Exchange Standards
[WaterML2](http://www.waterml2.org/):
>"...is a new data exchange standard in Hydrology which can basically be used to
exchange many kinds of hydro-meteorological observations and measurements.
WaterML2 has been initiated and designed over a period of several years by a
group of major national and international organizations from public and private
sector, such as CSIRO, CUAHSI, USGS, BOM, NOAA, KISTERS and others. WaterML2 has
been developed within the OGC Hydrology Domain Working group which has a mandate
by the WMO, too."
>
> --- [WaterML2](http://www.waterml2.org/)
[ODM2](http://www.odm2.org/): is an information model and supporting software
ecosystem for feature-based earth observations