02_data_publication.Rmd

## Data Publishing and Sharing

```{r fig.align='center', echo=FALSE, include=grepl('html', knitr:::pandoc_to()), fig.cap= 'Modern life context for the ten simple rules [@Boland_2017]', fig.link='https://doi.org/10.1371/journal.pcbi.1005278/'}
knitr::include_graphics("images/tensimplerules_datasharing.png", dpi = NA)
```

```{r fig.align='center', echo=FALSE, fig.width = 5, fig.height = 5, include=grepl('latex', knitr:::pandoc_to()), fig.cap= 'Modern life context for the ten simple rules \\citep{Boland_2017}', fig.link='https://doi.org/10.1371/journal.pcbi.1005278/'}
knitr::include_graphics("images/tensimplerules_datasharing.png", dpi = NA)
```

```{r fig.align='center', echo=FALSE, fig.width = 5, fig.height = 5, include=grepl('docx|epub', knitr:::pandoc_to()), fig.cap= 'Modern life context for the ten simple rules [@Boland_2017]', fig.link='https://doi.org/10.1371/journal.pcbi.1005278/'}
knitr::include_graphics("images/tensimplerules_datasharing.png", dpi = NA)
```


>"This figure provides a framework for understanding how the “Ten Simple Rules to
Enable Multi-site Collaborations through Data Sharing” `r citep("10.1371/journal.pcbi.1005278")`
can be translated into easily understood modern life concepts.
>
**Rule 1** is Open-Source Software. The openness is signified by a window to a room 
>filled with algorithms that are represented by gears. 
>
**Rule 2** involves making the source data available whenever possible. Source data 
>can be very useful for researchers. However, data are often housed in 
>institutions and are not publicly accessible. These files are often stored 
>externally; therefore, we depict this as a shed or storehouse of data, which, 
>if possible, should be provided to research collaborators. 
>
**Rule 3** is to “use multiple platforms to share research products.” This 
>increases thechances that other researchers will find and be able to utilize your 
>research product—this is represented by multiple locations (i.e., shed and house). 
>
**Rule 4** involves the need to secure all necessary permissions a priori. Many 
>datasets have data use agreements that restrict usage. These restrictions can 
>sometimes prevent researchers from performing certain types of analyses or 
>publishing in certain journals (e.g., journals that require all data to be 
>openly accessible); therefore, we represent this rule as a key that can lock or 
>unlock the door of your research. 
>
**Rule 5** discusses the privacy issues that surround source data. Researchers 
>need to understand what they can and cannot do (i.e., the privacy rules) with 
>their data. Privacy often requires allowing certain users to have access to 
>sections of data while restricting access to other sections of data. Researchers 
>need to understand what can and cannot be revealed about their data (i.e., when 
>to open and close the curtains). 
>
**Rule 6** is to facilitate reproducibility whenever possible. Since 
communication is the forte of reproducibility, we depicted it as two 
researchers sharing a giant scroll, because data documentation is required and 
is often substantial. 
>
**Rule 7** is to “think global.” We conceptualize this as a cloud. This cloud 
allows the research property (i.e., the house and shed) to be accessed across 
large distances. 
>
>**Rule 8** is to publicize your work. Think of it as “shouting from the rooftops.”
Publicizing is critical for enabling other researchers to access your research
product. 
>
>**Rule 9** is to “stay realistic.” It is important for researchers to “stay
grounded” and resist the urge to overstate the claims made by their research.
>
**Rule 10** is to be engaged, and this is depicted as a person waving an “I heart
research” sign. It is vitally important to stay engaged and enthusiastic about
one’s research. This enables you to draw others to care about your research."
>
> ---- `r citep("10.1371/journal.pcbi.1005278")` 

Recommended literature: 

- [Ten simple rules to enable multi-site collaborations through data sharing](https://doi.org/10.1371/journal.pcbi.1005278) 
`r citep("10.1371/journal.pcbi.1005278")`

- Guidelines for publishing (PhD) research data `r citep(manual["Kaden_2018"])`

###	Repositories 

Repositories for permanently deposing data are for example:

- General

    + [Figshare](https://figshare.com), 

    + [Zenodo](https://zenodo.org/) (a joint project between 
    [OpenAIRE](https://www.openaire.eu/) and [CERN](https://home.cern/)),

    + [Mendeley data](https://data.mendeley.com/), 

    + [Dataverse](https://dataverse.org/),

    + [Dryad](https://datadryad.org/)
    
- Focus on environmental and earth sciences

    + [Pangea](https://www.pangaea.de/)
    
    + [GFZ Potsdam data services](http://dataservices.gfz-potsdam.de/portal/)


Repositories for publishing program code are: 

- [Github](https://github.com) or 

- [Gitlab](https://gitlab.com). 

However, both do not offer long term data preservation by default, but using
[Github](https://github.com)  it is posible to make the code citable by linking 
it with [Zenodo](https://zenodo.org/) (see: https://guides.github.com/activities/citable-code/).

We are currently using the following three repositories for publishing program 
code (mainly R packages): 

- [Github](https://github.com): for developing and publishing program code (mainly
R packages) we use [https://github.com/kwb-r](https://github.com/kwb-r). Currently 
81 (i.e. 38 public and 43 private) repositories are published on this Github 
account. For all 32 public R packages there is also a detailed status 
report available available at https://kwb-r.github.io/status/ , e.g. with information on license, 
documentation and the "health" of the R package (i.e. whether it can be 
successfully installed on Linux or Windows platforms).


- [Zenodo](https://zenodo.org/): for automatically getting a [DOI](https://www.doi.org/) 
for each software release made in one of our public Github repositories, e.g. 
[aquanes.report](https://doi.org/10.5281/zenodo.825029) (for details see: https://guides.github.com/activities/citable-code/)
and 

- [Gitlab](https://gitlab.com): as backup mirror ([https://gitlab.com/kwb-r](https://gitlab.com/kwb-r)) 
for all of our currently 81 (i.e. 34 public and 47 private) repositories currently 
published on our Github account ([https://github.com/kwb-r](https://github.com/kwb-r)) 


**Proposal: define company-wide QMS policy ("top-down") for publishing program code**

The above workflow was established from "bottom-up" (i.e. Michael Rustler and 
Hauke Sonnenberg) with the idea in mind to make the code as open as possible 
(e.g. by chossing the permisse [MIT license](https://choosealicense.com/licenses/mit/) as default for all of our public 
R packages). 

However, up to now there is no company wide strategy ("top-down") defined yet 
that would legitimate this "bottom-up" approach. This creates uncertainty (e.g. 
what can be published?), so that much more code than necessary is labelled 
as "private". To reduce this uncertainty the following QMS policy is proposed, 
which should be discussed and agreed on in one of the next KWB management meetings:

- **Sponsor projects (e.g. funded by BMBF, EU)**: source code will be published 
by default at https://github.com/kwb-r in **public** repositories (i.e. 
it will be accessible for everyone) under the permissive [MIT license](https://choosealicense.com/licenses/mit/) 
in case that the source code does not: 

  + contain security critical paths (e.g. to our company server) or
    
  + confidential data. 

  Code should be developed in such a way that both of the criteria 
  (security critical paths, confidential data) defined above are considered. 
  Making the code openly available will decrease the burden to install them (e.g. 
  not each student needs to get an "access" token to install private repositories, as 
  required for "contract" projects, see below).

- **Contract projects (e.g. funded by BWB, Veolia)**: will be published in 
**private** repositories by default at [https://github.com/kwb-r](https://github.com/kwb-r) 
in case the funder does not pre-define a specific repository. Access to the source 
code is thus resticted to KWB researchers and students working in the contract 
project. Project partners and funders can access the source code only if they get 
an "access token" from the KWB project team. 


```{block2 type = "rmdtip"}
A [blog post](https://101innovations.wordpress.com/2016/10/09/github-and-more-sharing-data-code/) by @Bosman_2016 provide results of a large survey carried out in 2015 among more than 15000 researchers. Insights can be gained on:   

- Which scholary communications tools are used and

- Are there disciplinary differences in usage?

They finally summarise:
"Another surprising finding is the overall low use of Zenodo – a CERN-hosted 
repository that is the recommended archiving and sharing solution for data from 
EU-projects and -institutions. The fact that Zenodo is a data-sharing platform 
that is available to anyone (thus not just for EU project data) might not be 
widely known yet."
```


### ORCID

Problem:

>"Two large challenges that researchers face today are discovery and evaluation. 
We are overwhelmed by the volume of new research works, and traditional discovery 
tools are no longer sufficient. We are spending considerable amounts of time 
optimizing the impact—and discoverability—of our research work so as to support 
grant applications and promotions, and the traditional measures for this are not 
enough. 
> --- `r citep(manual["Fenner2014"])`

Solution:

>"Open Researcher & Contributor ID ([ORCID](http://orcid.org/)) is an international, interdisciplinary, open and not-for-profit organization created to solve the 
researcher name ambiguity problem for the benefit of all stakeholders. 
[ORCID](http://orcid.org/)was built with the goal of becoming the universally 
accepted unique identifier for researchers: 
>
>1. ORCID is a community-driven organization
>
>2. ORCID is not limited by discipline, institution, or geography
>
>3. ORCID is an inclusive and transparently governed not-for profit organization
>
>4. ORCID data and source code are available under recognized open licenses
>
>5. the ORCID iD is part of institutional, publisher, and funding agency 
infrastructures.
>
>Furthermore, [ORCID](http://orcid.org/) recognizes that existing researcher and 
identifier schemes serve specific communities, and is working to link with, 
rather than replace, existing infrastructures."
>
> --- `r citep(manual["Fenner2014"])`


### Licenses

>"In most countries in the world, creative work is protected by copyright laws.
International conventions, and primarily the Berne Convention of 1886, protect
the copyright of creators even across international borders for 50 years after
the death of the creator. This means that copying and using the creative work is
limited by conditions set by the creator, or another copyright holder. For
example, in many cases musical recordings may not be copied and further
distributed without the permission of the musician, or of the production company
that has acquired the copyright from the musician. Facts about the universe that
are discovered through research are not subject to copyright, but the
collection, aggregation, analysis and interpretation of research data may be
considered creative work, and could be protected by copyright laws. Thus, the
consumption of research publications is governed by copyright law. Furthermore,
even data sharing is often governed by copyright laws, because the compilation
of data to be shared often requires a creative effort. Another case of
resarch-relevant copyrighted products is software that is developed in the
course of research. In all of these cases, if license terms are not explicitly
specified, the work is considered to be protected as "all rights reserved". This
means that no one but the creator of the work can use the work unencumbered. For
software this means that copying and further distribution of the software is
prohibited. Even running the software may be restricted. The exact selection of
a license is beyond the scope of this section, but depends on your intentions
and goals with regard to the software"
>
> --- `r citep(manual["Rokem_2018"])`

Recommended literature: 

- [Intellectual Property and Computational Science](https://link.springer.com/chapter/10.1007/978-3-319-00026-8_19) `r citep(manual["Stodden2014"])`

- [forschungslizenzen.de](http://www.forschungslizenzen.de) `r citep(manual["Forschungslizenzen"])`

- [Creative Commons Licences](https://link.springer.com/chapter/10.1007/978-3-319-00026-8_19) `r citep(manual["Friesike2014"])`

- [choosealicense.com/](https://choosealicense.com/)


### File Formats 

>"Scientific data is saved in a myriad of file formats. A typical file format
might include a file header, describing the layout of the data on disk, metadata
associated with the data, and the data itself, often stored in binary format. In
some cases (e.g., CSV (or comma-separated value) files), data will be stored as
text. The danger of proliferation of file formats in scientific data lies in the
need to build and maintain separate software tools to read, write and process
all these data formats. This makes interoperability between different
practitioners more difficult, and limits the value of data sharing, because
access to the data in the files remains limited."
>
> --- `r citep(manual["Rokem_2018"])`


```{r echo = FALSE, warning=FALSE, message=FALSE}

if (!require("tibble")) {
  install.packages("tibble")
}

tbl_longterm_file_formats <- tibble::tribble(
  ~., ~More.than.ten.years, ~Up.to.ten.years, ~Not.suitable,
  "Text", "PDF/A, TXT, ASC, XML", "PDF, RTF, HTML, DOCX, PPTX, ODT, LATEX", "DOC, PPT",
  "Data", "CSV", "XLSX, ODS", "XLS",
  "Pictures", "TIFF, PNG, JPG 2000, SVG", "GIF, BMP, JPEG", "INDD, EPS",
  "Audio", "WAV", "MP3, MP4", "",
  "Video", "Motion JPG 2000, MOV", "MP4", "WMV"
)

names(tbl_longterm_file_formats) <- c(
  "", "More than ten years", "Up to ten years",
  "Not suitable"
)

if (!require("kableExtra")) {
  install.packages("kableExtra")
}
library(kableExtra, quietly = TRUE)
```

```{r echo = FALSE, include=grepl('html', knitr:::pandoc_to())}
knitr::kable(tbl_longterm_file_formats,
  format = "html",
  align = "c",
  caption = "Suitability of file formats for long-term preservation [@Kaden_2018]"
) %>%
  kableExtra::column_spec(2:4, width = "5cm")
```

```{r echo = FALSE, include=grepl('latex', knitr:::pandoc_to())}

kableExtra::column_spec(
  kable_input = knitr::kable(tbl_longterm_file_formats, 
    "latex",
    align = "c",
    booktabs = T,
    caption = "Suitability of file formats for long-term preservation \\citep{Kaden_2018}"
  ),
  column = 2:4, width = "4cm"
)
```


```{r echo = FALSE, include=grepl('docx|epub', knitr:::pandoc_to()), fig.width = 5, fig.height = 5 }
knitr::kable(tbl_longterm_file_formats,
             format = "markdown",
             align = "c", 
  caption = "Suitability of file formats for long-term preservation [@Kaden_2018]"
)
```


### Data Exchange Standards

[WaterML2](http://www.waterml2.org/): 

>"...is a new data exchange standard in Hydrology which can basically be used to 
exchange many kinds of hydro-meteorological observations and measurements. 
WaterML2 has been initiated and designed over a period of several years by a 
group of major national and international organizations from public and private 
sector, such as CSIRO, CUAHSI, USGS, BOM, NOAA, KISTERS and others. WaterML2 has 
been developed within the OGC Hydrology Domain Working group which has a mandate 
by the WMO, too."
>
> --- [WaterML2](http://www.waterml2.org/)


[ODM2](http://www.odm2.org/): is an information model and supporting software 
ecosystem for feature-based earth observations