-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
New release of docs See merge request hmc/hmc-public/unhide/documentation!7
- Loading branch information
Showing
12 changed files
with
770 additions
and
697 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,13 @@ | ||
# About UnHIDE | ||
# About UnHIDE and its mission | ||
|
||
![unhide_overview](../images/unhide_overview.png) | ||
## Mission | ||
|
||
The efforts of the unHIDE initiative are one part of the efforts by the Helmholtz metadata collaboration (HMC) to improve the quality, knowledge management and conservation of research output of the Helmholtz association with respect and through metadata. This is accomplished by making research output `FAIR` through better metadata or differently formulated creating to a certain extend in a certain form of a semantic web encompassing Helmholtz research. | ||
|
||
With the unHIDE initiative our goal is to improve the metadata at the source and make data providers as well as scientists more aware of what metadata they put out on the web, how and with what quality. | ||
For this we create and expose the Helmholtz knowledge graph, which contains open high-level metadata exposed by different Helmholtz infrastructures. Also such a graph allows for services which serve needs of certain stakeholder groups to empower their work in different ways. | ||
|
||
Beyond the knowledge graph in unHIDE we communicate and work together with Helmholtz infrastructures to improve metadata, (or make it available in the first place), through consulting, help and fostering networking between the infrastructures and respected experts. | ||
|
||
|
||
![unhide_overview](../images/unhide_overview.png) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,30 @@ | ||
# Data harvesting | ||
# Data harvesting: extracting metadata from the web | ||
|
||
How does UnHIDE harvested data? | ||
How does UnHIDE harvested data? | ||
|
||
Data harvesting and mining for the knowledge graph is done by `Harvester classes`. | ||
For each interface a specific Harvester class should be implemented. | ||
All Harvester classes should inherit from existing Harvesters or the [`BaseHarvester`](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/main/data_harvesting/base_harvester.py?ref_type=heads), which currently specifies that: | ||
|
||
1. Each harvester needs a `run` method | ||
2. Can read from the [`config.yml`](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/main/data_harvesting/configs/config.yaml?ref_type=heads) | ||
3. Reads from a `<harvesterclass>.last_run` file the time the harvester was last run | ||
|
||
Implemented harvester classes include: | ||
|
||
| Name (Cli) | Class Name | Interface | Comment | | ||
|-------------|------------|-----------|---------| | ||
|sitemap | SitemapHarvester | sitemaps | Selecting record links from the sitemap requires expression matching. Relies on the advertools lib.| | ||
|oai | OAIHarvester | OAI-PMH | Relies on the oai lib. For the library providers, dublin core is converted to schema.org | | ||
|git | GitHarvester | Git, Gitlab/Github API | Relies on codemetapy and codemeta-harvester as well as gitlab/github APIs. | | ||
|datacite | DataciteHarvester | REST API & GraphQL endpoint | schema.org extracted through content negotiation.| | ||
|feed | FeedHarvester | RSS & Atom Feeds | Relies on the atoma library, and also only works if on the landing pages schema.org metadata can be extracted. Can only get recent data, useful for event metadata.| | ||
|indico | IndicoHarvester | Indico REST API | Directly extracts schema.org metadata through API, requires an access token | | ||
|
||
Json-ld metadata from landing pages of records is extracted via the `extruct` library, if it cannot be directly retrieved through some standardized interface. | ||
|
||
All harvesters are exposed on the `hmc-unhide` commandline interface. | ||
They store the extracted metadata per default in the internal data model [`LinkedDataObject`](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/main/data_harvesting/data_model.py?ref_type=heads). | ||
Which has a serialization with some provenance information, original source data and uplifted data and provides method for validation. | ||
|
||
In a single central yaml configuration file called [`config.yml`](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/main/data_harvesting/configs/config.yaml?ref_type=heads), specifies for each harvester class the sources to harvest and harvester or source specific configuration. |