shortname: DEC-PROV
name: Decentralized Data Provenance
type: Standard
status: Raw
editor: Aitor Argomaniz <aitor@oceanprotocol.com>
contributors: Fang Gong <fang@oceanprotocol.com>
- Table of Contents
- Abstract
- Change Process
- Language
- Motivation
- Decentralized Data Provenance
- Integration of Data Provenance in Ocean Protocol
This OEP introduces the concept of asset Data Provenance in Ocean Protocol based on the W3C Provenance specification.
The process to change this document is described in OEP-2 (COSS).
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.
The intention of this document is to discuss how data provenance can be established in a decentralized system, permitting integrity validation of this provenance information.
The main objectives of this OEP are:
- Understand what provenance information needs to be tracked
- Specify how the provenance integrity check is stored on-chain
- Identify the actors involved in the publishing and visualization of provenance information
- Detail how to register relationships between source and derived entities
- Detail how to register activities
- Understand how to associate activities with the input and output entities in a workflow
- Validate cryptographically that an entity was generated from a specific input entity in a specific activity
The W3C PROV specification defines Provenance as:
Provenance is information about entities, activities, and people
involved in producing a piece of data or thing, which can be used to
form assessments about its quality, reliability or trustworthiness.
The W3C PROV data model includes a core set of types and relations commonly found in provenance tracking for more specific uses.
In particular, the data model includes both Type
and Relation
categories.
Type
category contains entity
, activity
and agent
which are
core components.
Relations
category defines key relationships between different types of
components, which can be mapped into specific PROV model relations.
Provenance information can be modeled as the interaction between Agents and Entities related via the Activities between them:
In PROV, physical, digital, conceptual, or other kinds of things are called entities. Examples of such entities are assets, a web page, a chart, a spellchecker, etc.
An agent takes a role in an activity such that the agent can be assigned some degree of responsibility for the activity taking place. An agent can be a person, a piece of software, an inanimate object, an organization, or other entities that may be ascribed responsibility.
Activities are how entities come into existence and how their attributes
change to become new entities, often making use of previously existing
entities to achieve this. They are dynamic aspects of the world, such as
actions, processes, etc. For example, if the second version of document D
was generated by a translation from the first version of the document in
another language, then this translation is an activity.
Generation
is the completion of production of a new entity by an
activity. This entity did not exist before generation and becomes
available for usage after this generation.
Generation
, written wasGeneratedBy(id; e, a, t, attrs)
in PROV-N
(a notation for provenance aimed at human consumption), has:
id
: an optional identifier for a generation;entity
: an identifier (e) for a created entity. In Ocean a DID;activity
: an optional identifier (a) for the activity that creates the entity. In Ocean a DID;time
: an optional "generation time" (t), the time at which the entity was completely created;attributes
: an optional set (attrs) of attribute-value pairs representing additional information about this generation.
While each of id
, activity
, time
, and attributes
are optional,
at least one of them must be present.
PROV uses qualified names to identify things for data provenance,
which is essentially a shortened representation of a URI in the form of
prefix:localpart
.
Example:
wasGeneratedBy(did:op:e1, did:op:a1, 2001-10-26T21:32:52)
wasGeneratedBy(did:op:e2, did:op:a1, 2001-10-26T10:00:00)
The above example shows the existence of two generations (with
respective times 2001-10-26T21:32:52 and 2001-10-26T10:00:00), at which
new entities, identified by did:op:e1
and did:op:e2
, were created
by an activity, identified by did:op:a1
.
Derivation
is a transformation of an entity into another, an update of
an entity resulting in a new one, or the construction of a new entity
based on a pre-existing entity.
A derivation
, written wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)
in PROV-N, has:
id
: an optional identifier for a derivation;generatedEntity
: the identifier (e2) of the derived entity generated by the derivation. In Ocean a DID;usedEntity
: the identifier (e1) of the source entity used by the derivation. In Ocean a DID;activity
: an optional identifier (a) for the activity using and generating the above entities. In Ocean a DID;generation
: an optional identifier (g2) for the generation involving the generated entity (e2) and activity (a). In Ocean a DID;usage
: an optional identifier (u1) for the usage involving the used entity (e1) and activity (a). In Ocean a DID;attributes
: an optional set (attrs) of attribute-value pairs representing additional information about this derivation.
The following descriptions are about derivations between did:op:e1
and
did:op:e2
, but no information is provided as to the identity of the
activity (and usage and generation) underpinning the derivation.
In the second line, a type attribute is also provided.
wasDerivedFrom(did:op:e2, did:op:e1)
wasDerivedFrom(did:op:e2, did:op:e1, [ prov:type="physical transform" ])
The following description expresses that activity did:op:a
, using the
entity did:op:e1
according to usage did:op:01
, derived the entity
did:op:e2
and generated it according to generation did:op:02
.
It is followed by descriptions for generation did:op:02
and usage did:op:01
.
wasDerivedFrom(did:op:e2, did:op:e1, did:op:a, did:op:02, did:op:01)
wasGeneratedBy(did:op:02; did:op:e2, did:op:a, -)
used(did:op:01; did:op:a, did:op:e1, -)
With such a comprehensive description of derivation, a program that
analyzes provenance can identify the activity underpinning
the derivation, it can identify how the preceding entity did:op:e1
was
used by the activity (e.g. for instance, which argument it was passed, if the
activity is the result of a function invocation), and which
output the derived entity did:op:e2
was obtained from (say, for a
function returning multiple results).
An activity
is something that occurs over a period of time and acts
upon or with entities; it may include consuming, processing,
transforming, modifying, relocating, using, or generating entities.
An activity
, written activity(id, st, et, [attr1=val1, ...])
in
PROV-N, has:
id
: an identifier for an activity;startTime
: an optional time (st) for the start of the activity;endTime
: an optional time (et) for the end of the activity;attributes
: an optional set of attribute-value pairs ((attr1, val1), ...) representing additional information about this activity.
Example:
activity(a1, 2011-11-16T16:05:00, 2011-11-16T16:06:00,
[ ex:host="server.example.org", prov:type='ex:edit' ])
The above example shows the existence of an activity with identifier a1
, start time 2011-11-16T16:05:00
, and end time 2011-11-16T16:06:00
,
running on host server.example.org
, and of type edit
.
The activities could be mapped to the Ocean Protocol Workflows.
To support the inclusion of the PROV information into a W3C Decentralized Document (DDO), the PROV modeling of Assets uses the W3C PROV-JSON spec.
From the PROV-JSON spec, each type of PROV assertion (e.g. entity, activity, generation, usage, etc.) is organised into a separate property with the same name in the top-level object as follows.
{
"entity": { // Map of entities by entities' IDs
},
"activity": { // Map of activities by IDs
},
"agent": { // Map of agents by IDs
},
<relationName>: { // A map of relations of type relationName by their IDs
}
}
Each property itself is a JSON object, which is a map-like structure to hold all the PROV-JSON representation of assertions of the same type indexed by their identifiers. Hence, a PROV-JSON document is an indexed representation of a PROV document, in which PROV assertions are indexed by their types and by their identifiers.
This section describes the JSON representations for all PROV elements: entity, activity, and agent.
Each entity is represented as a property in the entity object, identified by the entity's ID. The property's value itself is an object structure containing the entity's attribute-values pairs.
...
"entity": {
"did:op:1234": {
"prov:type": "dataset",
"ex:version": "5"
}
},
...
...
"agent": {
"did:op:abcd": {
"prov:type": {
"$": "prov:Person",
"type": "xsd:QName"
}
}
...
},
...
...
"activity": {
"did:op:a1": {
"prov:startTime": "2011-11-16T16:05:00",
"prov:endTime": "2011-11-16T16:06:00",
"ehost": "server.example.org",
"prov:type": {
"$": "ex:edit",
"type": "xsd:QName"
}
}
},
...
In general, a PROV relation expression follows the following construct:
relationExpression := relationName (relationID; identifier1, identifier2, [additionalProperties])
where identifier1 and identifier2 are identifiers to PROV elements. Relations are grouped by relationName in a PROV-JSON structure.
{
...
"wasGeneratedBy": {
"8989": {
"prov:entity": "did:op:1234",
"prov:activity": "did:op:a1",
"prov:time": "2001-10-26T21:32:52",
"ex:port": "p1"
},
"8990": {
"prov:entity": "did:op:1235",
"prov:activity": "did:op:a1",
"prov:time": "2001-10-26T10:00:00",
"ex:port": "p1"
},
},
...
}
Different types of relations are:
- wasGeneratedBy
- used
- wasDerivedFrom
- wasInformedBy
- wasStartedBy
- wasEndedBy
- wasInvalidatedBy
- wasAttributedTo
- wasAssociatedWith
- actedOnBehalfOf
- wasInfluencedBy
- specializationOf
- alternateOf
- hadMember
All the provenance information will be stored as part of the DDO as a an independent service:
{
"@context": "https://w3id.org/future-method/v1",
"authentication": [{}],
"created": "2019-02-08T08:13:49Z",
"id": "did:op:0000",
"proof": {},
"publicKey": [{}],
"service": [{
"index": "0",
"type": "metadata",
"serviceEndpoint": "https://service/api/v1/metadata/assets/ddo/did:op:0ebed8226ada17fde24b6bf2b95d27f8f05fcce09139ff5cec31f6d81a7cd2ea",
"attributes": {
"main": {},
"additional": {},
"curation": {}
}
}, {
"index": "1",
"type": "provenance",
"serviceEndpoint": "https://service/api/v1/provenance/assets/ddo/did:op:0ebed8226ada17fde24b6bf2b95d27f8f05fcce09139ff5cec31f6d81a7cd2ea",
"attributes": {
"main": {
"entity": {
"did:op:1234": {
"ex:version": "5",
"prov:type": "dataset"
}
},
"activity": {
"ex:edit1": {
"prov:type": "edit"
}
},
"comment": {
"ex:comment1": {
"prov:type": "comment"
}
},
"wasGeneratedBy": {
"did:op:abcd": {
"prov:activity": "ex:edit1",
"prov:entity": "did:op:1234"
}
},
"wasAssociatedWith": {
"did:op:eeff": {
"prov:activity": "ex:comment1",
"prov:entity": "did:op:1234"
}
},
"agent": {
"did:op:abcd": {
"prov:type": {
"$": "prov:Person",
"type": "xsd:QName"
}
},
"did:op:eeff": {
"prov:type": {
"$": "prov:Person",
"type": "xsd:QName"
}
}
}
}
}
}]
}
In the above example you can see a DDO including a Provenance
service. It details how the entity did:op:1234
wasGeneratedBy
did:op:abcd
the activity of edit1
by the agent did:op:abcd
and
wasAssociatedWith the activity of comment1
by the agent
did:op:eeff
.
To guarantee the integrity of the provenance information, the complete
provenance
JSON object inside of the Provenance
service is
serialized and hashed using Hash.sha3
.
This generated checksum will be stored as part of the Provenance
service in the checksum
attribute.
Additionally, this checksum can be registered on-chain to prevent the tampering of provenance information. Also this checksum can be used as part of the DID computation.
To generate the checksum in a way that can be computed and verified multiple times, the JSON object must be serialized using the following rules:
- All the \n, \t, \r characters must be removed
- All the whitespaces out of the Json entities names or values must be removed
- The Json document and all the nested objects must be sort alphabetically
This must generate a compact JSON object of only one line that can be hashed and the verification of the hash can be re-calculated.
- W3C PROV-O - The Provenance Ontology: https://www.w3.org/TR/2013/NOTE-prov-xml-20130430/
- W3C PROV-DM - The Provenance Data Model: https://www.w3.org/TR/2013/REC-prov-dm-20130430/
- W3C PROV-XML: https://www.w3.org/TR/2013/NOTE-prov-xml-20130430/
- W3C PROV-JSON Serialization: https://www.w3.org/Submission/prov-json/
The Open Provenance Model (OPM) defines a data model that is open from an inter-operability viewpoint but also with respect to the community of its contributors, reviewers and users.
It has several tools & libraries:
- ProvToolbox - a Java toolbox for handling PROV Tutorial
- Prov Python - a Python implementation of the PROV data model
- ProvJS - a JavaScript implementation of the PROV data model
- ProvStore - Provenance storage and distribution
- ProvExtract - for dealing with PROV embedded in web pages
- ProvVis - experimental visualizations of PROV
- PROV-N Editor - a text editor with PROV-N syntax highlighted