Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persistent identifiers for annotation data #43

Open
balmas opened this issue Nov 18, 2020 · 15 comments
Open

persistent identifiers for annotation data #43

balmas opened this issue Nov 18, 2020 · 15 comments

Comments

@balmas
Copy link
Member

balmas commented Nov 18, 2020

The following are the types of identifiers considered as viable for PIDs for open data:

  • Handles
  • DOIs (built on Handles)
  • URNs
  • ARKs
  • PURLs

(some helpful refs:
https://www.pidforum.org/t/pids-for-publications-and-data/297
https://journal.code4lib.org/articles/14978
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5944906/
)

Handles and DOIs are more standard in the global science domain and are backed by global infrastructure and sustainability plans but are more costly in terms of either fees or infrastructure (we could become a member of CLARIN or ERIC for handles or pay $50 annually for a Handle prefix and run our own Handle server- DOIs are probably cost-prohibitive). URNs have no standard or community supported resolution services. PURLs can be used for vocabulary terms but are not intended for individual data objects (afaik). We can registered a Name Assigning Authority Number (NAAN) for ARKs for free and the N2T.net service hosted at CDL will perform simple resolution to our own servers for free, or we could subscribe to EZID for a fee and get ID generation and resolution services.

From both a technical and global support perspective I prefer Handles to ARKs and think that would be the way to go if we were part of a larger institution, but I think that ARKs are probably more appropriate for Alpheios as standalone project, both in terms of cost and portability.

Essentially these characteristics (outlined at https://arks.org/learn-about-arks/) all align well with our needs:

  • affordability – there are no fees to assign or use ARKs
  • self-sufficiency – you can host ARKs on your own web server, eg, Noid (Nice Opaque Identifiers) open source software
  • portability – you can move ARKs to other servers without losing their core identities
  • global resolvability – you can host ARKs at a well-known server, eg, at the N2T.net (Name-to-Thing) resolver
  • density – ARKs handle mixed case, permitting shorter identifiers (CD, Cd, cD, cd are all distinct)
@balmas
Copy link
Member Author

balmas commented Nov 18, 2020

#40 lists the proposed data model. From this model I think all of the following are candidates for persistent identifiers

So far I have:

  • All Entity Node Objects
  • Vocabulary terms (in the hopefully rare case that we cannot use a term from a standard ontology)

For Translation Alignments, I think any Alignment published to the Alpheios Data Store should have a persistent identifier, as probably should all adressable component parts but I think while in the editing stage local, document-specific identifiers should suffice. In other words, segments and tokens within an alignment document do not need PIDs until those data oblects are published to the Alpheios Data Store.

I need to think more about users -- ideally if a user supplied an ORCID (https://orcid.org/) we could use that but I don't know if we want to make an ORCID a requirement of using Alpheios' annotation features.

@irina060981
Copy link
Member

I think that it is a good choice for alignments:

  • to use unique PID for each alignment
  • to use local identifires for object inside each alignment

Because all public links that we could give to the users would need to define alignment first.
and all others could be applied to the alignment PID as a parent.

The only case that I could imagine that could need not only alignment PID -
two users work with the same alignment
and each of them have its own alignment groups and comments;

In our model it is not yet possible and I believe if we would add a feature to share an alignment among users - we would clone it.
But if won't clone - then we would need user's PID - and each alignment object (child of the alignment) would need to have two PIDs - alignment's and user's.

What our plans for such a feature, @balmas?

@balmas
Copy link
Member Author

balmas commented Dec 4, 2020

In our model it is not yet possible and I believe if we would add a feature to share an alignment among users - we would clone it.
But if won't clone - then we would need user's PID - and each alignment object (child of the alignment) would need to have two PIDs - alignment's and user's.
What our plans for such a feature, @balmas?

This is a good question. Supporting collaborative work by multiple users on a single alignment could be a future requirement, but as PIDs, once assigned, will be unique across ALL alignments, regardless of user or object, I don't think it changes the requirements for the PIDs themselves. It might instead require different requirements for when PIDs are assigned and additional access levels (e.g. 'shared' in addition to 'public' and 'private')

@irina060981
Copy link
Member

Then I think we could use PIDs only for alignments for now and may be in futue it would be useful to add some additional data layer to separate work between users.

@balmas
Copy link
Member Author

balmas commented Dec 18, 2020

In deciding how to uniquely identify the data in the Alpheios popup as annotation targets we need to consider that this data, as presented to the user, is really a view on data that can possibly be combined from many sources.

Take the following scenario:

  • the user has selected to see short definitions from lexiconA and lexiconB and lexiconA doesn't have the lemma but lexiconB does
  • the page the user is viewing has treebank data that is used for disambiguation. The user sees only the inflection from the morph service which matched the one in the treebank, and the stem+suffix of the inflection come from the Whitakers morphology engine while the case, gender and number are from both.

If the user chooses to annotate this, they are potentially annotating multiple things at once:

  1. the lemma, short definition and morphology of the word in context
  2. the lemma and morphology as reported by the Whitaker morphology engine
  3. the lemma and morphology as reported by the Treebank data file
  4. the missing short definition in LexiconA
  5. the short definition produced by LexiconB

Suppose, in the simplest case, the user wants to make a comment on the case of the inflection that was shown to the user. the potential targets of that annotation are:

  • the word
  • the word in the specific context
  • the lexeme
  • the inflection
  • the case of the inflection
  • the treebank file
  • the case of the specific word in the specific sentence of the specific treebank file
  • the case reported by the Whitaker morphological engine for the word

I think if we ask the user themselves to define which of these they would like their annotation to apply to, I think that would make the act of annotation too onerous. (However, I still would very much like a debugging version of the view which allows us to see clearly how the different pieces of data are combined to create the view).

The next time they lookup the same word, they might get different results if the resources chosen at that time are not the same as the ones that were used when they made the annotation. But we will still want to be able to include their annotation if any of the same targets are applicable.

We need therefore to be able to uniquely and distinctly identify all of the sources that contribute to the different parts of the view, as well as all of the lexical entities that are represented in the view and include all that are applicable as the target when the annotation is saved.

@balmas
Copy link
Member Author

balmas commented Dec 18, 2020

  • A new IRIProvider component should be created which presents an API for creating, retrieving or minting identifiers for:

    • lexical entities
    • annotations
    • other hashable resources

    Depending on the type of id being requested, the component should have the capability to (1) construct an identifier by
    hashing the data that is provided to it, and (2) call a remote service to query and retrieve an identifier when hashing is not
    possible for the resource type

  • The ClientAdapter should be responsible for reporting the IRI of the resource it provides. How this is done will depend upon the individual resource. Some resources may provide their own IRIs and others will need to be determined by the adapter code. A ClientAdapter may use the IRIProvider component to generate IRIs.

  • Functions which disamibugate and combine lexical query results must retain the underlying IRIs of the resource(s) which contributed to the resulting data objects

  • Components which display an annotatable view of annotatable query results must convey the identifiers of the annotatable elements to the annotation handler.

@balmas
Copy link
Member Author

balmas commented Dec 18, 2020

@kirlat and @irina060981 please provide your questions and comments on the above. Thank you!

@kirlat
Copy link
Member

kirlat commented Dec 18, 2020

I have a conceptual question about the way annotations should work. We probably can assume there are two type of data:
a. Data on the remote services such as the Tufts morphology service or the Perseids treebank. We cannot change this data. We can also not rely on this data to be persistent: items may change or disappear at any moment.
b. The data we synthesize by taking information from one or several (a) source and combining it in the way we think is the most appropriate. This is the information that is displayed to the user. The choice of items presented to the user and the way they are combined together will define the set of items user will be able to comment upon; the way items are combined may entice the user to make a comment (or not).

It seems the way we combine information of type (b) is extremely important as it affects the decisions of users during commenting. However, we do not store the objects that are synthesized by us (type b) anywhere. We create them based on the set of rules which are often complex and, even more importantly, may change over time, with the new updates of our app coming out.

So we may not guarantee that a year from now we would return the same objects for the same lexical query. The user comments made a year before, however, may be related to the combination of objects (a synthesized object of type b) that existed a year before but does not exist now. And I'm not even talking about resources of type (a) that, being part of object type (b), may disappear or being altered.

Is the above based on the correct assumptions?

@kirlat
Copy link
Member

kirlat commented Dec 18, 2020

The other question is about what is the subject of commenting. Do we think that user would mostly try to comment upon:

  1. What data is selected to be combined to create an object of type (b).
  2. How this data is combined (i.e what morphology items are attached to which words).
    Users would probably NOT comment on the correctness of data from sources of type (a) directly because they would not see them form the information we present to the users.

So can we assume that all comments be related to what data we select to present to the user and the way we choose to combine it together (i.e. how we build an object of type b), but not about how good the data is in the sources of type (a)?

@balmas
Copy link
Member Author

balmas commented Dec 18, 2020

I have a conceptual question about the way annotations should work. We probably can assume there are two type of data:
a. Data on the remote services such as the Tufts morphology service or the Perseids treebank. We cannot change this data. We can also not rely on this data to be persistent: items may change or disappear at any moment.
b. The data we synthesize by taking information from one or several (a) source and combining it in the way we think is the most appropriate. This is the information that is displayed to the user. The choice of items presented to the user and the way they are combined together will define the set of items user will be able to comment upon; the way items are combined may entice the user to make a comment (or not).

It seems the way we combine information of type (b) is extremely important as it affects the decisions of users during commenting. However, we do not store the objects that are synthesized by us (type b) anywhere. We create them based on the set of rules which are often complex and, even more importantly, may change over time, with the new updates of our app coming out.

So we may not guarantee that a year from now we would return the same objects for the same lexical query. The user comments made a year before, however, may be related to the combination of objects (a synthesized object of type b) that existed a year before but does not exist now. And I'm not even talking about resources of type (a) that, being part of object type (b), may disappear or being altered.

Is the above based on the correct assumptions?

That is mostly correct, yes. The availability of the remote services is not as ephemeral as you suggest -- many are hosted on Alpheios servers. However, especially as we add new sources of data and make more configuration options for how to combine these available to users, it is reasonable to assume that the combined view of resources available for a word is not stable in different times or circumstances. And it's not clear that it should be.

One possibility I thought of was that when a user annotates an item in a view, that we store a full replication of the data they were seeing as they annotated as the target. However, we do not want to build up a data set of "frozen" views that the user gets whenever they lookup something they have annotated. And what I think we really want to do is as I have described above, annotate the source resources for the view, and the lexical entities that they reference.

@balmas
Copy link
Member Author

balmas commented Dec 18, 2020

The other question is about what is the subject of commenting. Do we think that user would mostly try to comment upon:

  1. What data is selected to be combined to create an object of type (b).
  2. How this data is combined (i.e what morphology items are attached to which words).
    Users would probably NOT comment on the correctness of data from sources of type (a) directly because they would not see them form the information we present to the users.

So can we assume that all comments be related to what data we select to present to the user and the way we choose to combine it together (i.e. how we build an object of type b), but not about how good the data is in the sources of type (a)?

While the user might not be specifically aware that they are commenting on the data that is in the sources, incorrect or incomplete data in the sources is the mostly likely reason a user would be annotating the data in the first place.

@kirlat
Copy link
Member

kirlat commented Dec 21, 2020

Thanks for the comments! If I understand correctly (please let me know if it is not so), here is how the annotation-enabled workflow might look like.

Right now data goes through several stages before it is displayed to the user:

  1. We retrieve information about individual lexical entities (such as lexemes and definitions) from remote sources. This is done in client adapters.
  2. We transform it to the format that we use internally and do some data corrections, if necessary, based on additional knowledge we have.
  3. The lexical query workflow gathers data returned from several client adapters (lexemes, definitions, translations, etc.) and compose a homonym object from it. That homonym object is be displayed to the user.

It seems we could have two types of annotations. The first one are annotations that correct pieces of data from various sources. These annotations need to be applied them during the step (2). I'm not sure if the code that does that should belong to the client adapters or not. From one side, it would be similar to transformations that the client adapters already perform. From the other side, doing so would require the client adapters to query more than one source (the lexical data source and the annotations data source) so they may lose its specialization as a result. It would also mean tighter integration between client adapters and the annotations package. I don't think it is a good thing. So, on my opinion, it should be a layer of transformations separate from the client adapters.

The second type of annotations are the ones that correct the relationships between lexical entities. Those annotations should be appended during the composition of the homonym object (3). Right now the lexical query workflow is responsible for that.

The workflow with annotations added seems to be a slightly modified one:

  1. Client adapters get data from the source.
  2. Client adapters transform data.
    2a. We retrieve annotations for the lexical entities that were returned by the client adapters.
  3. We compose a homonym out of lexical entities.
    3a. We apply annotations of the relationships to the homonym assembled in step (3).

The specifics of the lexical entities that we handle during step (2a) is that they, most likely, will not have any IDs attached to them (as not all external source will provide them). It also might happen that the lexical entity we have an annotation for may differ slightly from the one being returned by the client adapter and yet we might still want this annotation to be attached (that would be up to our business logic to decide whether the attachment should take place). So the transaction seems to be like: "Hey, annotation data source! Here is the lexical entity (lexeme, definitions, etc.) we've got. Do you have any annotations that could be relevant to it?". It may even specify what level of relevance is desired. It could be something similar to relevance level that is used in the text search. In response, the annotation data source would return all annotation records that might be relevant.

Let's take a lexeme, returned from by the client adapter. A query to the annotation data source should include no ID of the lexeme (because it is most likely not provided by the remote source), but it should contain an information that can identify a lexeme uniquely: word, language, and context. The annotation data source should look through its database and return exact or close matches (for example, it may return records with the same word and language, but a different context). It means that, in order to be able to retrieve the information requested, the annotation DB (or some other related DB) should store an "essence" of an annotatable item (word, language, context in case of the lexeme). It also means that an ID can be assigned to the lexeme within the annotation data source and be used to establish connection between an annotation and the "essence" of the lexical data this annotation is associated with.

The next step, a homonym composition, is slightly different. Let's say that the lexeme we've got from the client adapters has the context that we don't have a record for in the annotation data source. But let's say we have two records in the annotation data source that have context close enough so that it can be useful to display it to the user. Let's say those annotation record has IDs of RA101 and RA102.

The lexeme entity in the composed homonym would then consist of the three data pieces: the lexeme in a form that came from the client adapter (let's say it has no matching records in the annotation DB), and RA101 and RA102 annotation records. Let's say that during the composition phase we attached a definition to this lexeme. We don't know if this definition has any matching records in our annotation database.

We would want to check if there are any annotations attached to the relationship between the lexeme and the defintion. We might also be interested in any annotations of relationships between RA101, RA102 and the definition. So we will need to send a query like:

item1:
  lexeme from the source presented by its "essence": word, language, context
  RA101 (an ID of an item we found relevant)
  RA102 (the same as above)
item2:
  definition: (a definition text as information identifying the definition)

The database should return annotations for any relationships listed below:
lexeme - definition
RA101 - definition
RA102 - definition

The lexical query business logic would use the annotation data source response to decide whether the relationship created between the lexeme and the definition should be amended. For example, if we have annotations saying that this is not a correct definition of the lexeme we might decide to break the relationship and detach the definition. An addition to that, the lexical query will also attach annotations to the homonym object so that then can be displayed to the user.

Does the above make sense?

If so, it leads to several important conclusions then:

  • Annotation (or some other related) database(s) should store the "essence" of lexical entries (word/language/context in case of lexeme).
  • We need to identify what would be that essential information for each type of the lexical entity. We also would need to identify what types of lexical entities we will recognize (we do that already but we may re-review it).
  • A lexical entity's ID will be assigned to it when a record using the entity will be created in the annotation database(s). When we create that record, we can keep a reference (a URI) of an external resource but that seems to be not essential to the workflow described above.
  • An annotation database can use IDs of items that are minted by the DB that stores the lexical entities. Both DBs can be part of an annotation data source.
  • Right now, once the homonym is composed, its structure is "monolithic" and cannot be decomposed. To make decomposition possible, we have to keep all the entities from which the homonym was composed as separate items. The homonym should probably become a set of resources that compose it and a function that returns an assembled homonym representation (something similar to the concept of the View in a database). We might have several such functions that would allow to compose the homonym differently for different purposes, if we'll need to: for debugging or according to user preferences, for example. That composition will allow, when user annotates the homonym, to decide to what entity(ies) exactly an annotation should be attached (we can rely on our business logic here or we can ask user to choose in cases that are not so obvious).
  • Not every lexical entity that is part of a homonym need to have an ID (but they might anyway). Only those items that have matching records in our annotation databases must have IDs assigned to them.
  • Once we define an "essence" that makes an entity unique, we can consider any two entities with the same essence to be the same. This will also mean that we can use MD5 hash of all essence data as an ID: two entities with the same hashes can be considered the same.

With a design like that, we're not that dependent on external resources. Since with have an "essence" stored in our database, we would be able to display an annotation to the user in a meaningful (although an abridged) way using that "essence" information.

What do you think? Sorry it is lengthy, but I was not able to put it any shorter, unfortunately.

@balmas
Copy link
Member Author

balmas commented Dec 21, 2020

Let's take a lexeme, returned from by the client adapter. A query to the annotation data source should include no ID of the lexeme (because it is most likely not provided by the remote source), but it should contain an information that can identify a lexeme uniquely: word, language, and context.

This is generally how I thought this would work, yes. I think we may in many instances, be able assign IDs to the responses returned by the resource, by creating a hash of the resource contents. However, it doesn't obviate the need to be able to query the "essence" of the lexical entity as well.

@balmas
Copy link
Member Author

balmas commented Dec 21, 2020

An annotation database can use IDs of items that are minted by the DB that stores the lexical entities. Both DBs can be part of an annotation data source.

I believe this to be one database, at least on our side. That was the design in #40.

@irina060981
Copy link
Member

After all this description - I think that annotations are really a difficult task with a lot of undefined steps :)

From my point of view we should not forget about the following:

  • the code of getting data and composing it to homonyms should be independent of annotations and should be able to use without any annotations injections, especially it is important for client-adapters repository. Because it could be used for other applicatons, it could be used inside other embed-library without willing to use annotations.

  • the process of getting annotations should be in the parallel workflow of the getting data of the lookup, otherwise it would make getting data much slower (especially in the case of not strict queries to database), we could spent much time being waiting for response with annotations while a user could want only get morph data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants