Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENG-4647] search api docs #813

Merged
merged 24 commits into from
Nov 9, 2023
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 14 additions & 136 deletions how-to/use-the-api.md
Original file line number Diff line number Diff line change
@@ -1,150 +1,28 @@
# How to use the API

## Harvesting metadata records in bulk

`/oaipmh` -- an implementation of the Open Access Initiative's [Protocol for Metadata Harvesting](https://www.openarchives.org/OAI/openarchivesprotocol.html), an open standard for harvesting metadata
from open repositories. You can use this to list metadata in bulk, or query by a few simple
parameters (date range or source).


## Searching metadata records

`/api/v2/search/creativeworks/_search` -- an elasticsearch API endpoint that can be used for
searching metadata records and for compiling summary statistics and analyses of the
completeness of data from the various sources.

You can search by sending a GET request with the query parameter `q`, or a POST request
with a body that conforms to the [elasticsearch query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).

For example, the following two queries are equivalent:
```
GET https://share.osf.io/api/v2/search/creativeworks/_search?q=badges
```
```
POST https://share.osf.io/api/v2/search/creativeworks/_search
{
"query": {
"query_string" : {
"query" : "badges"
}
}
}
```
(see [openapi docs](/trove/docs/openapi.html) for detail)

You can also use the [SHARE Discover page](https://share.osf.io/discover) to generate query DSL.
Use the filters in the sidebar to construct a query, then click "View query body" to see the query in JSON form.
## Sample and search for index-cards

`GET /trove/index-card-search`: search index-cards

### Fields Indexed by Elasticsearch
`GET /trove/index-value-search`: search values for specific properties on index-cards

The search endpoint has the following metadata fields available:

'title'
'description'
'type'
'date'
'date_created'
'date_modified
'date_updated'
'date_published'
'tags'
'subjects'
'sources'
'language'
'contributors'
'funders'
'publishers'

#### Date fields
There are five date fields, and each has a different meaning. Two are given to SHARE by the data source:

``date_published``
When the work was first published, issued, or made publicly available in any form.
Not all sources provide this, so some works in SHARE have no ``date_published``.
``date_updated``
When the work was last updated by the source. For example, an OAI-PMH record's ``<datestamp>``.
Most works have a ``date_updated``, but some sources do not provide this.

Three date fields are populated by SHARE itself:
## Posting index-cards
> NOTE: currently used only by other COS projects, not yet for public use

``date_created``
When SHARE first ingested the work and added it to the SHARE dataset. Every work has a ``date_created``.
``date_modified``
When SHARE last ingested the work and modified the work's record in the SHARE dataset. Every work
has a ``date_modified``.
``date``
Because many works may not have ``date_published`` or ``date_updated`` values, sorting and filtering works
by date can be confusing. The ``date`` field is intended to help. It contains the most useful available
date. If the work has a ``date_published``, ``date`` contains the value of ``date_published``. If the work
has no ``date_published`` but does have ``date_updated``, ``date`` is set to ``date_updated``. If the work
has neither ``date_published`` nor ``date_updated``, ``date`` is set to ``date_created``.
`POST /trove/ingest?focus_iri=...&record_identifier=...`:

## Pushing metadata records
> NOTE: currently used only by other COS projects, not yet for public use
currently supports only `Content-Type: text/turtle`

`/api/v2/normalizeddata` -- how to push data into SHARE/Trove (instead of waiting to be harvested)
## Deleting index-cards

```
POST /api/v2/normalizeddata HTTP/1.1
Host: share.osf.io
Authorization: Bearer ACCESS_TOKEN
Content-Type: application/vnd.api+json
`DELETE /trove/ingest?record_identifier=...`: request

{
"data": {
"type": "NormalizedData",
"attributes": {
"data": {
"central_node_id": '...',
"@graph": [/* see below */]
}
}
}
}
```

### NormalizedData format
The normalized metadata format used internally by SHARE/Trove is a subset of
[JSON-LD graph](https://www.w3.org/TR/json-ld/#named-graphs).
Each graph node must contain `@id` and `@type`, plus other key/value pairs
according to the
["SHARE schema"](https://github.com/CenterForOpenScience/SHARE/blob/develop/share/schema/schema-spec.yaml)
## Harvesting metadata records in bulk

In this case, `@id` will always be a "blank" identifier, which begins with `'_:'`
and is used only to define relationships between nodes in the graph -- nodes
may reference each other with `@id`/`@type` pairs --
e.g. `{'@id': '...', '@type': '...'}`
`/oaipmh` -- an implementation of the Open Access Initiative's [Protocol for Metadata Harvesting](https://www.openarchives.org/OAI/openarchivesprotocol.html), an open standard for harvesting metadata
from open repositories. You can use this to list metadata in bulk, or query by a few simple
parameters (date range or source).

Example serialization: The following SHARE-style JSON-LD document represents a
preprint with one "creator" and one identifier -- the graph contains nodes for
the preprint, person, and identifier, plus another node representing the
"creator" relationship between the preprint and person:
```
{
'central_node_id': '_:foo',
'@graph': [
{
'@id': '_:foo',
'@type': 'preprint',
'title': 'This is a preprint!',
},
{
'@id': '_:bar',
'@type': 'workidentifier',
'uri': 'https://osf.io/foobar/',
'creative_work': {'@id': '_:foo', '@type': 'preprint'}
},
{
'@id': '_:baz',
'@type': 'person',
'name': 'Magpie Jones'
},
{
'@id': '_:qux',
'@type': 'creator',
'creative_work': {'@id': '_:foo', '@type': 'preprint'},
'agent': {'@id': '_:baz', '@type': 'person'}
}
]
}
```
2 changes: 2 additions & 0 deletions project/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,15 @@

from share.admin import admin_site
from share.oaipmh.views import OAIPMHView
from trove.views.vocab import TroveVocabView


urlpatterns = [
url(r'^admin/', admin_site.urls),
# url(r'^api-auth/', include('rest_framework.urls', namespace='rest_framework')),
path('api/v3/', include('trove.urls', namespace='trove')), # same as 'trove/' but more subtle
path('trove/', include('trove.urls', namespace='trovetrove')),
path('vocab/2023/trove/<path:vocab_term>', view=TroveVocabView.as_view(), name='trove-vocab'),
url(r'^api/v2/', include('api.urls', namespace='api')),
url(r'^api/(?P<path>(?!v\d+).*)', APIVersionRedirectView.as_view()),
url(r'^api/v1/', include('api.urls_v1', namespace='api_v1')),
Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ gevent==22.10.2 # MIT
jsonschema==3.2.0 # MIT
lxml==4.9.1 # BSD
kombu==5.1.0 # BSD 3 Clause
markdown2==2.4.10 # MIT
nameparser==1.0.6 # LGPL
networkx==2.5.1 # BSD
newrelic==8.4.0 # newrelic APM agent, Custom License
Expand All @@ -42,4 +43,4 @@ xmltodict==0.12.0 # MIT
# Allows custom-rendered IDs, hiding null values, and including data in error responses
git+https://github.com/cos-forks/django-rest-framework-json-api.git@v4.2.1+cos0

git+https://github.com/aaxelb/gather.git@0.2023.45
git+https://github.com/aaxelb/primitive_metadata.git@0.2023.57
40 changes: 20 additions & 20 deletions share/search/index_strategy/trove_indexcard_flats.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from django.conf import settings
from django.db.models import Exists, OuterRef
import elasticsearch8
from gather import primitive_rdf
from primitive_metadata import primitive_rdf

from share.search import exceptions
from share.search import messages
Expand Down Expand Up @@ -167,7 +167,7 @@ def index_mappings(self):
}

def _build_sourcedoc(self, indexcard_rdf):
_rdfdoc = primitive_rdf.TripledictWrapper(indexcard_rdf.as_rdf_tripledict())
_rdfdoc = primitive_rdf.RdfGraph(indexcard_rdf.as_rdf_tripledict())
if _should_skip_card(indexcard_rdf, _rdfdoc):
return None # will be deleted from the index
_nested_iris = defaultdict(set)
Expand All @@ -182,13 +182,13 @@ def _build_sourcedoc(self, indexcard_rdf):
_nested_dates[_walk_path].add(datetime.date.isoformat(_walk_obj))
elif is_date_property(_walk_path[-1]):
try:
datetime.date.fromisoformat(_walk_obj.unicode_text)
datetime.date.fromisoformat(_walk_obj.unicode_value)
except ValueError:
logger.debug('skipping malformatted date "%s" in %s', _walk_obj.unicode_text, indexcard_rdf)
logger.debug('skipping malformatted date "%s" in %s', _walk_obj.unicode_value, indexcard_rdf)
else:
_nested_dates[_walk_path].add(_walk_obj.unicode_text)
elif isinstance(_walk_obj, primitive_rdf.Text):
_nested_texts[(_walk_path, _walk_obj.language_iri)].add(_walk_obj.unicode_text)
_nested_dates[_walk_path].add(_walk_obj.unicode_value)
elif isinstance(_walk_obj, primitive_rdf.Literal):
_nested_texts[(_walk_path, tuple(_walk_obj.datatype_iris))].add(_walk_obj.unicode_value)
_focus_iris = {indexcard_rdf.focus_iri}
_suffuniq_focus_iris = {get_sufficiently_unique_iri(indexcard_rdf.focus_iri)}
for _identifier in indexcard_rdf.indexcard.focus_identifier_set.all():
Expand Down Expand Up @@ -224,10 +224,10 @@ def _build_sourcedoc(self, indexcard_rdf):
'nested_text': [
{
**_iri_path_as_indexable_fields(_path),
'language_iri': _language_iri,
'language_iri': _language_iris,
'text_value': list(_value_set),
}
for (_path, _language_iri), _value_set in _nested_texts.items()
for (_path, _language_iris), _value_set in _nested_texts.items()
],
}

Expand Down Expand Up @@ -816,13 +816,13 @@ def _gather_textmatch_evidence(self, es8_hit) -> Iterable[TextMatchEvidence]:
json.loads(_innerhit['fields']['nested_text.path_from_focus'][0]),
)
try:
_language_iri = _innerhit['fields']['nested_text.language_iri'][0]
_language_iris = _innerhit['fields']['nested_text.language_iri']
except KeyError:
_language_iri = None
_language_iris = ()
for _highlight in _innerhit['highlight']['nested_text.text_value']:
yield TextMatchEvidence(
property_path=_property_path,
matching_highlight=primitive_rdf.text(_highlight, language_iri=_language_iri),
matching_highlight=primitive_rdf.literal(_highlight, datatype_iris=_language_iris),
card_iri=_innerhit['_id'],
)

Expand Down Expand Up @@ -858,6 +858,7 @@ def fuzzy_text_must_query(self, text: str) -> dict:
self._text_field: {
'query': text,
'fuzziness': 'AUTO',
# TODO: 'operator': 'and' (by query param FilterOperator, `cardSearchText[*][every-word]=...`)
},
}}

Expand Down Expand Up @@ -961,7 +962,6 @@ def _pathset_as_nestedvalue_filter(propertypath_set: frozenset[tuple[str, ...]],
_glob_path_lengths = []
for _path in propertypath_set:
if all(_pathstep == GLOB_PATHSTEP for _pathstep in _path):
logger.critical(f'{_path=}')
_glob_path_lengths.append(len(_path))
else:
_suffuniq_iri_paths.append(iri_path_as_keyword(_path, suffuniq=True))
Expand Down Expand Up @@ -1075,7 +1075,7 @@ def walk_from_subject(self, iri_or_blanknode, last_path: tuple[str, ...] = ()) -
'''
with self._visit(iri_or_blanknode):
_twopledict = (
primitive_rdf.twopleset_as_twopledict(iri_or_blanknode)
primitive_rdf.twopledict_from_twopleset(iri_or_blanknode)
if isinstance(iri_or_blanknode, frozenset)
else self.tripledict.get(iri_or_blanknode, {})
)
Expand Down Expand Up @@ -1114,19 +1114,19 @@ def for_iri_at_path(cls, path: tuple[str, ...], iri: str, rdfdoc):
type_iris=frozenset(rdfdoc.q(iri, RDF.type)),
# TODO: don't discard language for name/title/label
name_text=frozenset(
_text.unicode_text
_text.unicode_value
for _text in rdfdoc.q(iri, NAME_PROPERTIES)
if isinstance(_text, primitive_rdf.Text)
if isinstance(_text, primitive_rdf.Literal)
),
title_text=frozenset(
_text.unicode_text
_text.unicode_value
for _text in rdfdoc.q(iri, TITLE_PROPERTIES)
if isinstance(_text, primitive_rdf.Text)
if isinstance(_text, primitive_rdf.Literal)
),
label_text=frozenset(
_text.unicode_text
_text.unicode_value
for _text in rdfdoc.q(iri, LABEL_PROPERTIES)
if isinstance(_text, primitive_rdf.Text)
if isinstance(_text, primitive_rdf.Literal)
),
)

Expand Down
Loading