CenterForOpenScience · aaxelb · Nov 9, 2023 · Oct 26, 2023 · Oct 26, 2023 · Oct 26, 2023
diff --git a/how-to/use-the-api.md b/how-to/use-the-api.md
@@ -1,150 +1,28 @@
 # How to use the API
 
-## Harvesting metadata records in bulk
-
-`/oaipmh` -- an implementation of the Open Access Initiative's [Protocol for Metadata Harvesting](https://www.openarchives.org/OAI/openarchivesprotocol.html), an open standard for harvesting metadata
-from open repositories. You can use this to list metadata in bulk, or query by a few simple
-parameters (date range or source).
-
-
-## Searching metadata records
-
-`/api/v2/search/creativeworks/_search` -- an elasticsearch API endpoint that can be used for
-searching metadata records and for compiling summary statistics and analyses of the
-completeness of data from the various sources.
-
-You can search by sending a GET request with the query parameter `q`, or a POST request
-with a body that conforms to the [elasticsearch query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).
-
-For example, the following two queries are equivalent:
-```
-GET https://share.osf.io/api/v2/search/creativeworks/_search?q=badges
-```
-```
-POST https://share.osf.io/api/v2/search/creativeworks/_search
-{
-    "query": {
-        "query_string" : {
-            "query" : "badges"
-        }
-    }
-}
-```
+(see [openapi docs](/trove/docs/openapi.html) for detail)
 
-You can also use the [SHARE Discover page](https://share.osf.io/discover) to generate query DSL.
-Use the filters in the sidebar to construct a query, then click "View query body" to see the query in JSON form.
+## Sample and search for index-cards
 
+`GET /trove/index-card-search`: search index-cards
 
-### Fields Indexed by Elasticsearch
+`GET /trove/index-value-search`: search values for specific properties on index-cards
 
-The search endpoint has the following metadata fields available:
-
-    'title'
-    'description'
-    'type'
-    'date'
-    'date_created'
-    'date_modified
-    'date_updated'
-    'date_published'
-    'tags'
-    'subjects'
-    'sources'
-    'language'
-    'contributors'
-    'funders'
-    'publishers'
-
-#### Date fields
-There are five date fields, and each has a different meaning. Two are given to SHARE by the data source:
-
-``date_published``
-    When the work was first published, issued, or made publicly available in any form.
-    Not all sources provide this, so some works in SHARE have no ``date_published``.
-``date_updated``
-    When the work was last updated by the source. For example, an OAI-PMH record's ``<datestamp>``.
-    Most works have a ``date_updated``, but some sources do not provide this.
-
-Three date fields are populated by SHARE itself:
+## Posting index-cards
+> NOTE: currently used only by other COS projects, not yet for public use
 
-``date_created``
-    When SHARE first ingested the work and added it to the SHARE dataset. Every work has a ``date_created``.
-``date_modified``
-    When SHARE last ingested the work and modified the work's record in the SHARE dataset. Every work
-    has a ``date_modified``.
-``date``
-    Because many works may not have ``date_published`` or ``date_updated`` values, sorting and filtering works
-    by date can be confusing. The ``date`` field is intended to help. It contains the most useful available
-    date. If the work has a ``date_published``, ``date`` contains the value of ``date_published``. If the work
-    has no ``date_published`` but does have ``date_updated``, ``date`` is set to ``date_updated``. If the work
-    has neither ``date_published`` nor ``date_updated``, ``date`` is set to ``date_created``.
+`POST /trove/ingest?focus_iri=...&record_identifier=...`: 
 
-## Pushing metadata records
-> NOTE: currently used only by other COS projects, not yet for public use
+currently supports only `Content-Type: text/turtle`
 
-`/api/v2/normalizeddata` -- how to push data into SHARE/Trove (instead of waiting to be harvested)
+## Deleting index-cards
 
-```
-POST /api/v2/normalizeddata HTTP/1.1
-Host: share.osf.io
-Authorization: Bearer ACCESS_TOKEN
-Content-Type: application/vnd.api+json
+`DELETE /trove/ingest?record_identifier=...`: request 
 
-{
-    "data": {
-        "type": "NormalizedData",
-        "attributes": {
-            "data": {
-                "central_node_id": '...',
-                "@graph": [/* see below */]
-            }
-        }
-    }
-}
-```
 
-### NormalizedData format
-The normalized metadata format used internally by SHARE/Trove is a subset of
-[JSON-LD graph](https://www.w3.org/TR/json-ld/#named-graphs).
-Each graph node must contain `@id` and `@type`, plus other key/value pairs
-according to the
-["SHARE schema"](https://github.com/CenterForOpenScience/SHARE/blob/develop/share/schema/schema-spec.yaml)
+## Harvesting metadata records in bulk
 
-In this case, `@id` will always be a "blank" identifier, which begins with `'_:'`
-and is used only to define relationships between nodes in the graph -- nodes
-may reference each other with `@id`/`@type` pairs --
-e.g. `{'@id': '...', '@type': '...'}`
+`/oaipmh` -- an implementation of the Open Access Initiative's [Protocol for Metadata Harvesting](https://www.openarchives.org/OAI/openarchivesprotocol.html), an open standard for harvesting metadata
+from open repositories. You can use this to list metadata in bulk, or query by a few simple
+parameters (date range or source).
 
-Example serialization: The following SHARE-style JSON-LD document represents a
-preprint with one "creator" and one identifier -- the graph contains nodes for
-the preprint, person, and identifier, plus another node representing the
-"creator" relationship between the preprint and person:
-```
-{
-    'central_node_id': '_:foo',
-    '@graph': [
-        {
-            '@id': '_:foo',
-            '@type': 'preprint',
-            'title': 'This is a preprint!',
-        },
-        {
-            '@id': '_:bar',
-            '@type': 'workidentifier',
-            'uri': 'https://osf.io/foobar/',
-            'creative_work': {'@id': '_:foo', '@type': 'preprint'}
-        },
-        {
-            '@id': '_:baz',
-            '@type': 'person',
-            'name': 'Magpie Jones'
-        },
-        {
-            '@id': '_:qux',
-            '@type': 'creator',
-            'creative_work': {'@id': '_:foo', '@type': 'preprint'},
-            'agent': {'@id': '_:baz', '@type': 'person'}
-        }
-    ]
-}
-```
diff --git a/project/urls.py b/project/urls.py
@@ -12,13 +12,15 @@
 
 from share.admin import admin_site
 from share.oaipmh.views import OAIPMHView
+from trove.views.vocab import TroveVocabView
 
 
 urlpatterns = [
     url(r'^admin/', admin_site.urls),
     # url(r'^api-auth/', include('rest_framework.urls', namespace='rest_framework')),
     path('api/v3/', include('trove.urls', namespace='trove')),  # same as 'trove/' but more subtle
     path('trove/', include('trove.urls', namespace='trovetrove')),
+    path('vocab/2023/trove/<path:vocab_term>', view=TroveVocabView.as_view(), name='trove-vocab'),
     url(r'^api/v2/', include('api.urls', namespace='api')),
     url(r'^api/(?P<path>(?!v\d+).*)', APIVersionRedirectView.as_view()),
     url(r'^api/v1/', include('api.urls_v1', namespace='api_v1')),

diff --git a/requirements.txt b/requirements.txt
@@ -21,6 +21,7 @@ gevent==22.10.2  # MIT
 jsonschema==3.2.0  # MIT
 lxml==4.9.1  # BSD
 kombu==5.1.0  # BSD 3 Clause
+markdown2==2.4.10  # MIT
 nameparser==1.0.6  # LGPL
 networkx==2.5.1  # BSD
 newrelic==8.4.0  # newrelic APM agent, Custom License
@@ -42,4 +43,4 @@ xmltodict==0.12.0  # MIT
 # Allows custom-rendered IDs, hiding null values, and including data in error responses
 git+https://github.com/cos-forks/django-rest-framework-json-api.git@v4.2.1+cos0
 
-git+https://github.com/aaxelb/gather.git@0.2023.45
+git+https://github.com/aaxelb/primitive_metadata.git@0.2023.57
diff --git a/share/search/index_strategy/trove_indexcard_flats.py b/share/search/index_strategy/trove_indexcard_flats.py
@@ -12,7 +12,7 @@
 from django.conf import settings
 from django.db.models import Exists, OuterRef
 import elasticsearch8
-from gather import primitive_rdf
+from primitive_metadata import primitive_rdf
 
 from share.search import exceptions
 from share.search import messages
@@ -167,7 +167,7 @@ def index_mappings(self):
         }
 
     def _build_sourcedoc(self, indexcard_rdf):
-        _rdfdoc = primitive_rdf.TripledictWrapper(indexcard_rdf.as_rdf_tripledict())
+        _rdfdoc = primitive_rdf.RdfGraph(indexcard_rdf.as_rdf_tripledict())
         if _should_skip_card(indexcard_rdf, _rdfdoc):
             return None  # will be deleted from the index
         _nested_iris = defaultdict(set)
@@ -182,13 +182,13 @@ def _build_sourcedoc(self, indexcard_rdf):
                 _nested_dates[_walk_path].add(datetime.date.isoformat(_walk_obj))
             elif is_date_property(_walk_path[-1]):
                 try:
-                    datetime.date.fromisoformat(_walk_obj.unicode_text)
+                    datetime.date.fromisoformat(_walk_obj.unicode_value)
                 except ValueError:
-                    logger.debug('skipping malformatted date "%s" in %s', _walk_obj.unicode_text, indexcard_rdf)
+                    logger.debug('skipping malformatted date "%s" in %s', _walk_obj.unicode_value, indexcard_rdf)
                 else:
-                    _nested_dates[_walk_path].add(_walk_obj.unicode_text)
-            elif isinstance(_walk_obj, primitive_rdf.Text):
-                _nested_texts[(_walk_path, _walk_obj.language_iri)].add(_walk_obj.unicode_text)
+                    _nested_dates[_walk_path].add(_walk_obj.unicode_value)
+            elif isinstance(_walk_obj, primitive_rdf.Literal):
+                _nested_texts[(_walk_path, tuple(_walk_obj.datatype_iris))].add(_walk_obj.unicode_value)
         _focus_iris = {indexcard_rdf.focus_iri}
         _suffuniq_focus_iris = {get_sufficiently_unique_iri(indexcard_rdf.focus_iri)}
         for _identifier in indexcard_rdf.indexcard.focus_identifier_set.all():
@@ -224,10 +224,10 @@ def _build_sourcedoc(self, indexcard_rdf):
             'nested_text': [
                 {
                     **_iri_path_as_indexable_fields(_path),
-                    'language_iri': _language_iri,
+                    'language_iri': _language_iris,
                     'text_value': list(_value_set),
                 }
-                for (_path, _language_iri), _value_set in _nested_texts.items()
+                for (_path, _language_iris), _value_set in _nested_texts.items()
             ],
         }
 
@@ -816,13 +816,13 @@ def _gather_textmatch_evidence(self, es8_hit) -> Iterable[TextMatchEvidence]:
                         json.loads(_innerhit['fields']['nested_text.path_from_focus'][0]),
                     )
                     try:
-                        _language_iri = _innerhit['fields']['nested_text.language_iri'][0]
+                        _language_iris = _innerhit['fields']['nested_text.language_iri']
                     except KeyError:
-                        _language_iri = None
+                        _language_iris = ()
                     for _highlight in _innerhit['highlight']['nested_text.text_value']:
                         yield TextMatchEvidence(
                             property_path=_property_path,
-                            matching_highlight=primitive_rdf.text(_highlight, language_iri=_language_iri),
+                            matching_highlight=primitive_rdf.literal(_highlight, datatype_iris=_language_iris),
                             card_iri=_innerhit['_id'],
                         )
 
@@ -858,6 +858,7 @@ def fuzzy_text_must_query(self, text: str) -> dict:
                     self._text_field: {
                         'query': text,
                         'fuzziness': 'AUTO',
+                        # TODO: 'operator': 'and' (by query param FilterOperator, `cardSearchText[*][every-word]=...`)
                     },
                 }}
 
@@ -961,7 +962,6 @@ def _pathset_as_nestedvalue_filter(propertypath_set: frozenset[tuple[str, ...]],
     _glob_path_lengths = []
     for _path in propertypath_set:
         if all(_pathstep == GLOB_PATHSTEP for _pathstep in _path):
-            logger.critical(f'{_path=}')
             _glob_path_lengths.append(len(_path))
         else:
             _suffuniq_iri_paths.append(iri_path_as_keyword(_path, suffuniq=True))
@@ -1075,7 +1075,7 @@ def walk_from_subject(self, iri_or_blanknode, last_path: tuple[str, ...] = ()) -
         '''
         with self._visit(iri_or_blanknode):
             _twopledict = (
-                primitive_rdf.twopleset_as_twopledict(iri_or_blanknode)
+                primitive_rdf.twopledict_from_twopleset(iri_or_blanknode)
                 if isinstance(iri_or_blanknode, frozenset)
                 else self.tripledict.get(iri_or_blanknode, {})
             )
@@ -1114,19 +1114,19 @@ def for_iri_at_path(cls, path: tuple[str, ...], iri: str, rdfdoc):
             type_iris=frozenset(rdfdoc.q(iri, RDF.type)),
             # TODO: don't discard language for name/title/label
             name_text=frozenset(
-                _text.unicode_text
+                _text.unicode_value
                 for _text in rdfdoc.q(iri, NAME_PROPERTIES)
-                if isinstance(_text, primitive_rdf.Text)
+                if isinstance(_text, primitive_rdf.Literal)
             ),
             title_text=frozenset(
-                _text.unicode_text
+                _text.unicode_value
                 for _text in rdfdoc.q(iri, TITLE_PROPERTIES)
-                if isinstance(_text, primitive_rdf.Text)
+                if isinstance(_text, primitive_rdf.Literal)
             ),
             label_text=frozenset(
-                _text.unicode_text
+                _text.unicode_value
                 for _text in rdfdoc.q(iri, LABEL_PROPERTIES)
-                if isinstance(_text, primitive_rdf.Text)
+                if isinstance(_text, primitive_rdf.Literal)
             ),
         )