-
Notifications
You must be signed in to change notification settings - Fork 3
Harvesting using OAI PMH
Along with its ResourceSync service, the POD Aggregator also has an OAI-PMH service for harvesting data records. Some portions of the OAI-PMH protocol not strictly necessary for POD-ReShare data consumption are not supported yet.
Consumers can use the OAI-PMH interface to get all the normalized records from a particular provider stream (specified with a set argument), or just to get records that have been added or changed (as well as identifiers of records that have been deleted) since a particular time. Records are returned as MARC XML elements in the XML documents returned by the service. Up to 10,000 records may be returned at a time. (The number of records in a response is subject to change, however.) If the full answer to a request involves more records than fit in an initial response, a resumption token is also returned that a client can use to get more records, as described in the OAI-PMH protocol specification.
The base URL for the API server is https://pod.stanford.edu/oai
. As with other POD API requests, all POD OAI-PMH requests must include an access token generated by the POD Aggregator. Requests use one of four verbs described in the OAI-PMH protocol specification: Identify
, ListSets
, ListMetadataFormats
, and ListRecords
. (The ListIdentifiers
and GetRecord
verbs are not supported at this time.)
The ListSets
verb (full URL: https://pod.stanford.edu/oai?verb=ListSets
) shows the sets of records available for harvesting.
A set is available for each stream that can be harvested. The ListSets
response includes setDescription
elements inside the returned set
elements. The set description for the current default stream includes has a dc:type
element (inside a Dublin Core element) with the exact value default
. (Other streams may have dc:type
elements that may include the word "default" within a larger string, such as "former default". Those are not current default streams.) The dc:contributor
element in the setDescription contains the slug of the institution providing the data in the set.
OAI-PMH clients can use the dc:contributor
and dc:type
elements to discover the current default set for institutions of interest. For example, to find the current default set for the University of Pennsylvania libraries, a client can parse the ListSets
response to find a set whose setDescription includes a dc:contributor
element with value penn
and a dc:type
element with value default
.
Once the desired set is discovered, the client can then request records using the identifier of the applicable set, which is given in the set's setSpec
element.
The most up to date records for an institution will generally be in the set for its default stream. The records in that stream can be harvested with the ListRecords
verb.
A harvest of a full set of MARC records from a set can be requested with this URL: https://pod.stanford.edu/oai?verb=ListRecords&metadataPrefix=marc21&set=$SET
where $SET
is the set for the institutional stream to be harvested. (It is also possible to call ListRecords
without a set. However, such a request will return records from all sets, which will typically include a large amount of redundant data that will take a long time to deliver. We therefore recommend that clients not issue set-less requests.)
A full harvest may include millions of records. The OAI service will return batches of between 500 and 10,000 records at a time. For each batch, the service will provide a resumption token that clients can use to get more records, until all records are harvested. Resumption requests take the form
https://pod.stanford.edu/oai?verb=ListRecords&resumptionToken=$TOKEN
where $TOKEN
is the resumption token returned by the server. Each subsequent response will include a different resumption token, until all records have been returned in response to a request.
Resumption tokens should be used in a timely manner, as their validity may eventually time out.
Once a consumer has completed a full harvest from a particular set, it can get updates from that set without having to re-harvest the entire set, using the incremental harvesting features of OAI-PMH.
A request of the form https://pod.stanford.edu/oai?verb=ListRecords&metadataPrefix=marc21&set=$SET&from=$DATE
(where $SET
is the set of interest and $DATE
is the date a consumer's previous harvest began) will return all of the records that have been added, changed, or deleted since that prior harvest. As with the full harvest above, this response will be partitioned into groups of 1000 records or less, with resumption tokens, if necessary.
Consumers may continue to make further incremental harvests as often as desired, though the increments used in POD Aggregator harvests are full days. (Hence, there is no point in making, for example, hourly incremental harvests.) Consumers may also re-do full harvests when desired, and the records returned will reflect the current state of the set's stream.
Data providers will from time to time change the stream they use as their default; for example, when they want to start a fresh stream with a new full dump of records. (They should not make that stream default until they have fully populated that stream, however.) Data consumers will most likely want to start harvesting from the set for that new stream, since the old set is unlikely to get further updates. Consumers can use ListSets
as above to discover when a provider's default set has changed. They can then do a full harvest from that set to get a full, up-to-date set of records, and also do further incremental updates.
There is no provision for incremental harvests between different streams or sets. Rather, consumers will need to do a full harvest to get a complete set of records from a new default set. (Hopefully, no provider will change its default set too often, though. We recommend doing so no more than once a quarter if possible.)
Due to the characteristics of the OAI-PMH protocol, the OAI-PMH IDs of the records in each OAI-PMH set will be different, even when they were derived from the same underlying record in the data provider's catalog. However, it is possible for a harvester to tell when a record from one provider set comes from the same underlying record in a different set from the same provider. For example, the internal record ID, as recorded in the 001 MARC field, will be the same. The OAI-PMH IDs are also formatted to end with the provider's internal identifier, after a colon.
Verb | URL | Description |
---|---|---|
Identify |
https://pod.stanford.edu/oai?verb=Identify |
Returns basic information about the server and its capabilities. |
ListMetadataFormats |
https://pod.stanford.edu/oai?verb=ListMetadataFormats |
Returns information on the metadata formats the server supports. Currently this is just MARC XML (marc21). Dublin Core, though required by the OAI-PMH specification, is not currently supported. |
ListIdentifiers |
https://pod.stanford.edu/oai?verb=ListIdentifiers&metadataPrefix=marc21&set=$SET |
Not currently supported by the POD Aggregator OAI-PMH service. (If implemented, it would work like ListRecords, but return headers instead of full records.) |
GetRecord |
https://pod.stanford.edu?verb=GetRecord&identifier=$ID&metadataPrefix=marc21 |
Not currently supported by the POD Aggregator OAI-PMH service. (If implemented, it would return the contents of the record with the identifier $ID, in MARC XML format.) |