Generate static HTML sites using specifications like OCFL and Bagit from various feed formats like OAI-PMH, RSS. Part of a suite of static HTML repository tools.
This project was inspired by Professor Hussein Suleman (University of Cape Town), who gave a rousing closing keynote at Open Repositories 2023 about what really makes an open access repository accessible, and a call to reduce complexity in digital repository/library development.
Rather than re-invent a whole repository platform, this tool is one step in that direction: using existing protocols that have served us well for years (OAI-PMH and RSS), we can harvest existing repositories on the web, no matter how complex they happen to be, and produce our own simple, file-based repositories using standards and simple document formats like HTML.
If you came here from the OR2024 closing presentation, you might want to check out the 'mets' spider instead of the oai_dc one, as it will harvest an OAI-PMH METS with MODS feed, download linked files, and write markdown documents with frontmatter instead of using XSLT to try and produce all the HTML itself. (this lets us feed static site generators which generally will do a better job of this).
See the slides and pitches from this presentation at [TODO: Put link here once slides read-only]
The rest of the instructions here should still apply, but you should also read through some of the initial settings like base file paths for storage, etc., not just the OFCL paths.
Right now, the tool is only tested on DSpace OAI-PMH feeds using the oai_dc (simple Dublin Core elements) metadata format.
- Make sure you have Python 3 installed. I also recommend pyenv for virtualenv and version management.
- Clone this repository with
git clone https://github.com/kshepherd/feed2html.git
- Optional: Create or activate a virtualenv with the standard tools or pyenv
- Install requirements with
pip install -r requirements.txt
- Identify the start URL for your DSpace ListRecords OAI verb, eg. https://openaccess.myinstitution.edu/oai/request?verb=ListRecords&metadataPrefix=oai_dc
- Give the
output/oaidc2html.xsl
stylesheet a quick check to make sure it is transforming the fields you're interested in - Set up a base directory for your OCFL repository and css files and note the full path
- Copy or symlink
output/css
to this base directory - Begin a crawl! Let's go with that example URL and a base dir of /tmp/site
scrapy crawl oaipmh_dc_xml \
-a url="https://openaccess.myinstitution.edu/oai/request?verb=ListRecords&metadataPrefix=oai_dc" \
-a website_title="Test" \
-a website_subtitle="open access research" \
-a path_to_assets="/tmp/site" \
-a path_to_ocfl="/tmp/site/repository" \
-L INFO
To test just the first page of the OAI results, uncomment CLOSESPIDER_ITEMCOUNT
in feed2html/spiders/oaipmh_dc_xml.py
- Take a look at the
parse_record
method infeed2html/spiders/oaipmh_dc_xml.py
to see how the simple item objects are constructed. This can be extended the same way as any other scrapy XML feed spider - If the spider does not properly follow resumption tokens (to get the next page), run the crawl in debug mode with
-L DEBUG
and compare the expected XML with the token extraction inparse_node
Python 3 is a popular, accessible language and is widely used by researchers, librarians and other open access practitioners.
Scrapy is a well-supported, extensible Python module which can scrape web resources and process the results through pipelines, allowing a lot of customisation while leaving the low-level HTTP, document parsing work to an existing framework which has its own open source community and can easily be extended for more advanced solutions.
The Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage of digital information in a structured, transparent, and predictable manner. It is designed to promote long-term object management best practices within digital repositories.
Extensible Stylesheet Language Transformations (XSLT) is an XML-based language used, in conjunction with specialized processing software, for the transformation of XML documents. It has long been popular in the library sector.
- Turn feeds (OAI, RSS/Atom, ActivityPub) into complete static websites
- "Put DSpace on a CD-ROM"
- Start with most basic requirements -- OAIPMH, Dublin Core elements and terms -- RSS 2.0 for blogs, podcasts -- (RDF, jsonld, other formats and protocols come later)
- Spider
- Build OCFL layer
- Read XML feed with resumption tokens
- Pipelines:
- Initialize OCFL repository on disk
- Create BagIt fs structure (simpler alternative to OCFL)
- Search OA services (unpaywall etc) for OA links
- Transform to HTML with XSLT
- Create OCFL object and version and add to repository
- Send to search index (solr, ES, zincsearch?)
- Documentation
- Complete pydoc coverage
- Installation and usage instructions
- Complete this README with thorough explanation of the spider and pipelines, and advanced usage instructions
- Release
- Create requirements.txt and INSTALL.md
- Create LICENSE.txt for BSD 3-Clause license.
- Release to PyPI (or figure out the best way to package and distribute releases) once the project is beyond prototype
See NOTES.md for informal notes, ideas, links, references.