-
Notifications
You must be signed in to change notification settings - Fork 17
Project: eTheses for openVirus
eTheses for openVirus
Because Ph.D theses are an under-utilised body of scholarly writing and research.
- Structurally similar to the DOAJ work, in that the goal is to set up a new data source we can feed into the
openVirus
tool chain, where the data source is too large to process directly and requires pre-processing in some way.
We have taken the data from the EThOS service and re-used the tools of the UK Web Archive to build a full-text search API that can be used to find relevant theses.
- DRAFT BLOG 1: Searching eTheses for the openVirus project
- DRAFT BLOG 2: Bringing Metadata & Full-text Together
This notebook illustrates how to use the API
- The
openVirus
tools need to be extended/supplemented to use the API and then download the PDFs of the relevant theses.- This would work in the same way as
getpapers
/quickscrape
/ami download
(/ferret
?) - e.g. adding an
ethos-api
source togetpapers
would be one implementation approach. - An alternative would be to write a Scrapy crawler that outputs a suitable CProject.
- This would work in the same way as
- The whole workflow needs to be verified with a realistic/useful example.
???
Step 1 complete, the API works well enough.
Step 2 needs to be implemented, but it's not clear how best to proceed. Andy Jackson is currently working on understanding ami download
/getpapers
/etc. well enough to work out what might work best.
One idea I keep coming back to is that the core of the work done by ami search
is very similar to the core of Apache Solr itself. The upshot of this is that rather than adding this Solr index as a data source, the initial part of the ami search
process could be done directly in Solr.
(PMR comment). Yes, I am working to replace the engines in ami search
and ami words
by Solr
.
Specifically, for each query term in each dictionary, we could:
- Search for that term using the Solr API
- Export the full result set, including the text surrounding each hit.
- THEN: Generate the
snippets
XML from that, and pass to the rest of theami search
chain. - OR: Generate the results tables and co-occurrences plots etc. directly from Solr.
(PMR comment). Agreed.