-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract missing pages and chapters from blue book #310
Comments
missing pages extracted and output as rdf triples via python script in a private repo due to license issues: https://github.com/ThomasPause/MissingPageFinder |
solve the following problems:
|
I think the main difference is that the algorithm finds any occurrence of a given word or word group, specified by the regex, whilst the human extraction maybe a little bit inconsistent. |
this shows that some buzzwords occur very often and the question is if it is useful or not to have an array of about 250 pagenumbers where the term "hospital" is in the book. |
Die Verteilung nach manueller Extraktion
|
Die Verteilung nach automatisierter Extraktion
|
Die manuelle Extraktion liefert maximal 2 Seiten pro Klasse, bei der automatisierten gibt es Klassen mit mehreren hundert Vorkommen. Daher ist die Frage ob es sinnvoll ist, einen Cutoff zu setzen und wenn ja wo. |
cutoff is now at 3, that means that classes which appear more than 3 times are not written into the file. |
algorithm now converts to normalized unicode, sorted results are attached here as .nt file |
Upload done |
See also: #296, #309.
Despite efforts to restore missing chapters, there are still several hundred classes in the blue book without page, partly without chapter too.
All classes with a page now have a chapter because we automatically generated those, see snikproject/graph#214.
However for those that don't have a page and maybe also don't have a chapter, we should extract those from the blue book. Using the digital version and using search it should be doable semi-automatically and it would be a huge help for the chapter search feature for teaching.
The text was updated successfully, but these errors were encountered: