-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse Policy Doc PDFs #1
Comments
@multimeric I tried to integrate the MTR grammar you wrote for Venser's Journal, but I ran into some issues regarding bullet lists. Take the current MTR as an example.
but instead it's parsed as
which makes it impossible to detect where the last item ends and the following paragraph begins.
Instead, I assume because the points each take multiple lines, the second one is parsed as
which is just obviously wrong. Sometimes a rogue bullet point is inserted on an empty line, either out of nowhere or behind the list item instead of at the beginning of it (this happens for the first item of the second list in section 1.4). Since you wrote the thing (and I don't have much experience with this kind of parsing), I was wondering if you had any insight into how to solve these issues before I go digging too deep into it. |
Cool, thanks for looking into this. I remember there were some issues with the parser, but I never finished them off because the VJ guy wasn't actually hosting it anyway. |
It occurs to me that the MTG Judge Core app is successfully parsing the IPG and MTR. Maybe Andrew Teo would share their data/method for doing that? |
The API currently only creates diffs of the CR, because all policy docs are only available as PDFs. To be able to diff policy docs, we must first transform them into some machine-readable representation. There are a number of available PDF parsers available, all working slightly differently, so some research should be done into which one can work best for this use case.
The text was updated successfully, but these errors were encountered: