Parse Policy Doc PDFs #1

lunakv · 2022-08-11T17:30:09Z

The API currently only creates diffs of the CR, because all policy docs are only available as PDFs. To be able to diff policy docs, we must first transform them into some machine-readable representation. There are a number of available PDF parsers available, all working slightly differently, so some research should be done into which one can work best for this use case.

lunakv · 2022-08-11T18:03:37Z

@multimeric I tried to integrate the MTR grammar you wrote for Venser's Journal, but I ran into some issues regarding bullet lists. Take the current MTR as an example.

The newline after the last list item is erroneously removed during the cleanup process. For example, in section 1.3, the list should end

•  Player 
•  Spectator 
The first four roles above are [...],

but instead it's parsed as

•  Player 
•  Spectator The first four roles above are [...].

which makes it impossible to detect where the last item ends and the following paragraph begins.

Sometimes the list itself is parsed incorrectly, inserting extra bullet points. In section 1.4, the first list should read

[...]
• Individuals currently suspended by the DCI. Individuals currently suspended from the DCI may not act as tournament officials;
• Other individuals specifically prohibited from participation by DCI or Wizards of the Coast policy (such determination is at Wizards of the Coast’s sole discretion);
[...]

Instead, I assume because the points each take multiple lines, the second one is parsed as

• Other individuals specifically prohibited from participation by DCI or Wizards of the Coast policy 
• (such determination is at Wizards of the Coast’s sole discretion);

which is just obviously wrong. Sometimes a rogue bullet point is inserted on an empty line, either out of nowhere or behind the list item instead of at the beginning of it (this happens for the first item of the second list in section 1.4).

Since you wrote the thing (and I don't have much experience with this kind of parsing), I was wondering if you had any insight into how to solve these issues before I go digging too deep into it.

multimeric · 2022-08-12T01:50:48Z

Cool, thanks for looking into this. I remember there were some issues with the parser, but I never finished them off because the VJ guy wasn't actually hosting it anyway.

KingSupernova31 · 2022-12-25T17:41:21Z

It occurs to me that the MTG Judge Core app is successfully parsing the IPG and MTR. Maybe Andrew Teo would share their data/method for doing that?

lunakv added the enhancement New feature or request label Aug 11, 2022

lunakv mentioned this issue Mar 5, 2023

Add MTR parsing #37

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse Policy Doc PDFs #1

Parse Policy Doc PDFs #1

lunakv commented Aug 11, 2022

lunakv commented Aug 11, 2022

multimeric commented Aug 12, 2022

KingSupernova31 commented Dec 25, 2022

Parse Policy Doc PDFs #1

Parse Policy Doc PDFs #1

Comments

lunakv commented Aug 11, 2022

lunakv commented Aug 11, 2022

multimeric commented Aug 12, 2022

KingSupernova31 commented Dec 25, 2022