Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception on invalid xml. #145

Open
rillian opened this issue Jun 4, 2019 · 2 comments
Open

Exception on invalid xml. #145

rillian opened this issue Jun 4, 2019 · 2 comments

Comments

@rillian
Copy link
Contributor

rillian commented Jun 4, 2019

Some logging output got into my tei files, and hooktest asserts rather than reporting the error:

  File "${HOME}/HookTest/HookTest/capitains_units/cts.py", line 434, in auto_rng
    xml = parse(self.path)
  File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1840, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL
  File "src/lxml/parser.pxi", line 1770, in lxml.etree._parseDocFromFile
  File "src/lxml/parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
  File "tests/repo1/data/hafez/divan/hafez.divan.perseus-eng1.xml", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

One may reproduce by prepending the string 'Garbage text\n' to e.g. the beginning of tests/repo1/data/hafez/divan/hafez.divan.perseus-eng1.xml.

The XMLSyntaxError is hidden by the imap_unordered call through the threadpool and presents instead as a MaybeEncodingError because lxml.etree can't pickle its _ListErrorLog. Flattening the parallel iterator to a serial one reveals the underlying issue.

@rillian
Copy link
Contributor Author

rillian commented Jun 4, 2019

The problem occurs with general xml parsing failures. E.g. the unrecognized &sect; entity on line 776 of tlg0004.tlg001.perseus-eng1.xml from canonical-greekLit.

@rillian rillian changed the title Exception on garbage at the start of an xml file. Exception on invalid xml. Jun 4, 2019
@PonteIneptique
Copy link
Member

Yes, this seems like something that would need work. The XML parsing vs. Capitains Parsing is something that has remained in the codebase for a long time. Feel free to propose a fix, including by creating a new exception :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants