Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML character references are not unescaped/escaped #17

Open
stemann opened this issue Aug 29, 2023 · 4 comments
Open

XML character references are not unescaped/escaped #17

stemann opened this issue Aug 29, 2023 · 4 comments

Comments

@stemann
Copy link
Contributor

stemann commented Aug 29, 2023

XML character entity references, e.g. Å ("Å"), and XML numeric character references, e.g. Å ("Å"), are not unescaped/escaped by XML.unescape and XML.escape methods.

@stemann
Copy link
Contributor Author

stemann commented Aug 29, 2023

Something like the following may help (for unescaping hexadecimal numeric character references):

function unescape_unicode(s::AbstractString)
    i = firstindex(s)
    while (m = match(r"&#(x)(\w{2,4});", s, i)) !== nothing
        s = replace(s, m.match => unescape_string("\\u$(m.captures[2])"))
        i = m.offset + 1
    end
    return s
end

@joshday
Copy link
Member

joshday commented Aug 29, 2023

Hmm, these entities need to be defined in the DTD, correct? I think we'd need (un)escape methods that take in an XML.DTDBody as well as the string.

@stemann
Copy link
Contributor Author

stemann commented Aug 30, 2023

Ah - yes - that's right - my ancient memory of XML, and in particular HTML, led me to believe that they were built-in also in XML, but I see now that XML only defines five entities - and all of the HTML-like entities are mostly/solely defined for HTML: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Standard_public_entity_sets_for_characters

Perhaps one could just have some html-convenience escape methods...

@stemann
Copy link
Contributor Author

stemann commented Aug 30, 2023

Or perhaps provide something convenient for getting common DTDs like http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants