Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mojibake for some Idiomdrottning posts #5

Open
wasamasa opened this issue Jul 13, 2022 · 4 comments
Open

Mojibake for some Idiomdrottning posts #5

wasamasa opened this issue Jul 13, 2022 · 4 comments

Comments

@wasamasa
Copy link
Contributor

wasamasa commented Jul 13, 2022

The Atom feed contains a lot of posts from https://idiomdrottning.org/blog/programs, some of which are rendered incorrectly. My observations so far:

  • Only UTF-8 characters such as typographic quotes and emoji are affected
  • Not every post using these characters is affected. For example the "zshbrev" post renders fine, but "Romancing Sisyphus’ Stone" doesn't.
  • The source posts on the original blog look fine, so it's unclear at which step the conversion issues happen and why

I've tried to reproduce the issue by setting up planet/venus with the configuration files of this repository and failed to reproduce the issue. The generated Atom feed isn't affected for some weird reason. Much appreciated if you could help me pinpoint what configuration leads to this.

FROM debian:10

RUN apt-get -y update
RUN apt-get -y install python2.7 curl xsltproc

RUN cd /root && curl -LO https://intertwingly.net/code/venus.tgz && tar -xf venus.tgz
RUN cd /root/venus && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/planet.ini
RUN mkdir /root/venus/templates
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/atom.xml.xslt
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/index.html.tmpl
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/rss20.xml.tmpl

RUN cd /root/venus && python2.7 planet.py planet.ini
CMD ["/usr/bin/env", "-C", "/root/venus", "python2.7", "-m", "SimpleHTTPServer"]

Repro:

vim Dockerfile
docker build -t planetscheme:debian10 .
docker run -p 8000:8000 -it planetscheme:debian10
# Run in another terminal
xdg-open http://localhost:8000/output/atom.xml
@wasamasa
Copy link
Contributor Author

I discovered in #4 (comment) that the Debian package is used. After adjusting the Dockerfile accordingly, I can reproduce the mojibake:

FROM debian:10

RUN apt-get -y update
RUN apt-get -y install curl planet-venus

RUN mkdir /root/venus
RUN cd /root/venus && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/planet.ini
RUN mkdir /root/venus/templates
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/atom.xml.xslt
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/index.html.tmpl
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/rss20.xml.tmpl

RUN cd /root/venus && planet planet.ini
CMD ["/usr/bin/env", "-C", "/root/venus", "python2.7", "-m", "SimpleHTTPServer"]

I'll experiment some more and hope I'll figure this one out.

@lassik
Copy link
Contributor

lassik commented Jul 13, 2022

Wow, you've put in a lot of effort reproducing the setup. If you want to look at it directly on the server, I'm happy to make you an account.

It's still using Python 2?

@lassik
Copy link
Contributor

lassik commented Jul 13, 2022

Proposals for how to best rewrite the planet in Scheme/Racket are welcome as well. If we solve this problem, we'll soon it another, then another..

@wasamasa
Copy link
Contributor Author

I've reproduced the Debian build using a Dockerfile, including their patches. Without their patches, the issue disappears. I've bisected it to the patches that remove the vendored dependencies and discovered that the issue is with html5lib. Normally I'd suggest to hand in a Debian bug, but the package has been removed from Debian for being essentially unmaintained, so we're out of luck here.

I'm sympathetic to rewriting Planet, but then we'd essentially become its maintainers. Which isn't too bad since we use a subset only, but is still a considerable weight. Other replacement options:

  • Unpatched Git release (master branch): This would require monitoring whether it behaves the same otherwise. I can understand why Debian added patches, but most seem unrelated and the CVE ones do not really apply.
  • Unpatched Git release (dev branch): Lots of changes in here. Might remove the need for Debian's patches completely. We're talking about 12 years out of date vs 5 years out of date. It's marked as experimental though.
  • Use a Ruby alternative. Not sure whether it's a full replacement and whether Ruby's approach to encodings is better, but maybe it delivers.

Minimal broken Dockerfile:

FROM debian:10

RUN apt-get -y update
RUN apt-get -y install python git curl python-chardet python-feedparser python-html5lib python-htmltmpl python-httplib2 python-librdf python-portalocker python-utidylib python-libxml2 python-libxslt1

RUN cd /root && git clone https://github.com/rubys/venus.git
RUN cd /root && curl -LO http://deb.debian.org/debian/pool/main/p/planet-venus/planet-venus_0~git9de2109-4.2.debian.tar.xz
RUN mkdir /root/planet-debian && tar -C /root/planet-debian -xf /root/planet-venus*.tar.xz
RUN cd /root/venus && patch -p1 < /root/planet-debian/debian/patches/html5lib-no_XHTMLSerializer.patch
RUN rm /root/venus/planet/vendor/html5lib/__init__.py && \
    rm /root/venus/planet/vendor/html5lib/constants.py && \
    rm /root/venus/planet/vendor/html5lib/filters/__init__.py && \
    rm /root/venus/planet/vendor/html5lib/filters/_base.py && \
    rm /root/venus/planet/vendor/html5lib/filters/formfiller.py && \
    rm /root/venus/planet/vendor/html5lib/filters/inject_meta_charset.py && \
    rm /root/venus/planet/vendor/html5lib/filters/lint.py && \
    rm /root/venus/planet/vendor/html5lib/filters/optionaltags.py && \
    rm /root/venus/planet/vendor/html5lib/filters/sanitizer.py && \
    rm /root/venus/planet/vendor/html5lib/filters/whitespace.py && \
    rm /root/venus/planet/vendor/html5lib/html5parser.py && \
    rm /root/venus/planet/vendor/html5lib/ihatexml.py && \
    rm /root/venus/planet/vendor/html5lib/inputstream.py && \
    rm /root/venus/planet/vendor/html5lib/sanitizer.py && \
    rm /root/venus/planet/vendor/html5lib/serializer/__init__.py && \
    rm /root/venus/planet/vendor/html5lib/serializer/htmlserializer.py && \
    rm /root/venus/planet/vendor/html5lib/serializer/xhtmlserializer.py && \
    rm /root/venus/planet/vendor/html5lib/tokenizer.py && \
    rm /root/venus/planet/vendor/html5lib/treebuilders/__init__.py && \
    rm /root/venus/planet/vendor/html5lib/treebuilders/_base.py && \
    rm /root/venus/planet/vendor/html5lib/treebuilders/dom.py && \
    rm /root/venus/planet/vendor/html5lib/treebuilders/etree.py && \
    rm /root/venus/planet/vendor/html5lib/treebuilders/etree_lxml.py && \
    rm /root/venus/planet/vendor/html5lib/treebuilders/simpletree.py && \
    rm /root/venus/planet/vendor/html5lib/treebuilders/soup.py && \
    rm /root/venus/planet/vendor/html5lib/treewalkers/__init__.py && \
    rm /root/venus/planet/vendor/html5lib/treewalkers/_base.py && \
    rm /root/venus/planet/vendor/html5lib/treewalkers/dom.py && \
    rm /root/venus/planet/vendor/html5lib/treewalkers/etree.py && \
    rm /root/venus/planet/vendor/html5lib/treewalkers/genshistream.py && \
    rm /root/venus/planet/vendor/html5lib/treewalkers/lxmletree.py && \
    rm /root/venus/planet/vendor/html5lib/treewalkers/pulldom.py && \
    rm /root/venus/planet/vendor/html5lib/treewalkers/simpletree.py && \
    rm /root/venus/planet/vendor/html5lib/treewalkers/soup.py && \
    rm /root/venus/planet/vendor/html5lib/utils.py

RUN cd /root/venus && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/planet.ini
RUN sed -i -e '30,200d' /root/venus/planet.ini
RUN mkdir /root/venus/templates
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/atom.xml.xslt
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/index.html.tmpl
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/rss20.xml.tmpl

RUN cd /root/venus && python2.7 planet.py planet.ini
CMD ["/usr/bin/env", "-C", "/root/venus", "python2.7", "-m", "SimpleHTTPServer"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants