-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mojibake for some Idiomdrottning posts #5
Comments
I discovered in #4 (comment) that the Debian package is used. After adjusting the Dockerfile accordingly, I can reproduce the mojibake: FROM debian:10
RUN apt-get -y update
RUN apt-get -y install curl planet-venus
RUN mkdir /root/venus
RUN cd /root/venus && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/planet.ini
RUN mkdir /root/venus/templates
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/atom.xml.xslt
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/index.html.tmpl
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/rss20.xml.tmpl
RUN cd /root/venus && planet planet.ini
CMD ["/usr/bin/env", "-C", "/root/venus", "python2.7", "-m", "SimpleHTTPServer"] I'll experiment some more and hope I'll figure this one out. |
Wow, you've put in a lot of effort reproducing the setup. If you want to look at it directly on the server, I'm happy to make you an account. It's still using Python 2? |
Proposals for how to best rewrite the planet in Scheme/Racket are welcome as well. If we solve this problem, we'll soon it another, then another.. |
I've reproduced the Debian build using a Dockerfile, including their patches. Without their patches, the issue disappears. I've bisected it to the patches that remove the vendored dependencies and discovered that the issue is with html5lib. Normally I'd suggest to hand in a Debian bug, but the package has been removed from Debian for being essentially unmaintained, so we're out of luck here. I'm sympathetic to rewriting Planet, but then we'd essentially become its maintainers. Which isn't too bad since we use a subset only, but is still a considerable weight. Other replacement options:
Minimal broken Dockerfile: FROM debian:10
RUN apt-get -y update
RUN apt-get -y install python git curl python-chardet python-feedparser python-html5lib python-htmltmpl python-httplib2 python-librdf python-portalocker python-utidylib python-libxml2 python-libxslt1
RUN cd /root && git clone https://github.com/rubys/venus.git
RUN cd /root && curl -LO http://deb.debian.org/debian/pool/main/p/planet-venus/planet-venus_0~git9de2109-4.2.debian.tar.xz
RUN mkdir /root/planet-debian && tar -C /root/planet-debian -xf /root/planet-venus*.tar.xz
RUN cd /root/venus && patch -p1 < /root/planet-debian/debian/patches/html5lib-no_XHTMLSerializer.patch
RUN rm /root/venus/planet/vendor/html5lib/__init__.py && \
rm /root/venus/planet/vendor/html5lib/constants.py && \
rm /root/venus/planet/vendor/html5lib/filters/__init__.py && \
rm /root/venus/planet/vendor/html5lib/filters/_base.py && \
rm /root/venus/planet/vendor/html5lib/filters/formfiller.py && \
rm /root/venus/planet/vendor/html5lib/filters/inject_meta_charset.py && \
rm /root/venus/planet/vendor/html5lib/filters/lint.py && \
rm /root/venus/planet/vendor/html5lib/filters/optionaltags.py && \
rm /root/venus/planet/vendor/html5lib/filters/sanitizer.py && \
rm /root/venus/planet/vendor/html5lib/filters/whitespace.py && \
rm /root/venus/planet/vendor/html5lib/html5parser.py && \
rm /root/venus/planet/vendor/html5lib/ihatexml.py && \
rm /root/venus/planet/vendor/html5lib/inputstream.py && \
rm /root/venus/planet/vendor/html5lib/sanitizer.py && \
rm /root/venus/planet/vendor/html5lib/serializer/__init__.py && \
rm /root/venus/planet/vendor/html5lib/serializer/htmlserializer.py && \
rm /root/venus/planet/vendor/html5lib/serializer/xhtmlserializer.py && \
rm /root/venus/planet/vendor/html5lib/tokenizer.py && \
rm /root/venus/planet/vendor/html5lib/treebuilders/__init__.py && \
rm /root/venus/planet/vendor/html5lib/treebuilders/_base.py && \
rm /root/venus/planet/vendor/html5lib/treebuilders/dom.py && \
rm /root/venus/planet/vendor/html5lib/treebuilders/etree.py && \
rm /root/venus/planet/vendor/html5lib/treebuilders/etree_lxml.py && \
rm /root/venus/planet/vendor/html5lib/treebuilders/simpletree.py && \
rm /root/venus/planet/vendor/html5lib/treebuilders/soup.py && \
rm /root/venus/planet/vendor/html5lib/treewalkers/__init__.py && \
rm /root/venus/planet/vendor/html5lib/treewalkers/_base.py && \
rm /root/venus/planet/vendor/html5lib/treewalkers/dom.py && \
rm /root/venus/planet/vendor/html5lib/treewalkers/etree.py && \
rm /root/venus/planet/vendor/html5lib/treewalkers/genshistream.py && \
rm /root/venus/planet/vendor/html5lib/treewalkers/lxmletree.py && \
rm /root/venus/planet/vendor/html5lib/treewalkers/pulldom.py && \
rm /root/venus/planet/vendor/html5lib/treewalkers/simpletree.py && \
rm /root/venus/planet/vendor/html5lib/treewalkers/soup.py && \
rm /root/venus/planet/vendor/html5lib/utils.py
RUN cd /root/venus && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/planet.ini
RUN sed -i -e '30,200d' /root/venus/planet.ini
RUN mkdir /root/venus/templates
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/atom.xml.xslt
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/index.html.tmpl
RUN cd /root/venus/templates && curl -LO https://raw.githubusercontent.com/schemeorg/planet.scheme.org/master/planet/templates/rss20.xml.tmpl
RUN cd /root/venus && python2.7 planet.py planet.ini
CMD ["/usr/bin/env", "-C", "/root/venus", "python2.7", "-m", "SimpleHTTPServer"] |
The Atom feed contains a lot of posts from https://idiomdrottning.org/blog/programs, some of which are rendered incorrectly. My observations so far:
I've tried to reproduce the issue by setting up planet/venus with the configuration files of this repository and failed to reproduce the issue. The generated Atom feed isn't affected for some weird reason. Much appreciated if you could help me pinpoint what configuration leads to this.
Repro:
The text was updated successfully, but these errors were encountered: