Support client-side rendered content #20

deklanw · 2021-09-03T14:12:41Z

Many sites aren't rendered server-side and so are unusable with consume_web, for example all the articles on KhanAcademy https://www.khanacademy.org/humanities/world-history/medieval-times/cross-cultural-diffusion-of-knowledge/a/the-golden-age-of-islam

Integration with Selenium, splash, etc would be one way to fix this

The text was updated successfully, but these errors were encountered:

thiswillbeyourgithub · 2021-09-03T23:24:55Z

Hi! Thanks for your interest in autocards.

I've contributed quite a lot to PRs of autocards (see for ex the pending PR) but sadly I'm terrible at webdesign so will very probably not do this myself.

If you provide a clean way to simply get text data from a URL I can manage integrating it to the codebase very quickly though if you want.

Have a nice day!

deklanw · 2021-09-06T15:26:34Z

I looked into it and it seems that basically every solution either requires 1) integration with a web browser or 2) using a paid service (which probably uses 1 under the hood).

Here's one working example

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import FirefoxOptions

# need Firefox installed, and the corresponding Firefox driver
# see https://selenium-python.readthedocs.io/installation.html#drivers
opts = FirefoxOptions()

# I'm using WSL, so I need this option
opts.add_argument("--headless")

url = "https://www.khanacademy.org/humanities/world-history/medieval-times/cross-cultural-diffusion-of-knowledge/a/the-golden-age-of-islam"

driver = webdriver.Firefox(options=opts)
driver.get(url)

soup = BeautifulSoup(driver.page_source)

# close(), or quit()
driver.quit()

Unfortunately it requires having Firefox installed and installing the corresponding web driver into your PATH. There is also requests-html which is supposed to be a drop-in replacement for requests. It supports 'rendering' the JS in the page, but it also seems to work by just downloading a Chromium instance the first time you call it. And, I'm getting an error with it anyway (maybe WSL related)

This is to say that all of these methods are brittle and trying to support it in the library itself would be a pain. But, including instructions on how to do it somewhere might be useful.

thiswillbeyourgithub · 2021-09-06T16:00:50Z

Yes that's my conclusion as well. I think dynamic website can be exported to PDF or just copied and pasted to autocards so that's "fine" :/

Thanks for looking into this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support client-side rendered content #20

Support client-side rendered content #20

deklanw commented Sep 3, 2021

thiswillbeyourgithub commented Sep 3, 2021

deklanw commented Sep 6, 2021 •

edited

Loading

thiswillbeyourgithub commented Sep 6, 2021 •

edited

Loading

Support client-side rendered content #20

Support client-side rendered content #20

Comments

deklanw commented Sep 3, 2021

thiswillbeyourgithub commented Sep 3, 2021

deklanw commented Sep 6, 2021 • edited Loading

thiswillbeyourgithub commented Sep 6, 2021 • edited Loading

deklanw commented Sep 6, 2021 •

edited

Loading

thiswillbeyourgithub commented Sep 6, 2021 •

edited

Loading