Skip to content
This repository has been archived by the owner on May 5, 2023. It is now read-only.

Support client-side rendered content #20

Open
deklanw opened this issue Sep 3, 2021 · 3 comments
Open

Support client-side rendered content #20

deklanw opened this issue Sep 3, 2021 · 3 comments

Comments

@deklanw
Copy link

deklanw commented Sep 3, 2021

Many sites aren't rendered server-side and so are unusable with consume_web, for example all the articles on KhanAcademy https://www.khanacademy.org/humanities/world-history/medieval-times/cross-cultural-diffusion-of-knowledge/a/the-golden-age-of-islam

Integration with Selenium, splash, etc would be one way to fix this

@thiswillbeyourgithub
Copy link
Collaborator

Hi! Thanks for your interest in autocards.

I've contributed quite a lot to PRs of autocards (see for ex the pending PR) but sadly I'm terrible at webdesign so will very probably not do this myself.

If you provide a clean way to simply get text data from a URL I can manage integrating it to the codebase very quickly though if you want.

Have a nice day!

@deklanw
Copy link
Author

deklanw commented Sep 6, 2021

I looked into it and it seems that basically every solution either requires 1) integration with a web browser or 2) using a paid service (which probably uses 1 under the hood).

Here's one working example

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import FirefoxOptions

# need Firefox installed, and the corresponding Firefox driver
# see https://selenium-python.readthedocs.io/installation.html#drivers
opts = FirefoxOptions()

# I'm using WSL, so I need this option
opts.add_argument("--headless")

url = "https://www.khanacademy.org/humanities/world-history/medieval-times/cross-cultural-diffusion-of-knowledge/a/the-golden-age-of-islam"

driver = webdriver.Firefox(options=opts)
driver.get(url)

soup = BeautifulSoup(driver.page_source)

# close(), or quit()
driver.quit()

Unfortunately it requires having Firefox installed and installing the corresponding web driver into your PATH. There is also requests-html which is supposed to be a drop-in replacement for requests. It supports 'rendering' the JS in the page, but it also seems to work by just downloading a Chromium instance the first time you call it. And, I'm getting an error with it anyway (maybe WSL related)

This is to say that all of these methods are brittle and trying to support it in the library itself would be a pain. But, including instructions on how to do it somewhere might be useful.

@thiswillbeyourgithub
Copy link
Collaborator

thiswillbeyourgithub commented Sep 6, 2021

Yes that's my conclusion as well. I think dynamic website can be exported to PDF or just copied and pasted to autocards so that's "fine" :/

Thanks for looking into this!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants