Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for custom user agents in is_live_page() #114

Open
drFerg opened this issue Aug 29, 2024 · 1 comment
Open

Support for custom user agents in is_live_page() #114

drFerg opened this issue Aug 29, 2024 · 1 comment

Comments

@drFerg
Copy link

drFerg commented Aug 29, 2024

Hi!

We're currently using courlan via trafilatura for some crawling and found that when trying to do liveness checks for a hosts url we're being blocked due to user agent headers, however, we're unable to change them. I noticed there's some commented out code in the redirection test which the is_live_page uses that references user agent headers.

Is there any interest in supporting changing the headers or having a different one set?

Thanks.

@drFerg drFerg changed the title is_live_url is sometimes failing due to user agent blocking is_live_page is sometimes failing due to user agent blocking Aug 29, 2024
@adbar adbar changed the title is_live_page is sometimes failing due to user agent blocking Support for custom user agents in is_live_page() Aug 29, 2024
@adbar
Copy link
Owner

adbar commented Aug 29, 2024

Hi @drFerg, definitely, Trafilatura supports custom user-agent settings, courlan could also do so. The config file approach could be replicated here.

Are you interested in drafting a pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants