-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding errors when using the BeautifulSoupCrawlingContext #695
Comments
Hello @Rigos0! I tried reproducing this with the following snippet: from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient
from .routes import router
async def main() -> None:
"""The crawler entry point."""
crawler = BeautifulSoupCrawler(
request_handler=router,
max_requests_per_crawl=50,
http_client=CurlImpersonateHttpClient(),
)
await crawler.run(
[
'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=25',
'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50'
]
) ...and there was no error 🤯 Can you please provide a better reproduction script? |
I can currently reproduce the error using this script.
Note: The scraped website will change in time as we are using the offset parameter - new reviews will push the problematic ones to higher offset. If you are not getting any errors, please verify if you scraped the entire html by looking for the string c-review-block" |
I tried again, I'm getting 25 review blocks consistently and no errors. Could you please try to save the HTML of the page that crashes beautifulsoup? We should try to make a reproduction example that doesn't depend on the current state of a website that changes this often. |
Got an error using offset=200. The zip contains the original html source code and also the html scraped by the script from previous message. It contains only 3/25 blocks and some special chars are messed up. |
This is strange. Could you dump the response headers of a failing request for me? print(context.http_response.headers) |
|
Okay, nothing suspicious there. Could you also provide the complete stack trace of your error? |
The stack trace is unfortunately just I tried catching the error/warning with custom handling but that did not help. This is what Claude had to say about it: Also just a note, I was thinking if I did something wrong with libraries or the script but then I remembered I've already replicated the error both locally and on the Apify platform. |
Interesting. Could you link me to the run on the Apify platform then? |
When running a crawler using the BeautifulSoupCrawlingContext, I am getting unfixable encoding errors.
They are thrown even before the handler function is called.
"encoding error : input conversion failed due to input error, bytes 0xEB 0x85 0x84 0x20"
The error occurs in about 30% of requests when trying to scrape reviews from booking. Some example links for replication:
https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=25
https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50
I found relevant issue stating "Libxml2 does not support the GB2312 encoding so a way to get around this problem is to convert it to utf-8. I did it and it works for me:" mitmproxy/mitmproxy#657 but I did not manage to fix your BeautifulSoupCrawlingContext code by specifying the encoding.
The text was updated successfully, but these errors were encountered: