encoding errors when using the BeautifulSoupCrawlingContext #695

Rigos0 · 2024-11-13T23:20:22Z

When running a crawler using the BeautifulSoupCrawlingContext, I am getting unfixable encoding errors.
They are thrown even before the handler function is called.

"encoding error : input conversion failed due to input error, bytes 0xEB 0x85 0x84 0x20"

async def main() -> None:
    # async with Actor:
        crawler = BeautifulSoupCrawler()

        @crawler.router.default_handler
        async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
            url = context.request.url
            print(f"Processing URL: {url}")

The error occurs in about 30% of requests when trying to scrape reviews from booking. Some example links for replication:

https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=25

https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50

I found relevant issue stating "Libxml2 does not support the GB2312 encoding so a way to get around this problem is to convert it to utf-8. I did it and it works for me:" mitmproxy/mitmproxy#657 but I did not manage to fix your BeautifulSoupCrawlingContext code by specifying the encoding.

The text was updated successfully, but these errors were encountered:

janbuchar · 2024-11-14T09:38:42Z

Hello @Rigos0! I tried reproducing this with the following snippet:

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient

from .routes import router


async def main() -> None:
    """The crawler entry point."""
    crawler = BeautifulSoupCrawler(
        request_handler=router,
        max_requests_per_crawl=50,
        http_client=CurlImpersonateHttpClient(),
    )

    await crawler.run(
        [
            'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=25',
            'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50'
        ]
    )

...and there was no error 🤯 Can you please provide a better reproduction script?

Rigos0 · 2024-11-14T10:20:48Z

I can currently reproduce the error using this script.

import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient

from crawlee.router import Router

router = Router[BeautifulSoupCrawlingContext]()

@router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
    url = context.request.url
    print(f"Processing URL: {url}")

    html_content = context.soup.prettify()
    print(html_content)


async def main() -> None:
    crawler = BeautifulSoupCrawler(request_handler=router)
    await crawler.run([
            'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50'
        ])


if __name__ == '__main__':
    asyncio.run(main())

Note: The scraped website will change in time as we are using the offset parameter - new reviews will push the problematic ones to higher offset.

If you are not getting any errors, please verify if you scraped the entire html by looking for the string c-review-block"
If the site gets scraped correctly, you should find 25 review blocks - one for each review.

In my case, I get this error and only 11/25 review blocks:

janbuchar · 2024-11-18T12:29:29Z

Note: The scraped website will change in time as we are using the offset parameter - new reviews will push the problematic ones to higher offset.

If you are not getting any errors, please verify if you scraped the entire html by looking for the string c-review-block" If the site gets scraped correctly, you should find 25 review blocks - one for each review.

I tried again, I'm getting 25 review blocks consistently and no errors.

Could you please try to save the HTML of the page that crashes beautifulsoup? We should try to make a reproduction example that doesn't depend on the current state of a website that changes this often.

Rigos0 · 2024-11-18T12:56:20Z

html_contents.zip

Got an error using offset=200.

The zip contains the original html source code and also the html scraped by the script from previous message. It contains only 3/25 blocks and some special chars are messed up.

janbuchar · 2024-11-18T16:13:44Z

This is strange. Could you dump the response headers of a failing request for me?

print(context.http_response.headers)

Rigos0 · 2024-11-18T16:28:46Z

root={'cache-control': 'private', 'content-encoding': 'br', 'content-length': '10185', 'content-security-policy-report-only': "frame-ancestors 'none'; report-uri https://nellie.booking.com/csp-report-uri?type=report&tag=112&pid=2a8173b6036f02ae&e=UmFuZG9tSVYkc2RlIyh9YeCr9sjcycwx2MIpyyQyTpmqV_3QMFueVZyxbPr4tb7Q", 'content-type': 'text/html; charset=UTF-8', 'date': 'Mon, 18 Nov 2024 16:27:25 GMT', 'nel': '{"max_age":604800,"report_to":"default"}', 'report-to': '{"group":"default","max_age":604800,"endpoints":[{"url":"https://nellie.booking.com/report"}]}', 'server': 'nginx', 'strict-transport-security': 'max-age=63072000; includeSubDomains; preload', 'vary': 'User-Agent, Accept-Encoding', 'via': '1.1 fbd2b51fce9ee4f3aa7b93dbbda3d698.cloudfront.net (CloudFront)', 'x-amz-cf-id': 'KhVqUbybzfGTSTkfDeZIo7vTNT1GvYPI0WPTTsfMMvZio5OpiTlFXw==', 'x-amz-cf-pop': 'FRA56-P8', 'x-cache': 'Miss from cloudfront', 'x-content-type-options': 'nosniff', 'x-recruiting': 'Like HTTP headers? Come write ours: https://careers.booking.com', 'x-xss-protection': '1; mode=block'}

janbuchar · 2024-11-18T17:55:49Z

Okay, nothing suspicious there. Could you also provide the complete stack trace of your error?

Rigos0 · 2024-11-19T08:21:59Z

The stack trace is unfortunately just encoding error : input conversion failed due to input error, bytes 0xF0 0x9F 0x98 0x89 encoding error : input conversion failed due to input error, bytes 0xF0 0x9F 0x98 0x89

I tried catching the error/warning with custom handling but that did not help. This is what Claude had to say about it:
That's the entire error message, repeated twice. The bytes 0xF0 0x9F 0x98 0x89 represent a UTF-8 encoded emoji (the winking face 😉). There's no additional stack trace or context for these specific encoding errors because they're likely being generated at a lower level by the XML/HTML parser without proper Python exception handling.

Also just a note, I was thinking if I did something wrong with libraries or the script but then I remembered I've already replicated the error both locally and on the Apify platform.

janbuchar · 2024-11-19T09:03:08Z

Interesting. Could you link me to the run on the Apify platform then?

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Nov 13, 2024

B4nan assigned janbuchar Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding errors when using the BeautifulSoupCrawlingContext #695

encoding errors when using the BeautifulSoupCrawlingContext #695

Rigos0 commented Nov 13, 2024 •

edited

Loading

janbuchar commented Nov 14, 2024

Rigos0 commented Nov 14, 2024

janbuchar commented Nov 18, 2024

Rigos0 commented Nov 18, 2024

janbuchar commented Nov 18, 2024

Rigos0 commented Nov 18, 2024 •

edited

Loading

janbuchar commented Nov 18, 2024

Rigos0 commented Nov 19, 2024 •

edited

Loading

janbuchar commented Nov 19, 2024

encoding errors when using the BeautifulSoupCrawlingContext #695

encoding errors when using the BeautifulSoupCrawlingContext #695

Comments

Rigos0 commented Nov 13, 2024 • edited Loading

janbuchar commented Nov 14, 2024

Rigos0 commented Nov 14, 2024

janbuchar commented Nov 18, 2024

Rigos0 commented Nov 18, 2024

janbuchar commented Nov 18, 2024

Rigos0 commented Nov 18, 2024 • edited Loading

janbuchar commented Nov 18, 2024

Rigos0 commented Nov 19, 2024 • edited Loading

janbuchar commented Nov 19, 2024

Rigos0 commented Nov 13, 2024 •

edited

Loading

Rigos0 commented Nov 18, 2024 •

edited

Loading

Rigos0 commented Nov 19, 2024 •

edited

Loading