Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding errors when using the BeautifulSoupCrawlingContext #695

Open
Rigos0 opened this issue Nov 13, 2024 · 9 comments
Open

encoding errors when using the BeautifulSoupCrawlingContext #695

Rigos0 opened this issue Nov 13, 2024 · 9 comments
Assignees
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@Rigos0
Copy link

Rigos0 commented Nov 13, 2024

When running a crawler using the BeautifulSoupCrawlingContext, I am getting unfixable encoding errors.
They are thrown even before the handler function is called.

"encoding error : input conversion failed due to input error, bytes 0xEB 0x85 0x84 0x20"

async def main() -> None:
    # async with Actor:
        crawler = BeautifulSoupCrawler()

        @crawler.router.default_handler
        async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
            url = context.request.url
            print(f"Processing URL: {url}")

The error occurs in about 30% of requests when trying to scrape reviews from booking. Some example links for replication:

https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=25

https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50

I found relevant issue stating "Libxml2 does not support the GB2312 encoding so a way to get around this problem is to convert it to utf-8. I did it and it works for me:" mitmproxy/mitmproxy#657 but I did not manage to fix your BeautifulSoupCrawlingContext code by specifying the encoding.

@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Nov 13, 2024
@janbuchar
Copy link
Collaborator

Hello @Rigos0! I tried reproducing this with the following snippet:

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient

from .routes import router


async def main() -> None:
    """The crawler entry point."""
    crawler = BeautifulSoupCrawler(
        request_handler=router,
        max_requests_per_crawl=50,
        http_client=CurlImpersonateHttpClient(),
    )

    await crawler.run(
        [
            'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=25',
            'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50'
        ]
    )

...and there was no error 🤯 Can you please provide a better reproduction script?

@Rigos0
Copy link
Author

Rigos0 commented Nov 14, 2024

I can currently reproduce the error using this script.

import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient

from crawlee.router import Router

router = Router[BeautifulSoupCrawlingContext]()

@router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
    url = context.request.url
    print(f"Processing URL: {url}")

    html_content = context.soup.prettify()
    print(html_content)


async def main() -> None:
    crawler = BeautifulSoupCrawler(request_handler=router)
    await crawler.run([
            'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50'
        ])


if __name__ == '__main__':
    asyncio.run(main())

Note: The scraped website will change in time as we are using the offset parameter - new reviews will push the problematic ones to higher offset.

If you are not getting any errors, please verify if you scraped the entire html by looking for the string c-review-block"
If the site gets scraped correctly, you should find 25 review blocks - one for each review.

In my case, I get this error and only 11/25 review blocks:
image

@janbuchar
Copy link
Collaborator

Note: The scraped website will change in time as we are using the offset parameter - new reviews will push the problematic ones to higher offset.

If you are not getting any errors, please verify if you scraped the entire html by looking for the string c-review-block" If the site gets scraped correctly, you should find 25 review blocks - one for each review.

I tried again, I'm getting 25 review blocks consistently and no errors.

Could you please try to save the HTML of the page that crashes beautifulsoup? We should try to make a reproduction example that doesn't depend on the current state of a website that changes this often.

@Rigos0
Copy link
Author

Rigos0 commented Nov 18, 2024

html_contents.zip

Got an error using offset=200.

The zip contains the original html source code and also the html scraped by the script from previous message. It contains only 3/25 blocks and some special chars are messed up.

@janbuchar
Copy link
Collaborator

This is strange. Could you dump the response headers of a failing request for me?

print(context.http_response.headers)

@Rigos0
Copy link
Author

Rigos0 commented Nov 18, 2024

root={'cache-control': 'private', 'content-encoding': 'br', 'content-length': '10185', 'content-security-policy-report-only': "frame-ancestors 'none'; report-uri https://nellie.booking.com/csp-report-uri?type=report&tag=112&pid=2a8173b6036f02ae&e=UmFuZG9tSVYkc2RlIyh9YeCr9sjcycwx2MIpyyQyTpmqV_3QMFueVZyxbPr4tb7Q", 'content-type': 'text/html; charset=UTF-8', 'date': 'Mon, 18 Nov 2024 16:27:25 GMT', 'nel': '{"max_age":604800,"report_to":"default"}', 'report-to': '{"group":"default","max_age":604800,"endpoints":[{"url":"https://nellie.booking.com/report"}]}', 'server': 'nginx', 'strict-transport-security': 'max-age=63072000; includeSubDomains; preload', 'vary': 'User-Agent, Accept-Encoding', 'via': '1.1 fbd2b51fce9ee4f3aa7b93dbbda3d698.cloudfront.net (CloudFront)', 'x-amz-cf-id': 'KhVqUbybzfGTSTkfDeZIo7vTNT1GvYPI0WPTTsfMMvZio5OpiTlFXw==', 'x-amz-cf-pop': 'FRA56-P8', 'x-cache': 'Miss from cloudfront', 'x-content-type-options': 'nosniff', 'x-recruiting': 'Like HTTP headers? Come write ours: https://careers.booking.com', 'x-xss-protection': '1; mode=block'}

@janbuchar
Copy link
Collaborator

Okay, nothing suspicious there. Could you also provide the complete stack trace of your error?

@Rigos0
Copy link
Author

Rigos0 commented Nov 19, 2024

The stack trace is unfortunately just encoding error : input conversion failed due to input error, bytes 0xF0 0x9F 0x98 0x89 encoding error : input conversion failed due to input error, bytes 0xF0 0x9F 0x98 0x89

I tried catching the error/warning with custom handling but that did not help. This is what Claude had to say about it:
That's the entire error message, repeated twice. The bytes 0xF0 0x9F 0x98 0x89 represent a UTF-8 encoded emoji (the winking face 😉). There's no additional stack trace or context for these specific encoding errors because they're likely being generated at a lower level by the XML/HTML parser without proper Python exception handling.

Also just a note, I was thinking if I did something wrong with libraries or the script but then I remembered I've already replicated the error both locally and on the Apify platform.

@janbuchar
Copy link
Collaborator

Interesting. Could you link me to the run on the Apify platform then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants