-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the deduplication of requests #178
Comments
I think yes, you should do that.
Yes. For example, for cases where the server returns a 200 response status. But the response body contains data that an error occurred and this request should be executed again. |
Let's try to find a better name than |
Context
A while ago, Honza Javorek raised some good points regarding the deduplication process in the request queue (#190).
The first one:
In response, we improved the unique key generation logic in the Python SDK (PR #193) to align with the TS Crawlee. This logic was lates copied to
crawlee-python
and can be found in crawlee/_utils/requests.py.The second one:
Currently, HTTP headers are not considered in the computation of unique keys. Additionally, we do not offer an option to explicitly bypass request deduplication, unlike the
dont_filter
option in Scrapy (docs).Questions
unique_key
andextended_unique_key
computation?dont_filter
feature?always_enqueue
)?use_extended_unique_key
be set as the default behavior?The text was updated successfully, but these errors were encountered: