Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the deduplication of requests #178

Closed
vdusek opened this issue Jun 10, 2024 · 3 comments
Closed

Improve the deduplication of requests #178

vdusek opened this issue Jun 10, 2024 · 3 comments
Assignees
Labels
solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@vdusek
Copy link
Collaborator

vdusek commented Jun 10, 2024

Context

A while ago, Honza Javorek raised some good points regarding the deduplication process in the request queue (#190).

The first one:

Is it possible that Apify's request queue dedupes the requests only based on the URL? Because the POSTs all have the same URL, just different payload. Which should be very common - by definition of what POST is, or even in practical terms with all the GraphQL APIs around.

In response, we improved the unique key generation logic in the Python SDK (PR #193) to align with the TS Crawlee. This logic was lates copied to crawlee-python and can be found in crawlee/_utils/requests.py.

The second one:

Also wondering whether two identical requests with one different HTTP header should be considered same or different. Even with a simple GET request, I could make one with Accept-Language: cs, another with Accept-Language: en, and I can get two wildly different responses from the same server.

Currently, HTTP headers are not considered in the computation of unique keys. Additionally, we do not offer an option to explicitly bypass request deduplication, unlike the dont_filter option in Scrapy (docs).

Questions

  • Should we include HTTP headers in the unique_key and extended_unique_key computation?
    • Yes.
  • Should we implement a dont_filter feature?
    • It will be just a syntax sugar appending some random string to a unique key.
    • Also come up with a better name (e.g. always_enqueue)?
  • Should use_extended_unique_key be set as the default behavior?
    • Probably not now.
@vdusek vdusek added t-tooling Issues with this label are in the ownership of the tooling team. enhancement New feature or request. labels Jun 10, 2024
@vdusek vdusek added this to the 0.2 milestone Jun 10, 2024
@Mantisus
Copy link
Contributor

Should we include HTTP headers in the unique_key (extended_unique_key) computation?

I think yes, you should do that.
In addition to the Accept-Language mentioned. You can consider the situation when within the crawler work requests are executed from different authorized users. The only difference is the header containing the authorization token.
There are other special cases when the header has a significant impact on the content of the response.

Should we implement a dont_filter feature?

Yes. For example, for cases where the server returns a 200 response status. But the response body contains data that an error occurred and this request should be executed again.
If I see the current implementation correctly, this will not be possible without this option.

@vdusek vdusek removed this from the v0.2 milestone Jul 15, 2024
@B4nan
Copy link
Member

B4nan commented Sep 23, 2024

Let's try to find a better name than dont_filter, we want a new option that will put a random value into unique_key.

@vdusek vdusek added solutioning The issue is not being implemented but only analyzed and planned. and removed enhancement New feature or request. hacktoberfest labels Sep 27, 2024
@vdusek vdusek self-assigned this Sep 27, 2024
@vdusek vdusek added this to the 99th sprint - Tooling team milestone Sep 27, 2024
@vdusek
Copy link
Collaborator Author

vdusek commented Sep 27, 2024

Thanks for the inputs, based on this and our internal discussions I opened #547 and #548 and closing this one.

@vdusek vdusek closed this as completed Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants