Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add always_enqueue option to Request for bypassing deduplication #547

Closed
vdusek opened this issue Sep 27, 2024 · 4 comments · Fixed by #621
Closed

Add always_enqueue option to Request for bypassing deduplication #547

vdusek opened this issue Sep 27, 2024 · 4 comments · Fixed by #621
Labels
enhancement New feature or request. hacktoberfest t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@vdusek
Copy link
Collaborator

vdusek commented Sep 27, 2024

  • Add an always_enqueue option (or use a better name for it, but avoid negative terms) as an input parameter to the Request.from_url constructor.
    • This will allow users to easily opt out of the request deduplication process.
  • Implement the option as a convenient wrapper that generates a random unique_key, ensuring that each request is always enqueued and processed.
  • Address edge cases where both unique_key and always_enqueue=True are provided.
  • It should work in the same way as the dont_filter option in Scrapy (docs).
@vdusek vdusek added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. hacktoberfest labels Sep 27, 2024
@belloibrahv
Copy link

Hi @vdusek,

I'm interested in working on this enhancement to add an always_enqueue option for bypassing request deduplication. To ensure I understand the requirements correctly, I have a few questions:

  1. Regarding the option name:

    • Is always_enqueue the preferred name, or would you like suggestions for alternatives?
    • Should it be implemented as a boolean parameter?
  2. For the implementation of the random unique_key wrapper:

    • Do you have any preferences for the random key generation method?
    • Should there be a specific format or prefix for these generated keys?
  3. About the edge cases:

    • When both unique_key and always_enqueue=True are provided, which should take precedence?
    • Should we add a warning or raise an exception in this case?
  4. Regarding similarity to Scrapy's dont_filter:

    • Are there any specific behaviors from Scrapy's implementation that we should replicate or avoid?
    • Should we maintain exact parity with Scrapy's behavior, or are there Crawlee-specific considerations?

I'm familiar with Python and HTTP clients, and I'd be happy to work on this enhancement. Let me know if you need any clarification or have additional requirements before I start implementing the solution.

Thank you for considering my contribution!

@vdusek
Copy link
Collaborator Author

vdusek commented Oct 2, 2024

Hi @belloibrahv, thanks for your interest in Crawlee.

Is always_enqueue the preferred name, or would you like suggestions for alternatives?

always_enqueue is good, but if you come up with something better, we can consider it.

Should it be implemented as a boolean parameter?

yes

Do you have any preferences for the random key generation method?
Should there be a specific format or prefix for these generated keys?

You can probably generate a standard unique key and then append some random string using crypto_random_object_id.

When both unique_key and always_enqueue=True are provided, which should take precedence?
Should we add a warning or raise an exception in this case?

In that case, raise an exception.

Are there any specific behaviors from Scrapy's implementation that we should replicate or avoid?
Should we maintain exact parity with Scrapy's behavior, or are there Crawlee-specific considerations?

I am not aware of any.

@paradoxxx09
Copy link

/assign

@vdusek
Copy link
Collaborator Author

vdusek commented Oct 14, 2024

@paradoxxx09 We don't assign issues for hacktoberfest. If you want to work on this, open a PR. First mergeable one gets merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment