Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add tiered proxy blog #2552

Merged
merged 8 commits into from
Jun 28, 2024
Merged

docs: add tiered proxy blog #2552

merged 8 commits into from
Jun 28, 2024

Conversation

souravjain540
Copy link
Collaborator

add tiered proxy blog

@souravjain540 souravjain540 requested a review from B4nan June 24, 2024 05:10
@B4nan B4nan requested a review from barjin June 24, 2024 08:40
Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few formatting nits, havent read the text in detail yet

Co-authored-by: Martin Adámek <banan23@gmail.com>
@souravjain540 souravjain540 requested a review from B4nan June 24, 2024 09:12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

this 😄 plus idk how readable something like this would be for a layman - maybe adding chronological numbers ala UML communication diagram to the actions would help with readability?

Cool graphics though :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad copied the wrong one here, thanks for pointing out!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for the second part, I noted the feedback, I will take care of it the next time! :) let's go with this one otherwise design have to create a new one 😅


Proxies vary in quality, speed, reliability, and cost. There are a [few types of proxies](https://blog.apify.com/types-of-proxies/), such as datacenter and residential proxies. Datacenter proxies are cheaper but, on the other hand, more prone to getting blocked, and vice versa with residential proxies.

It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced `TieredProxies` in Crawlee. Let’s take a look at it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced `TieredProxies` in Crawlee. Let’s take a look at it.
It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced tiered proxies in Crawlee. Let’s take a look at it.

We don't call this feature TieredProxies anywhere


Tiered proxies are a method of organizing and using different types of proxies based on their quality, speed, reliability, and cost. Tiered proxies allow you to rotate between a mix of proxy types to optimize your scraping activities.

**Define proxy tiers**: You categorize your proxies into different tiers based on their quality. For example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**Define proxy tiers**

This seems like a rogue ChatGPT prompt left behind 😄

- **Adjusting tiers**: Higher-tier proxies are used if a domain shows more errors. Conversely, if a domain performs well with a high-tier proxy, the system will occasionally test lower-tier proxies. If successful, it continues using the lower tier, optimizing costs.
- **Forgetting old errors**: Old errors are given less weight over time, allowing the system to adjust tiers dynamically as proxies' performance changes.

### Working
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope it is

jokes aside, the paragraphs from Working to Implementation seem very bare-bones. The feature is IMO very simple to explain, I don't think we need to inflate the blog post. It can be short, there is just not enough content to talk about.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the need to add these things are mostly because there is already a lot of content on Scrapy rotating proxies boasting their features and we don't want users to feel our feature does not do those things, when it actually does.

I understand its not fully developer oriented part, but in the end it is a blog not exactly docs, so we can and I think should be a little more explanation here :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the need for good SEO content, but I'm not sure this is it. E.g.:

Structure - the structure of what? and why is the Structure heading nested under the Features? Also "Tiered Array" doesn't tell me anything (plus the description is wrong, the type is string[][], i.e. each element in the array is an array of strings - proxy URLs).

In general, I'm all for the "show don't tell" approach - the example shows much more to me than any amount of text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still hating on this part 😄 But as long as you fix the wrong type in the Structure paragraph, it's a go from me - I don't want to block this for too long.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point; thanks for putting in the effort. 🫶

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i removed the structure and tiered array, both were not giving any value. 👍

also corrected the headings!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barjin if its okay let's release it today? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's a go from me 👍🏽


**Fallback Mechanism**: Crawlee starts with the first tier of proxies. If proxies in the current tier fail, it will switch to the next tier.

### Implementation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example, not implementation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if it's a heading it shouldn't have the colon at the end I guess

await crawler.run();
```

## How tiered proxies use Session Pool under the hood
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this part is not wrong, the implementation of all of this in Crawlee is a mess. I don't think people profit much from knowing how it's implemented.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thinking was as I said earlier is to explain we do rotate/create sessions and use session pool as it is one of the feature that makes us different from scrapy and we under promote it.

and this is making the blog a little more technical than just talking about the product/crawlee update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Radial gradients are notoriously hard to compress for PNG.

In the video below, I'm switching between the original image (~800 kB) and a WebP compressed image (50 kB) - see the lower left corner. I don't see any visual degradation, especially at the sizes we'll be showing this picture on the web. Plus as long as the blog is still sharing the repo with Crawlee (the library), it would be nice to keep the git history as small as possible.

Peek.2024-06-24.11-41.mov



```js
const { CheerioCrawler, ProxyConfiguration } = require('crawlee');
Copy link
Contributor

@barjin barjin Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did the ESM imports (import {} from "") go? 😅

For context, we're now (exclusively?) using the ESM import syntax in our examples, as it's the new "standard" way of working with JavaScript modules (also, the TypeScript syntax afaik only allows for the ESM imports).

@souravjain540
Copy link
Collaborator Author

@B4nan good to go 👍

#2552 (comment)

@B4nan B4nan merged commit 899c064 into master Jun 28, 2024
9 checks passed
@B4nan B4nan deleted the tiered-proxies branch June 28, 2024 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants