docs: add tiered proxy blog #2552

souravjain540 · 2024-06-24T05:10:30Z

add tiered proxy blog

B4nan

few formatting nits, havent read the text in detail yet

website/blog/2024/06-24-proxy-management-in-crawlee/index.md

Co-authored-by: Martin Adámek <banan23@gmail.com>

barjin · 2024-06-24T09:14:07Z

website/blog/2024/06-24-proxy-management-in-crawlee/img/session-pool-working.png

this 😄 plus idk how readable something like this would be for a layman - maybe adding chronological numbers ala UML communication diagram to the actions would help with readability?

Cool graphics though :)

my bad copied the wrong one here, thanks for pointing out!

And for the second part, I noted the feedback, I will take care of it the next time! :) let's go with this one otherwise design have to create a new one 😅

barjin · 2024-06-24T09:21:13Z

website/blog/2024/06-24-proxy-management-in-crawlee/index.md

+
+Proxies vary in quality, speed, reliability, and cost. There are a [few types of proxies](https://blog.apify.com/types-of-proxies/), such as datacenter and residential proxies. Datacenter proxies are cheaper but, on the other hand, more prone to getting blocked, and vice versa with residential proxies.
+
+It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced `TieredProxies` in Crawlee. Let’s take a look at it.


Suggested change

It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced `TieredProxies` in Crawlee. Let’s take a look at it.

It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced tiered proxies in Crawlee. Let’s take a look at it.

We don't call this feature TieredProxies anywhere

barjin · 2024-06-24T09:22:43Z

website/blog/2024/06-24-proxy-management-in-crawlee/index.md

+
+Tiered proxies are a method of organizing and using different types of proxies based on their quality, speed, reliability, and cost. Tiered proxies allow you to rotate between a mix of proxy types to optimize your scraping activities.
+
+**Define proxy tiers**: You categorize your proxies into different tiers based on their quality. For example:


**Define proxy tiers**

This seems like a rogue ChatGPT prompt left behind 😄

barjin · 2024-06-24T09:28:27Z

website/blog/2024/06-24-proxy-management-in-crawlee/index.md

+- **Adjusting tiers**: Higher-tier proxies are used if a domain shows more errors. Conversely, if a domain performs well with a high-tier proxy, the system will occasionally test lower-tier proxies. If successful, it continues using the lower tier, optimizing costs.
+- **Forgetting old errors**: Old errors are given less weight over time, allowing the system to adjust tiers dynamically as proxies' performance changes.
+
+### Working


I hope it is

jokes aside, the paragraphs from Working to Implementation seem very bare-bones. The feature is IMO very simple to explain, I don't think we need to inflate the blog post. It can be short, there is just not enough content to talk about.

the need to add these things are mostly because there is already a lot of content on Scrapy rotating proxies boasting their features and we don't want users to feel our feature does not do those things, when it actually does.

I understand its not fully developer oriented part, but in the end it is a blog not exactly docs, so we can and I think should be a little more explanation here :)

I understand the need for good SEO content, but I'm not sure this is it. E.g.:

Structure - the structure of what? and why is the Structure heading nested under the Features? Also "Tiered Array" doesn't tell me anything (plus the description is wrong, the type is string[][], i.e. each element in the array is an array of strings - proxy URLs).

In general, I'm all for the "show don't tell" approach - the example shows much more to me than any amount of text.

I'm still hating on this part 😄 But as long as you fix the wrong type in the Structure paragraph, it's a go from me - I don't want to block this for too long.

I got your point; thanks for putting in the effort. 🫶

i removed the structure and tiered array, both were not giving any value. 👍

also corrected the headings!

@barjin if its okay let's release it today? :)

Sure, it's a go from me 👍🏽

barjin · 2024-06-24T09:28:55Z

website/blog/2024/06-24-proxy-management-in-crawlee/index.md

+
+**Fallback Mechanism**: Crawlee starts with the first tier of proxies. If proxies in the current tier fail, it will switch to the next tier.
+
+### Implementation:


This is an example, not implementation

also if it's a heading it shouldn't have the colon at the end I guess

website/blog/2024/06-24-proxy-management-in-crawlee/index.md

barjin · 2024-06-24T09:32:59Z

website/blog/2024/06-24-proxy-management-in-crawlee/index.md

+await crawler.run();
+```
+
+## How tiered proxies use Session Pool under the hood


While this part is not wrong, the implementation of all of this in Crawlee is a mess. I don't think people profit much from knowing how it's implemented.

the thinking was as I said earlier is to explain we do rotate/create sessions and use session pool as it is one of the feature that makes us different from scrapy and we under promote it.

and this is making the blog a little more technical than just talking about the product/crawlee update.

barjin · 2024-06-24T09:52:28Z

website/blog/2024/06-24-proxy-management-in-crawlee/img/tiered-proxies.png

Radial gradients are notoriously hard to compress for PNG.

In the video below, I'm switching between the original image (~800 kB) and a WebP compressed image (50 kB) - see the lower left corner. I don't see any visual degradation, especially at the sizes we'll be showing this picture on the web. Plus as long as the blog is still sharing the repo with Crawlee (the library), it would be nice to keep the git history as small as possible.

Peek.2024-06-24.11-41.mov

barjin · 2024-06-25T09:24:18Z

website/blog/2024/06-24-proxy-management-in-crawlee/index.md

+
+
+```js
+const { CheerioCrawler, ProxyConfiguration } = require('crawlee');


Where did the ESM imports (import {} from "") go? 😅

For context, we're now (exclusively?) using the ESM import syntax in our examples, as it's the new "standard" way of working with JavaScript modules (also, the TypeScript syntax afaik only allows for the ESM imports).

souravjain540 · 2024-06-28T07:42:58Z

@B4nan good to go 👍

#2552 (comment)

add tiered proxy blog

7316dd5

souravjain540 requested a review from B4nan June 24, 2024 05:10

add images

a1eb1c7

B4nan requested a review from barjin June 24, 2024 08:40

B4nan requested changes Jun 24, 2024

View reviewed changes

Apply suggestions from code review

04e3950

Co-authored-by: Martin Adámek <banan23@gmail.com>

souravjain540 requested a review from B4nan June 24, 2024 09:12

barjin requested changes Jun 24, 2024

View reviewed changes

Saurav Jain added 3 commits June 24, 2024 15:56

minor changes

63fca7a

small fix

8f6d5a2

fix

a953cde

souravjain540 requested a review from barjin June 25, 2024 04:03

barjin reviewed Jun 25, 2024

View reviewed changes

esm

90d3d4a

souravjain540 requested a review from barjin June 26, 2024 05:51

correct headings

c7e0b24

souravjain540 force-pushed the tiered-proxies branch from 31b6a56 to c7e0b24 Compare June 26, 2024 18:20

B4nan merged commit 899c064 into master Jun 28, 2024
9 checks passed

B4nan deleted the tiered-proxies branch June 28, 2024 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add tiered proxy blog #2552

docs: add tiered proxy blog #2552

souravjain540 commented Jun 24, 2024

B4nan left a comment

barjin Jun 24, 2024

souravjain540 Jun 24, 2024

souravjain540 Jun 24, 2024

barjin Jun 24, 2024

barjin Jun 24, 2024

barjin Jun 24, 2024

souravjain540 Jun 24, 2024

barjin Jun 25, 2024

barjin Jun 26, 2024

souravjain540 Jun 26, 2024

souravjain540 Jun 26, 2024

souravjain540 Jun 28, 2024

barjin Jun 28, 2024

barjin Jun 24, 2024

B4nan Jun 24, 2024

barjin Jun 24, 2024

souravjain540 Jun 24, 2024

barjin Jun 24, 2024

barjin Jun 25, 2024 •

edited

Loading

souravjain540 commented Jun 28, 2024


		Proxies vary in quality, speed, reliability, and cost. There are a [few types of proxies](https://blog.apify.com/types-of-proxies/), such as datacenter and residential proxies. Datacenter proxies are cheaper but, on the other hand, more prone to getting blocked, and vice versa with residential proxies.

		It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced `TieredProxies` in Crawlee. Let’s take a look at it.


		Tiered proxies are a method of organizing and using different types of proxies based on their quality, speed, reliability, and cost. Tiered proxies allow you to rotate between a mix of proxy types to optimize your scraping activities.

		Define proxy tiers: You categorize your proxies into different tiers based on their quality. For example:


		Fallback Mechanism: Crawlee starts with the first tier of proxies. If proxies in the current tier fail, it will switch to the next tier.

		### Implementation:



		```js
		const { CheerioCrawler, ProxyConfiguration } = require('crawlee');

docs: add tiered proxy blog #2552

docs: add tiered proxy blog #2552

Conversation

souravjain540 commented Jun 24, 2024

B4nan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barjin Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

souravjain540 commented Jun 28, 2024

barjin Jun 25, 2024 •

edited

Loading