-
Notifications
You must be signed in to change notification settings - Fork 586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add tiered proxy blog #2552
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few formatting nits, havent read the text in detail yet
Co-authored-by: Martin Adámek <banan23@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this 😄 plus idk how readable something like this would be for a layman - maybe adding chronological numbers ala UML communication diagram to the actions would help with readability?
Cool graphics though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad copied the wrong one here, thanks for pointing out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for the second part, I noted the feedback, I will take care of it the next time! :) let's go with this one otherwise design have to create a new one 😅
|
||
Proxies vary in quality, speed, reliability, and cost. There are a [few types of proxies](https://blog.apify.com/types-of-proxies/), such as datacenter and residential proxies. Datacenter proxies are cheaper but, on the other hand, more prone to getting blocked, and vice versa with residential proxies. | ||
|
||
It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced `TieredProxies` in Crawlee. Let’s take a look at it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced `TieredProxies` in Crawlee. Let’s take a look at it. | |
It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced tiered proxies in Crawlee. Let’s take a look at it. |
We don't call this feature TieredProxies
anywhere
|
||
Tiered proxies are a method of organizing and using different types of proxies based on their quality, speed, reliability, and cost. Tiered proxies allow you to rotate between a mix of proxy types to optimize your scraping activities. | ||
|
||
**Define proxy tiers**: You categorize your proxies into different tiers based on their quality. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Define proxy tiers**
This seems like a rogue ChatGPT prompt left behind 😄
- **Adjusting tiers**: Higher-tier proxies are used if a domain shows more errors. Conversely, if a domain performs well with a high-tier proxy, the system will occasionally test lower-tier proxies. If successful, it continues using the lower tier, optimizing costs. | ||
- **Forgetting old errors**: Old errors are given less weight over time, allowing the system to adjust tiers dynamically as proxies' performance changes. | ||
|
||
### Working |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope it is
jokes aside, the paragraphs from Working
to Implementation
seem very bare-bones. The feature is IMO very simple to explain, I don't think we need to inflate the blog post. It can be short, there is just not enough content to talk about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the need to add these things are mostly because there is already a lot of content on Scrapy rotating proxies boasting their features and we don't want users to feel our feature does not do those things, when it actually does.
I understand its not fully developer oriented part, but in the end it is a blog not exactly docs, so we can and I think should be a little more explanation here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the need for good SEO content, but I'm not sure this is it. E.g.:
Structure - the structure of what? and why is the Structure heading nested under the Features? Also "Tiered Array" doesn't tell me anything (plus the description is wrong, the type is string[][]
, i.e. each element in the array is an array of strings - proxy URLs).
In general, I'm all for the "show don't tell" approach - the example shows much more to me than any amount of text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still hating on this part 😄 But as long as you fix the wrong type in the Structure
paragraph, it's a go from me - I don't want to block this for too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your point; thanks for putting in the effort. 🫶
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i removed the structure and tiered array, both were not giving any value. 👍
also corrected the headings!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@barjin if its okay let's release it today? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it's a go from me 👍🏽
|
||
**Fallback Mechanism**: Crawlee starts with the first tier of proxies. If proxies in the current tier fail, it will switch to the next tier. | ||
|
||
### Implementation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example, not implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also if it's a heading it shouldn't have the colon at the end I guess
await crawler.run(); | ||
``` | ||
|
||
## How tiered proxies use Session Pool under the hood |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this part is not wrong, the implementation of all of this in Crawlee is a mess. I don't think people profit much from knowing how it's implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thinking was as I said earlier is to explain we do rotate/create sessions and use session pool as it is one of the feature that makes us different from scrapy and we under promote it.
and this is making the blog a little more technical than just talking about the product/crawlee update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Radial gradients are notoriously hard to compress for PNG.
In the video below, I'm switching between the original image (~800 kB) and a WebP compressed image (50 kB) - see the lower left corner. I don't see any visual degradation, especially at the sizes we'll be showing this picture on the web. Plus as long as the blog is still sharing the repo with Crawlee (the library), it would be nice to keep the git history as small as possible.
Peek.2024-06-24.11-41.mov
|
||
|
||
```js | ||
const { CheerioCrawler, ProxyConfiguration } = require('crawlee'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did the ESM imports (import {} from ""
) go? 😅
For context, we're now (exclusively?) using the ESM import syntax in our examples, as it's the new "standard" way of working with JavaScript modules (also, the TypeScript syntax afaik only allows for the ESM imports).
31b6a56
to
c7e0b24
Compare
@B4nan good to go 👍 |
add tiered proxy blog