-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: tieredProxyUrls
for ProxyConfiguration
#2348
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
do you have a plan how this can be integrated this with the apify sdk and its proxy groups?
hmm maybe we should really think about using proxy chain for this, it would be good to get rid of the need for this option (since it has a huge perf impact) |
Right, that will need some touch-ups directly in SDK... I'd be happy with an interface like this: const apifyProxy = new ProxyConfiguration({
// // existing options:
// password?: string;
// groups?: string[];
// countryCode?: string;
proxyTiersConfig: [
{ // Basically have the same options as the constructor, only as items in an array.
groups: ['DATACENTER'], // The further in the array, the higher tier the proxy is.
countryCode: 'US'
},
{
groups: ['DATACENTER123'],
countryCode: 'US'
},
{
groups: ['DATACENTER-PRICY', 'DATACENTER-PRICY-ALT'],
},
{
groups: ['RESIDENTIALS'],
countryCode: 'DE'
}
]
}) These are 1:1 mappable to proxy URLs (using the username parameters), so we could construct the |
Yeah, unfortunately, we cannot circumvent this without a smart proxy afaiac, for reasons similar to #2065 - you just cannot switch the proxy URL for a context / browser in the middle. The session management understands this (and kills the old browser on a retried session), but a feature like this adaptive proxy calls for something we can do per domain.
|
After the today's brainstorming (me with myself), it seems that it should work with Turning this into a draft until it gets a bit less hairy and a bit better tested. |
Now, this was a bit wilder ride than I expected. Looking forward to Crawlee v4, this is definitely something we want to rethink and make more space for in the current interfaces/classes. Right now, it works as follows:
For the
As mentioned above, there are various parts I'm not too happy with (especially the |
Could we (not necessarily now) handle #2065 too with this new approach? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few nits, looking great otherwise
packages/browser-pool/src/abstract-classes/browser-controller.ts
Outdated
Show resolved
Hide resolved
Regarding the "disabling proxy with |
packages/core/src/request.ts
Outdated
/** | ||
* Local hooks for the request. Note that the hooks are not persisted once the request is stored to a storage. | ||
*/ | ||
hooks: Partial<Record<RequestEvent, ((request: Request) => void)[]>> = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that this is not ideal. The Request type is an amalgamation of multiple things and this adds one more. Also adding actual behavior to a previously more or less data-only class doesn't feel right.
I guess this is not a good fit for the EventManager
that we already have, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Request
being data-only class worries me the most here, the hooks are a bit forced. Not sure about EventManager
, right now it only handles Actor-wide events (abort
, migration
etc.), and is used as such... making it handle the crawling logic doesn't feel right either, mostly from the what-depends-on-what POV.
In a perfect world, we would have the ProxyConfiguration
predict the proxy tier from the request statelessly - the request contains the sessionRotationCount
, (and can contain the lastUsedProxyTier
), which is everything you need to determine the proxy - this is how it worked initially... until I realized that to go up one tier, you have to process the entire queue (of new requests) until you get to the retried request which carries the info about the failed proxy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although - maybe reclaimRequest
with forefront
should make the crawler retry the request sooner? gotta need to try this, brb :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it works! 💥 check out the latest few commits - the way of predicting the proxy tier has changed completely (once again)
request.addHook('sessionRotation', () => { | ||
tracker.addError(tierPrediction); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect a function called getProxyTier
to modify its arguments in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point - the new approach still does this, but in a more... manageable way? I'm curious to hear your opinion there.
Also, so we all know how the proxy tier prediction works, I made a 10-minute animation (it took 10 minutes, the animation is 5 seconds). The green blob is the current tier prediction, and the red bars are error counters for different tiers. Every visited tier is used as a prediction at least once - if the tier doesn't produce errors, it doesn't push the blob away (and so it stays and is used for next requests too). The error counters decrement with time, so we try lower proxy tiers from time to time (potentially saving some money). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks fine to me, am pretty smooth brain tho
Alright, I'm for merging now (not releasing) and implementing the |
sure, lets merge |
Introduces the
tieredProxyUrls
option forProxyConfigration
, allowing the user to pass proxy URL groups.The
newProxyInfo
now takes an optionalrequest
parameter, which can be used for picking the correct proxy tier. Because of this, the proxy tiering doesn't work properly in browser crawlers withoutuseIncogintoPages
(as the proxy is tied to the launched browser instance and can be used for multiple requests).To do:
useIncognitoPages: false
sessionId
with thetieredProxyUrls
- would require tracking thesession - URL - proxyTier
mapping, which might be a lot of data for large crawls... still worth trying out.