feat: `tieredProxyUrls` for ProxyConfiguration #2348

barjin · 2024-02-21T09:55:01Z

Introduces the tieredProxyUrls option for ProxyConfigration, allowing the user to pass proxy URL groups.

The newProxyInfo now takes an optional request parameter, which can be used for picking the correct proxy tier. Because of this, the proxy tiering doesn't work properly in browser crawlers without useIncogintoPages (as the proxy is tied to the launched browser instance and can be used for multiple requests).

To do:

maybe figure out the tiers for useIncognitoPages: false
use sessionId with the tieredProxyUrls - would require tracking the session - URL - proxyTier mapping, which might be a lot of data for large crawls... still worth trying out.

B4nan

nice!

do you have a plan how this can be integrated this with the apify sdk and its proxy groups?

packages/core/src/proxy_configuration.ts

B4nan · 2024-02-21T12:30:01Z

maybe figure out the tiers for useIncognitoPages: false

hmm maybe we should really think about using proxy chain for this, it would be good to get rid of the need for this option (since it has a huge perf impact)

barjin · 2024-02-21T15:04:25Z

Do you have a plan for how this can be integrated with the apify SDK and its proxy groups?

Right, that will need some touch-ups directly in SDK... I'd be happy with an interface like this:

const apifyProxy = new ProxyConfiguration({
    // // existing options:
    // password?: string;
    // groups?: string[];
    // countryCode?: string;
    proxyTiersConfig: [              
        {                            // Basically have the same options as the constructor, only as items in an array.
           groups: ['DATACENTER'],   // The further in the array, the higher tier the proxy is.
           countryCode: 'US'
        },
        {
           groups: ['DATACENTER123'],
           countryCode: 'US'
        },
        {
           groups: ['DATACENTER-PRICY', 'DATACENTER-PRICY-ALT'],
        },
        {
           groups: ['RESIDENTIALS'],
           countryCode: 'DE'
        }
    ]
})

These are 1:1 mappable to proxy URLs (using the username parameters), so we could construct the tieredProxyUrls from these... although there is still the session logic missing (which is more important here) - maybe it would be worth it looking into a more generic solution in Crawlee than just hard-coded URLs (one more level of dispatch, something like getUrlForTier(tier)).

barjin · 2024-02-21T15:20:45Z

maybe we should think about using proxy-chain for this, it would be good to get rid of the need for this option

Yeah, unfortunately, we cannot circumvent this without a smart proxy afaiac, for reasons similar to #2065 - you just cannot switch the proxy URL for a context / browser in the middle. The session management understands this (and kills the old browser on a retried session), but a feature like this adaptive proxy calls for something we can do per domain.

proxy-chain sounds perfect for the job... if it was complete. Using it now as-is would be 100% breaking for some people's use cases, because now we just pass proxy URLs to the browser that know how to use them... Compare to what proxy-chain can do (e.g. browsers have support for SOCKS proxies while proxy-chain doesn't, WebSocket connection cannot be made over HTTP etc etc.)

barjin · 2024-03-11T13:23:06Z

After the today's brainstorming (me with myself), it seems that it should work with useIncognitoPages: false, but this definitely needs more testing.

Turning this into a draft until it gets a bit less hairy and a bit better tested.

barjin · 2024-03-14T14:55:18Z

Now, this was a bit wilder ride than I expected.

Looking forward to Crawlee v4, this is definitely something we want to rethink and make more space for in the current interfaces/classes.

Right now, it works as follows:

The proxy configuration is now getting the request instance to determine the correct proxy tier.
The Request class now can receive hooks - the only hook now is sessionRotated - this is used by the ProxyConfiguration (it first predicts a proxy tier and tracks the session rotation - if it happens, it marks the proxy tier given to the request as a bad recommendation)

For the useIncognitoPages:false, we also had to integrate the proxy tiers logic to the browser-pool:

we want to open a new page in a given browser pool
we got the predicted proxy tier for this request / page, so we search for a browser-controller that has been launched with this proxy tier
if it doesn't exist, we open a new browser with this proxy tier.

As mentioned above, there are various parts I'm not too happy with (especially the Request hooks). Ideas welcome!

B4nan · 2024-03-15T09:15:25Z

Could we (not necessarily now) handle #2065 too with this new approach?

B4nan

just a few nits, looking great otherwise

test/core/proxy_configuration.test.ts

packages/core/src/request.ts

packages/browser-pool/src/abstract-classes/browser-controller.ts

packages/core/src/proxy_configuration.ts

barjin · 2024-03-15T14:52:07Z

Regarding the "disabling proxy with false" issue - this might actually be the correct solution (with yet another option for browserPool._pickBrowserWithFreeCapacity). The interfaces for this are strictly private/internal anyway, so IMO there is no need to think too much about the design right now (and we can easily split those in separate PRs).

test/core/proxy_configuration.test.ts

janbuchar · 2024-03-18T10:01:29Z

packages/core/src/request.ts

+    /**
+     * Local hooks for the request. Note that the hooks are not persisted once the request is stored to a storage.
+     */
+    hooks: Partial<Record<RequestEvent, ((request: Request) => void)[]>> = {};


I agree with you that this is not ideal. The Request type is an amalgamation of multiple things and this adds one more. Also adding actual behavior to a previously more or less data-only class doesn't feel right.

I guess this is not a good fit for the EventManager that we already have, is it?

Yeah, Request being data-only class worries me the most here, the hooks are a bit forced. Not sure about EventManager, right now it only handles Actor-wide events (abort, migration etc.), and is used as such... making it handle the crawling logic doesn't feel right either, mostly from the what-depends-on-what POV.

In a perfect world, we would have the ProxyConfiguration predict the proxy tier from the request statelessly - the request contains the sessionRotationCount, (and can contain the lastUsedProxyTier), which is everything you need to determine the proxy - this is how it worked initially... until I realized that to go up one tier, you have to process the entire queue (of new requests) until you get to the retried request which carries the info about the failed proxy.

Although - maybe reclaimRequest with forefront should make the crawler retry the request sooner? gotta need to try this, brb :)

it works! 💥 check out the latest few commits - the way of predicting the proxy tier has changed completely (once again)

janbuchar · 2024-03-18T10:38:00Z

packages/core/src/proxy_configuration.ts

+        request.addHook('sessionRotation', () => {
+            tracker.addError(tierPrediction);
+        });


I wouldn't expect a function called getProxyTier to modify its arguments in any way.

Yeah, good point - the new approach still does this, but in a more... manageable way? I'm curious to hear your opinion there.

packages/core/src/proxy_configuration.ts

barjin · 2024-03-18T13:00:17Z

Also, so we all know how the proxy tier prediction works, I made a 10-minute animation (it took 10 minutes, the animation is 5 seconds). The green blob is the current tier prediction, and the red bars are error counters for different tiers. Every visited tier is used as a prediction at least once - if the tier doesn't produce errors, it doesn't push the blob away (and so it stays and is used for next requests too). The error counters decrement with time, so we try lower proxy tiers from time to time (potentially saving some money).

vladfrangu

Code looks fine to me, am pretty smooth brain tho

packages/core/src/proxy_configuration.ts

barjin · 2024-03-25T10:39:05Z

Alright, I'm for merging now (not releasing) and implementing the newUrlFunction revamp in another PR - even though it's similar to this, it might get quite messy as it's not clear what code does what. Who's with me? ✋🏽

B4nan · 2024-03-25T10:47:46Z

sure, lets merge

Based on changes from #2348 , this PR simplifies the proxy handling in the browser crawlers and makes those more intuitive. Closes #2065

barjin added 2 commits February 21, 2024 10:38

feat: tieredProxyUrls for ProxyConfiguration

e4ebc7d

docs: reformat jsdoc

a2b8269

barjin added the adhoc Ad-hoc unplanned task added during the sprint. label Feb 21, 2024

barjin requested review from janbuchar and B4nan February 21, 2024 09:55

barjin self-assigned this Feb 21, 2024

github-actions bot added this to the 83rd sprint - Tooling team milestone Feb 21, 2024

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Feb 21, 2024

B4nan reviewed Feb 21, 2024

View reviewed changes

packages/core/src/proxy_configuration.ts Outdated Show resolved Hide resolved

fix: lint, fix tests

3206810

feat: make adaptive proxy work with useIncognitoPages: false

9c0a5c9

barjin marked this pull request as draft March 11, 2024 13:23

barjin added 5 commits March 11, 2024 14:26

chore: lint fix

17592ae

feat: add Request.addHook, ProxyTierTracker and optimum-search

7589f4b

chore: lint fixes

62981e2

chore: fix tests

e68362f

chore: rewrite tests for new tiered proxy logic

45ed2ac

barjin marked this pull request as ready for review March 14, 2024 14:39

chore: remove unused code

6df3356

B4nan reviewed Mar 15, 2024

View reviewed changes

B4nan requested a review from vladfrangu March 15, 2024 10:28

chore: lint, PR comments

4904358

janbuchar reviewed Mar 18, 2024

View reviewed changes

barjin added 3 commits March 18, 2024 16:09

feat: use forefront for stateless proxy tier rotation

9b1ca39

chore: naming

ae40392

fix: fix tests, cover more paths

13c7b3f

vladfrangu approved these changes Mar 21, 2024

View reviewed changes

packages/core/src/proxy_configuration.ts Show resolved Hide resolved

barjin merged commit 5408c7f into master Mar 25, 2024
8 checks passed

barjin deleted the feat/tiered-proxy-rotation branch March 25, 2024 10:48

barjin mentioned this pull request Mar 26, 2024

feat: better newUrlFunction for ProxyConfiguration #2392

Merged

barjin added a commit that referenced this pull request Apr 4, 2024

feat: better newUrlFunction for ProxyConfiguration (#2392)

330598b

Based on changes from #2348 , this PR simplifies the proxy handling in the browser crawlers and makes those more intuitive. Closes #2065

barjin mentioned this pull request Apr 4, 2024

feat: support for proxy tiers apify/apify-sdk-js#290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: `tieredProxyUrls` for ProxyConfiguration #2348

feat: `tieredProxyUrls` for ProxyConfiguration #2348

barjin commented Feb 21, 2024

B4nan left a comment

B4nan commented Feb 21, 2024

barjin commented Feb 21, 2024

barjin commented Feb 21, 2024

barjin commented Mar 11, 2024

barjin commented Mar 14, 2024 •

edited

Loading

B4nan commented Mar 15, 2024

B4nan left a comment

barjin commented Mar 15, 2024

janbuchar Mar 18, 2024

barjin Mar 18, 2024 •

edited

Loading

barjin Mar 18, 2024

barjin Mar 18, 2024

janbuchar Mar 18, 2024

barjin Mar 18, 2024

barjin commented Mar 18, 2024

vladfrangu left a comment

barjin commented Mar 25, 2024

B4nan commented Mar 25, 2024

feat: tieredProxyUrls for ProxyConfiguration #2348

feat: tieredProxyUrls for ProxyConfiguration #2348

Conversation

barjin commented Feb 21, 2024

B4nan left a comment

Choose a reason for hiding this comment

B4nan commented Feb 21, 2024

barjin commented Feb 21, 2024

barjin commented Feb 21, 2024

barjin commented Mar 11, 2024

barjin commented Mar 14, 2024 • edited Loading

B4nan commented Mar 15, 2024

B4nan left a comment

Choose a reason for hiding this comment

barjin commented Mar 15, 2024

janbuchar Mar 18, 2024

Choose a reason for hiding this comment

barjin Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

barjin Mar 18, 2024

Choose a reason for hiding this comment

barjin Mar 18, 2024

Choose a reason for hiding this comment

janbuchar Mar 18, 2024

Choose a reason for hiding this comment

barjin Mar 18, 2024

Choose a reason for hiding this comment

barjin commented Mar 18, 2024

vladfrangu left a comment

Choose a reason for hiding this comment

barjin commented Mar 25, 2024

B4nan commented Mar 25, 2024

feat: `tieredProxyUrls` for ProxyConfiguration #2348

feat: `tieredProxyUrls` for ProxyConfiguration #2348

barjin commented Mar 14, 2024 •

edited

Loading

barjin Mar 18, 2024 •

edited

Loading