Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: tieredProxyUrls for ProxyConfiguration #2348

Merged
merged 14 commits into from
Mar 25, 2024
Merged

Conversation

barjin
Copy link
Contributor

@barjin barjin commented Feb 21, 2024

Introduces the tieredProxyUrls option for ProxyConfigration, allowing the user to pass proxy URL groups.

The newProxyInfo now takes an optional request parameter, which can be used for picking the correct proxy tier. Because of this, the proxy tiering doesn't work properly in browser crawlers without useIncogintoPages (as the proxy is tied to the launched browser instance and can be used for multiple requests).

To do:

  • maybe figure out the tiers for useIncognitoPages: false
  • use sessionId with the tieredProxyUrls - would require tracking the session - URL - proxyTier mapping, which might be a lot of data for large crawls... still worth trying out.

@barjin barjin added the adhoc Ad-hoc unplanned task added during the sprint. label Feb 21, 2024
@barjin barjin self-assigned this Feb 21, 2024
@github-actions github-actions bot added this to the 83rd sprint - Tooling team milestone Feb 21, 2024
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Feb 21, 2024
Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

do you have a plan how this can be integrated this with the apify sdk and its proxy groups?

packages/core/src/proxy_configuration.ts Outdated Show resolved Hide resolved
@B4nan
Copy link
Member

B4nan commented Feb 21, 2024

maybe figure out the tiers for useIncognitoPages: false

hmm maybe we should really think about using proxy chain for this, it would be good to get rid of the need for this option (since it has a huge perf impact)

@barjin
Copy link
Contributor Author

barjin commented Feb 21, 2024

Do you have a plan for how this can be integrated with the apify SDK and its proxy groups?

Right, that will need some touch-ups directly in SDK... I'd be happy with an interface like this:

const apifyProxy = new ProxyConfiguration({
    // // existing options:
    // password?: string;
    // groups?: string[];
    // countryCode?: string;
    proxyTiersConfig: [              
        {                            // Basically have the same options as the constructor, only as items in an array.
           groups: ['DATACENTER'],   // The further in the array, the higher tier the proxy is.
           countryCode: 'US'
        },
        {
           groups: ['DATACENTER123'],
           countryCode: 'US'
        },
        {
           groups: ['DATACENTER-PRICY', 'DATACENTER-PRICY-ALT'],
        },
        {
           groups: ['RESIDENTIALS'],
           countryCode: 'DE'
        }
    ]
})

These are 1:1 mappable to proxy URLs (using the username parameters), so we could construct the tieredProxyUrls from these... although there is still the session logic missing (which is more important here) - maybe it would be worth it looking into a more generic solution in Crawlee than just hard-coded URLs (one more level of dispatch, something like getUrlForTier(tier)).

@barjin
Copy link
Contributor Author

barjin commented Feb 21, 2024

maybe we should think about using proxy-chain for this, it would be good to get rid of the need for this option

Yeah, unfortunately, we cannot circumvent this without a smart proxy afaiac, for reasons similar to #2065 - you just cannot switch the proxy URL for a context / browser in the middle. The session management understands this (and kills the old browser on a retried session), but a feature like this adaptive proxy calls for something we can do per domain.

proxy-chain sounds perfect for the job... if it was complete. Using it now as-is would be 100% breaking for some people's use cases, because now we just pass proxy URLs to the browser that know how to use them... Compare to what proxy-chain can do (e.g. browsers have support for SOCKS proxies while proxy-chain doesn't, WebSocket connection cannot be made over HTTP etc etc.)

@barjin
Copy link
Contributor Author

barjin commented Mar 11, 2024

After the today's brainstorming (me with myself), it seems that it should work with useIncognitoPages: false, but this definitely needs more testing.

Turning this into a draft until it gets a bit less hairy and a bit better tested.

@barjin barjin marked this pull request as draft March 11, 2024 13:23
@barjin barjin marked this pull request as ready for review March 14, 2024 14:39
@barjin
Copy link
Contributor Author

barjin commented Mar 14, 2024

Now, this was a bit wilder ride than I expected.

Looking forward to Crawlee v4, this is definitely something we want to rethink and make more space for in the current interfaces/classes.

Right now, it works as follows:

  • The proxy configuration is now getting the request instance to determine the correct proxy tier.
  • The Request class now can receive hooks - the only hook now is sessionRotated - this is used by the ProxyConfiguration (it first predicts a proxy tier and tracks the session rotation - if it happens, it marks the proxy tier given to the request as a bad recommendation)

For the useIncognitoPages:false, we also had to integrate the proxy tiers logic to the browser-pool:

  • we want to open a new page in a given browser pool
  • we got the predicted proxy tier for this request / page, so we search for a browser-controller that has been launched with this proxy tier
  • if it doesn't exist, we open a new browser with this proxy tier.

As mentioned above, there are various parts I'm not too happy with (especially the Request hooks). Ideas welcome!

@B4nan
Copy link
Member

B4nan commented Mar 15, 2024

Could we (not necessarily now) handle #2065 too with this new approach?

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few nits, looking great otherwise

test/core/proxy_configuration.test.ts Outdated Show resolved Hide resolved
packages/core/src/request.ts Outdated Show resolved Hide resolved
packages/core/src/proxy_configuration.ts Show resolved Hide resolved
@B4nan B4nan requested a review from vladfrangu March 15, 2024 10:28
@barjin
Copy link
Contributor Author

barjin commented Mar 15, 2024

Regarding the "disabling proxy with false" issue - this might actually be the correct solution (with yet another option for browserPool._pickBrowserWithFreeCapacity). The interfaces for this are strictly private/internal anyway, so IMO there is no need to think too much about the design right now (and we can easily split those in separate PRs).

test/core/proxy_configuration.test.ts Outdated Show resolved Hide resolved
Comment on lines 136 to 139
/**
* Local hooks for the request. Note that the hooks are not persisted once the request is stored to a storage.
*/
hooks: Partial<Record<RequestEvent, ((request: Request) => void)[]>> = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that this is not ideal. The Request type is an amalgamation of multiple things and this adds one more. Also adding actual behavior to a previously more or less data-only class doesn't feel right.

I guess this is not a good fit for the EventManager that we already have, is it?

Copy link
Contributor Author

@barjin barjin Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Request being data-only class worries me the most here, the hooks are a bit forced. Not sure about EventManager, right now it only handles Actor-wide events (abort, migration etc.), and is used as such... making it handle the crawling logic doesn't feel right either, mostly from the what-depends-on-what POV.

In a perfect world, we would have the ProxyConfiguration predict the proxy tier from the request statelessly - the request contains the sessionRotationCount, (and can contain the lastUsedProxyTier), which is everything you need to determine the proxy - this is how it worked initially... until I realized that to go up one tier, you have to process the entire queue (of new requests) until you get to the retried request which carries the info about the failed proxy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although - maybe reclaimRequest with forefront should make the crawler retry the request sooner? gotta need to try this, brb :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works! 💥 check out the latest few commits - the way of predicting the proxy tier has changed completely (once again)

Comment on lines 307 to 309
request.addHook('sessionRotation', () => {
tracker.addError(tierPrediction);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect a function called getProxyTier to modify its arguments in any way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point - the new approach still does this, but in a more... manageable way? I'm curious to hear your opinion there.

packages/core/src/proxy_configuration.ts Show resolved Hide resolved
packages/core/src/proxy_configuration.ts Show resolved Hide resolved
packages/core/src/proxy_configuration.ts Outdated Show resolved Hide resolved
@barjin
Copy link
Contributor Author

barjin commented Mar 18, 2024

Also, so we all know how the proxy tier prediction works, I made a 10-minute animation (it took 10 minutes, the animation is 5 seconds). The green blob is the current tier prediction, and the red bars are error counters for different tiers. Every visited tier is used as a prediction at least once - if the tier doesn't produce errors, it doesn't push the blob away (and so it stays and is used for next requests too). The error counters decrement with time, so we try lower proxy tiers from time to time (potentially saving some money).

anim

Copy link
Member

@vladfrangu vladfrangu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine to me, am pretty smooth brain tho

packages/core/src/proxy_configuration.ts Show resolved Hide resolved
@barjin
Copy link
Contributor Author

barjin commented Mar 25, 2024

Alright, I'm for merging now (not releasing) and implementing the newUrlFunction revamp in another PR - even though it's similar to this, it might get quite messy as it's not clear what code does what. Who's with me? ✋🏽

@B4nan
Copy link
Member

B4nan commented Mar 25, 2024

sure, lets merge

@barjin barjin merged commit 5408c7f into master Mar 25, 2024
8 checks passed
@barjin barjin deleted the feat/tiered-proxy-rotation branch March 25, 2024 10:48
barjin added a commit that referenced this pull request Apr 4, 2024
Based on changes from #2348 , this PR simplifies the proxy handling in
the browser crawlers and makes those more intuitive.

Closes #2065
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants