Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native re-implementation of Web Discovery (part 1) #24421

Closed
wants to merge 36 commits into from

Conversation

DJAndries
Copy link
Collaborator

Resolves brave/brave-browser#39439

Submitter Checklist:

  • I confirm that no security/privacy review is needed and no other type of reviews are needed, or that I have requested them
  • There is a ticket for my issue
  • Used Github auto-closing keywords in the PR description above
  • Wrote a good PR/commit description
  • Squashed any review feedback or "fixup" commits before merge, so that history is a record of what happened in the repo, not your PR
  • Added appropriate labels (QA/Yes or QA/No; release-notes/include or release-notes/exclude; OS/...) to the associated issue
  • Checked the PR locally:
    • npm run test -- brave_browser_tests, npm run test -- brave_unit_tests wiki
    • npm run presubmit wiki, npm run gn_check, npm run tslint
  • Ran git rebase master (if needed)

Reviewer Checklist:

  • A security review is not needed, or a link to one is included in the PR description
  • New files have MPL-2.0 license header
  • Adequate test coverage exists to prevent regressions
  • Major classes, functions and non-trivial code blocks are well-commented
  • Changes in component dependencies are properly reflected in gn
  • Code follows the style guide
  • Test plan is specified in PR before merging

After-merge Checklist:

Test Plan:

TBD

@github-actions github-actions bot added CI/run-audit-deps Check for known npm/cargo vulnerabilities (audit_deps) CI/storybook-url Deploy storybook and provide a unique URL for each build labels Jun 28, 2024
@DJAndries DJAndries requested a review from remusao June 28, 2024 00:03
kMaxRequestInterval,
kMaxRetries,
base::BindRepeating(&DoubleFetcher::OnFetchTimer,
base::Unretained(this))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] base::Unretained is most of the time unrequited, and a weak reference is better suited for secure coding.
Consider swapping Unretained for a weak reference.
base::Unretained usage may be acceptable when a callback owner is guaranteed
to be destroyed with the object base::Unretained is pointing to, for example:

- PrefChangeRegistrar
- base::*Timer
- mojo::Receiver
- any other class member destroyed when the class is deallocated


Source: https://github.com/brave/security-action/blob/main/assets/semgrep_rules/client/chromium-uaf.yaml


Cc @thypon @goodov @iefremov

std::move(resource_request), kFetchNetworkTrafficAnnotation);
url_loader_->DownloadToString(
shared_url_loader_factory_.get(),
base::BindOnce(&DoubleFetcher::OnRequestComplete, base::Unretained(this),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] base::Unretained is most of the time unrequited, and a weak reference is better suited for secure coding.
Consider swapping Unretained for a weak reference.
base::Unretained usage may be acceptable when a callback owner is guaranteed
to be destroyed with the object base::Unretained is pointing to, for example:

- PrefChangeRegistrar
- base::*Timer
- mojo::Receiver
- any other class member destroyed when the class is deallocated


Source: https://github.com/brave/security-action/blob/main/assets/semgrep_rules/client/chromium-uaf.yaml


Cc @thypon @goodov @iefremov

kMaxRequestInterval,
kMaxRetries,
base::BindRepeating(&Reporter::PrepareRequest,
base::Unretained(this))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] base::Unretained is most of the time unrequited, and a weak reference is better suited for secure coding.
Consider swapping Unretained for a weak reference.
base::Unretained usage may be acceptable when a callback owner is guaranteed
to be destroyed with the object base::Unretained is pointing to, for example:

- PrefChangeRegistrar
- base::*Timer
- mojo::Receiver
- any other class member destroyed when the class is deallocated


Source: https://github.com/brave/security-action/blob/main/assets/semgrep_rules/client/chromium-uaf.yaml


Cc @thypon @goodov @iefremov

credential_signer_->Sign(
std::vector<const uint8_t>(payload_hash.begin(), payload_hash.end()),
basename_result->basename,
base::BindOnce(&Reporter::OnRequestSigned, base::Unretained(this),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] base::Unretained is most of the time unrequited, and a weak reference is better suited for secure coding.
Consider swapping Unretained for a weak reference.
base::Unretained usage may be acceptable when a callback owner is guaranteed
to be destroyed with the object base::Unretained is pointing to, for example:

- PrefChangeRegistrar
- base::*Timer
- mojo::Receiver
- any other class member destroyed when the class is deallocated


Source: https://github.com/brave/security-action/blob/main/assets/semgrep_rules/client/chromium-uaf.yaml


Cc @thypon @goodov @iefremov

content_scraper_->ScrapePage(url, false,
document_extractor_remotes_.Get(remote_id),
base::BindOnce(&WDPService::OnContentScraped,
base::Unretained(this), false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] base::Unretained is most of the time unrequited, and a weak reference is better suited for secure coding.
Consider swapping Unretained for a weak reference.
base::Unretained usage may be acceptable when a callback owner is guaranteed
to be destroyed with the object base::Unretained is pointing to, for example:

- PrefChangeRegistrar
- base::*Timer
- mojo::Receiver
- any other class member destroyed when the class is deallocated


Source: https://github.com/brave/security-action/blob/main/assets/semgrep_rules/client/chromium-uaf.yaml


Cc @thypon @goodov @iefremov

alive_message_timer_.Start(
FROM_HERE, kAliveCheckInterval,
base::BindRepeating(&WDPService::MaybeSendAliveMessage,
base::Unretained(this)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] base::Unretained is most of the time unrequited, and a weak reference is better suited for secure coding.
Consider swapping Unretained for a weak reference.
base::Unretained usage may be acceptable when a callback owner is guaranteed
to be destroyed with the object base::Unretained is pointing to, for example:

- PrefChangeRegistrar
- base::*Timer
- mojo::Receiver
- any other class member destroyed when the class is deallocated


Source: https://github.com/brave/security-action/blob/main/assets/semgrep_rules/client/chromium-uaf.yaml


Cc @thypon @goodov @iefremov

server_config_loader_ = std::make_unique<ServerConfigLoader>(
local_state_, user_data_dir_, shared_url_loader_factory_.get(),
base::BindRepeating(&WDPService::OnConfigChange,
base::Unretained(this)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] base::Unretained is most of the time unrequited, and a weak reference is better suited for secure coding.
Consider swapping Unretained for a weak reference.
base::Unretained usage may be acceptable when a callback owner is guaranteed
to be destroyed with the object base::Unretained is pointing to, for example:

- PrefChangeRegistrar
- base::*Timer
- mojo::Receiver
- any other class member destroyed when the class is deallocated


Source: https://github.com/brave/security-action/blob/main/assets/semgrep_rules/client/chromium-uaf.yaml


Cc @thypon @goodov @iefremov

base::BindRepeating(&WDPService::OnConfigChange,
base::Unretained(this)),
base::BindRepeating(&WDPService::OnPatternsLoaded,
base::Unretained(this)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] base::Unretained is most of the time unrequited, and a weak reference is better suited for secure coding.
Consider swapping Unretained for a weak reference.
base::Unretained usage may be acceptable when a callback owner is guaranteed
to be destroyed with the object base::Unretained is pointing to, for example:

- PrefChangeRegistrar
- base::*Timer
- mojo::Receiver
- any other class member destroyed when the class is deallocated


Source: https://github.com/brave/security-action/blob/main/assets/semgrep_rules/client/chromium-uaf.yaml


Cc @thypon @goodov @iefremov

service_manager::BinderRegistry* registry)
: content::RenderFrameObserver(render_frame), render_frame_(render_frame) {
registry->AddInterface<mojom::DocumentExtractor>(base::BindRepeating(
&BlinkDocumentExtractor::BindReceiver, base::Unretained(this)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] base::Unretained is most of the time unrequited, and a weak reference is better suited for secure coding.
Consider swapping Unretained for a weak reference.
base::Unretained usage may be acceptable when a callback owner is guaranteed
to be destroyed with the object base::Unretained is pointing to, for example:

- PrefChangeRegistrar
- base::*Timer
- mojo::Receiver
- any other class member destroyed when the class is deallocated


Source: https://github.com/brave/security-action/blob/main/assets/semgrep_rules/client/chromium-uaf.yaml


Cc @thypon @goodov @iefremov

@brave-builds
Copy link
Collaborator

A Storybook has been deployed to preview UI for the latest push

Copy link
Member

@fmarier fmarier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

script/brave_license_helper.py changes look good.

* This Source Code Form is subject to the terms of the Mozilla Public
* License, v. 2.0. If a copy of the MPL was not distributed with this file,
* You can obtain one at https://mozilla.org/MPL/2.0/. */

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall my concern with this content scraper is it seems to be doing parsing of web content in the browser process, which would violate the Rule of 2: https://chromium.googlesource.com/chromium/src/+/master/docs/security/rule-of-2.md. In general C++ code in the browsre process should not process untrustworthy inputs unless they are very trivial or first passed through a sanitizer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above the class declaration provides info on the two modes of operation:

https://github.com/brave/brave-core/blob/4ec17ed3ddf4ed1cef849e88b1e40c46c7563c45/components/web_discovery/browser/content_scraper.h#L47-L58

The first mode involves gathering contents from the current page in the renderer process. The untrusted content is processed by the sandboxed renderer process.

The second mode retrieves the untrusted content via the sandboxed network service process, and processes the HTML/DOM in the browser process using Rust.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds fine to me as the processing that does happen in C++ seems to be relatively minimal but would ike @bridiver to sign off

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are several existing methods for extracting html content from the page, why are you introducing something new here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am unaware of those existing methods, could you please provide a couple pointers?

auto final_payload = GenerateFinalPayload(*payload_dict);

std::string final_payload_json;
if (!base::JSONWriter::Write(final_payload, &final_payload_json)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the payload could contain any untrustworthy content (like any properties which can be controlled by a webpage, for instance) this should follow the rule of 2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The payload is generated from the output of the ContentScraper, which follows Rule of 2.

@DJAndries
Copy link
Collaborator Author

I noted that:

  • components/web_discovery/browser/document_extractor/rs/BUILD.gn has allow_unsafe = true,
  • components/web_discovery/browser/document_extractor/rs/src/lib.rs has
    #[allow(unsafe_op_in_unsafe_fn)].

I didn't see any unsafe blocks while going over this. Do we have any unsafe code or is it in dependencies?

allow_unsafe = true is required for all cxx bindings due to the way cxx works internally, afaik. however, there is no other unsafe code within the binding itself and was able to remove #[allow(unsafe_op_in_unsafe_fn)] from both bindings.

@brave-builds
Copy link
Collaborator

A Storybook has been deployed to preview UI for the latest push

@@ -86,6 +86,7 @@ public class BravePrivacySettings extends PrivacySettings implements ConnectionE
public static final String PREF_FINGERPRINTING_PROTECTION2 = "fingerprinting_protection2";
private static final String PREF_CLOSE_TABS_ON_EXIT = "close_tabs_on_exit";
private static final String PREF_SEND_P3A = "send_p3a_analytics";
private static final String PREF_SEND_WEB_DISCOVERY = "send_web_discovery";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a mechanism to expose cpp prefs to java, but I don't recall what it is off-hand. @SergeyZhukovsky ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefs from cpp to java are exposed via that section https://github.com/brave/brave-core/blob/master/browser/android/preferences/BUILD.gn#L37. You basically add files with prefs that you want to expose and the script does. The prefs could be found in an auto generated BrsavePref.java file that is made from a template https://github.com/brave/brave-core/blob/master/browser/preferences/android/java_templates/BravePref.java.tmpl
That's an example on how a pref would look like in Java
public static final String BRAVE_CHAT_PURCHASE_TOKEN_ANDROID = "brave.ai_chat.purchase_token_android"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we already do this:

if (BraveConfig.WEB_DISCOVERY_ENABLED
&& ChromeFeatureList.isEnabled(BraveFeatureList.BRAVE_WEB_DISCOVERY_NATIVE)) {
mSendWebDiscovery = (ChromeSwitchPreference) findPreference(PREF_SEND_WEB_DISCOVERY);
mSendWebDiscovery.setOnPreferenceChangeListener(this);
} else {
removePreferenceIfPresent(PREF_SEND_WEB_DISCOVERY);
}

java_cpp_strings("java_pref_names_srcjar") {
sources = [
"//brave/components/brave_adaptive_captcha/pref_names.h",
"//brave/components/brave_rewards/common/pref_names.h",
"//brave/components/brave_shields/core/common/pref_names.h",
"//brave/components/brave_vpn/common/pref_names.h",
"//brave/components/constants/pref_names.h",
"//brave/components/de_amp/common/pref_names.cc",
"//brave/components/de_amp/common/pref_names.h",
"//brave/components/debounce/core/common/pref_names.h",
"//brave/components/decentralized_dns/core/pref_names.h",
"//brave/components/ipfs/pref_names.h",
"//brave/components/ntp_background_images/common/pref_names.h",
"//brave/components/omnibox/browser/brave_omnibox_prefs.cc",
"//brave/components/omnibox/browser/brave_omnibox_prefs.h",
"//brave/components/p3a/pref_names.h",
"//brave/components/request_otr/common/pref_names.h",
"//brave/components/speedreader/speedreader_pref_names.h",
"//brave/components/web_discovery/browser/pref_names.h",
"//components/translate/core/browser/translate_pref_names.h",
]

constexpr size_t kMinPageCountForAliveMessage = 2;
} // namespace

WDPService::WDPService(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use consistent naming, WebDiscovery appears to be used everywhere else.

void WebDiscoveryTabHelper::DidFinishLoad(
content::RenderFrameHost* render_frame_host,
const GURL& url) {
if (!render_frame_host->IsInPrimaryMainFrame()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocumentOnLoadCompletedInPrimaryMainFrame?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that method does not pass a url


const TEXT_CONTENT_ATTRIBUTE_NAME: &str = "textContent";

fn extract_attributes_from_nodes(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we use the axtree which would also take the place of the document extrator?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crate is used to process the page contents from the anonymous request made in the browser process. If there is a way to construct an AxTree from a string, would it be easy to do this in a sandboxed process? if not, wouldn't that violate rule of 2?


namespace web_discovery {

class WebDiscoveryContentScraperTest : public PlatformBrowserTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only browser test? I think we need some higher level tests for the full process and also tests to ensure that it's properly enabled/disabled with pref/flag

RSAKeyInfo::RSAKeyInfo() = default;
RSAKeyInfo::~RSAKeyInfo() = default;

std::unique_ptr<RSAKeyInfo> GenerateRSAKeyPair() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use RsaHashedAlgorithm from webcrypto?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. i meant to swap this code out with RsaHashedAlgorithm before making the PR (i learned about its existence later on during implementation), but i think we can use it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or crypto::RSAPrivateKey?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also what thread does this run on?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all callers use a thread pool

std::optional<AESEncryptResult> DeriveAESKeyAndEncrypt(
const std::string& server_pub_key_b64,
const std::vector<uint8_t>& data) {
auto server_pub_key_data = base::Base64Decode(server_pub_key_b64);
Copy link
Collaborator

@bridiver bridiver Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems strange to me that we're doing this low level encryption stuff ourselves. Are there really not any utilities in chromium for this? There must be something in crypto? Aead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for our specific use case of ecdh, i couldn't find anything useful from webcrypto

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AesGcmImplementation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or crypt::Aead with `AES_256_GCM?

Copy link
Collaborator

@bridiver bridiver Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crypto::ECPrivateKey/EcdhImplementation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've tried using stuff from components/webcrypto/algorithms previously, but it ended up being a no-go since it required a few renderer specific blink dependencies

crypt::Aead should be an option, however. there may have been a reason why i chose not to use it but i can't recall. i'll revisit nevertheless.

the ecdh stuff would still remain, however

Copy link
Collaborator Author

@DJAndries DJAndries Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i remembered why crypt::Aead wasn't an option. we need the aes-128-gcm algo, and the class does not expose that algo. i suppose we could patch it to support it, but it probably wouldn't be worth it especially considering the simplicity of the aes portion of the code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ECPrivateKey is also not useful since only EC points are exchanged between the server and client, and not PKCS#8/X.509 blocks which the ECPrivateKey uses for serialization.

@bridiver bridiver requested review from iefremov and goodov July 30, 2024 00:30
start_join_result.gsk.end());
pool_sequenced_task_runner_->PostTaskAndReplyWithResult(
FROM_HERE,
base::BindOnce(&GenerateJoinRequest, &**anonymous_credential_manager_,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not look safe, anonymous_credential_manager_ is accessed on more than one thread dca6a55#diff-41b339c44a2b795ab1bfc9a8a48f5a1a1a1ce380252a5cde46a18338ff5052baR388

also why do you have a rust::Box inside a unique_ptr? You can pass the rust::Box with &*

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the unique_ptr is just for OnTaskRunnerDeleter I think it makes things a bit confusing with &** and you can just std::move the rust::Box and PostTask

pool_sequenced_task_runner_->PostTask(
        FROM_HERE, base::BindOnce([](::rust::Box<anonymous_credentials::CredentialManager> credential_manager) {},
                                  std::move(anonymous_credential_manager_)));

Copy link
Collaborator Author

@DJAndries DJAndries Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be possible, but then we wouldn't be able to perform multiple join and sign requests concurrently

Copy link
Collaborator Author

@DJAndries DJAndries Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not look safe, anonymous_credential_manager_ is accessed on more than one thread

the issue in the diff that you linked was fixed in a later commit. all calls to the manager are through the task runner

@@ -71,15 +85,18 @@ class CredentialManager {
net::BackoffEntry backoff_entry_;
base::WallClockTimer retry_timer_;

base::RepeatingClosure credentials_loaded_callback_;
scoped_refptr<base::SequencedTaskRunner> pool_sequenced_task_runner_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this pool_* ?

@goodov
Copy link
Member

goodov commented Jul 30, 2024

This PR has a nice history of commits, but they are not very usable as-is. Some significant changes occur midway, making a per-commit review obsolete.

At this point, the PR should be split into multiple smaller ones to gradually introduce these changes. There are too many modifications affecting performance, stability, and security.

The new PRs can use the existing commit history as a base, but some commits should be reordered and squash-fixed to create the "final" version of each logical change.

@DJAndries DJAndries force-pushed the wdp-native-anon-creds branch 3 times, most recently from 6c85c0b to 0eb2038 Compare August 1, 2024 23:04
@DJAndries
Copy link
Collaborator Author

superseded by #24969, #24970 and #24971

@DJAndries DJAndries closed this Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/run-audit-deps Check for known npm/cargo vulnerabilities (audit_deps) CI/storybook-url Deploy storybook and provide a unique URL for each build needs-security-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Native re-implementation of Web Discovery (part 1)