diff --git a/docs/.gitbook/assets/Screenshot 2024-11-30 at 15.33.27.png b/docs/.gitbook/assets/Screenshot 2024-11-30 at 15.33.27.png new file mode 100644 index 000000000..c18d3bee7 Binary files /dev/null and b/docs/.gitbook/assets/Screenshot 2024-11-30 at 15.33.27.png differ diff --git a/docs/advanced/migrations/reacher-configuration-v0.10.md b/docs/advanced/migrations/reacher-configuration-v0.10.md index 4dd59d190..010a1bfc9 100644 --- a/docs/advanced/migrations/reacher-configuration-v0.10.md +++ b/docs/advanced/migrations/reacher-configuration-v0.10.md @@ -106,28 +106,10 @@ enable = true # Env variable: RCH__WORKER__RABBITMQ__URL url = "amqp://guest:guest@localhost:5672" -# Queues to consume emails from. By default, the worker consumes from all -# queues. -# -# To consume from only a subset of queues, uncomment the line `queues = "all"` -# and specify the queues you want to consume from. -# -# Below is the exhaustive list of queue names that the worker can consume from: -# - "check.gmail": subscribe exclusively to Gmail emails. -# - "check.hotmailb2b": subscribe exclusively to Hotmail B2B emails. -# - "check.hotmailb2c": subscribe exclusively to Hotmail B2C emails. -# - "check.yahoo": subscribe exclusively to Yahoo emails. -# - "check.everything_else": subscribe to all emails that are not Gmail, Yahoo, or Hotmail. -# -# Env variable: RCH__WORKER__RABBITMQ__QUEUES -# -# queues = ["check.gmail", "check.hotmail.b2b", "check.hotmail.b2c", "check.yahoo", "check.everything_else"] -queues = "all" - -# Number of concurrent emails to verify for this worker across all queues. +# Number of concurrent emails to verify for this worker. # # Env variable: RCH__WORKER__RABBITMQ__CONCURRENCY -concurrency = 20 +concurrency = 5 # Throttle the maximum number of requests per second, per minute, per hour, and # per day for this worker. @@ -159,6 +141,7 @@ db_url = "postgresql://localhost/reacherdb" # # Env variable: RCH__SENTRY_DSN # sentry_dsn = "" + ``` ## Usage with Docker diff --git a/docs/self-hosting/scaling-for-production.md b/docs/self-hosting/scaling-for-production.md index 099ce85c8..1b7712408 100644 --- a/docs/self-hosting/scaling-for-production.md +++ b/docs/self-hosting/scaling-for-production.md @@ -12,31 +12,29 @@ The architecture contains 4 components: Note that Reacher provides the same Docker image `reacherhq/backend` which can act as both a **Worker** and a **HTTP server**. -

Reacher architecture for scaling

+

Reacher queue architecture

-With this architecture, it's possible to horizontally scale the number of workers, while making sure that the individual IPs don't get blacklisted. To do so, we propose to start with two types of workers. +With this architecture, it's possible to horizontally scale the number of workers. However, to prevent spawning to many workers at once resulting in blacklisted IPs, we need to configure some concurrency and throttling parameters below. -### Shared Configuration between both workers +### Worker Configuration -To enable the above worker architecture, set the following parameters in [reacher-configuration-v0.10.md](../advanced/migrations/reacher-configuration-v0.10.md "mention"): +To enable the above worker architecture without getting blacklisted, we need to set some parameters in [reacher-configuration-v0.10.md](../advanced/migrations/reacher-configuration-v0.10.md "mention"): * `worker.enable`: true * `worker.rabbitmq.url`: Points to the URL of the RabbitMQ instance. * `worker.postgres.db_url`: A Postgres database to store the email verification results. -### 1st worker type: SMTP worker using Proxy +Since spawning workers (generally on cloud providers) doesn't guarantee a reputable IP assigned to the worker, we propose to configure all workers to use a proxy. Proxies generally offer a pricing per IP per month; we recommend buying one IP for each 10000 email verifications you do per day. -These workers will consume all emails that should be verified through SMTP. Currently, this includes all emails, except Hotmail B2C and Yahoo emails, which are best verified using a headless navigator. Since maintaing IP addresses is hard, we recommend using a proxy, see [proxies.md](proxies.md "mention"). +* `worker.proxy.{host,port}`: Set a proxy to route all SMTP requests through. You can optionally pass in `username` and `password` if required. -Assuming your proxy has `N` available IP addresses, we recommend spawning the same number `N` of workers, each with the config below: +We also propose some recommended values for concurrency and throttling parameters. These parameters ensure that the proxy that we use will have its IP well maintained. -* `worker.rabbitmq.queues`: `["check.gmail","check.hotmailb2b","everything_else"]`. The SMTP workers will listen to these queues. -* `worker.proxy.{host,port}`: Set a proxy to route all SMTP requests through. You can optionally pass in `username` and `password` if required. -* `worker.rabbitmq.concurrency`: 10. -* `worker.throttle.max_requests_per_minute`: 100. -* `worker.throttle.max_requests_per_day`: 10000. This is the recommended number of verifications per IP per day. Assuming there are `N` IP addresses and `N` workers, each worker should perform 10000 verifications per day. +* `worker.rabbitmq.concurrency`: 5. Each worker can process 5 emails at a time. +* `worker.throttle.max_requests_per_minute`: 60. If this value is too high, the recipient SMTP server might see sudden spikes of email verifications, resulting in an IP blacklist. +* `worker.throttle.max_requests_per_day`: 10000. This is the recommended number of verifications per IP per day. Assuming our proxy has `N` IP addresses and `N` workers, each worker will perform 10000 verifications per day in average. -You can scale up the number `N` as much as you need. Remember, the rule of thumb is 10000 verifications per IP per day. For example, if you're aiming for 10 millions verifications per month, we recommend 33 or 34 IPs. +You can scale up the number `N` as much as you need, by buying more IPs and spawning more workers. Remember, the rule of thumb is 10000 verifications per IP per day. For example, if you're aiming for 10 millions verifications per month, we recommend buying 33 or 34 IPs: ``` 10,000,000 emails per month / 30 = 33,000 emails per day / 10000 = 33 IPs @@ -44,17 +42,6 @@ You can scale up the number `N` as much as you need. Remember, the rule of thumb Refer to [reacher-configuration-v0.10.md](../advanced/migrations/reacher-configuration-v0.10.md "mention")to see how to set these settings. -### 2nd worker type: Headless worker - -These workers will consume all emails that are best verified using a headless browser. The idea behind this verification method is to spawn a headless browser that will navigate to the email provider's password recovery page, and parse the website's response to inputting emails. This method currently works well for Hotmail and Yahoo emails. - -To spawn such a worker, provide the config: - -* `worker.rabbitmq.queues`: `["check.hotmailb2c","check.yahoo"]`. These are the emails that are best verified using headless. -* `worker.throttle.max_requests_per_minute`: 100 - -Refer to [reacher-configuration-v0.10.md](../advanced/migrations/reacher-configuration-v0.10.md "mention")to see how to set these settings. - ## Understanding the architecture with Docker Compose We do not recommend using Docker Compose for a high-volume production setup. However, for understanding the architecture, the different Docker images, as well as how to configure the workers, this [`docker_compose.yaml`](../../docker-compose.yaml) file can be useful. @@ -64,4 +51,4 @@ We do not recommend using Docker Compose for a high-volume production setup. How Contact [amaury@reacher.email](https://app.gitbook.com/u/F1LnsqPFtfUEGlcILLswbbp5cgk2 "mention")if you have more questions about this architecture, such as: * deploying on Kubernetes (Ansible playbook, Pulumi) -* more specialized workers (e.g. Gmail and Hotmail B2B workers can be separated) +* more specialized workers (e.g. some workers doing headless verification only, others doing SMTP only)