Come up with architecture for a scalable and highly available secondary server #1733

VJag · 2024-01-08T15:01:34Z

Is your feature request related to a problem? Please describe.

The secondary server as it is today is not scalable. The only possible scaling option today is vertical scaling. Because of the way our persistence works, it is not possible to run a secondary server per region and honor data locality, replication, etc..

Describe the solution you'd like

Come up with a design for the problem described above. The task will have the following sub-tasks:

Requirements Analysis:

Understand the current and anticipated future requirements: data volume, traffic patterns, and performance expectations.

Scalability Considerations:

Horizontal Scaling: Plan for distributing the load across multiple servers or instances. Implement load balancing mechanisms to evenly distribute incoming traffic.
Vertical Scaling: Consider scaling up resources (CPU, RAM) on individual servers if needed, although horizontal scaling often provides better long-term scalability.

High Availability Design:

Redundancy and Failover: Design the system with redundancy in mind to mitigate single points of failure. Implement failover mechanisms to ensure continuous service in case of server failures.
Replication: Employ data replication strategies to duplicate data across multiple servers or regions for resilience and data availability.

Fault-Tolerant Architecture: Use fault-tolerant technologies and practices to handle failures without service disruptions.

Database Considerations:

Scalable Database: Choose a database system that can scale horizontally

Replication and Backups: Implement database replication for data redundancy and backups to prevent data loss in case of failures.

Load Balancing and Traffic Management:

Implement load balancers to distribute incoming traffic evenly across multiple servers or regions.

Describe alternatives you've considered

No response

Additional context

No response

VJag · 2024-02-05T15:17:38Z

Here is the document that captures aspects related to this ticket : https://docs.google.com/document/d/1UNgcTBlvDSCqX5N-ai05vl6vrGBfTtgdPh-j50TibkU/edit?usp=sharing

VJag · 2024-02-20T10:59:14Z

As part of the ongoing ticket, our team has initiated efforts to benchmark server performance. The key objectives for this task include:

Designing and implementing a robust framework for stress testing. The code should be optimized to seamlessly run within a virtual machine (VM) environment.
Developing the actual stress testing code, running it locally, and ensuring that it not only accomplishes its intended purposes but also operates efficiently within a VM setting.
Presenting and demonstrating the outcomes of this work to a broader audience through an architecture call or stand-up meeting.
Collaborating with Chris to execute the benchmarking against the production environment atSigns.

The timeline for achieving objectives 3 and 4 is within the current sprint (PR 81). The results of this benchmarking exercise will play a crucial role in shaping our subsequent actions and decisions.

purnimavenkatasubbu · 2024-03-04T07:40:02Z

In the PR-81, We spent time writing tests for the following along with the documentation and demonstrated them in the architecture call

Parallel_put_sync test
sync_pull_load test
parallel_notify_same_atsign test
monitor_test

The goal in this sprint is to expand the tests to cover all the notification scenarios and work with Chris to execute the benchmarking against the production environment atSigns.

The documentation of the tests completed so far can be found in the branch

Also, Planning to explore locust for load testing - https://locust.io/

purnimavenkatasubbu · 2024-03-18T12:03:50Z

Done with the phase -1 of writing scripts for the above mentioned scenarios and moved on to the locust Script to run multiple clients performing an unauthenticated scan/Info. Locust script can be found
locust_test_script

Next Goal is to narrow down on the performance i.e.., To get the metrics for the following scenarios and be able to predict the point where the server breaks down

Memory consumed for starting a fresh secondary server
How the Memory consumed increases as the no of connections increases
how memory allocated for hive is in proportion to size of the keys in steady state
How memory increases as the no of clients running the scan verb(through locust scripts) and at what point the server breaks.

Details collected so far can be seen in the following sheet.
performance_metrics

purnimavenkatasubbu · 2024-04-02T05:50:26Z

During this Sprint, we utilized the locust script to conduct a series of performance tests aimed at evaluating the scalability and resilience of our server infrastructure. Specifically, we focused on conducting lookup tests wherein we systematically increased both the number of client connections and the number of keys stored within the server.

Test Conditions:

Number of Keys: We systematically increased the quantity of keys stored within the server. We initiated the test with 5 unique keys and incrementally expanded it to 10, 100, 1000, and eventually 10,000 keys.

Number of Clients: Simultaneously, we varied the number of client connections accessing the server. Beginning with a single client, we progressively scaled up the load to 10, 100, 200, 500, 1000, and ultimately 10,000 concurrent clients.

Key size -
1) Not fixed size
2) 240 Characters
Value size - 1k
Test Duration -
1. 30 seconds
2. 1 minute
3. 2 minutes

All the collected performance test metrics can be found in the following sheet.
Load_testing_metrics

purnimavenkatasubbu · 2024-04-29T06:01:37Z

We collected metrics by running both client and server on the same VM. Metrics can be found in the following Same_VM_Metrics.

Now, we aim to run the server and client on separate virtual machines (VMs) to ensure that their simultaneous operation on the machine does not impact overall performance.

VJag added the enhancement New feature or request label Jan 8, 2024

VJag self-assigned this Jan 8, 2024

ksanty mentioned this issue Jan 8, 2024

Analysis: Compose a white paper outlining the requirements for the server and client sides of the Dart Hive DB replacement. #1700

Closed

VJag assigned purnimavenkatasubbu Feb 20, 2024

XavierChanth assigned vnaresh66 and VJag and unassigned VJag and purnimavenkatasubbu Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Come up with architecture for a scalable and highly available secondary server #1733

Come up with architecture for a scalable and highly available secondary server #1733

VJag commented Jan 8, 2024

VJag commented Feb 5, 2024

VJag commented Feb 20, 2024

purnimavenkatasubbu commented Mar 4, 2024 •

edited

Loading

purnimavenkatasubbu commented Mar 18, 2024 •

edited

Loading

purnimavenkatasubbu commented Apr 2, 2024 •

edited

Loading

purnimavenkatasubbu commented Apr 29, 2024

Come up with architecture for a scalable and highly available secondary server #1733

Come up with architecture for a scalable and highly available secondary server #1733

Comments

VJag commented Jan 8, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

VJag commented Feb 5, 2024

VJag commented Feb 20, 2024

purnimavenkatasubbu commented Mar 4, 2024 • edited Loading

purnimavenkatasubbu commented Mar 18, 2024 • edited Loading

purnimavenkatasubbu commented Apr 2, 2024 • edited Loading

purnimavenkatasubbu commented Apr 29, 2024

purnimavenkatasubbu commented Mar 4, 2024 •

edited

Loading

purnimavenkatasubbu commented Mar 18, 2024 •

edited

Loading

purnimavenkatasubbu commented Apr 2, 2024 •

edited

Loading