Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Come up with architecture for a scalable and highly available secondary server #1733

Open
VJag opened this issue Jan 8, 2024 · 6 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@VJag
Copy link
Member

VJag commented Jan 8, 2024

Is your feature request related to a problem? Please describe.

The secondary server as it is today is not scalable. The only possible scaling option today is vertical scaling. Because of the way our persistence works, it is not possible to run a secondary server per region and honor data locality, replication, etc..

Describe the solution you'd like

Come up with a design for the problem described above. The task will have the following sub-tasks:

Requirements Analysis:

Understand the current and anticipated future requirements: data volume, traffic patterns, and performance expectations.

Scalability Considerations:

Horizontal Scaling: Plan for distributing the load across multiple servers or instances. Implement load balancing mechanisms to evenly distribute incoming traffic.
Vertical Scaling: Consider scaling up resources (CPU, RAM) on individual servers if needed, although horizontal scaling often provides better long-term scalability.

High Availability Design:

Redundancy and Failover: Design the system with redundancy in mind to mitigate single points of failure. Implement failover mechanisms to ensure continuous service in case of server failures.
Replication: Employ data replication strategies to duplicate data across multiple servers or regions for resilience and data availability.

Fault-Tolerant Architecture: Use fault-tolerant technologies and practices to handle failures without service disruptions.

Database Considerations:

Scalable Database: Choose a database system that can scale horizontally

Replication and Backups: Implement database replication for data redundancy and backups to prevent data loss in case of failures.

Load Balancing and Traffic Management:

Implement load balancers to distribute incoming traffic evenly across multiple servers or regions.

Describe alternatives you've considered

No response

Additional context

No response

@VJag
Copy link
Member Author

VJag commented Feb 5, 2024

Here is the document that captures aspects related to this ticket : https://docs.google.com/document/d/1UNgcTBlvDSCqX5N-ai05vl6vrGBfTtgdPh-j50TibkU/edit?usp=sharing

@VJag
Copy link
Member Author

VJag commented Feb 20, 2024

As part of the ongoing ticket, our team has initiated efforts to benchmark server performance. The key objectives for this task include:

  1. Designing and implementing a robust framework for stress testing. The code should be optimized to seamlessly run within a virtual machine (VM) environment.
  2. Developing the actual stress testing code, running it locally, and ensuring that it not only accomplishes its intended purposes but also operates efficiently within a VM setting.
  3. Presenting and demonstrating the outcomes of this work to a broader audience through an architecture call or stand-up meeting.
  4. Collaborating with Chris to execute the benchmarking against the production environment atSigns.

The timeline for achieving objectives 3 and 4 is within the current sprint (PR 81). The results of this benchmarking exercise will play a crucial role in shaping our subsequent actions and decisions.

@purnimavenkatasubbu
Copy link
Member

purnimavenkatasubbu commented Mar 4, 2024

In the PR-81, We spent time writing tests for the following along with the documentation and demonstrated them in the architecture call

  1. Parallel_put_sync test
  2. sync_pull_load test
  3. parallel_notify_same_atsign test
  4. monitor_test

The goal in this sprint is to expand the tests to cover all the notification scenarios and work with Chris to execute the benchmarking against the production environment atSigns.

The documentation of the tests completed so far can be found in the branch

Also, Planning to explore locust for load testing - https://locust.io/

@purnimavenkatasubbu
Copy link
Member

purnimavenkatasubbu commented Mar 18, 2024

Done with the phase -1 of writing scripts for the above mentioned scenarios and moved on to the locust Script to run multiple clients performing an unauthenticated scan/Info. Locust script can be found
locust_test_script

Next Goal is to narrow down on the performance i.e.., To get the metrics for the following scenarios and be able to predict the point where the server breaks down

  • Memory consumed for starting a fresh secondary server
  • How the Memory consumed increases as the no of connections increases
  • how memory allocated for hive is in proportion to size of the keys in steady state
  • How memory increases as the no of clients running the scan verb(through locust scripts) and at what point the server breaks.

Details collected so far can be seen in the following sheet.
performance_metrics

@purnimavenkatasubbu
Copy link
Member

purnimavenkatasubbu commented Apr 2, 2024

During this Sprint, we utilized the locust script to conduct a series of performance tests aimed at evaluating the scalability and resilience of our server infrastructure. Specifically, we focused on conducting lookup tests wherein we systematically increased both the number of client connections and the number of keys stored within the server.

Test Conditions:

Number of Keys: We systematically increased the quantity of keys stored within the server. We initiated the test with 5 unique keys and incrementally expanded it to 10, 100, 1000, and eventually 10,000 keys.

Number of Clients: Simultaneously, we varied the number of client connections accessing the server. Beginning with a single client, we progressively scaled up the load to 10, 100, 200, 500, 1000, and ultimately 10,000 concurrent clients.

  • Key size -
    1) Not fixed size
    2) 240 Characters
  • Value size - 1k
  • Test Duration -
    1. 30 seconds
    2. 1 minute
    3. 2 minutes

All the collected performance test metrics can be found in the following sheet.
Load_testing_metrics

@purnimavenkatasubbu
Copy link
Member

We collected metrics by running both client and server on the same VM. Metrics can be found in the following Same_VM_Metrics.

Now, we aim to run the server and client on separate virtual machines (VMs) to ensure that their simultaneous operation on the machine does not impact overall performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants