This repository contains the code for the Assignment-3 of Distributed Systems(CS60002) course of Spring, 2024.
- Assignment-2 Distributed Systems
- Implementing a Write-Ahead Logging for consistency in Replicated Database with Sharding
- Table of Contents
- Group Details
- Prerequisite
- Getting Started
- Design Choices
- Troubleshooting
- Evaluation
Repo Link: https://github.com/PranavMehrotra/A3_Distributed_Systems
- Pranav Mehrotra (20CS10085)
- Saransh Sharma (20CS30065)
- Pranav Nyati (20CS30037)
- Shreyas Jena (20CS30049)
sudo apt-get update
sudo apt-get install \
ca-certificates \
curl \
gnupg \
lsb-release
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
sudo curl -SL https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-linux-x86_64 -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose
To create the necessary Docker images for the load balancer and servers, execute the following command:
make install
To initiate the deployment of load balancer containers, execute the following command:
make deploy
This command will launch the load balancer container, which, in turn, will spawn the initial N server Docker containers along with their heartbeat probing threads. Ensure that the necessary configurations are in place for a seamless deployment. The command also clears any existing containers using server or load balancer image (i.e. execute make clean).
Note: The deployment command launches Docker in the background mode. Therefore, users are advised to check the docker-compose logs to view load-balancer logs.
To interact with the load balancer and send GET/POST requests, launch the interactive terminal using the following command:
bash client.sh
To stop and remove all containers using the server image and load balancer, run the following command:
make clean
Executing this command is recommended before running the main code to ensure there are no conflicting container instances already running. This helps prevent potential errors and ensures a clean environment for the code execution.
To remove previously created server and load balancer images, execute the following command:
make deepclean
It is advisable to run this command before executing the main code to eliminate any pre-existing images with the same name. This ensures a clean slate and avoids potential conflicts during the code execution.
- Initialization of heartbeat threads: When the load_balancer receives a client request for adding or removing servers or init_servers request, the load_balancer sends a request to the shard_manager with the information of the severs to be added/removed. The shard manager initializes the heartbeat threads for each of the servers which in turn monitor each server.
- In case of an init or an add server request, for each new shard that is added in the request, the Shard Manager selects one of the servers as primary server randomly. In case of a remove server request, for each removed server, if it was the primary server for some of its shards, then for each of those shards, the Shard Manager runs a mechanism for electing a new primary server from amongst the secondary servers corresponding to the shard. The mechanism selects one of the servers that have the latest (most updated) data as the new primary server, which it gets to know by the latest transaction ids received by network calls to each of the secondary servers.
- In the event when a server crashes or stops, the heartbeat thread of that server (mainitained by the Shard_Manager) handles it through a crash recovery mechanism. It first runs the mechanism for electing new primary server (as discussed in point 2) for each of the shards for which the crashed server was previously the primary server. It then reinstantiates the new server and configures it with the appropriate schema. Next, for each shard that the server was maintaining, the heartbeat thread recovers the data from the primary server for that shard. When data is copied for all shards, the server has recovered and is up and running.
- The log file maintains a record of transaction IDs, types (write, update, or delete), and associated data exclusively upon the receipt of incoming requests. Data is logged at this initial stage and not upon the fulfillment of the request by the database server or when the request is committed.
Code 137 indicated RAM memory related issue. Stop and remove already runining container to free up space.
Particular container: docker stop container_id
Stop all running container: docker stop $(docker ps -a -q)
Particular container: docker rm container_id
Remove all running container: docker rm $(docker ps -a -q)
# initialise and deploy containers
make install
make deploy
# initialise database container with default configuration
cd db_analysis/
python p1.py --type init
python p1.py --type status
# run analysis file
python p1.py --type write --nreqs 10000
python p1.py --type read --nreqs 10000
Leveraging the default configuration, i.e.,
NUM_SERVERS: 6
NUM_SHARDS: 4
NUM_REPLICAS: 3
We obtain the following statistics for 10000 write and read requests respectively:
- Request Type: write
No of successful requests: 10000/10000
No of failed requests: 0/10000
Time taken to send 10000 requests: 171.6880922317505 seconds
- Request Type: read
No of successful requests: 10000/10000
No of failed requests: 0/10000
Time taken to send 10000 requests: 44.47232532501221 seconds
# initialise database container with specific configuration
cd db_analysis/
python p2.py --type init # initialise database
python p2.py --type status
# run analysis file
python p2.py --type write --nreqs 10000
python p2.py --type read --nreqs 10000
On setting NUM_REPLICAS=6
, keeping the number of servers and shards fixed, i.e.,
NUM_SERVERS: 6
NUM_SHARDS: 4
NUM_REPLICAS: 6
We obtain the following statistics for 10000 write and read requests respectively:
- Request Type: write
No of successful requests: 10000/10000
No of failed requests: 0/10000
Time taken to send 10000 requests: 571.5665924549103 seconds
- Request Type: read
No of successful requests: 9995/10000
No of failed requests: 5/10000
Time taken to send 10000 requests: 109.68647050857544 seconds
The increased latency for write and read requests can be attributed to the increased number of replicas for each shard. This implies that both write and read requests need to access all replicas of a shard to maintain consistency, increasing the time taken to handle requests.
# initialise database container with specific configuration
cd db_analysis/
python p3.py --type init
python p3.py --type status
# run analysis file
python p3.py --type write --nreqs 10000
python p3.py --type read --nreqs 10000
The following configuration for the database server, i.e.,
NUM_SERVERS: 10
NUM_SHARDS: 6
NUM_REPLICAS: 8
yields the following statistics for 10000 write and read requests respectively:
- Request Type: write
No of successful requests: 10000/10000
No of failed requests: 0/10000
Time taken to send 10000 requests: 758.3099572658539 seconds
- Request Type: read
No of successful requests: 9999/10000
No of failed requests: 1/10000
Time taken to send 10000 requests: 110.17270064353943 seconds
In this case, there is a noticeable, albeit slight increase in the latency for write and read requests compared to Part-2
.
Why isn't the increase in latency for read requests as prominent as Part-2
? It has to do with the fact that an increase in the number of servers leads to better distribution of read requests, implying that incoming requests face lesser contention while accessing shard replicas across a large number of servers. This leads to only a slight increase in the latency for handling read requests, as shown above.
For write requests, an increase in the number of replicas to be edited overcomes the benefit of less contention due to more servers, leading to a marked increase in latency for processing write requests.
# initialise database container with specific configuration
cd db_analysis/
python p1.py --type init # initialise database
python p1.py --type status # check status
# write/read requests
python p1.py --type write --nreqs 100
python p1.py --type read --nreqs 100
# update/delete requests
python p1.py --type update # updates a random db entry
python p1.py --type delete # deletes a random entry from all involved replicas
# add/remove servers
python p1.py --type add # adds list of servers mentioned in script
python p1.py --type rm # removes list of servers mentioned in script
The initial server configuration consists of 6 servers (Server0
to Server5
), as shown in Fig.2.
Now, Server1 is primary server for 'sh1 ' and 'sh3', so we kill Server 1
# list all active server containers
docker ps
# select a random <container_id> and stop the container
docker stop <container_id>
# re-check the server configuration to see if the stopped container has respawned or noticeable
docker ps
Fig.3: Configuration just after stopping `Server1`
We can observe that after crashing of Server 1, new primary servers have been elected for sh1 (Server 0 as the new primary server) and sh3 (Server5 as the new primary server) and Server 1 has become one of the secondary servers for sh1 and sh3 shards after respwaning.
Load balancer's logs stating killing and restarting of server 1
Shard Manager's logs stating killing and restarting of server 1
docker ps indicating respawning server 1