Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
mikeizbicki committed Apr 4, 2024
1 parent c21e8a4 commit c769cf2
Show file tree
Hide file tree
Showing 3 changed files with 35 additions and 25 deletions.
48 changes: 31 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,8 @@ The resulting code is much more complicated than the code you wrote for your `lo
Instead, I am providing it for you.
You should see that the test cases for `test_normalizedbatch_sequential` are already passing.
#### Verifying Correctness
Your first task is to make the other two sequential tests pass.
Do this by:
1. Copying the `load_tweets.py` file from your `twitter_postgres` homework into this repo.
Expand All @@ -128,20 +130,19 @@ Do this by:
You should be able to use the same lines of code as you used in the `load_tweets.sh` file from the previous assignment.
Once you've done those two steps, verify that the test cases pass by uploading to github and getting green badges.
Bring up a fresh version of your containers by running the commands:
#### Measuring Runtimes
Once you've verified that the test cases pass, on the lambda server, you should run the following commands to load the data.
```
$ docker-compose down
$ docker volume prune
$ docker-compose up -d
```
Run the following command to insert data into each of the containers sequentially.
(Note that you will have to modify the ports to match the ports of your `docker-compose.yml` file.)
```
$ sh load_tweets_sequential.sh
```
Record the elapsed time in the table in the Submission section below.
You should notice that batching significantly improves insertion performance speed.
The `load_tweets_sequential.sh` file reports the runtime of loading data into each of the three databases.
Record the runtime in the table in the Submission section below.
You should notice that batching significantly improves insertion performance speed,
but the denormalized database insertion is still the fastest.
> **NOTE:**
> The `time` command outputs 3 times:
Expand Down Expand Up @@ -169,8 +170,8 @@ You should notice that batching significantly improves insertion performance spe
There are 10 files in `/data` folder of this repo.
If we process each file in parallel, then we should get a theoretical 10x speed up.
The file `load_tweets_parallel.sh` will insert the data in parallel and get nearly a 10-fold speedup,
but there are several changes that you'll have to make first to get this to work.
The file `load_tweets_parallel.sh` will insert the data in parallel and if you implement it correctly you will observe this speedup.
There are several changes that you'll have to make to your code to get this to work.
#### Denormalized Data
Expand All @@ -190,21 +191,33 @@ Complete the following steps:
2. Call the `load_denormalized.sh` file using the `parallel` program from within the `load_tweets_parallel.sh` script.
You know you've completed this step correctly if the `run_tests.sh` script passes (locally) and the test badge turns green (on the lambda server).
The `parallel` program takes a single paramater as input which is the command that it will execute.
For each line that it receives in stdin, it will pass that line as an argument to the input command.
All of these commands will be run in parallel,
and the `parallel` program will terminate once all of the individual commands terminate.
My solution looks like
```
time echo "$files" | parallel ./load_denormalized.sh
```
Notice that I also use the `time` command to time the insertion operation.
One of the advantages of using the `parallel` command over the `&` operator we used previously is that it is easier to time your parallel computations.
You know you've completed this step correctly if the `run_tests.sh` script passes (locally) and the test badge turns green (on the lambda server).
#### Normalized Data (unbatched)
Parallel loading of the unbatched data should "just work."
Modify the `load_tweets_parallel.sh` file to load the `pg_normalized` database in parallel following the same procedure above.
Parallel loading of the unbatched data will probably "just work."
The code in the `load_tweets.py` file is structured so that you never run into deadlocks.
Unfortunately, the code is extremely slow,
so even when run in parallel it is still slower than the batched code.
> **NOTE:**
> The `tests_normalizedbatch_parallel` is currently failing because the `load_tweets_parallel.sh` script is not yet implemented.
> After you use GNU parallel to implement this script, everything should pass.
#### Normalized Data (batched)
Modify the `load_tweets_parallel.sh` file to load the `pg_normalized_batch` database in parallel following the same procedure above.
Parallel loading of the batched data will fail due to deadlocks.
These deadlocks will cause some of your parallel loading processes to crash.
So all the data will not get inserted,
Expand Down Expand Up @@ -247,8 +260,9 @@ Once you remove these constraints, this will cause downstream errors in both the
> In a production database where you are responsible for the consistency of your data,
> you would never want to remove these constraints.
> In our case, however, we're not responsible for the consistency of the data.
> The data comes straight from Twitter, and so Twitter is responsible for the data consistency.
> We want to represent the data exactly how Twitter represents it "upstream",
> and so Twitter are responsible for ensuring the consistency.
> and so removing the UNIQUE/FOREIGN KEY constraints is reasonable.
#### Results
Expand Down
6 changes: 3 additions & 3 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ services:
- POSTGRES_PASSWORD=pass
- PGUSER=postgres
ports:
- 15433:5432
- 1:5432

pg_normalized:
build: services/pg_normalized
Expand All @@ -23,7 +23,7 @@ services:
- POSTGRES_PASSWORD=pass
- PGUSER=postgres
ports:
- 25433:5432
- 2:5432

pg_normalized_batch:
build: services/pg_normalized_batch
Expand All @@ -35,7 +35,7 @@ services:
- POSTGRES_PASSWORD=pass
- PGUSER=postgres
ports:
- 35433:5432
- 3:5432

volumes:
pg_normalized:
Expand Down
6 changes: 1 addition & 5 deletions load_tweets_parallel.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,7 @@ files=$(find data/*)
echo '================================================================================'
echo 'load pg_denormalized'
echo '================================================================================'
#time for file in $files; do
# ./load_denormalized.sh "$file"
#unzip -p "$file" | sed 's/\\u0000//g' | psql postgresql://postgres:pass@localhost:15433/ -c "COPY tweets_jsonb (data) FROM STDIN csv quote e'\x01' delimiter e'\x02';"
#done
time echo "$files" | parallel ./load_denormalized.sh
# FIXME: implement this with GNU parallel

echo '================================================================================'
echo 'load pg_normalized'
Expand Down

0 comments on commit c769cf2

Please sign in to comment.