Skip to content

Commit

Permalink
initial release
Browse files Browse the repository at this point in the history
  • Loading branch information
peter-emil committed Feb 27, 2021
0 parents commit 675ab2e
Show file tree
Hide file tree
Showing 11 changed files with 508 additions and 0 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.idea/
dist/
storagebox.egg-info/
__pycache__/
*.pyc
116 changes: 116 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# StorageBox

StorageBox is a python module that you can use to de-duplicate data
among distributed components.

For example, let's assume you run a movie store. You have
voucher codes you'd like to distribute to the first 30 users who press
a button. You are concerned that some users might try to get more
than 1 voucher code by exploiting race conditions (maybe clicking the
button from multiple machines at the same time).



Here is what StorageBox allows you to do
```
# Setup Code
import storagebox
item_repo = storagebox.ItemBankDynamoDbRepository(table_name="voucher_codes")
deduplication_repo = storagebox.DeduplicationDynamoDbRepository(table_name="storage_box_deduplication_table")
# You can add items to the item repo (for example add list of voucher codes)
item_repo.batch_add_items(voucher_codes)
# You can then assign voucher codes to User IDs
deduplicator = storagebox.Deduplicator(item_repo=item_repo, deduplication_repo=deduplication_repo)
voucher_code = deduplicator.fetch_item_for_deduplication_id(
deduplication_id=user_id
)
```
And that's it!

As long as you use a suitable `deduplication_id`, all race conditions
and data hazards will be taken care of for you. Examples of suitable
candidates for `deduplication_id` can be User ID, IP Address,
Email Address or anything that works best with your application.


## Prerequisites
To use StorageBox, you need the following already set up.

- An ItemBank DynamoDB Table, The current implementation requires the table to have 1 column
called `item`. This is where you will store items (in the case of the example:
voucher codes).
- A Deduplication DynamoDB Table, This will be used by `StorageBox` to achieve idempotency,
that is, to make sure that if you call `fetch_item_for_deduplication_id` multiple times with
the same `deduplication_id`, you will always get the same result.

If you prefer to use something else other than DynamoDB, you can implement your own `ItemBankRepository`
and/or `DeduplicationRepository` for any other backend. This implementation will have to implement
the already established Abstract class. If you do that, contributions are welcome!


## Installation
```
pip install storagebox
```


## Other Example Use Cases
Hosting a big event and only have 10,300 seats that would be booked in the first few minutes?
```
# Before the event, add 10,300 numbers to the bank
item_repo.batch_add_items([str(i) for i in range(10300)])
# From your webserver
assignment_number = deduplicator.fetch_item_for_deduplication_id(
deduplication_id=email
)
```

Are you an influencer and only have 5000 people to give special referral links to? (First 5000
people who click the link in the description get a free something!)
```
# Before you post your content
item_repo.batch_add_items(referral_links_list)
# From your webserver
referral_link = deduplicator.fetch_item_for_deduplication_id(
deduplication_id=ip_address
)
```

Are you organizing online classes for your 150 students, you're willing to host 3 classes (50 students)
each but you'd like to be sure that no student attends more than 1 class?
```
# Before you host your classes
class_1_codes = storagebox.ItemBankDynamoDbRepository(table_name="class_1_codes")
class_2_codes = storagebox.ItemBankDynamoDbRepository(table_name="class_2_codes")
class_3_codes = storagebox.ItemBankDynamoDbRepository(table_name="class_3_codes")
deduplication_repo = storagebox.DeduplicationDynamoDbRepository(table_name="myonline_classes_deduplication_table")
class_1_codes.([str(i) for i in range(0, 50)])
class_2_codes.([str(i) for i in range(50, 100)])
class_3_codes.([str(i) for i in range(100, 150)])
# From your webserver
deduplicators = {
'class_1': storagebox.Deduplicator(item_repo=class_1_codes, deduplication_repo=deduplication_repo),
'class_2': storagebox.Deduplicator(item_repo=class_2_codes, deduplication_repo=deduplication_repo),
'class_3': storagebox.Deduplicator(item_repo=class_3_codes, deduplication_repo=deduplication_repo),
}
deduplicator[requested_class].fetch_item_for_deduplication_id(
deduplication_id=student_id
)
```

# How It Works
A blogpost explaining how `storagebox` works is available [here](https://blog.peteremil.com/2021/02/realtime-distributed-deduplication-how.html)
114 changes: 114 additions & 0 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 16 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[tool.poetry]
name = "storagebox"
version = "1.0.5"
description = "A reusable, idempotent, and exactly once deduplication API"
authors = ["Peter Emil Halim <peter@peteremil.com>"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.8"
boto3 = "^1.16.63"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"
3 changes: 3 additions & 0 deletions storagebox/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from storagebox.repository.deduplication import DeduplicationDynamoDbRepository
from storagebox.repository.item_bank import ItemBankDynamoDbRepository
from storagebox.api import Deduplicator
44 changes: 44 additions & 0 deletions storagebox/api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import logging
import typing
from botocore.exceptions import ClientError
from storagebox import settings
from storagebox import repository


log = logging.getLogger('storageBox')
log.setLevel(settings.DEFAULT_LOGGING_LEVEL)


class Deduplicator:
def __init__(self, item_repo, deduplication_repo):
self.item_repo = item_repo
self.deduplication_repo = deduplication_repo

def fetch_item_for_deduplication_id(self, deduplication_id):
item_string = self.item_repo.get_item_from_bank()
if item_string is None:
return item_string
try:
self.deduplication_repo.put_deduplication_id(
deduplication_id=deduplication_id,
item_string=item_string
)
return item_string
except ClientError:
log.debug("deduplication_id is already assigned, will check if I"
" should return item_string %s to the bank", item_string)
existing_item_string = self.deduplication_repo.get_value_for_deduplication_id(
deduplication_id=deduplication_id
)
if existing_item_string != item_string:
self.item_repo.add_item_to_bank(
item_string=item_string
)
log.debug("Item %s was returned", item_string)
return existing_item_string
return item_string

def add_items_to_bank(self, items: typing.List[str]):
self.item_repo.batch_add_items(
items=items
)
2 changes: 2 additions & 0 deletions storagebox/repository/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from storagebox.repository.deduplication import DeduplicationDynamoDbRepository
from storagebox.repository.item_bank import ItemBankDynamoDbRepository
40 changes: 40 additions & 0 deletions storagebox/repository/deduplication.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import abc
import logging
from storagebox.repository.dynamodb import DynamoDBBasedRepository


log = logging.getLogger('storageBox')


class DeduplicationRepository(abc.ABC):
@abc.abstractmethod
def get_value_for_deduplication_id(self, deduplication_id: str):
raise NotImplementedError

@abc.abstractmethod
def put_deduplication_id(self, deduplication_id: str, item_string: str):
raise NotImplementedError


class DeduplicationDynamoDbRepository(DeduplicationRepository, DynamoDBBasedRepository):
def get_value_for_deduplication_id(self, deduplication_id:str):
response = self.table.get_item(
Key={
'deduplication_id': str(deduplication_id)
}
)
return response.get('Item', {}).get('item_string') # Returns None if not found

def put_deduplication_id(self, deduplication_id: str, item_string: str):
obj = {
'deduplication_id': deduplication_id,
'item_string': item_string
}
self.table.put_item( # should only be put if there is no existing entry
Item=obj,
Expected={
'deduplication_id': {
'Exists': False
}
}
)
18 changes: 18 additions & 0 deletions storagebox/repository/dynamodb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import boto3


class DynamoDBBasedRepository:
def __init__(self, table_name):
self.table_name = table_name
self.client = boto3.client('dynamodb')
if not self.table_alreaedy_exists(table_name=self.table_name):
raise RuntimeError(f"DynamoDB table {self.table_name} does not exist")
dynamodb = boto3.resource('dynamodb')
self.table = dynamodb.Table(self.table_name)

def table_alreaedy_exists(self, table_name) -> bool:
try:
self.client.describe_table(TableName=table_name)
return True
except self.client.exceptions.ResourceNotFoundException:
return False
Loading

0 comments on commit 675ab2e

Please sign in to comment.