Skip to content

Commit

Permalink
docs: 📝 How to extract a batch of images from AWS (#10830)
Browse files Browse the repository at this point in the history
After some discussions during the OFF Days, it seems it was complex for
users to extract a batch of images from the OFF AWS S3. We provide the
code and explanations in the documentation.
  • Loading branch information
jeremyarancio authored Sep 27, 2024
1 parent 2f93877 commit a55e995
Showing 1 changed file with 20 additions and 0 deletions.
20 changes: 20 additions & 0 deletions docs/api/aws-images-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,23 @@ JSON) before downloading them. For example, to keep only 400px versions of all
images:

`zcat data_keys.gz | grep '.400.jpg'`

For example, if you want to extract a sample of images, you can use the code snippet below:

```bash
# Extract images from AWS
n=1000
images_dir="images"
bucket_url="https://openfoodfacts-images.s3.eu-west-3.amazonaws.com/"

zcat data_keys.gz |
grep '.jpg' | # Filter
shuf -n "$n" | # Random sample
sed "s|^|$bucket_url|" | #Add bucket_url: "https://openfoodfacts-images.s3.eu-west-3.amazonaws.com/data/376/005/047/0099/1.jpg"
while read -r url; do
filename=$(echo "$url" | sed "s|$bucket_url||" | tr '/' '_' | sed 's|data_||') # Filename as 376_005_047_0099_1.jpg
wget -O "$images_dir/$filename" "$url"
done
```

You can further refine the image extraction process by applying additional filters like `last_editor` or `last_edited_date`. This can be done by combining the Open Food Facts database [dump](https://world.openfoodfacts.org/data) with **DuckDB** and the `data_keys.gz` file. For detailed instructions on using DuckDB to efficiently process the OFF database, refer to our [blog post](https://medium.com/@jeremyarancio/duckdb-open-food-facts-the-largest-open-food-database-in-the-palm-of-your-hand-0d4ab30d0701).

0 comments on commit a55e995

Please sign in to comment.