Skip to content

Commit

Permalink
deploy: 0667dd6
Browse files Browse the repository at this point in the history
  • Loading branch information
Hagellach37 committed Aug 26, 2024
1 parent e9af1d1 commit 439666c
Show file tree
Hide file tree
Showing 26 changed files with 766 additions and 130 deletions.
7 changes: 4 additions & 3 deletions README.html
Original file line number Diff line number Diff line change
Expand Up @@ -181,11 +181,12 @@
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="book/00_motivation.html">Why you should be excited about this workshop</a></li>
<li class="toctree-l1"><a class="reference internal" href="book/00_motivation.html">Why should you be excited about <em>ohsome-data-insights</em>?</a></li>
<li class="toctree-l1"><a class="reference internal" href="book/00_MinIO_Object_Store.html">Connect to MinIO Object Store</a></li>

<li class="toctree-l1"><a class="reference internal" href="book/00_Iceberg_Catalog.html">Connect to Apache Iceberg</a></li>
<li class="toctree-l1"><a class="reference internal" href="book/00_data_structure.html">Data Structure</a></li>
<li class="toctree-l1"><a class="reference internal" href="book/00_partitioning_and_sorting.html">Partitioning and Sorting</a></li>
<li class="toctree-l1"><a class="reference internal" href="book/00_MinIO_Object_Store.html">DuckDB: Connect to MinIO Object Store</a></li>
<li class="toctree-l1"><a class="reference internal" href="book/00_Iceberg_Catalog.html">PyIceberg: Connect to Iceberg Catalog</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">Data Extraction</span></p>
<ul class="nav bd-sidenav">
Expand Down
Binary file added _images/flexibility_clients.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/minio_access_key_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/minio_access_key_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/minio_login.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
222 changes: 219 additions & 3 deletions _sources/book/00_Iceberg_Catalog.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,232 @@
"id": "0aabe61e-c6fe-49e5-babf-3108f7aef591",
"metadata": {},
"source": [
"# PyIceberg: Connect to Iceberg Catalog"
"# Connect to Apache Iceberg\n",
"\n",
"## What is Apache Iceberg?\n",
"* Iceberg brings together MinIO object store and things you are used to have from a database"
]
},
{
"cell_type": "markdown",
"id": "dcb06ce8-3974-436b-9c1b-f994e2b4093f",
"metadata": {},
"source": [
"## Connect to Apache Iceberg Catalog via PyIceberg"
]
},
{
"cell_type": "markdown",
"id": "22b5dd18-6672-48f8-bddc-1f0cae1962bd",
"metadata": {},
"source": [
"Adjust the code below and add your MinIO access keys in there:"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"id": "b79af493-a3cd-4e08-9b3b-4a5fa7e19329",
"metadata": {},
"outputs": [],
"source": []
"source": [
"import os\n",
"\n",
"s3_user = os.environ[\"S3_ACCESS_KEY_ID\"] # add your user here\n",
"s3_password = os.environ[\"S3_SECRET_ACCESS_KEY\"] # add your password here"
]
},
{
"cell_type": "markdown",
"id": "6609c52a-a0ef-46b9-b021-a09a6cd80167",
"metadata": {},
"source": [
"Run this line if you haven't installed the python libraries yet, e.g. when you are running this in Google Colab."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d65fa4ba-39b7-40fa-9dad-f8d2832e3aba",
"metadata": {},
"outputs": [],
"source": [
"!pip install \"pyiceberg[s3fs,duckdb,sql-sqlite,pyarrow]\""
]
},
{
"cell_type": "markdown",
"id": "d2e411c4-3aa9-41e4-8dc1-65932496caae",
"metadata": {},
"source": [
"Set up connection to Iceberg catalog."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "120d9b6c-5786-4fd3-b8f1-e327d97c4a01",
"metadata": {},
"outputs": [],
"source": [
"from pyiceberg.catalog.rest import RestCatalog\n",
"\n",
"catalog = RestCatalog(\n",
" name=\"default\",\n",
" **{\n",
" \"uri\": \"https://sotm2024.iceberg.ohsome.org\",\n",
" \"s3.endpoint\": \"https://sotm2024.minio.heigit.org\",\n",
" \"py-io-impl\": \"pyiceberg.io.pyarrow.PyArrowFileIO\",\n",
" \"s3.access-key-id\": s3_user,\n",
" \"s3.secret-access-key\": s3_password,\n",
" \"s3.region\": \"eu-central-1\"\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"id": "2bb7b974-13ac-46ea-89e3-fe5f1192193d",
"metadata": {},
"source": [
"## Get an overview\n",
"Find out what data exists and where to find it.\n",
"Tables in Iceberg are organized in groups called NAMESPACES. \n",
"1. List all existing namespaces\n",
"2. List the tables that exist in a namespace\n",
"3. Get some table metadata"
]
},
{
"cell_type": "markdown",
"id": "3f3b44bf-9762-4b01-9ab4-4a0962acc4aa",
"metadata": {},
"source": [
"Currently this catalog consists of only a single namespace. You can think of namespaces like a `schema` in postgres or other databases."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "4eeb0dda-4cf9-48bd-8f56-42e9605d6fd7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('geo_sort',)]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"catalog.list_namespaces()"
]
},
{
"cell_type": "markdown",
"id": "0ea360c7-b3c9-41e8-9388-d17038706e4a",
"metadata": {},
"source": [
"In this step we list which tables are available in this namespace."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "041f29b6-e207-4004-a97c-053173fbf735",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('geo_sort', 'benni_test_heidelberg'),\n",
" ('geo_sort', 'contributions'),\n",
" ('geo_sort', 'contributions_germany')]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"catalog.list_tables('geo_sort')"
]
},
{
"cell_type": "markdown",
"id": "da820525-03fd-4241-ba2c-7e4e5f4e5b47",
"metadata": {},
"source": [
"Let's inspect a single Iceberg table and list all columns / attributes from this table. (We will explain these in detail again on the next page.)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "7232059f-142b-4e4d-9b9a-5df7701bc64e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"contributions(\n",
" 1: user_id: optional int,\n",
" 2: valid_from: optional timestamp,\n",
" 3: valid_to: optional timestamp,\n",
" 4: osm_type: optional string,\n",
" 5: osm_id: optional string,\n",
" 6: osm_version: optional int,\n",
" 7: contrib_type: optional string,\n",
" 8: members: optional list<struct<32: type: optional string, 33: id: optional long, 34: role: optional string, 35: geometry: optional binary>>,\n",
" 9: status: optional string,\n",
" 10: changeset: optional struct<36: id: optional long, 37: timestamp: optional timestamp, 38: tags: optional map<string, string>, 39: hashtags: optional list<string>, 40: editor: optional string>,\n",
" 11: tags: optional map<string, string>,\n",
" 12: tags_before: optional map<string, string>,\n",
" 13: map_features: optional struct<48: aerialway: optional boolean, 49: aeroway: optional boolean, 50: amenity: optional boolean, 51: barrier: optional boolean, 52: boundary: optional boolean, 53: building: optional boolean, 54: craft: optional boolean, 55: emergency: optional boolean, 56: geological: optional boolean, 57: healthcare: optional boolean, 58: highway: optional boolean, 59: historic: optional boolean, 60: landuse: optional boolean, 61: leisure: optional boolean, 62: man_made: optional boolean, 63: military: optional boolean, 64: natural: optional boolean, 65: office: optional boolean, 66: place: optional boolean, 67: power: optional boolean, 68: public_transport: optional boolean, 69: railway: optional boolean, 70: route: optional boolean, 71: shop: optional boolean, 72: sport: optional boolean, 73: telecom: optional boolean, 74: water: optional boolean, 75: waterway: optional boolean>,\n",
" 14: area: optional long,\n",
" 15: area_delta: optional long,\n",
" 16: length: optional long,\n",
" 17: length_delta: optional long,\n",
" 18: xzcode: optional struct<76: level: optional int, 77: code: optional long>,\n",
" 19: country_iso_a3: optional list<string>,\n",
" 20: bbox: optional struct<79: xmin: optional double, 80: ymin: optional double, 81: xmax: optional double, 82: ymax: optional double>,\n",
" 21: xmin: optional double,\n",
" 22: xmax: optional double,\n",
" 23: ymin: optional double,\n",
" 24: ymax: optional double,\n",
" 25: centroid: optional struct<83: x: optional double, 84: y: optional double>,\n",
" 26: quadkey_z10: optional string,\n",
" 27: h3_r5: optional long,\n",
" 28: geometry_type: optional string,\n",
" 29: geometry_valid: optional boolean,\n",
" 30: geometry: optional string\n",
"),\n",
"partition by: [status, geometry_type],\n",
"sort order: [],\n",
"snapshot: Operation.APPEND: id=1440840715635230871, schema_id=0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"iceberg_table = catalog.load_table(('geo_sort', 'contributions'))\n",
"display(iceberg_table)"
]
},
{
"cell_type": "markdown",
"id": "845ce15e-a61e-4e2f-be53-3bc8fa49c5b3",
"metadata": {},
"source": [
"Let's dive deeper now into the data structure and what you can expect for your data analysis."
]
}
],
"metadata": {
Expand Down
68 changes: 47 additions & 21 deletions _sources/book/00_MinIO_Object_Store.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,45 +5,71 @@
"id": "775ea76a-be8d-4abd-b13c-5fb96d54e8da",
"metadata": {},
"source": [
"# DuckDB: Connect to MinIO Object Store"
"# Connect to MinIO Object Store\n",
"\n",
"### Log-In with OSM Account\n",
"\n",
"* Go to https://sotm2024.minio.heigit.org website.\n",
"* Log-in with your OSM Account credentials.\n",
"\n",
"![minio_login](../figs/minio_login.png)\n",
"\n",
"### Create Access Key\n",
"* Create a new access key.\n",
" \n",
"![minio_login](../figs/minio_access_key_1.png)\n",
"\n",
"* Copy both keys, you'll need them in the next step.\n",
" \n",
"![minio_login](../figs/minio_access_key_2.png)"
]
},
{
"cell_type": "markdown",
"id": "8a8d65c2-8638-4385-a40d-0dfa190d4601",
"metadata": {},
"source": [
"# Connect to MinIO via DuckDB"
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 5,
"id": "9682aa64-27d7-4b95-98e4-cb49b5a0c4c6",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import duckdb\n",
"\n",
"con = duckdb.connect(\n",
" config={\n",
" 'threads': 32,\n",
" 'max_memory': '50GB',\n",
" }\n",
")\n",
"con.install_extension(\"spatial\")\n",
"con.load_extension(\"spatial\")"
"con = duckdb.connect()"
]
},
{
"cell_type": "markdown",
"id": "a9654816-9167-486b-9c68-0436ac5133e6",
"metadata": {},
"source": [
"Adjust the code below and add your access keys in there:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 6,
"id": "c3c9bc77-45da-4c7e-8ec5-1276c5bf92f6",
"metadata": {},
"outputs": [],
"source": [
"s3_user = os.environ[\"S3_ACCESS_KEY_ID\"] # add your user here\n",
"s3_password = os.environ[\"S3_SECRET_ACCESS_KEY\"] # add your password here"
"import os\n",
"s3_user = os.environ[\"S3_ACCESS_KEY_ID\"] # s3_user = 'my_user_access_key'\n",
"s3_password = os.environ[\"S3_SECRET_ACCESS_KEY\"] # s3_password = 'my_user_secret_key'"
]
},
{
"cell_type": "markdown",
"id": "e1589972-7ef6-4a21-9bca-9b5f5303faa3",
"metadata": {},
"source": []
"source": [
"Create DuckDB connection to MinIO."
]
},
{
"cell_type": "code",
Expand Down Expand Up @@ -82,12 +108,12 @@
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "edc5c4a3-8ee8-40fd-b27d-055c2642fab6",
"cell_type": "markdown",
"id": "1c482713-4ab0-411f-98f3-69f54826c5ca",
"metadata": {},
"outputs": [],
"source": []
"source": [
"Now you are ready to explore Apache Iceberg Catalog and Iceberg tables in the next step."
]
}
],
"metadata": {
Expand Down
30 changes: 29 additions & 1 deletion _sources/book/00_motivation.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,30 @@
# Why you should be excited about this workshop
# Why should you be excited about *ohsome-data-insights*?

### Fast data extraction
* there is no easier way access to the full OpenStreetMap data
* parquet file format allows you to download large datasets easily



### Enriched attributes
* get OSM elements + changeset information + geographic attributes + precomputed statistics


### Fits your style
* choose your own favorite client and programming language

![](../figs/flexibility_clients.png)


### Data integration
* combine OSM data with other datasets (e.g. from Overture, Mapillary, your own data)

### flexibility
* write your own specialized queries




---
* optimize resource consumption

Loading

0 comments on commit 439666c

Please sign in to comment.