deploy: 0667dd6

GIScience · Aug 26, 2024 · 439666c · 439666c
1 parent e9af1d1
commit 439666c
Show file tree

Hide file tree

Showing 26 changed files with 766 additions and 130 deletions.
diff --git a/README.html b/README.html
@@ -181,11 +181,12 @@
         </ul>
         <p aria-level="2" class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
 <ul class="nav bd-sidenav">
-<li class="toctree-l1"><a class="reference internal" href="book/00_motivation.html">Why you should be excited about this workshop</a></li>
+<li class="toctree-l1"><a class="reference internal" href="book/00_motivation.html">Why should you be excited about <em>ohsome-data-insights</em>?</a></li>
+<li class="toctree-l1"><a class="reference internal" href="book/00_MinIO_Object_Store.html">Connect to MinIO Object Store</a></li>
+
+<li class="toctree-l1"><a class="reference internal" href="book/00_Iceberg_Catalog.html">Connect to Apache Iceberg</a></li>
 <li class="toctree-l1"><a class="reference internal" href="book/00_data_structure.html">Data Structure</a></li>
 <li class="toctree-l1"><a class="reference internal" href="book/00_partitioning_and_sorting.html">Partitioning and Sorting</a></li>
-<li class="toctree-l1"><a class="reference internal" href="book/00_MinIO_Object_Store.html">DuckDB: Connect to MinIO Object Store</a></li>
-<li class="toctree-l1"><a class="reference internal" href="book/00_Iceberg_Catalog.html">PyIceberg: Connect to Iceberg Catalog</a></li>
 </ul>
 <p aria-level="2" class="caption" role="heading"><span class="caption-text">Data Extraction</span></p>
 <ul class="nav bd-sidenav">

diff --git a/_images/flexibility_clients.png b/_images/flexibility_clients.png
diff --git a/_images/minio_access_key_1.png b/_images/minio_access_key_1.png
diff --git a/_images/minio_access_key_2.png b/_images/minio_access_key_2.png
diff --git a/_images/minio_login.png b/_images/minio_login.png
diff --git a/_sources/book/00_Iceberg_Catalog.ipynb b/_sources/book/00_Iceberg_Catalog.ipynb
@@ -5,16 +5,232 @@
    "id": "0aabe61e-c6fe-49e5-babf-3108f7aef591",
    "metadata": {},
    "source": [
-    "# PyIceberg: Connect to Iceberg Catalog"
+    "# Connect to Apache Iceberg\n",
+    "\n",
+    "## What is Apache Iceberg?\n",
+    "* Iceberg brings together MinIO object store and things you are used to have from a database"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dcb06ce8-3974-436b-9c1b-f994e2b4093f",
+   "metadata": {},
+   "source": [
+    "## Connect to Apache Iceberg Catalog via PyIceberg"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "22b5dd18-6672-48f8-bddc-1f0cae1962bd",
+   "metadata": {},
+   "source": [
+    "Adjust the code below and add your MinIO access keys in there:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "id": "b79af493-a3cd-4e08-9b3b-4a5fa7e19329",
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "import os\n",
+    "\n",
+    "s3_user = os.environ[\"S3_ACCESS_KEY_ID\"]  # add your user here\n",
+    "s3_password = os.environ[\"S3_SECRET_ACCESS_KEY\"]  # add your password here"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6609c52a-a0ef-46b9-b021-a09a6cd80167",
+   "metadata": {},
+   "source": [
+    "Run this line if you haven't installed the python libraries yet, e.g. when you are running this in Google Colab."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d65fa4ba-39b7-40fa-9dad-f8d2832e3aba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install \"pyiceberg[s3fs,duckdb,sql-sqlite,pyarrow]\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2e411c4-3aa9-41e4-8dc1-65932496caae",
+   "metadata": {},
+   "source": [
+    "Set up connection to Iceberg catalog."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "120d9b6c-5786-4fd3-b8f1-e327d97c4a01",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyiceberg.catalog.rest import RestCatalog\n",
+    "\n",
+    "catalog = RestCatalog(\n",
+    "    name=\"default\",\n",
+    "    **{\n",
+    "        \"uri\": \"https://sotm2024.iceberg.ohsome.org\",\n",
+    "        \"s3.endpoint\": \"https://sotm2024.minio.heigit.org\",\n",
+    "        \"py-io-impl\": \"pyiceberg.io.pyarrow.PyArrowFileIO\",\n",
+    "        \"s3.access-key-id\": s3_user,\n",
+    "        \"s3.secret-access-key\": s3_password,\n",
+    "        \"s3.region\": \"eu-central-1\"\n",
+    "    }\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2bb7b974-13ac-46ea-89e3-fe5f1192193d",
+   "metadata": {},
+   "source": [
+    "## Get an overview\n",
+    "Find out what data exists and where to find it.\n",
+    "Tables in Iceberg are organized in groups called NAMESPACES. \n",
+    "1. List all existing namespaces\n",
+    "2. List the tables that exist in a namespace\n",
+    "3. Get some table metadata"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3f3b44bf-9762-4b01-9ab4-4a0962acc4aa",
+   "metadata": {},
+   "source": [
+    "Currently this catalog consists of only a single namespace. You can think of namespaces like a `schema` in postgres or other databases."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "4eeb0dda-4cf9-48bd-8f56-42e9605d6fd7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[('geo_sort',)]"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "catalog.list_namespaces()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0ea360c7-b3c9-41e8-9388-d17038706e4a",
+   "metadata": {},
+   "source": [
+    "In this step we list which tables are available in this namespace."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "041f29b6-e207-4004-a97c-053173fbf735",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[('geo_sort', 'benni_test_heidelberg'),\n",
+       " ('geo_sort', 'contributions'),\n",
+       " ('geo_sort', 'contributions_germany')]"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "catalog.list_tables('geo_sort')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "da820525-03fd-4241-ba2c-7e4e5f4e5b47",
+   "metadata": {},
+   "source": [
+    "Let's inspect a single Iceberg table and list all columns / attributes from this table. (We will explain these in detail again on the next page.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "7232059f-142b-4e4d-9b9a-5df7701bc64e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "contributions(\n",
+       "  1: user_id: optional int,\n",
+       "  2: valid_from: optional timestamp,\n",
+       "  3: valid_to: optional timestamp,\n",
+       "  4: osm_type: optional string,\n",
+       "  5: osm_id: optional string,\n",
+       "  6: osm_version: optional int,\n",
+       "  7: contrib_type: optional string,\n",
+       "  8: members: optional list<struct<32: type: optional string, 33: id: optional long, 34: role: optional string, 35: geometry: optional binary>>,\n",
+       "  9: status: optional string,\n",
+       "  10: changeset: optional struct<36: id: optional long, 37: timestamp: optional timestamp, 38: tags: optional map<string, string>, 39: hashtags: optional list<string>, 40: editor: optional string>,\n",
+       "  11: tags: optional map<string, string>,\n",
+       "  12: tags_before: optional map<string, string>,\n",
+       "  13: map_features: optional struct<48: aerialway: optional boolean, 49: aeroway: optional boolean, 50: amenity: optional boolean, 51: barrier: optional boolean, 52: boundary: optional boolean, 53: building: optional boolean, 54: craft: optional boolean, 55: emergency: optional boolean, 56: geological: optional boolean, 57: healthcare: optional boolean, 58: highway: optional boolean, 59: historic: optional boolean, 60: landuse: optional boolean, 61: leisure: optional boolean, 62: man_made: optional boolean, 63: military: optional boolean, 64: natural: optional boolean, 65: office: optional boolean, 66: place: optional boolean, 67: power: optional boolean, 68: public_transport: optional boolean, 69: railway: optional boolean, 70: route: optional boolean, 71: shop: optional boolean, 72: sport: optional boolean, 73: telecom: optional boolean, 74: water: optional boolean, 75: waterway: optional boolean>,\n",
+       "  14: area: optional long,\n",
+       "  15: area_delta: optional long,\n",
+       "  16: length: optional long,\n",
+       "  17: length_delta: optional long,\n",
+       "  18: xzcode: optional struct<76: level: optional int, 77: code: optional long>,\n",
+       "  19: country_iso_a3: optional list<string>,\n",
+       "  20: bbox: optional struct<79: xmin: optional double, 80: ymin: optional double, 81: xmax: optional double, 82: ymax: optional double>,\n",
+       "  21: xmin: optional double,\n",
+       "  22: xmax: optional double,\n",
+       "  23: ymin: optional double,\n",
+       "  24: ymax: optional double,\n",
+       "  25: centroid: optional struct<83: x: optional double, 84: y: optional double>,\n",
+       "  26: quadkey_z10: optional string,\n",
+       "  27: h3_r5: optional long,\n",
+       "  28: geometry_type: optional string,\n",
+       "  29: geometry_valid: optional boolean,\n",
+       "  30: geometry: optional string\n",
+       "),\n",
+       "partition by: [status, geometry_type],\n",
+       "sort order: [],\n",
+       "snapshot: Operation.APPEND: id=1440840715635230871, schema_id=0"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "iceberg_table = catalog.load_table(('geo_sort', 'contributions'))\n",
+    "display(iceberg_table)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "845ce15e-a61e-4e2f-be53-3bc8fa49c5b3",
+   "metadata": {},
+   "source": [
+    "Let's dive deeper now into the data structure and what you can expect for your data analysis."
+   ]
   }
  ],
  "metadata": {

diff --git a/_sources/book/00_MinIO_Object_Store.ipynb b/_sources/book/00_MinIO_Object_Store.ipynb
@@ -5,45 +5,71 @@
    "id": "775ea76a-be8d-4abd-b13c-5fb96d54e8da",
    "metadata": {},
    "source": [
-    "# DuckDB: Connect to MinIO Object Store"
+    "# Connect to MinIO Object Store\n",
+    "\n",
+    "### Log-In with OSM Account\n",
+    "\n",
+    "* Go to https://sotm2024.minio.heigit.org website.\n",
+    "* Log-in with your OSM Account credentials.\n",
+    "\n",
+    "![minio_login](../figs/minio_login.png)\n",
+    "\n",
+    "### Create Access Key\n",
+    "* Create a new access key.\n",
+    " \n",
+    "![minio_login](../figs/minio_access_key_1.png)\n",
+    "\n",
+    "* Copy both keys, you'll need them in the next step.\n",
+    "  \n",
+    "![minio_login](../figs/minio_access_key_2.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a8d65c2-8638-4385-a40d-0dfa190d4601",
+   "metadata": {},
+   "source": [
+    "# Connect to MinIO via DuckDB"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 5,
    "id": "9682aa64-27d7-4b95-98e4-cb49b5a0c4c6",
    "metadata": {},
    "outputs": [],
    "source": [
-    "import os\n",
     "import duckdb\n",
-    "\n",
-    "con = duckdb.connect(\n",
-    "    config={\n",
-    "        'threads': 32,\n",
-    "        'max_memory': '50GB',\n",
-    "    }\n",
-    ")\n",
-    "con.install_extension(\"spatial\")\n",
-    "con.load_extension(\"spatial\")"
+    "con = duckdb.connect()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a9654816-9167-486b-9c68-0436ac5133e6",
+   "metadata": {},
+   "source": [
+    "Adjust the code below and add your access keys in there:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 6,
    "id": "c3c9bc77-45da-4c7e-8ec5-1276c5bf92f6",
    "metadata": {},
    "outputs": [],
    "source": [
-    "s3_user = os.environ[\"S3_ACCESS_KEY_ID\"]  # add your user here\n",
-    "s3_password = os.environ[\"S3_SECRET_ACCESS_KEY\"]  # add your password here"
+    "import os\n",
+    "s3_user = os.environ[\"S3_ACCESS_KEY_ID\"]  # s3_user = 'my_user_access_key'\n",
+    "s3_password = os.environ[\"S3_SECRET_ACCESS_KEY\"]  # s3_password = 'my_user_secret_key'"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "e1589972-7ef6-4a21-9bca-9b5f5303faa3",
    "metadata": {},
-   "source": []
+   "source": [
+    "Create DuckDB connection to MinIO."
+   ]
   },
   {
    "cell_type": "code",
@@ -82,12 +108,12 @@
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "edc5c4a3-8ee8-40fd-b27d-055c2642fab6",
+   "cell_type": "markdown",
+   "id": "1c482713-4ab0-411f-98f3-69f54826c5ca",
    "metadata": {},
-   "outputs": [],
-   "source": []
+   "source": [
+    "Now you are ready to explore Apache Iceberg Catalog and Iceberg tables in the next step."
+   ]
   }
  ],
  "metadata": {

diff --git a/_sources/book/00_motivation.md b/_sources/book/00_motivation.md
@@ -1,2 +1,30 @@
-# Why you should be excited about this workshop
+# Why should you be excited about *ohsome-data-insights*?
+
+### Fast data extraction
+* there is no easier way access to the full OpenStreetMap data
+* parquet file format allows you to download large datasets easily
+
+
+
+### Enriched attributes
+* get OSM elements + changeset information + geographic attributes + precomputed statistics
+
+
+### Fits your style
+* choose your own favorite client and programming language 
+
+![](../figs/flexibility_clients.png)
+
+
+### Data integration
+* combine OSM data with other datasets (e.g. from Overture, Mapillary, your own data)
+
+### flexibility
+* write your own specialized queries
+
+
+
+
+---
+* optimize resource consumption