commit

recohut · Feb 13, 2022 · 45002f6 · 45002f6
1 parent 510ed56
commit 45002f6
Show file tree

Hide file tree

Showing 28 changed files with 9,879 additions and 0 deletions.
diff --git a/_notebooks/2022-01-29-finbert.ipynb b/_notebooks/2022-01-29-finbert.ipynb
diff --git a/_notebooks/2022-01-29-infersent.ipynb b/_notebooks/2022-01-29-infersent.ipynb
diff --git a/_notebooks/2022-01-29-movie-transformer.ipynb b/_notebooks/2022-01-29-movie-transformer.ipynb
diff --git a/_notebooks/2022-01-29-mta.ipynb b/_notebooks/2022-01-29-mta.ipynb
diff --git a/_notebooks/2022-01-29-mts.ipynb b/_notebooks/2022-01-29-mts.ipynb
diff --git a/_notebooks/2022-01-29-ngcf.ipynb b/_notebooks/2022-01-29-ngcf.ipynb
diff --git a/_notebooks/2022-01-29-nlp.ipynb b/_notebooks/2022-01-29-nlp.ipynb
diff --git a/_notebooks/2022-01-29-obj-det-2.ipynb b/_notebooks/2022-01-29-obj-det-2.ipynb
diff --git a/_notebooks/2022-01-29-obj-det.ipynb b/_notebooks/2022-01-29-obj-det.ipynb
diff --git a/_notebooks/2022-01-29-opencv-video.ipynb b/_notebooks/2022-01-29-opencv-video.ipynb
diff --git a/_notebooks/2022-01-29-profiling.ipynb b/_notebooks/2022-01-29-profiling.ipynb
diff --git a/_notebooks/2022-01-29-snp.ipynb b/_notebooks/2022-01-29-snp.ipynb
diff --git a/_notebooks/2022-01-30-benchmark.ipynb b/_notebooks/2022-01-30-benchmark.ipynb
diff --git a/_notebooks/2022-01-30-finance.ipynb b/_notebooks/2022-01-30-finance.ipynb
diff --git a/_notebooks/2022-01-30-glt-community.ipynb b/_notebooks/2022-01-30-glt-community.ipynb
@@ -0,0 +1,345 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Graph Learning Tasks - Community Detection"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "One common problem data scientists face when dealing with networks is how to identify clusters and communities within a graph. This often arises when graphs are derived from social networks and communities are known to exist. However, the underlying algorithms and methods can also be used in other contexts, representing another option to perform clustering and segmentation. For example, these methods can effectively be used in text mining to identify emerging topics and to cluster documents that refer to single events/topics. A community detection task consists of partitioning a graph such that nodes belonging to the same community are tightly connected with each other and are weakly connected with nodes from other communities."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this notebook, we will explore some methods to perform a community detection using several algortihms."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install git+https://github.com/palash1992/GEM\n",
+    "!pip install communities"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import networkx as nx \n",
+    "from sklearn.manifold import TSNE\n",
+    "from gem.embedding.hope import HOPE \n",
+    "from matplotlib import pyplot as plt\n",
+    "from sklearn.decomposition import NMF\n",
+    "from sklearn.mixture import GaussianMixture\n",
+    "from communities.algorithms import girvan_newman\n",
+    "from communities.algorithms import louvain_method\n",
+    "from communities.algorithms import spectral_clustering\n",
+    "\n",
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import networkx as nx \n",
+    "G = nx.barbell_graph(m1=10, m2=4)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Matrix Factorization "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We start by using some matrix factorization technique to extract the embeddings, which are visualized and then clustered traditional clustering algorithms.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gf = HOPE(d=4, beta=0.01) \n",
+    "gf.learn_embedding(G) \n",
+    "embeddings = gf.get_embedding()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tsne = TSNE(n_components=2) \n",
+    "emb2d = tsne.fit_transform(embeddings)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.plot(embeddings[:, 0], embeddings[:, 1], 'o', linewidth=0);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gm = GaussianMixture(n_components=3, random_state=0) #.(embeddings)\n",
+    "labels = gm.fit_predict(embeddings)\n",
+    "colors = [\"blue\", \"green\", \"red\"]\n",
+    "nx.draw_spring(G, node_color=[colors[label] for label in labels])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Spectral Clustering"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We now perform a spectral clustering based on the adjacency matrix of the graph. It is worth noting that this clustering is not a mutually exclusive clustering and nodes may belong to more than one community"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "adj = np.array(nx.adjacency_matrix(G).todense())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "communities = spectral_clustering(adj, k=3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the next plot we highlight the nodes that belong to a community using the red color. The blue nodes do not belong to the given community"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(20, 5))\n",
+    "\n",
+    "for ith, community in enumerate(communities):\n",
+    "    cols = [\"red\" if node in community else \"blue\" for node in G.nodes]\n",
+    "    plt.subplot(1,3,ith+1)\n",
+    "    plt.title(f\"Community {ith}\")\n",
+    "    nx.draw_spring(G, node_color=cols)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The next command shows the node ids belonging to the different communities"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "communities"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Non Negative Matrix Factorization "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here, we again use matrix factorization, but now using the Non-Negative Matrix Factorization, and associating the clusters with the latent dimensions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nmf = NMF(n_components=2)\n",
+    "emb = nmf.fit_transform(adj)\n",
+    "plt.plot(emb[:, 0], emb[:, 1], 'o', linewidth=0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By setting a threshold value of 0.01, we determine which nodes belong to the given community."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "communities = [set(np.where(emb[:,ith]>0.01)[0]) for ith in range(2)]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(20, 5))\n",
+    "\n",
+    "for ith, community in enumerate(communities):\n",
+    "    cols = [\"red\" if node in community else \"blue\" for node in G.nodes]\n",
+    "    plt.subplot(1,3,ith+1)\n",
+    "    plt.title(f\"Community {ith}\")\n",
+    "    nx.draw_spring(G, node_color=cols)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Although the example above does not show this, in general also this clustering method may be non-mutually exclusive, and nodes may belong to more than one community"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Louvain and Modularity Optimization"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here, we use the Louvain method, which is one of the most popular methods for performing community detection, even on fairly large graphs. As described in the chapter, the Louvain method basically optimize the partitioning (it is a mutually exclusing community detection algorithm), identifying the one that maximize the modularity score, meaning that nodes belonging to the same community are very well connected among themself, and weakly connected to the other communities. \n",
+    "\n",
+    "**Louvain, unlike other community detection algorithms, does not require to specity the number of communities in advance and find the best, optimal number of communities.**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "communities = louvain_method(adj)\n",
+    "\n",
+    "c = pd.Series({node: colors[ith] for ith, nodes in enumerate(communities[0]) for node in nodes}).values\n",
+    "nx.draw_spring(G, node_color=c)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Girvan Newman"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The Girvan–Newman algorithm detects communities by progressively removing edges from the original graph. The algorithm removes the “most valuable” edge, traditionally the edge with the highest betweenness centrality, at each step. As the graph breaks down into pieces, the tightly knit community structure is exposed and the result can be depicted as a dendrogram.\n",
+    "\n",
+    "**BE AWARE that because of the betweeness centrality computation, this method may not scale well on large graphs**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "communities = girvan_newman(adj, n=2)\n",
+    "c = pd.Series({node: colors[ith] for ith, nodes in enumerate(communities[0]) for node in nodes}).values\n",
+    "nx.draw_spring(G, node_color=c)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "communities[0]"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "authorship_tag": "ABX9TyO9Y9GoAZ9oghYFJpCXF8Cx",
+   "collapsed_sections": [],
+   "name": "rec-tut-gml-05-glt-community-detection.ipynb",
+   "provenance": [
+    {
+     "file_id": "1sAKOySokSkK8dTp6GYjmIh3AjBOT1R0J",
+     "timestamp": 1627993731574
+    },
+    {
+     "file_id": "1FlR0Nt00zRzrjciEpl46j51IIomB0K_p",
+     "timestamp": 1627989061002
+    }
+   ]
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/_notebooks/2022-01-30-glt-missing-link.ipynb b/_notebooks/2022-01-30-glt-missing-link.ipynb