Documentation and gensim int

reglab · Nov 9, 2024 · 45b4ffc · 45b4ffc
1 parent c808b42
commit 45b4ffc
Show file tree

Hide file tree

Showing 14 changed files with 1,150 additions and 830 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -16,14 +16,6 @@ repos:
       - id: check-yaml
       - id: end-of-file-fixer
       - id: trailing-whitespace
-  - repo: https://github.com/roy-ht/pre-commit-jupyter
-    rev: v1.2.1
-    hooks:
-      - id: jupyter-notebook-cleanup
-        args:
-          # - --remove-kernel-metadata
-          - --pin-patterns
-          - "[pin];[donotremove]"
   - repo: https://github.com/pre-commit/mirrors-prettier
     rev: v3.1.0
     hooks:

diff --git a/README.md b/README.md
@@ -5,43 +5,46 @@ This is the code repository for the paper "Statistical Uncertainty in Word Embed
 
 **We introduce a method to obtain approximate, easy-to-use, and scalable uncertainty estimates for the GloVe word embeddings and demonstrate its usefulness in natural language tasks and computational social science analysis. This code repository contains code to download pre-computed GloVe embeddings and GloVe-V variances for several corpora from our HuggingFace repository, to interact with these data products and propagate uncertainty to downstream tasks.**
 
-
 ![GloVe-V](figures/glove_diagram.jpg)
 
-## HuggingFace Repository
-We store our data products on HuggingFace. You can find them [here](https://huggingface.co/datasets/reglab/glove-v). 
+## Available Corpora
 
 We provide embeddings and variances for the following corpora:
 
-- **Toy Corpus (300-dim)**: a subset of 11 words from the Corpus of Historical American English (1900-1999)
-- **Corpus of Historical American English (COHA) (1900-1999) (300-dim)**
+- **Toy Corpus (300-dim)**: a subset of 11 words from the Corpus of Historical American English (1900-1999). Downloadable as `Toy-Embeddings`
+- **Corpus of Historical American English (COHA) (1900-1999) (300-dim)**: Downloadable as `COHA_1900-1999_300d`
+- More to come!
+
+## HuggingFace Repository
+We store our data products on HuggingFace. You can find them [here](https://huggingface.co/datasets/reglab/glove-v).
 
-Each dataset contains the following files:
+Each dataset contains the following files (see the **Storage of GloVe-V Variances** section below for more details on the differences between the complete and approximated variances):
 - `vocab.txt`: a list of the words in the corpus with associated frequencies
 - `vectors.safetensors`: a safetensors file containing the embeddings for each word in the corpus
 - `complete_chunk_{i}.safetensors`: a set of safetensors file containing the complete variances for each word in the corpus. These variances are size $D \times D$, where $D$ is the embedding dimensionality, and thus are very storage-intensive.
-- `approx_info.txt`: a text file containing information on the approximation used to the full variance of each word (diagonal approximation, or SVD approximation)
+- `approx_info.txt`: a text file containing information on the approximation used to approximate the full variance of each word (diagonal approximation, or SVD approximation)
 - `ApproximationVariances.safetensors`: a safetensors file containing the approximation variances for each word in the corpus. These approximations require storing much fewer floating point numbers than the full variances. If a word has been approximated by a diagonal approximation, then this file will contain only $D$ floating point numbers for each word. Alternatively, if a word has been approximated by an SVD approximation of rank $k$, then this file will contain $k(2D + 1)$ floating point numbers for each word.
 
-If using the approximated variances, the `glove_v.variance.load_variance` function automaticallyhandles the reconstruction of the variances from these files. 
+If using the approximated variances, the `glove_v.variance.load_variance` function automatically handles the reconstruction of the variances from these files.
 
 ## Storage of GloVe-V Variances
 
 Let $V$ be the size of the vocabulary and $D$ be the embedding dimension. While GloVe embeddings only require storing $V \times D$ floating point numbers, the GloVe-V variances require storing $V \times (D x D)$ floating point numbers. For this reason, we offer two download options:
 
 1. **Approximation Variances**: These are approximations to the full GloVe-V variances that can use either a diagonal approximation to the full variance, or a low-rank Singular Value Decomposition (SVD) approximation. We optimize this approximation at the level of each word to guarantee at least 90% reconstruction of the original variance. These approximations require storing much fewer floating point numbers than the full variances.
-2. **Complete Variances**: These are the full GloVe-V variances, which require storing $V \times (D x D)$ floating point numbers. For example, in the case of the 300-dimensional embeddings for the COHA (1900-1999) corpus, this would be approximately 6.4 billion floating point numbers! 
+2. **Complete Variances**: These are the full GloVe-V variances, which require storing $V \times (D x D)$ floating point numbers. For example, in the case of the 300-dimensional embeddings for the COHA (1900-1999) corpus, this would be approximately 6.4 billion floating point numbers!
 
+Our [tutorial](https://github.com/reglab/glove-v/blob/main/glove_v/docs/tutorial.ipynb) compares results using the approximated and complete variances with an illustration from the paper.
 
 ## Setup
 
 First, clone this repo:
 
 ```bash
-git clone https://github.com/reglab/glove-v.git myword
+git clone https://github.com/reglab/glove-v.git glove_v
 ```
 
-Next, install uv if you haven't already:
+Next, install uv:
 
 ```bash
 curl -LsSf https://astral.sh/uv/install.sh | sh
@@ -50,16 +53,58 @@ curl -LsSf https://astral.sh/uv/install.sh | sh
 Then, create a virtual environment:
 
 ```bash
-cd myword
+cd glove_v
 uv venv  # optionally add --python 3.11 or another version
 ```
 
 To activate the virtual environment:
 
 ```bash
 source .venv/bin/activate # If using fish shell, use `source .venv/bin/activate.fish` instead
+
+uv sync
 ```
 
 ## Usage
 
-Our tutorial notebook is available [here](https://github.com/reglab/glove-v/blob/main/glove_v/docs/tutorial.ipynb) and walks through the process of downloading and interacting with the GloVe-V data products.
+Our tutorial notebook is available [here](https://github.com/reglab/glove-v/blob/main/glove_v/docs/tutorial.ipynb) and offers a more detailed walkthrough of the process of downloading and interacting with the GloVe-V data products.
+
+Here is a quick example of how to download the approximated embeddings for the Toy Corpus:
+
+```python
+glove_v.data.download_embeddings(
+    embedding_name='Toy-Embeddings',
+    approximation=True,
+)
+```
+
+We can easily load the vocabulary and embeddings for the Toy Corpus in several formats (dictionary, numpy arrays, gensim KeyedVectors):
+```python
+vocab, ivocab = glove_v.vector.load_vocab(
+    embedding_name='Toy-Embeddings',
+)
+vectors = glove_v.vector.load_vectors(
+    embedding_name='Toy-Embeddings',
+    format='dictionary'
+)
+```
+
+Next, we load the approximated variances for the Toy Corpus. This function automatically handles the reconstruction of the variances from the approximated files, such that the variances in the `approx_variances` dictionary are of size $D \times D$.
+
+```python
+approx_variances = {}
+for word in list(vocab.keys()):
+    approx_variances[word] = glove_v.variance.load_variance(
+        embedding_name='Toy-Embeddings',
+        approximation=True,
+        word_idx=vocab[word],
+    )
+```
+
+We also offer a Gensim integration for working with GloVe-V embeddings using Gensim's KeyedVectors.
+
+```python
+gensim_glovev_kv = glove_v.GloVeVKeyedVectors(
+    embedding_name='Toy-Embeddings',
+)
+```
diff --git a/glove_v/__init__.py b/glove_v/__init__.py
@@ -1,4 +1,11 @@
-from . import variance
+from . import propagate, variance, vector
 from .data import download_embeddings
-from . import vector
-from . import propagate
+from .gensim_integration import GloVeVKeyedVectors
+
+__all__ = [
+    "propagate",
+    "variance",
+    "vector",
+    "download_embeddings",
+    "GloVeVKeyedVectors",
+]
diff --git a/glove_v/data.py b/glove_v/data.py
@@ -4,80 +4,84 @@
 more lightweight, and guarantee 90% reconstruction of the original variance for each word.
 """
 
-import os
-from huggingface_hub import hf_hub_download
+from pathlib import Path
+
 import numpy as np
+from huggingface_hub import hf_hub_download
 from safetensors import safe_open
-from safetensors.numpy import save_file\
+from safetensors.numpy import save_file
 
 import glove_v.utils.file as file_utils
 
-
 AVAILABLE_EMBEDDINGS = [
     "Toy-Embeddings",
-    'COHA_1900-1999_300d',
+    "COHA_1900-1999_300d",
 ]
 
+
 def download_embeddings(
     embedding_name: str,
     approximation: bool = True,
-    download_dir: str = f'{file_utils.get_data_path()}/glove-v'
+    download_dir: str = f"{file_utils.get_data_path()}/glove-v",
 ) -> None:
     """
     Downloads the vectors and variances for a selected corpus.
-    
+
     Args:
         embedding_name: (str) The specific embedding to download. This should match one of the keys in the AVAILABLE_EMBEDDINGS dictionary
         approximation: (bool) Whether to download the approximate or complete GloVe-V variances. The GloVe embeddings
         are the same for both cases.
         download_dir: (str) path where GloVe-V files should be saved
     """
     if embedding_name not in AVAILABLE_EMBEDDINGS:
-        raise ValueError(f"[ERROR] Embeddings should be one of the following: {AVAILABLE_EMBEDDINGS}")
+        raise ValueError(
+            f"[ERROR] Embeddings should be one of the following: {AVAILABLE_EMBEDDINGS}"
+        )
 
-    final_download_dir = os.path.join(download_dir, f"{embedding_name}")
-    os.makedirs(final_download_dir, exist_ok=True)
+    final_download_dir = Path(download_dir) / embedding_name
+    final_download_dir.mkdir(parents=True, exist_ok=True)
 
     # Download vocabulary and embeddings
-    for file in ['vocab.txt', 'vectors.safetensors']:
-        if not os.path.exists(os.path.join(final_download_dir, file)):
-            file_path = hf_hub_download(
+    for file in ["vocab.txt", "vectors.safetensors"]:
+        file_path = final_download_dir / file
+        if not file_path.exists():
+            downloaded_path = hf_hub_download(
                 repo_id="reglab/glove-v",
                 filename=f"{embedding_name}/{file}",
                 local_dir=download_dir,
                 repo_type="dataset",
             )
-            print(f"[INFO] Downloaded {file}: {file_path}")
+            print(f"[INFO] Downloaded {file}: {downloaded_path}")
         else:
             print(f"[INFO] {file} already exists in {final_download_dir}")
 
     # Download variances
     if approximation:
-        print('[INFO] Downloading file containing approximated variances.')
-        for file in ['ApproximationVariances.safetensors', 'approx_info.txt']:  
-            if not os.path.exists(os.path.join(final_download_dir, file)):
-                file_path = hf_hub_download(
+        print("[INFO] Downloading file containing approximated variances.")
+        for file in ["ApproximationVariances.safetensors", "approx_info.txt"]:
+            file_path = final_download_dir / file
+            if not file_path.exists():
+                downloaded_path = hf_hub_download(
                     repo_id="reglab/glove-v",
                     filename=f"{embedding_name}/{file}",
                     local_dir=download_dir,
                     repo_type="dataset",
                 )
-                print(f"[INFO] Downloaded {file}: {file_path}")
+                print(f"[INFO] Downloaded {file}: {downloaded_path}")
             else:
                 print(f"[INFO] {file} already exists in {final_download_dir}")
     else:
-        print('[INFO] Downloading file containing complete variances.')
-        
-        output_path = os.path.join(final_download_dir, "CompleteVariances.safetensors")
-        if not os.path.exists(output_path):
+        print("[INFO] Downloading file containing complete variances.")
+
+        output_path = final_download_dir / "CompleteVariances.safetensors"
+        if not output_path.exists():
             download_and_reconstruct_complete_safetensor(
                 embedding_name=embedding_name,
                 download_dir=download_dir,
-                output_path=output_path,
+                output_path=str(output_path),
             )
         else:
-            print(f"[INFO] Complete.safetensors already exists in {final_download_dir}")  
-
+            print(f"[INFO] Complete.safetensors already exists in {final_download_dir}")
 
 
 def download_and_reconstruct_complete_safetensor(
@@ -88,41 +92,39 @@ def download_and_reconstruct_complete_safetensor(
     """
     Downloads chunked safetensor files from HuggingFace and reconstructs the complete safetensor
     containing the original variances.
-    
+
     Args:
         embedding_name: Name of the corpus on HuggingFace
         download_dir: Path where to save the downloaded chunks
         output_path: Path where to save the reconstructed complete safetensor
     """
     chunk_idx = 0
     all_variances = []
-    
+
     while True:
         try:
             # Download chunk
             chunk_path = hf_hub_download(
                 repo_id="reglab/glove-v",
                 filename=f"{embedding_name}/complete_chunk_{chunk_idx}.safetensors",
                 local_dir=download_dir,
-                repo_type="dataset"
+                repo_type="dataset",
             )
-            
+
             # Load chunk data
             with safe_open(chunk_path, framework="numpy") as f:
                 # Get variances from chunk
                 variances_chunk = f.get_tensor("variances")
                 all_variances.append(variances_chunk)
-            
+
             chunk_idx += 1
-            
-        except Exception as e:
+
+        except Exception:
             # No more chunks to download
             break
-    
+
     # Concatenate all variance chunks
     complete_variances = np.concatenate(all_variances, axis=0)
-    
+
     # Save reconstructed complete safetensor
-    save_file({
-        "variances": complete_variances
-    }, output_path)
+    save_file({"variances": complete_variances}, output_path)