Skip to content

Commit

Permalink
Documentation and gensim int
Browse files Browse the repository at this point in the history
  • Loading branch information
avaimar committed Nov 9, 2024
1 parent c808b42 commit 45b4ffc
Show file tree
Hide file tree
Showing 14 changed files with 1,150 additions and 830 deletions.
8 changes: 0 additions & 8 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,6 @@ repos:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/roy-ht/pre-commit-jupyter
rev: v1.2.1
hooks:
- id: jupyter-notebook-cleanup
args:
# - --remove-kernel-metadata
- --pin-patterns
- "[pin];[donotremove]"
- repo: https://github.com/pre-commit/mirrors-prettier
rev: v3.1.0
hooks:
Expand Down
71 changes: 58 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,43 +5,46 @@ This is the code repository for the paper "Statistical Uncertainty in Word Embed

**We introduce a method to obtain approximate, easy-to-use, and scalable uncertainty estimates for the GloVe word embeddings and demonstrate its usefulness in natural language tasks and computational social science analysis. This code repository contains code to download pre-computed GloVe embeddings and GloVe-V variances for several corpora from our HuggingFace repository, to interact with these data products and propagate uncertainty to downstream tasks.**


![GloVe-V](figures/glove_diagram.jpg)

## HuggingFace Repository
We store our data products on HuggingFace. You can find them [here](https://huggingface.co/datasets/reglab/glove-v).
## Available Corpora

We provide embeddings and variances for the following corpora:

- **Toy Corpus (300-dim)**: a subset of 11 words from the Corpus of Historical American English (1900-1999)
- **Corpus of Historical American English (COHA) (1900-1999) (300-dim)**
- **Toy Corpus (300-dim)**: a subset of 11 words from the Corpus of Historical American English (1900-1999). Downloadable as `Toy-Embeddings`
- **Corpus of Historical American English (COHA) (1900-1999) (300-dim)**: Downloadable as `COHA_1900-1999_300d`
- More to come!

## HuggingFace Repository
We store our data products on HuggingFace. You can find them [here](https://huggingface.co/datasets/reglab/glove-v).

Each dataset contains the following files:
Each dataset contains the following files (see the **Storage of GloVe-V Variances** section below for more details on the differences between the complete and approximated variances):
- `vocab.txt`: a list of the words in the corpus with associated frequencies
- `vectors.safetensors`: a safetensors file containing the embeddings for each word in the corpus
- `complete_chunk_{i}.safetensors`: a set of safetensors file containing the complete variances for each word in the corpus. These variances are size $D \times D$, where $D$ is the embedding dimensionality, and thus are very storage-intensive.
- `approx_info.txt`: a text file containing information on the approximation used to the full variance of each word (diagonal approximation, or SVD approximation)
- `approx_info.txt`: a text file containing information on the approximation used to approximate the full variance of each word (diagonal approximation, or SVD approximation)
- `ApproximationVariances.safetensors`: a safetensors file containing the approximation variances for each word in the corpus. These approximations require storing much fewer floating point numbers than the full variances. If a word has been approximated by a diagonal approximation, then this file will contain only $D$ floating point numbers for each word. Alternatively, if a word has been approximated by an SVD approximation of rank $k$, then this file will contain $k(2D + 1)$ floating point numbers for each word.

If using the approximated variances, the `glove_v.variance.load_variance` function automaticallyhandles the reconstruction of the variances from these files.
If using the approximated variances, the `glove_v.variance.load_variance` function automatically handles the reconstruction of the variances from these files.

## Storage of GloVe-V Variances

Let $V$ be the size of the vocabulary and $D$ be the embedding dimension. While GloVe embeddings only require storing $V \times D$ floating point numbers, the GloVe-V variances require storing $V \times (D x D)$ floating point numbers. For this reason, we offer two download options:

1. **Approximation Variances**: These are approximations to the full GloVe-V variances that can use either a diagonal approximation to the full variance, or a low-rank Singular Value Decomposition (SVD) approximation. We optimize this approximation at the level of each word to guarantee at least 90% reconstruction of the original variance. These approximations require storing much fewer floating point numbers than the full variances.
2. **Complete Variances**: These are the full GloVe-V variances, which require storing $V \times (D x D)$ floating point numbers. For example, in the case of the 300-dimensional embeddings for the COHA (1900-1999) corpus, this would be approximately 6.4 billion floating point numbers!
2. **Complete Variances**: These are the full GloVe-V variances, which require storing $V \times (D x D)$ floating point numbers. For example, in the case of the 300-dimensional embeddings for the COHA (1900-1999) corpus, this would be approximately 6.4 billion floating point numbers!

Our [tutorial](https://github.com/reglab/glove-v/blob/main/glove_v/docs/tutorial.ipynb) compares results using the approximated and complete variances with an illustration from the paper.

## Setup

First, clone this repo:

```bash
git clone https://github.com/reglab/glove-v.git myword
git clone https://github.com/reglab/glove-v.git glove_v
```

Next, install uv if you haven't already:
Next, install uv:

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
Expand All @@ -50,16 +53,58 @@ curl -LsSf https://astral.sh/uv/install.sh | sh
Then, create a virtual environment:

```bash
cd myword
cd glove_v
uv venv # optionally add --python 3.11 or another version
```

To activate the virtual environment:

```bash
source .venv/bin/activate # If using fish shell, use `source .venv/bin/activate.fish` instead

uv sync
```

## Usage

Our tutorial notebook is available [here](https://github.com/reglab/glove-v/blob/main/glove_v/docs/tutorial.ipynb) and walks through the process of downloading and interacting with the GloVe-V data products.
Our tutorial notebook is available [here](https://github.com/reglab/glove-v/blob/main/glove_v/docs/tutorial.ipynb) and offers a more detailed walkthrough of the process of downloading and interacting with the GloVe-V data products.

Here is a quick example of how to download the approximated embeddings for the Toy Corpus:

```python
glove_v.data.download_embeddings(
embedding_name='Toy-Embeddings',
approximation=True,
)
```

We can easily load the vocabulary and embeddings for the Toy Corpus in several formats (dictionary, numpy arrays, gensim KeyedVectors):
```python
vocab, ivocab = glove_v.vector.load_vocab(
embedding_name='Toy-Embeddings',
)
vectors = glove_v.vector.load_vectors(
embedding_name='Toy-Embeddings',
format='dictionary'
)
```

Next, we load the approximated variances for the Toy Corpus. This function automatically handles the reconstruction of the variances from the approximated files, such that the variances in the `approx_variances` dictionary are of size $D \times D$.

```python
approx_variances = {}
for word in list(vocab.keys()):
approx_variances[word] = glove_v.variance.load_variance(
embedding_name='Toy-Embeddings',
approximation=True,
word_idx=vocab[word],
)
```

We also offer a Gensim integration for working with GloVe-V embeddings using Gensim's KeyedVectors.

```python
gensim_glovev_kv = glove_v.GloVeVKeyedVectors(
embedding_name='Toy-Embeddings',
)
```
13 changes: 10 additions & 3 deletions glove_v/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
from . import variance
from . import propagate, variance, vector
from .data import download_embeddings
from . import vector
from . import propagate
from .gensim_integration import GloVeVKeyedVectors

__all__ = [
"propagate",
"variance",
"vector",
"download_embeddings",
"GloVeVKeyedVectors",
]
78 changes: 40 additions & 38 deletions glove_v/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,80 +4,84 @@
more lightweight, and guarantee 90% reconstruction of the original variance for each word.
"""

import os
from huggingface_hub import hf_hub_download
from pathlib import Path

import numpy as np
from huggingface_hub import hf_hub_download
from safetensors import safe_open
from safetensors.numpy import save_file\
from safetensors.numpy import save_file

import glove_v.utils.file as file_utils


AVAILABLE_EMBEDDINGS = [
"Toy-Embeddings",
'COHA_1900-1999_300d',
"COHA_1900-1999_300d",
]


def download_embeddings(
embedding_name: str,
approximation: bool = True,
download_dir: str = f'{file_utils.get_data_path()}/glove-v'
download_dir: str = f"{file_utils.get_data_path()}/glove-v",
) -> None:
"""
Downloads the vectors and variances for a selected corpus.
Args:
embedding_name: (str) The specific embedding to download. This should match one of the keys in the AVAILABLE_EMBEDDINGS dictionary
approximation: (bool) Whether to download the approximate or complete GloVe-V variances. The GloVe embeddings
are the same for both cases.
download_dir: (str) path where GloVe-V files should be saved
"""
if embedding_name not in AVAILABLE_EMBEDDINGS:
raise ValueError(f"[ERROR] Embeddings should be one of the following: {AVAILABLE_EMBEDDINGS}")
raise ValueError(
f"[ERROR] Embeddings should be one of the following: {AVAILABLE_EMBEDDINGS}"
)

final_download_dir = os.path.join(download_dir, f"{embedding_name}")
os.makedirs(final_download_dir, exist_ok=True)
final_download_dir = Path(download_dir) / embedding_name
final_download_dir.mkdir(parents=True, exist_ok=True)

# Download vocabulary and embeddings
for file in ['vocab.txt', 'vectors.safetensors']:
if not os.path.exists(os.path.join(final_download_dir, file)):
file_path = hf_hub_download(
for file in ["vocab.txt", "vectors.safetensors"]:
file_path = final_download_dir / file
if not file_path.exists():
downloaded_path = hf_hub_download(
repo_id="reglab/glove-v",
filename=f"{embedding_name}/{file}",
local_dir=download_dir,
repo_type="dataset",
)
print(f"[INFO] Downloaded {file}: {file_path}")
print(f"[INFO] Downloaded {file}: {downloaded_path}")
else:
print(f"[INFO] {file} already exists in {final_download_dir}")

# Download variances
if approximation:
print('[INFO] Downloading file containing approximated variances.')
for file in ['ApproximationVariances.safetensors', 'approx_info.txt']:
if not os.path.exists(os.path.join(final_download_dir, file)):
file_path = hf_hub_download(
print("[INFO] Downloading file containing approximated variances.")
for file in ["ApproximationVariances.safetensors", "approx_info.txt"]:
file_path = final_download_dir / file
if not file_path.exists():
downloaded_path = hf_hub_download(
repo_id="reglab/glove-v",
filename=f"{embedding_name}/{file}",
local_dir=download_dir,
repo_type="dataset",
)
print(f"[INFO] Downloaded {file}: {file_path}")
print(f"[INFO] Downloaded {file}: {downloaded_path}")
else:
print(f"[INFO] {file} already exists in {final_download_dir}")
else:
print('[INFO] Downloading file containing complete variances.')
output_path = os.path.join(final_download_dir, "CompleteVariances.safetensors")
if not os.path.exists(output_path):
print("[INFO] Downloading file containing complete variances.")

output_path = final_download_dir / "CompleteVariances.safetensors"
if not output_path.exists():
download_and_reconstruct_complete_safetensor(
embedding_name=embedding_name,
download_dir=download_dir,
output_path=output_path,
output_path=str(output_path),
)
else:
print(f"[INFO] Complete.safetensors already exists in {final_download_dir}")

print(f"[INFO] Complete.safetensors already exists in {final_download_dir}")


def download_and_reconstruct_complete_safetensor(
Expand All @@ -88,41 +92,39 @@ def download_and_reconstruct_complete_safetensor(
"""
Downloads chunked safetensor files from HuggingFace and reconstructs the complete safetensor
containing the original variances.
Args:
embedding_name: Name of the corpus on HuggingFace
download_dir: Path where to save the downloaded chunks
output_path: Path where to save the reconstructed complete safetensor
"""
chunk_idx = 0
all_variances = []

while True:
try:
# Download chunk
chunk_path = hf_hub_download(
repo_id="reglab/glove-v",
filename=f"{embedding_name}/complete_chunk_{chunk_idx}.safetensors",
local_dir=download_dir,
repo_type="dataset"
repo_type="dataset",
)

# Load chunk data
with safe_open(chunk_path, framework="numpy") as f:
# Get variances from chunk
variances_chunk = f.get_tensor("variances")
all_variances.append(variances_chunk)

chunk_idx += 1
except Exception as e:

except Exception:
# No more chunks to download
break

# Concatenate all variance chunks
complete_variances = np.concatenate(all_variances, axis=0)

# Save reconstructed complete safetensor
save_file({
"variances": complete_variances
}, output_path)
save_file({"variances": complete_variances}, output_path)
Loading

0 comments on commit 45b4ffc

Please sign in to comment.