Skip to content

A collection of code for Dimensionality Reduction for Documents with Nearest Neighbor Queries

License

Notifications You must be signed in to change notification settings

sfingram/apq-qsne

Repository files navigation

apq-qsne

A collection of C/C++ code implementing the algorithms from the paper "Dimensionality Reduction for Documents with Nearest Neighbor Queries" by Ingram and Munzner, 2014.

This includes the All-Pairs Query, or APQ, nearest-neighbor algorithm and the Q-SNE, dimensionality reduction algorithm.

Building

The code requires a C/C++ compiler. I have successfully used both clang and gcc.

Prerequisites

You will first need to have GLib installed on your system.

For Mac (with homebrew installed):

brew install glib

For Ubuntu Linux:

sudo apt-get install libglib2.0-dev

For other Linux build from source.

Compiling

For building the APQ algorithm just type

make

which will make the executable file testapq. For building BH-SNE with arbitrary input (use input from apq for Q-SNE) type

make tsne

As a fun bonus, there are some other executables you can build if you dig around in the makefile.

Input File Format

APQ processes document term-vectors into nearest-neighbor files (both in vec format).

QSNE processes nearest-neighbor files into coordinate files (in csv file).

Term-Vector File Description

Term vectors represent documents as a vector of term counts in a term vectors space. Each dimension represents a different term and the value represents the TFIDF weight of that term. Because most terms don't appear in a given document, APQ expects sparse vectors as input.

Nearest-Neighbor File Description

APQ computes the approximate k most similar documents for each document in the input as well as the approximate cosine distance between the specified document pairs.

Vec File Format

Both term term vectors and nearest neighbors are stored in vec format. Each line describes a document.

(index1,value1) (index2,value2)

For term-vectors, indices represent term numbers and values. For nearest neighbors, indices represent document numbers.

Running

(one day, maybe)

Command Line Options

for help running the APQ algorithm

testapq -?

for help running BH-SNE (and Q-SNE)

tsne -?

Some Examples

(one day, maybe)

About the Algorithms

(one day, maybe)

APQ

Q-SNE

About

A collection of code for Dimensionality Reduction for Documents with Nearest Neighbor Queries

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published