AdaOja

This repository contains the Python code that produces all of the experimental results from the paper "AdaOja: Adaptive Learning Rates for Streaming PCA". AdaOja is a new version of Oja's method with an adaptive learning rate that performs comparably to other state of the art methods and better than Oja's for standard learning rate choices such as eta_i = c/i, c/sqrt(i). The file streaming_subclass.py provides the framework for several different algorithms--including AdaOja--for streaming principal component analysis and can easily be used for a wider set of problems and datasets than those presented here.

Dependencies

Python: tested with version 3.5.2
Jupyter Notebook
NumPy: tested with version 1.13.1
SciPy: tested with version 0.19.1
Matplotlib: tested with version 2.0.2

Note that all of these packages can most easily be installed using Anaconda as follows:

conda install (package-name)

The Anaconda distribution can be downloaded here.

Streaming PCA Objects

The key code containing our streaming PCA objects is found in streaming_subclass.py. The main functionality for our PCA objects is found in StreamingPCA. Additionally, several subclasses are defined for specific algorithms:

AdaOja ¹
Oja: Oja's method ² for learning rates c/t and c/sqrt(t).
HPCA: History Principal Component Analysis ³
SPM: Streaming Power Method. ^{4, 5}

The file data_strm_subclass.py provides several examples for how to stream data into these classes. Current functionality runs AdaOja, HPCA and SPM simultaniously by streaming data from a list of blocks (run_sim_blocklist), an array already loaded fully into memory (run_sim_fullX), and directly from a bag-of-words file (run_sim_bag).

Plotting and Comparing AdaOja to other Algorithms

Datasets

We run AdaOja against several other streaming algorithms on three different kinds of datasets.

Synthetic Data

The functions to generate synthetic data are found in simulated_data.py.

Bag-of-words

These sparse, real-world bag-of-words datasets are available on the UCI Machine Learning Repository. Note that in order to run ExpVar_Comparison.ipynb your working directory must contain the following files:

docword.kos.txt
docword.nips.txt
docword.enron.txt
docword.nytimes.txt
docword.pubmed.txt

The file data_strm_subclass.py contains functions for parsing these bag-of-words text files in python.

For example, for small bag-of-words datasets the dimensions n, d, the number of non-zeros, the density, the dataset (as a sparse nxd csr matrix) and the norm of the dataset squared are computed by running:

n, d, nnz, dense, SpX, norm2 = dssb.get_bagX('docword.kos.txt')

Alternatively, a list of the first m sparse blocks of size B can be returned by running the following:

n, d, nnz, dense, SpX, norm2 = dssb.get_bagXblocks('docword.nytimes.txt', B, block_total=m)

CIFAR-10

The CIFAR-10 dataset is available online. It is a subset of the considerably larger Tiny Images Dataset. Note that in order to run ExpVar_Comparison.ipynb, you must download the following files and include them in your working directory:

data_batch_1
data_batch_2
data_batch_3
data_batch_4
data_batch_5

Running Experiments

We generate our comparison plots in ExpVar_Comparison.ipynb. These plots largely draw on two files: data_strm_subclass.py and plot_functions.py. To run this file, download the CIFAR-10 dataset and Bag-of-Words datasets as outlined in the section above and make sure the necessary files are in your working directory.

The file plot_functions.py compares and visualizes the end explained variance achieved by Oja's method varying over c for learning rates eta_i = c / i, c / sqrt(i) compared to the end explained variance achieved by AdaOja. These methods are stored in the class compare_lr. It also plots HPCA, AdaOja, and SPM against each other using the function plot_hpca_ada in conjunction with the streaming methods from data_strm_subclass.py.

The class compare_time contains preliminary functionality to compare these methods' (AdaOja, HPCA, and SPM) time costs.

Sources

License and Reference

This repository is licensed under the 3-clause BSD license, see LICENSE.md.

To reference this code base, please cite:

Amelia Henriksen and Rachel Ward. AdaOja: Adaptive Learning Rates for Streaming PCA. arXiv e-prints, page arXiv:1905.12115, May 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AdaOja

Dependencies

Streaming PCA Objects

Plotting and Comparing AdaOja to other Algorithms

Datasets

Synthetic Data

Bag-of-words

CIFAR-10

Running Experiments

Sources

License and Reference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
ExpVar_Comparison.ipynb		ExpVar_Comparison.ipynb
LICENSE.md		LICENSE.md
README.md		README.md
RMSProp, ADAM, AdaOja comparison.ipynb		RMSProp, ADAM, AdaOja comparison.ipynb
data_strm_subclass.py		data_strm_subclass.py
plot_functions.py		plot_functions.py
simulated_data.py		simulated_data.py
streaming_subclass.py		streaming_subclass.py

License

aamcbee/AdaOja

Folders and files

Latest commit

History

Repository files navigation

AdaOja

Dependencies

Streaming PCA Objects

Plotting and Comparing AdaOja to other Algorithms

Datasets

Synthetic Data

Bag-of-words

CIFAR-10

Running Experiments

Sources

License and Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages