by Tyler Renslow
This repository contains all scripts and data used for my master's thesis.
The repository structure loosely follows that of the Team Data Science Process developed by Microsoft. More info can be found here.
All scripts were written in Python 3, with additional packages used for different tasks.
Packages for data processing:
Packages for modeling:
- TensorFlow (code was written when v1.7 was latest, may be broken now)
- For training TensorFlow models on NVIDIA GPUs, follow the instructions at this link.
TODO:
- refactor all paths in scripts
- check compatability with latest TensorFlow version
- find more efficient way to store scraped wikipedia articles, with the goal of making them easier to share and process
- store all large data in compressed files to save disk space
- reformat log files in a smart way to reflect which feature set used to train model