A sequence to sequence (seq2seq) model to map mathematical expressions to their Taylor series expansion. For July 2021 Carnegie Mellon University hackathon in AI for physics conference: https://events.mcs.cmu.edu/qtc2021/
The jupyter notebooks in this package depend on the following Python packages:
modules | description |
---|---|
re | regular expressions |
sympy | an excellent symbolic algebra module |
numpy | array manipulation and numerical analysis |
random | random numbers |
torch | PyTorch machine learning toolkit |
This package can be installed as follows:
git clone https://github.com/hbprosper/symbolic-ai
notebook | description |
---|---|
seq2seq_data_generation | generate pairs of mathematical expressions |
seq2seq_data_prep | data preparation |
seq2seq_train | train! |
The file data/seq2seq_data.txt contains sequence pairs, one per line, with format
<symbolic-expression><\tab><symbolic-Taylor-series-expression>
which were created using seq2seq_data_generation.ipynb. For example, the first line in that file is
sinh(-2*x) -2*x - 4*x**3/3
Since the data are already available, there is no need to call this notebook. The notebook seq2seq_data_prep.ipynb applies some filtering to data/seq2seq_data.txt and creates the filtered text files data/seq2seq_data_10000.txt and data/seq2seq_data_60000.txt in which all spaces in the expressions have been removed. The first file contains 10,000 sequence pairs and the second contains 60,000 sequence pairs.
The symbolic translation model can be trained using the jupyter notebook seq2seq_train.ipynb, which should be run on a system with GPU support (e.g., Google Colaboratory). The notebook defines a sequence to sequence (seq2seq) model comprising a sequence encoder followed by a sequence decoder, each built using two or more layers of Long Short Term Memories (LSTM). An LSTM returns output, (hidden, cell), where, for a given sequence, output is a tensor of floats equal in size to the number of unique tokens from which the sequences are formed plus 2. The extra length of 2 is for a 2 extra tokens, one for padding (a space) and another for an unknown character (a question mark). The objects hidden and cell are the so-called hidden and cell states, respectively, which provide encodings of the input sequence. In spite of its suggestive name an LSTM is just another very clever non-linear function that was developed by conceptualizing a device containing various filtering elements.
The seq2seq_train.ipynb notebook does the following:
- Read a filtered text file and delimit each sequence of characters with a tab and a newline character.
- Build a character (i.e., token) to integer map for the input sequences and another for the target (that is, output) sequences.
- Use the maps to convert each sequence to an array of integers, where each integer corresponds to a unique character. We do not use padding.
- Create an Encoder, which performs the following tasks:
- Map each integer encoding of a token to a dense vector representation using the PyTorch Embedding class.
- Call a stack of LSTMs keeping only the hidden and cell states.
- Create a Decoder, which performs the following tasks:
- Take in the first character (the tab) from an input sequence (of course, in batches :), plus the hidden and cell states from the Encoder.
- Compute output, (hidden, cell).
- From output determine the index for the predicted character, or use the target index with some probability (during training mode).
- Go to ii until output sequence is done.
If you wish to use Google Colab, you need the following cell at the start of your version of the notebook seq2seq_train.ipynb:
from google.colab import drive
drive.mount('/content/gdrive')
import sys
sys.path.append('/content/gdrive/My Drive/AI')
This assumes that you have a Google account and you've created a folder called AI on your Google Drive. Follow the instructions on the screen. When you get to a website with a long cryptic code, click the button to the right of the code to copy the code and paste the code into the entry window that appears in the notebook. To make sure Python knows where to find modules (such seq2sequtil.py) add the full path of the folder AI to sys.path.