description.tex

\input{common/header.tex}
\inbpdocument


\chapter{Automatic Model Description}
\label{ch:description}


%“One day I will find the right words, and they will be simple.” ― Jack Kerouac, The Dharma Bums

%“You can make anything by writing.” ― C.S. Lewis

%“The role of a writer is not to say what we can all say, but what we are unable to say.” ― Anaïs Nin

%“Stories may well be lies, but they are good lies that say true things, and which can sometimes pay the rent.” ― Neil Gaiman

%“When I write, I feel like an armless, legless man with a crayon in his mouth.” ― Kurt Vonnegut

%If words are not things, or maps are not the actual territory, then, obviously, the only possible link between the objective world and the linguistic world is found in structure, and structure alone. — Alfred Korzybski
%Science and Sanity: An Introduction to Non-Aristotelian Systems and General Semantics (1958), 61.

%... the only possible link between the objective world and the linguistic world is found in structure, and structure alone. — Alfred Korzybski, Science and Sanity: An Introduction to Non-Aristotelian Systems and General Semantics (1958), 61.

%“The first draft of anything is shit.” 
%― Ernest Hemingway

%“Not a wasted word. This has been a main point to my literary thinking all my life.”
%—Hunter S. Thompson

%“We are all apprentices in a craft where no one ever becomes a master.”
%—Ernest Hemingway

\begin{quotation}
``Not a wasted word.
This has been a main point to my literary thinking all my life.''

\hspace*{\fill} \emph{ -- Hunter S. Thompson}
\end{quotation}


The previous chapter showed how to automatically build structured models by searching through a language of kernels.
It also showed how to decompose the resulting models into the different types of structure present, and how to visually illustrate the type of structure captured by each component.
This chapter shows how automatically describe the resulting model structures using English text.

The main idea is to describe every part of a given product of kernels as an adjective, or as a short phrase that modifies the description of a kernel.
To see how this could work, recall that the model decomposition plots of \cref{ch:grammar} showed that most of the structure in each component was determined by that component's kernel.
Even across different datasets, the meanings of individual parts of different kernels are consistent in some ways.
For example, $\kPer$ indicates repeating structure, and $\kSE$ indicates smooth change over time.

This chapter also presents a system that generates reports combining automatically generated text and plots which highlight interpretable features discovered in a data sets.
A complete example of an automatically-generated report can be found in appendix~\ref{ch:example-solar}.

The work appearing in this chapter was written in collaboration with James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani, and was published in \citet{LloDuvGroetal14}.
The procedure translating kernels into adjectives developed out of discussions between James and myself.
James Lloyd wrote the code to automatically generate reports, and ran all of the experiments.
The paper upon which this chapter is based was written mainly by both James Lloyd and I.%, and Zoubin Ghahramani, with many helpful contributions and suggestions from Roger Grosse and Josh Tenenbaum.


\section{Generating descriptions of composite kernels}

There are two main features of our language of \gp{} models that allow description to be performed automatically.
First, any kernel expression in the language can be simplified into a sum of products.
As discussed in \cref{sec:modeling-sums-of-functions}, a sum of kernels corresponds to a sum of functions, so each resulting product of kernels can be described separately, as part of a sum.
Second, each kernel in a product modifies the resulting model in a consistent way.
Therefore, one can describe a product of kernels by concatenating descriptions of the effect of each part of the product.
One part of the product needs to be described using a noun, which is modified by the other parts.

For example, one can describe the product of kernels $\kPer \kerntimes \kSE$ by representing $\kPer$ by a noun (``a periodic function'') modified by a phrase representing the effect of the $\kSE$ kernel (``whose shape varies smoothly over time'').
To simplify the system, we restricted base kernels to the set $\{\kC$, $\kLin$, $\kWN$, $\kSE$, $\kPer$, and $\vsigma\}$.
Recall that the sigmoidal kernel $\vsigma(x, x') = \sigma(x)\sigma(x')$ allows changepoints and change-windows.


\subsection{Simplification rules}
\label{sec:desc-simplification}

In order to be able to use the same phrase to describe the effect of each base kernel in different circumstances, our system converts each kernel expression into a standard, simplified form.

First, our system distributes all products of sums into sums of products.
Then, it applies several simplification rules to the kernel expression:

\begin{itemize}
\item Products of two or more $\kSE$ kernels can be equivalently replaced by a single $\kSE$ with different parameters.
\item Multiplying the white-noise kernel ($\kWN$) by any stationary kernel ($\kC$, $\kWN$, $\kSE$, or $\kPer$) gives another $\kWN$ kernel.
\item Multiplying any kernel by the constant kernel ($\kC$) only changes the parameters of the original kernel, and so can be factored out of any product in which it appears.
\end{itemize}

After applying these rules, any composite kernel expressible by the grammar can be written as a sum of terms of the form:
\begin{align}
K \prod_m \kLin^{(m)} \prod_n \boldsymbol\sigma^{(n)},
\label{eq:sop}
\end{align}
where $K$ is one of $\{\kWN, \kC, \kSE, \prod_k \kPer^{(k)}\}$ or $\{\kSE \times \prod_k \kPer^{(k)}\}$, 
and $\prod_i\kernel^{(i)}$ denotes a product of kernels, each having different parameters.
Superscripts denote different instances of the same kernel appearing in a product: $\SE^{(1)}$ can have different kernel parameters than $\kSE^{(2)}$.


\subsection{Describing each part of a product of kernels}

Each kernel in a product modifies the resulting \gp{} model in a consistent way.
This allows one to describe the contribution of each kernel in a product as an adjective, or more generally as a modifier of a noun.

We now describe how each of the kernels in our grammar modifies a \gp{} model:
%, and how this effect can be described in natural language:

\begin{itemize}
\item {\bf Multiplication by $\kSE$} removes long range correlations from a model, since $\kSE(x,x')$ decreases monotonically to 0 as $|x - x'|$ increases.
%This can be described as making an existing model's correlation structure `local' or `approximate'.
This converts any global correlation structure into local correlation only.

\item {\bf Multiplication by $\kLin$} is equivalent to multiplying the function being modeled by a linear function.
If $f(x) \dist \gp{}(0, \kernel)$, then $x \kerntimes f(x) \dist \gp{}\left(0, \kLin \kerntimes k \right)$.
This causes the standard deviation of the model to vary linearly, without affecting the correlation between function values.
% and can be described as \eg `with linearly increasing standard deviation'.

\item {\bf Multiplication by $\boldsymbol\sigma$} is equivalent to multiplying the function being modeled by a sigmoid, which means that the function goes to zero before or after some point.
%This can be described as \eg `from [time]' or `until [time]'.

\item {\bf Multiplication by $\kPer$}
removes correlation between all pairs of function values not close to one period apart, allowing variation within each period, but maintaining correlation between periods.

\item {\bf Multiplication by any kernel}
modifies the covariance in the same way as multiplying by a function drawn from a corresponding \gp{} prior.
This follows from the fact that if ${f_1(x) \dist \gp{}(0, \kernel_1)}$ and ${f_2(x) \dist \gp{}(0, \kernel_2)}$ then
%
\begin{align}
{\textrm{Cov} \big[f_1(x)f_2(x), \; f_1(x')f_2(x') \big] = k_1(x,x') \kerntimes k_2(x,x')}.
\end{align}
%
Put more plainly, a \gp{} whose covariance is a product of kernels has the same covariance as a product of two functions, each drawn from the corresponding \gp{} prior.
However, the distribution of $f_1 \kerntimes f_2$ is not always \gp{} distributed -- it can have third and higher central moments as well.
This identity can be used to generate a cumbersome ``worst-case'' description in cases where a more concise description of the effect of a kernel is not available.
For example, it is used in our system to describe products of more than one periodic kernel.
\end{itemize}

\Cref{table:modifiers} gives the corresponding description of the effect of each type of kernel in a product, written as a post-modifier.
%
\begin{table}[h!]
\centering
\begin{tabular}{l|l}
Kernel & Postmodifier phrase \\
\midrule
$\kSE$  & whose shape changes smoothly \\
$\kPer$ & modulated by a periodic function \\
$\kLin$ & with linearly varying amplitude \\
$\prod_k \kLin^{(k)}$ & with polynomially varying amplitude \\
$\prod_k \boldsymbol{\sigma}^{(k)}$ & which applies until / from [changepoint] \\
\end{tabular}
\caption[Descriptions of the effect of each kernel, written as a post-modifier]{
Descriptions of the effect of each kernel, written as a post-modifier.
}
\label{table:modifiers}
\end{table}

\Cref{table:nouns} gives the corresponding description of each kernel before it has been multiplied by any other, written as a noun phrase.
%
\begin{table}[h!]
\centering
\begin{tabular}{l|l}
Kernel & Noun phrase \\
\midrule
$\kWN$  & uncorrelated noise \\
$\kC$   & constant \\
$\kSE$  & smooth function \\
$\kPer$ & periodic function \\
$\kLin$ & linear function \\
$\prod_k \kLin^{(k)}$ & \{quadratic, cubic, quartic, \dots \} function %polynomial \\
\end{tabular}
\caption[Noun phrase descriptions of each type of kernel]{
Noun phrase descriptions of each type of kernel.}
\label{table:nouns}
\end{table}


\subsection{Combining descriptions into noun phrases}

In order to build a noun phrase describing a product of kernels, our system chooses one kernel to act as the head noun, which is then modified by appending descriptions of the other kernels in the product.
%described by the functions it encodes for when unmodified \eg `smooth function' for $\kSE$.
%Modifiers corresponding to the other kernels in the product are then appended to this description, forming a noun phrase of the form:
%\begin{align*}
%\textnormal{Determiner} + \textnormal{Premodifiers} + \textnormal{Noun} + \textnormal{Postmodifiers}
%\end{align*}

As an example, a kernel of the form $\kPer \kerntimes  \kLin \kerntimes \boldsymbol{\sigma}$ could be described as a
\begin{align*}
\underbrace{\kPer}_{\textnormal{\small periodic function}} \times 
\underbrace{\kLin}_{\textnormal{\small with linearly varying amplitude}} \times 
\underbrace{\boldsymbol{\sigma}}_{\textnormal{\small which applies until 1700.}}
\end{align*}
where $\kPer$ was chosen to be the head noun.

In our system, the head noun is chosen according to the following ordering:
\begin{align}
\kPer, \kWN, \kSE, \kC, \prod_m \kLin^{(m)}, \prod_n \boldsymbol\sigma^{(n)}
\label{eq:noun-ordering}
\end{align}
%\ie $\kPer$ is always chosen as the head noun when present.
Combining \cref{table:modifiers,table:nouns} with ordering \ref{eq:noun-ordering} provides a general method to produce descriptions of sums and products of these base kernels.


\subsubsection{Extensions and refinements}
%\subsubsection{Refinements to the Descriptions}
In practice, the system also incorporates a number of other rules which help to make the descriptions shorter, easier to parse, or clearer:
%There are a number of ways in which the descriptions of the kernels can be made more interpretable and informative:
\begin{itemize}
%  \item We incorporate rules for which kernel is chosen as the head noun can change the interpretability of a description.
  \item The system adds extra adjectives depending on kernel parameters.
        For example, an $\kSE$ with a relatively short lengthscale might be described as ``a rapidly-varying smooth function'' as opposed to just ``a smooth function''.
  \item Descriptions can include kernel parameters.
        For example, the system might write that a function is ``repeating with a period of 7 days''.
  \item Descriptions can include extra information about the model not contained in the kernel.
        For example, based on the posterior distribution over the function's slope, the system might write ``a linearly increasing function'' as opposed to ``a linear function''.
  \item Some kernels can be described through pre-modifiers.
        For example, the system might write ``an approximately periodic function'' as opposed to ``a periodic function whose shape changes smoothly''.
\end{itemize}

%The reports in the supplementary material and in \cref{sec:example-descriptions} include some of these refinements.
%The parameters and design choices of these refinements have been chosen by our best judgement, but learning these parameters objectively from expert statisticians would be an interesting area for future study.


\subsubsection{Ordering additive components}

The reports generated by our system attempt to present the most interesting or important features of a dataset first.
As a heuristic, the system orders components by always adding next the component which most reduces the 10-fold cross-validated mean absolute error.


\subsection{Worked example}

This section shows an example of our procedure describing a compound kernel containing every type of base kernel in our set:
%
\begin{align}
\kSE \kerntimes (\kWN \kerntimes \kLin \kernplus \kCP(\kC, \kPer)).
\end{align}
%
The kernel is first converted into a sum of products, and the changepoint is converted into sigmoidal kernels (recall the definition of changepoint kernels in \cref{sec:changepoint-definition}):
%
\begin{align}
\kSE \kerntimes \kWN \kerntimes \kLin \kernplus 
\kSE \kerntimes \kC \kerntimes \boldsymbol{\sigma} \kernplus 
\kSE \kerntimes \kPer \kerntimes \boldsymbol{\bar\sigma}
\end{align}
%
which is then simplified using the rules in \cref{sec:desc-simplification} to
%
\begin{align}
\kWN \kerntimes \kLin \kernplus 
\kSE \kerntimes \boldsymbol{\sigma} \kernplus 
\kSE \kerntimes \kPer \kerntimes \boldsymbol{\bar\sigma}.
\end{align}

To describe the first component, $(\kWN \kerntimes \kLin)$, the head noun description for $\kWN$, ``uncorrelated noise'', is concatenated with a modifier for $\kLin$, ``with linearly increasing standard deviation''.

The second component, $(\kSE \kerntimes \vsigma)$, is described as ``A smooth function with a lengthscale of [lengthscale] [units]'', corresponding to the $\kSE$, ``which applies until [changepoint]''.

Finally, the third component, $(\kSE \kerntimes \kPer \kerntimes \boldsymbol{\bar\sigma})$, is described as ``An approximately periodic function with a period of [period] [units] which applies from [changepoint]''.


\section{Example descriptions}
\label{sec:example-descriptions}
In this section, we demonstrate the ability of our procedure, \procedurename, to write intelligible descriptions of the structure present in two time series.
The examples presented here describe models produced by the automatic search method presented in \cref{ch:grammar}.


\subsection{Summarizing 400 years of solar activity}
\label{sec:solar}

First, we show excerpts from the report automatically generated on annual solar irradiation data from 1610 to 2011.
This dataset is shown in \cref{fig:solar}.
%
\begin{figure}[ht!]
\centering
\includegraphics[trim=0.2cm 18.0cm 8cm 2cm, clip, width=0.6\columnwidth]{\grammarfiguresdir/solarpages/pg_0002-crop}
\caption[Solar irradiance dataset]
{Solar irradiance data~\citep{lean1995reconstruction}.}
\label{fig:solar}
\end{figure}

This time series has two pertinent features: 
First, a roughly 11-year cycle of solar activity.
Second, a period lasting from 1645 to 1715 having almost no variance.
This flat region is known as to the Maunder minimum, a period in which sunspots were extremely rare \citep{lean1995reconstruction}.
The Maunder minimum is an example of the type of structure that can be captured by change-windows.

\begin{figure}[ht!]
\centering
\fbox{\includegraphics[trim=0cm 10.8cm 0cm 9.3cm, clip, width=0.98\columnwidth]{\grammarfiguresdir/solarpages/pg_0002-crop}}
\caption[Automatically-generated description of the solar irradiance data set]
{Automatically generated descriptions of the first four components discovered by \procedurename{} on the solar irradiance data set.
The dataset has been decomposed into diverse structures having concise descriptions.}
\label{fig:exec}
\end{figure}

The first section of each report generated by \procedurename{} is a summary of the structure found in the dataset.
\Cref{fig:exec} shows natural-language summaries of the top four components discovered by \procedurename{} on the solar dataset.
From these summaries, we can see that the system has identified the Maunder minimum (second component) and the 11-year solar cycle (fourth component).
These components are visualized and described in figures~\ref{fig:maunder} and \ref{fig:periodic}, respectively. 
The third component, visualized in \cref{fig:smooth}, captures the smooth variation over time of the overall level of solar activity.

\begin{figure}[]%
\centering
\fbox{
\begin{tabular}{@{}c@{}}
\includegraphics[trim=0cm 10.5cm 0cm 0.7cm, clip, width=0.98\columnwidth]{\grammarfiguresdir/solarpages/pg_0005-crop} \\
\includegraphics[trim=0cm 4.75cm 0cm 3.12cm, clip, width=0.98\columnwidth]{\grammarfiguresdir/solarpages/pg_0005-crop}
\end{tabular}
}
\caption[A component corresponding to the Maunder minimum]
{Extract from an automatically-generated report describing the model component corresponding to the Maunder minimum.}
\label{fig:maunder}
%\end{figure}
%
%\begin{figure}[h!]
\centering
\fbox{
\begin{tabular}{@{}c@{}}
\includegraphics[trim=0cm 10.5cm 0cm 1.05cm, clip, width=0.98\columnwidth]{\grammarfiguresdir/solarpages/pg_0006-crop} \\
\includegraphics[trim=0cm 4.75cm 0cm 3.9cm, clip, width=0.98\columnwidth]{\grammarfiguresdir/solarpages/pg_0006-crop}
\end{tabular}
}
\caption[\procedurename{} isolating part of the signal explained by a slowly-varying trend]
{Characterizing the medium-term smoothness of solar activity levels.
By allowing other components to explain the periodicity, noise, and the Maunder minimum, \procedurename{} can isolate the part of the signal best explained by a slowly-varying trend.}
\label{fig:smooth}
%\end{figure}
%
%\begin{figure}[ht!]
\centering
\fbox{
\begin{tabular}{@{}c@{}}
\includegraphics[trim=0cm 10.5cm 0cm 1.05cm, clip, width=0.98\columnwidth]{\grammarfiguresdir/solarpages/pg_0007-crop} \\
\includegraphics[trim=0cm 4.75cm 0cm 4.66cm, clip, width=0.98\columnwidth]{\grammarfiguresdir/solarpages/pg_0007-crop}
\end{tabular}
}
\caption[Automatic description of the solar cycle]
{
This part of the report isolates and describes the approximately 11-year sunspot cycle, also noting its disappearance during the Maunder minimum.}
% also noting its disappearance during the 16th century, a time known as the Maunder minimum \citep{lean1995reconstruction}.}
\label{fig:periodic}
\end{figure}
%
The complete report generated on this dataset can be found in appendix~\ref{ch:example-solar}.
Each report also contains samples from the model posterior.

\phantom{a}

%\newpage

\subsection{Describing changing noise levels} % in Air Traffic Data}
\label{sec:airline}

%\begin{figure}[ht!]
%\centering
%\includegraphics[trim=0.4cm 16.8cm 8cm 2cm, clip, width=0.5\columnwidth]{\grammarfiguresdir/airlinepages/pg_0002-crop}
%\caption[International airline passenger monthly volume dataset]
%{International airline passenger monthly volume \citep[e.g.][]{box2013time}.}
%\label{fig:airline}
%\end{figure}

Next, we present excerpts of the description generated by our procedure on a model of international airline passenger counts over time, shown in \cref{fig:airline_decomp}.
%The model described is the same as that shown in \cref{fig:airline_decomp}.
%constructed by \procedurename{} has four components: $\kLin \kernplus \kSE \kerntimes \kPer \kerntimes \kLin \kernplus \kSE \kernplus \kWN \kerntimes \kLin$, with 
%
High-level descriptions of the four components discovered are shown in \cref{fig:exec-airline}.

\begin{figure}[ht!]
\centering
\fbox{\includegraphics[trim=0cm 9.5cm 0cm 9.25cm, clip, width=0.98\columnwidth]{\grammarfiguresdir/airlinepages/pg_0002-crop}}
\caption[Short descriptions of the four components of the airline model]
{Short descriptions of the four components of a model describing the airline dataset.}
\label{fig:exec-airline}
\end{figure}

\begin{figure}[ht!]
\centering
\fbox{
\begin{tabular}{@{}c@{}}
\includegraphics[trim=0cm 10.5cm 0cm 1cm, clip, width=0.98\columnwidth]{\grammarfiguresdir/airlinepages/pg_0004-crop} \\
\includegraphics[trim=0cm 4.7cm 0cm 4.3cm, clip, width=0.98\columnwidth]{\grammarfiguresdir/airlinepages/pg_0004-crop}
\end{tabular}
}
\caption[Describing non-stationary periodicity in the airline data]
{Describing non-stationary periodicity in the airline data.}
\label{fig:lin_periodic}
\end{figure}

\begin{figure}[ht!]
\centering
\fbox{
\begin{tabular}{@{}c@{}}
\includegraphics[trim=0cm 6.7cm 0cm 0.6cm, clip, width=0.98\columnwidth]{\grammarfiguresdir/airlinepages/pg_0006-crop} \\
\includegraphics[trim=0cm 0cm 0cm 3.98cm, clip, width=0.98\columnwidth]{\grammarfiguresdir/airlinepages/pg_0006-crop}
\end{tabular}
}
\caption[Describing time-changing variance in the airline dataset]
{Describing time-changing variance in the airline dataset.}
\label{fig:heteroscedastic}
\end{figure}

The second component, shown in \cref{fig:lin_periodic}, is accurately described as approximately ($\kSE$) periodic ($\kPer$) with linearly growing amplitude ($\kLin$).

The description of the fourth component, shown in \cref{fig:heteroscedastic}, expresses the fact that the scale of the unstructured noise in the model grows linearly with time.
%By multiplying a white noise kernel by a linear kernel, the model is able to express heteroscedasticity (covariance which changes over time).

The complete report generated on this dataset can be found in the supplementary material of \citet{LloDuvGroetal14}.
Other example reports describing a wide variety of time-series can be found at \url{http://mlg.eng.cam.ac.uk/lloyd/abcdoutput/}


\section{Related work}
\label{sec:description-related-work}
To the best of our knowledge, our procedure is the first example of automatic textual description of a nonparametric statistical model.
However, systems with natural language output have been developed for automatic video description \citep{barbu2012video} and automated theorem proving \citep{ganesalingam2013fully}.

Although not a description procedure, \citet{durrande2013gaussian} developed an analytic method for decomposing \gp{} posteriors into entirely periodic and entirely non-periodic parts, even when using non-periodic kernels.


\section{Limitations of this approach}
\label{sec:limitations-of-abcd}

During development, we noted several difficulties with this overall approach:

\begin{itemize}

\item {\bf Some kernels are hard to describe.}
For instance, we did not include the $\kRQ$ kernel in the text-generation procedure.
This was done for several reasons.
First, the $\kRQ$ kernel can be equivalently expressed as a scale mixture of $\kSE$ kernels, making it redundant in principle.
Second, it was difficult to think of a clear and concise description for effect of the hyperparameter that controls the heaviness of the tails of the $\kRQ$ kernel.
Third, a product of two $\kRQ$ kernels does not give another $\kRQ$ kernel, which raises the question of how to concisely describe products of $\kRQ$ kernels.

\item {\bf Reliance on additivity.}
Much of the modularity of the description procedure is due to the additive decomposition.
%However, some types of structure do not allow additive decomposition.
%For example, A multiplication of , or only decompose additively after being warped.
However, additivity is lost under any nonlinear transformation of the output.
Such warpings can be learned~\citep{snelson2004warped}, but descriptions of transformations of the data may not be as clear to the end user.

\item {\bf Difficulty of expressing uncertainty.}
A natural extension to the model search procedure would be to report a posterior distribution on structures and kernel parameters, rather than point estimates.
Describing uncertainty about the hyperparameters of a particular structure may be feasible,
but describing even a few most-probable structures might result in excessively long reports.

\end{itemize}


\subsubsection{Source code}
Source code to perform all experiments is available at \\\url{http://www.github.com/jamesrobertlloyd/gpss-research}. 


\section{Conclusions}
This chapter presented a system which automatically generates detailed reports describing statistical structure captured by a \gp{} model.
The properties of \gp{}s and the kernels being used allow a modular description, avoiding an exponential blowup in the number of special cases that need to be considered.

Combining this procedure with the model search of \cref{ch:grammar} gives a system combining all the elements of an automatic statistician listed in \cref{sec:ingredients}:
an open-ended language of models, a method to search through model space, a model comparison procedure, and a model description procedure.
Each particular element used in the system presented here is merely a proof-of-concept.
However, even this simple prototype demonstrated the ability to discover and describe a variety of patterns in time series.

%We hope that systems such as these will allow the use of


\outbpdocument{
\bibliographystyle{plainnat}
\bibliography{references.bib}
}