preprocess.tex

\chapter{Data preprocessing}
\label{chap:preprocess}
\glsresetall

\chapterprecishere{I find your lack of faith disturbing.
  \par\raggedleft--- \textup{Darth Vader}, Star Wars: Episode IV -- A New Hope (1977)}

\begin{mainbox}{Chapter remarks}

  \boxsubtitle{Contents}

  \startcontents[chapters]
  \printcontents[chapters]{}{1}{}
  \vspace{1em}

  \boxsubtitle{Context}

  \begin{itemize}
    \itemsep0em
    \item \dots
  \end{itemize}

  \boxsubtitle{Objectives}

  \begin{itemize}
    \itemsep0em
    \item \dots
  \end{itemize}

  \boxsubtitle{Takeaways}

  \begin{itemize}
    \itemsep0em
    \item \dots
  \end{itemize}
\end{mainbox}

{}
\clearpage

\section{Introduction}

In \cref{chap:data,chap:handling}, we discussed data semantics and the tools to
handle data.  They provide the grounds for preparation of the data as we described in the
data sprint tasks in \cref{sub:workflow}.  However, the focus is to guarantee that the
data is tidy and in the observational unit of interest, not to prepare it for modeling.

As a result, although data might be appropriate for the learning tasks we described in
\cref{chap:slt} --- in the sense that we know what the feature vectors and the target
variable are ---, they might not be suitable for the machine learning methods we will use.

One simple example is the perceptron (\cref{sub:perceptron}) that assumes that all
input variables are real numbers.  If the data contains categorical variables, we must
convert them to numerical variables before applying the perceptron.

For this reason, the solution sprint tasks in \cref{sub:workflow} include not only the
learning tasks but also the \emph{data preprocessing} tasks, which are dependent on the
chosen machine learning methods.

\begin{defbox}{Data preprocessing}{preprocessing}
  The process of adjusting the data to make it suitable for a particular learning machine
  or, at the least, to ease the learning process.
\end{defbox}

This is done by applying a series of operations to the data, like in data handling.  The
difference here is that some of the parameters of the operations are not fixed rather they
are fit from a data sampling.  Once fitted, the operations can be applied to
new data, sample by sample.

As a result, a data processing technique acts in three steps:
\begin{enumerate}
  \itemsep0em
  \item \textbf{Fitting}: The parameters of the operation are adjusted to the training
    data (which has already been integrated and tidied, represents well the phenomenon of
    interest, and each sample is in the correct observational unit);
  \item \textbf{Adjustment}: The training data is adjusted according to the fitted
    parameters, eventually, changing the sampling size and distribution;
  \item \textbf{Applying}: The operation is applied to new data, sample by sample.
\end{enumerate}

Understanding these steps and correctly defining the behavior of each of them is crucial
to avoid \gls{leakage} and to guarantee that the model will behave as expected in
production.

\subsection{Formal definition}

Let $T = (K, H, c)$ be a table that represents the data in the desired observational unit
--- as defined in \cref{sec:formal-structured-data}.  In this chapter, without loss of
generality --- as the keys are not used in the modeling process ---, we can consider $K =
\{1, 2, \dots\}$ such that $\rowcard[i] = 0$ if, and only if, $i > n$.  That means that
every row $r \in \{1, \dots, n\}$ is present in the table.

A data preprocessing strategy $F$ is a function that takes a table $T = (K, H, c)$ and
returns a adjusted table $T' = (K', H', c')$ and a fitted \emph{preprocessor} $f(z; \phi)
= f_\phi(z)$ such that $$z \in \bigtimes_{h\, \in\, H} \domainof{h} \cup \{?\}$$ and $\phi$ are
the fitted parameters of the operation.  Similarly, $z' = f_\phi(z)$, called the
preprocessed tuple, satisfies $$z' \in \bigtimes_{h'\, \in\, H'} \domainof{h'} \cup
\{?\}\text{.}$$ Note that we make no restrictions on the number of rows in the adjusted
table, i.e., preprocessing techniques can change the number of rows in the training table.

In practice, strategy $F$ is a chain of dependent preprocessing operations $F_1$, \dots,
$F_m$ such that, given $T = T^{(0)}$, each operation $F_i$ is applied to the table
$T^{(i-1)}$ to obtain $T^{(i)}$ and the fitted preprocessor $f_{\phi_i}$.  Thus, $T' =
T^{(m)}$ and $$f(z; \phi = \{\phi_1, \dots, \phi_m\}) = \left(f_{\phi_1} \circ \dots \circ
f_{\phi_m}\right)(z)\text{,}$$ where $\circ$ is the composition operator.  I say that
they are dependent since none of the operations can be applied to the table without the
previous ones.

\subsection{Degeneration}

The objective of the fitted preprocessor is to adjust the data to make it suitable for the
model.  However, sometimes it can not achieve this goal for a particular input $z$.  This
can happen for many reasons, such as unexpected values, information ``too incomplete'' to
make a prediction, etc.

Formally, we say that the preprocessor $f_\phi$ degenerates over tuple $z$ if it outputs
$z' = f_\phi(z)$ such that $z' = (?, \dots, ?)$.  In practice, that means that the
preprocessor decided that it has no strategy to adjust the data to make it suitable for
the model.  For the sake of simplicity, if any step $f_{\phi_i}$ degenerates over
tuple $z^{(i)}$, the whole preprocessing chain degenerates\footnote{Usually, this is
implemented as an exception or similar programming mechanism.} over $z = z^{(0)}$.

Consequently, in the implementation of the solution, the developer must chose a default
behavior for the model when the preprocessing chain degenerates over a tuple.  It can
be as simple as returning a default value or as complex as redirecting the tuple to a
different pair of preprocessor and model.  Sometimes, the developer can choose to
integrate this as an error or warning in the user application.

\subsection{Data preprocessing tasks}

The most common data preprocessing tasks can be divided into three categories:
\begin{itemize}
  \itemsep0em
  \item Data cleaning;
  \item Data sampling; and
  \item Data transformation. % colocar enhancement aqui
\end{itemize}

In the next sections, I address some of the most common data preprocessing tasks
in each of these categories.  I present them at the order they are usually applied in the
preprocessing, but note that the order is not fixed and can be changed according to the
needs of the problem.

\section{Data cleaning}

Data cleaning is the process of removing errors and inconsistencies from the data.  This is
usually done to make the data more reliable for training and to avoid bias in the learning
process.  Usually, such errors and inconsistencies make the learning machines ``confused''
and can lead to poor performance models.

Also, it includes the process of dealing with missing information, which most machine
learning methods do not cope with.  Solutions range from the simple removal of the
observations with missing data to the creation of information to encode the missing data.

\subsection{Invalid and inconsistent data}

% TODO: move this somewhere when we talk about data handling and/or tidying
% Sometimes, during data collection, information is recorded using special codes.  For
% instance, the value 9999 might be used to indicate that the data is missing.  Such codes
% must be replaced with more appropriate values before modeling.  If a single variable
% encodes more than one concept, new variables must be created to represent each concept.

There are a few, but important, tasks to be done during data preprocessing in terms of
invalid and inconsistent data --- note that we assume that most of the issues in terms of
the semantics of the data have been solved in the data handling phase.  Especially in
production, the developer must be aware of the behavior of the model when it faces
information that is not supposed to be present in the data.

One of the tasks is to assert that physical quantities are dealt with standard units.  One must
check whether all columns that store physical quantities have the same unit of
measurement.  If not, one must convert the values to the same unit.  A summary of this
preprocessing task is presented in \cref{tab:unit-conversion}.

\begin{tablebox}[label=tab:unit-conversion]{Unit conversion preprocessing task.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Unit conversion}} \\
    \midrule
    % \textbf{Requirements} &
    %   A variable with the physical quantity and a variable with the unit of measurement. \\
    \textbf{Goal} &
      Convert physical quantities into the same unit of measurement. \\
    \textbf{Fitting} &
      None. User must declare the units to be used and, if appropriate, the conversion
      factors. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently. \\
    \textbf{Applying} &
      Preprocessor converts the numerical values and drop the unit of measurement column.  \\
    \bottomrule
  \end{tabular}
\end{tablebox}

Moreover, if one knows that a variable must follow a specific range of values, we can check
whether the values are within this range.  If not, one must replace the values with
missing data or with the closest valid value.  Alternatively, one can discard the
observation based on that criterion.  Consult \cref{tab:range-check} for a summary of this
operation.

\begin{tablebox}[label=tab:range-check]{Range check preprocessing task.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Range check}} \\
    \midrule
    % \textbf{Requirements} &
    %   A numerical variable. \\
    \textbf{Goal} &
      Check whether the values are within the expected range. \\
    \textbf{Fitting} &
      None. User must declare the valid range of values. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently.  If appropriate,
      degenerated samples are removed. \\
    \textbf{Applying} &
      Preprocessor checks whether the value $x$ of a variable are within the range $[a,
      b]$.  If not, it replaces $x$ with: (a) missing value $?$, (b) the closest valid
      value $\max(a, \min(b, x))$, or (c) degenerates (discards the observation). \\
    \bottomrule
  \end{tabular}
\end{tablebox}

Another common problem in inconsistent information is that the same category might be
represented by different strings.  This is usually done by creating a dictionary that maps
the different names to a single one, using standardizing lower or upper case, removing
special characters, or more advanced fuzzy matching techniques --- see
\cref{tab:text-standardization}.

\begin{tablebox}[label=tab:text-standardization]{Category standardization preprocessing task.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Category standardize}} \\
    \midrule
    % \textbf{Requirements} &
    %   A categorical variable. \\
    \textbf{Goal} &
      Create a dictionary and/or function to map different names to a single one. \\
    \textbf{Fitting} &
      None. User must declare the mapping. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently. \\
    \textbf{Applying} &
      Preprocessor replaces the categorical variables $x$ of a variable with the mapped
      value $f(x)$ that implements case standardization, special character removal, and/or
      dictionary fuzzy matching. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

Note that these techniques parameters are not fitted from the data, but rather are fixed
from the problem definition.  As a result, they could be done in the data handling phase.
The reason we put them here is that the new data in production usually come with the
same issues.  Having the fixes programmed into the preprocessor makes it easier to
guarantee that the model will behave as expected in production.

\subsection{Missing data}

Since most models cannot handle missing data, it is crucial to deal with it in the data
preprocessing.

There are four main strategies to deal with missing data:
\begin{itemize}
  \itemsep0em
  \item Remove the observations (rows) with missing data;
  \item Remove the variables (columns) with missing data;
  \item Just impute the missing data;
  \item Use an indicator variable to mark the missing data and impute it.
\end{itemize}

Removing rows and columns are commonly used when the number of missing data is small
compared to the total number of rows or columns.  However, be aware that removing rows
``on demand'' can
artificially change data distribution, especially when the missing data is not missing at
random.  Row removal suffers from the same problem as any filtering operations
(degeneration) in the preprocessing step; the developer must specify a default behavior
for the model when a row is discarded in production.  See \cref{tab:row-removal-missing}.

\begin{tablebox}[label=tab:row-removal-missing]{Task of filtering rows based on missing data.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Row removal based on missing data}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with missing data. \\
    \textbf{Goal} &
      Remove the observations with missing data in any (or some) variables. \\
    \textbf{Fitting} &
      None. Variables to look for missing data are declared beforehand. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently, removing
      degenerated samples. \\
    \textbf{Applying} &
      Preprocessor degenerates over the rows with missing data in the specified variables.
      \\
    \bottomrule
  \end{tabular}
\end{tablebox}

In the case of column removal, the
preprocessor just learns to drop the columns that have missing data during fitting.
Beware that valuable information might be lost when removing columns for all the samples.
See \cref{tab:col-drop-missing}.

\begin{tablebox}[label=tab:col-drop-missing]{Task of dropping columns based on missing data.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Column removal based on missing data}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with missing data. \\
    \textbf{Goal} &
      Remove the variables with missing data. \\
    \textbf{Fitting} &
      All variables with missing data in the training set are marked to be removed. \\
    \textbf{Adjustment} &
      Columns marked are dropped from the training set. \\
    \textbf{Applying} &
      Preprocessor drops the chosen columns in fitting. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

Imputing the missing data is usually done by replacing the missing values with some
statistic of the available values in the column, such as the mean, the median, or the
mode\footnote{More sophisticated methods can be used, such as the k-nearest neighbors
algorithm, for example, consult \fullcite{Troyanskaya2001}.}.  This is a simple and
effective strategy, but it can introduce bias in the data, especially when the number of
samples with missing data is large.  See \cref{tab:imputation}.

\begin{tablebox}[label=tab:imputation]{Task of imputing missing data.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Imputation of missing data}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with missing data. \\
    \textbf{Goal} &
      Replace the missing data with a statistic of the available values. \\
    \textbf{Fitting} &
      The statistic is calculated from the available data in the training set. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently. \\
    \textbf{Applying} &
      Preprocessor replaces the missing values with the chosen statistic. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

Just imputing data is not suitable when one is not sure whether the missing data is
missing because of a systematic error or phenomenon.  A model can learn the effect of the
underlying reason for missingness for the predictive task.
In that case, creating an indicator variable is a good strategy.  This is done by creating
a new column that contains a logical value indicating whether the data is missing or
not\footnote{Some kind of imputation is still needed, but we expect the model to deal
better with it since it can decide using both the indicator and the original variable.}.

% TODO: table

% \footnote{\color{red}Sometimes the indicator variable is already present: pregnancy and sex
% example.}.

Many times the indicator variable is already present in the data.  For instance, in a
dataset that contains information about pregnancy, let us say the number of days since
the last pregnancy.  This information certainly be missing if sex is male
or number of children is zero.  In this case, no new indicator variable is needed.

\subsection{Outliers}

Outliers are observations that are significantly different from the other observations.
They can be caused by errors in the data collection process or by the presence of a
different phenomenon.  In both cases, it is important to deal with outliers before
modeling.

There are many outliers detection methods, such as the Z-score, the IQR, and the DBSCAN.
% TODO

Like filtering operations in the pipeline, the developer must specify a default behavior
for the model when an outlier is detected in production.

% \begin{slidebox}{Data cleaning}{}
%   \begin{itemize}
%     \item Data cleaning is the process of removing errors and inconsistencies from the data;
%     \item Use the following strategies to deal with missing data:
%       \begin{itemize}
%         \item Remove the rows with missing data;
%         \item Remove the columns with missing data;
%         \item Impute the missing data;
%         \item Use an indicator variable to mark the missing data.
%       \end{itemize}
%     \item Replace special codes with more appropriate values;
%     \item Create a dictionary to map different names to a single one;
%     \item Check whether all columns that store physical quantities have the same unit of
%       measurement;
%     \item Check whether the values are within the expected range;
%     \item Use outlier detection methods to deal with outliers.
%   \end{itemize}
% \end{slidebox}

\section{Data transformation}

One important task in data handling is data transformation.  This is the process of adjusting
the format and the types of the data to make it suitable for analysis.

Before data transformation we must make sure that data is tidy, i.e., to have
each variable in a column and each observation in a row.  Remember that, depending on the
problem definition, we target a particular observational unit.  Having a clear picture of
the observational unit is important to define the columns and the rows of the dataset.

Then, when the data format is acceptable, we can perform a series of operations to make the
column's types and values suitable for modeling.  The reason for this is that most
machine learning methods require the input variables to follow some restrictions.  For
instance, some methods require the input variables to be real numbers, others require the
input variables are in a specific range, etc.

% \subsection{Reshaping}
%
% \textcolor{red}{TODO: pipeline exceptions: like pivoting and aggregating are kept outside
% the pipeline.}
%
% Reshaping is the process of changing the format of the data.  The most common reshaping
% operations are pivoting and unpivoting, which we have already discussed.  However, there
% are other reshaping operations that are useful in practice.
%
% For instance, one can reshape a dataset by splitting a column into multiple columns.  This
% is useful when a column contains multiple values that should be separated.  This can be
% done with mutation with appropriate expressions.  Some frameworks might provide special functions
% to do this, usually called splitting functions.
%
% We can also consider reshaping the operations of filtering, selecting, and aggregating.
% Filtering is usually done to reduce the scope of the data, given some conditions on the
% variables.  Selecting is usually done to remove irrelevant variables or highly correlated
% ones.  Aggregating in a reshaping task is usually applied together with pivoting to change the
% observational unit of the dataset.

% \begin{slidebox}{Reshaping}{}
%   \begin{itemize}
%     \item Reshaping is the process of changing the format of the data;
%     \item The most common reshaping operations are pivoting and unpivoting;
%     \item Other common operation include:
%       \begin{itemize}
%         \item Splitting a column into multiple columns;
%         \item Filtering to reduce the scope of the data;
%         \item Selecting to remove irrelevant variables or highly correlated ones;
%         \item Aggregating to change the observational unit of the dataset.
%       \end{itemize}
%   \end{itemize}
% \end{slidebox}

\subsection{Type conversion}

Type conversion is the process of changing the type of the values in the columns.  This
is usually done to make the data suitable for modeling.  For instance, some machine
learning methods require the input variables to be real numbers.

The most common type conversions are from categorical to numerical and from numerical to
categorical.  The former is usually done by creating dummy variables, i.e., a new column
for each possible value of the categorical variable.  This transformation is also known as
one-hot encoding.  The latter is usually done by binning (discretization and quantization
other concepts) the numerical variable, either by
frequency or by range.

% \begin{slidebox}{Type conversion}{}
%   \begin{itemize}
%     \item Type conversion is the process of changing the type of the values in the columns;
%     \item Use one-hot encoding to convert categorical variables to numerical;
%     \item Use binning to convert numerical variables to categorical.
%   \end{itemize}
% \end{slidebox}

\subsection{Normalization}

Normalization is the process of scaling the values in the columns.  This is usually done to
keep data in a specific range or to make the data comparable.  For instance, some machine
learning methods require the input variables to be in the range $[0, 1]$.

The most common normalization methods are standardization and rescaling.  The former is done
by subtracting the mean and dividing by the standard deviation of the values in the column.
The latter is performed so the values are in a specific range, usually $[0, 1]$ or $[-1, 1]$.

\begin{hlbox}{Clamping after rescaling}
  In production, it is common to clamp the values after rescaling.  This is done to avoid
  the model to make predictions that are out of the range of the training data.
\end{hlbox}

Related to normalization is the log transformation.  This is usually done to make the data
more symmetric or to reduce the effect of outliers.  The log transformation is the process
of taking the logarithm of the values in the column.

% \begin{slidebox}{Normalization}{}
%   \begin{itemize}
%     \item Normalization is the process of scaling the values in the columns;
%     \item Use standardization to make the values have mean 0 and standard deviation 1;
%     \item Use rescaling to make the values be in a specific range;
%     \item Use the log transformation to make the data more symmetric or to reduce the effect
%       of outliers.
%   \end{itemize}
% \end{slidebox}

\subsection{Sampling}

Sampling is the process of selecting a random subset of the data.  This is usually done to
reduce the size of the data or to create a balanced dataset.  For instance, some machine
learning methods are heavily affected by the number of observations in each class.
Also, some methods are computationally expensive and a smaller dataset might be enough to
solve the problem.

The most common sampling methods are random sampling and resampling\footnote{Resampling is
the process of sampling with replacement, sometimes called bootstrapping.}.  The former is
done by selecting a random subset of the data.  The latter is done by selecting a random
subset of the data with replacement.

While random sampling is useful to reduce the size of the data, resampling can be used to
increase the size of the data.  (Although this has some caveats.)  Moreover, resampling
can also create variations of the original dataset with the same distribution of the
values.

More advanced sampling methods are usually used to create balanced datasets.  For
instance, one can use the SMOTE algorithm\footfullcite{chawla2002smote} to create
synthetic observations of the minority class.

\textcolor{red}{In production, just ignore.}

\subsection{Dimensionality reduction}

Dimensionality reduction is the process of reducing the number of variables in the data.
This is usually done to reduce the complexity of the model or to identify irrelevant
variables.  The so-called \emph{curse of dimensionality} is a common problem in machine
learning, where the number of variables is much larger than the number of observations.

There are two main types of dimensionality reduction algorithms: feature selection and
feature extraction.  The former is done by selecting a subset of the variables that leads
to the best models.  The latter is done by creating new variables that are combinations
of the original ones.

Feature selection can be performed before modeling (filter), together with the model
search (wrapper), or as a part of the model itself (embedded).

Feature extraction is usually done by linear methods, such as principal component analysis
(PCA), or by non-linear methods, such as convolution layers and autoencoders.  These methods are able to
compress the information in the data into a smaller number of variables.

% \begin{slidebox}{Dimensionality reduction}{}
%   \begin{itemize}
%     \item Dimensionality reduction is the process of reducing the number of variables in the data;
%     \item Use feature selection to select a subset of the variables that leads to the best models;
%     \item Use feature extraction to create new variables that are combinations of the original ones.
%   \end{itemize}
% \end{slidebox}

% \begin{hlbox}{Practice!}
%   Can you identify which data transformation operations are used to make datasets
%   presented in \cref{chap:data} tidy?
% \end{hlbox}

\section{Data enhancement}

Data integration is the process of combining data from different sources into a single
dataset.  This is usually done to create a more complete dataset or to create a dataset
with a different observational unit.

To perform integration, consider the discussions in \cref{sec:normalization,sub:bridge}.

Additionally, one must consider the following points:
\begin{itemize}
  \item Sometimes the same column may have different names in different datasets.  Redundant
    columns must be removed.
  \item Separate datasets that share the same variables usually happen because there is a
    hidden variable that is not present in the datasets.  During integration, the new
    variable must be created.
\end{itemize}

% \begin{slidebox}{Data integration}{}
%   \begin{itemize}
%     \item Data integration is the process of combining data from different sources into a single dataset;
%     \item Not every join is possible, consider the discussions in \cref{sec:normalization,sub:bridge};
%     \item Remove redundant columns;
%     \item Create new variables to represent the hidden variables.
%   \end{itemize}
% \end{slidebox}

In the data handling pipeline, data integration is useful for data enhancement.  This is
the process of adding new columns to the dataset or single instances.  For example,
imagine that in the tidy data we have a column with the zip code of the customers.  We can
use this information to join (in this case a left join) a dataset with social and economic
information about the region of the zip code.

\section{Comments on unstructured data}

% vim: spell spelllang=en