A straight forward approach to categorization and measuring object to category distances would be to use a k-NN classifier with a Euclidian metric. However, with few data points a comparison based approach like k-NN is sensible to outliers and comparisons in high dimensions are complicated by the curse of dimensionality. In addition the intrinsic metric of the data might not be reflected by the Euclidian metric.
A more sensible approach would therefore be to learn the metric from the data and do it such that irrelevant features are discounted. To this end we employ a specific set of metric learning techniques, called Mahalanobis metric learning, that learns a transform of the data such similar items are close and non-similar items are far away.
\input{metric_learning_explanation}
Mahalanobis metric learning algorithms are a set of metric learning algorithms, that try to learn a positive semidefinite matrix (PSD),
\begin{equation} D(x,y) = (x-y)^{T} \mathbf{M} (x-y) = \parallel \mathbf{L}(x-y) \parallel^{2}_{2} \enspace , \end{equation}
where the second equality follows since
In our previous work \cite{Hjelm:2015hw} we used the Large Margin Nearest Neighbor (LMNN) model due to Weinberg et al. \cite{Weinberger:2009to}, to learn metrics over grasp affordance features. We forced the learned matrix,
Here we instead use the Large Margin Component Analysis (LMCA) model due to Torresani et al. \cite{Torresani:2006wb}. \cite{Torresani:2006wb} formulate an objective function similar to \cite{Weinberger:2009to} but where the optimization is carried out over
\small \begin{equation} \label{lossfun} \begin{aligned} & \epsilon(\mathbf{L}) = \sum\limits_{i,i \leadsto j} ||\mathbf{L} (x_i-x_j)||^{2} \ & + c \sum\limits_{i,i \leadsto j,l}(1-y_{il})h(||\mathbf{L} (x_i-x_j)||^{2} - || \mathbf{L} (x_i-x_l)||^{2} +1 ) \ \end{aligned} \end{equation}
\begin{equation} \label{lossfunderivative} \begin{aligned} & \frac{d\epsilon(\mathbf{L})}{d\mathbf{L}} = 2 \mathbf{L} ( \sum\limits_{i,i \leadsto j} (x_i-x_j)(x_i-x_j)^{T} \ & + c \sum\limits_{i,i \leadsto j,l}(1-y_{il}) [(x_i-x_j)(x_i-x_j)^{T} - (x_i-x_l)(x_i-x_l)^{T} ] \ & h'(||\mathbf{L} (x_i-x_j)||^{2} - || \mathbf{L} (x_i-x_l)||^{2} +1 ) ) \ \end{aligned} \end{equation}. \normalsize
Here
By optimizing over
The drawback of this approach is that since the objective function is no longer convex we will discover local minima. \cite{Torresani:2006wb} showed empirically that this was not a problem since the algorithm consistently converged to adequate solutions and the kernelized version outperformed k-NN and LMNN on several benchmark datasets.
Learning matrices that re-represents the input data opens up for analysis of the correlation between different features by considering
We therefore introduce the elastic net regularization of \cite{Cimpoi:2015eg}, which penalizes the column vectors of
The modified elastic net regularized loss function now has the form,
\small \begin{equation} \label{lossfun} \begin{aligned} & \epsilon_{new}(\mathbf{L}) = \beta(\lambda_1 \sum_{j=1}^{D} \left\lVert L_j \right\rVert_1 + \lambda_2 \sum_{j=1}^{D} \left\lVert L_j) \right\rVert_{2}^{2} + \epsilon(\mathbf{L}) \ \end{aligned} \end{equation}
\begin{equation} \label{lossfunderivative} \begin{aligned} & \frac{d\epsilon_{new}(\mathbf{L})}{d\mathbf{L}} = \beta(\lambda_1 \tilde{L}1 + \lambda_2 \tilde{L}2) + \frac{d\epsilon(\mathbf{L})}{d\mathbf{L}} \ & (\tilde{L}1){ij} = sign(L{ij}), ; ; ; (\tilde{L}2){ij} = 2 L{ij}, \end{aligned} \end{equation} \normalsize
where
\begin{equation} ( \tilde{L}1 ){ij}_ = \begin{cases} 1 & \frac{d\epsilon(\mathbf{L})}{d\mathbf{L}} > \lambda \ -1 & \frac{d\epsilon(\mathbf{L})}{d\mathbf{L}} < \lambda \ \end{cases} \end{equation}
\begin{comment} [We should maybe investigate if the penalization has the same effect on the eigenvalues as ridge regression ] \end{comment}
We want to use features that have at least a weak semantic meaning to a human observer and that are as generic as possible. Affordances can from a human perspective be understood in both global and local properties of the object we therefore try to cover both. Due to lack of space we will provide a short list with motivations when needed.
\small
- Object Volume - The volume of the convex hull enclosing the object.
- Primitive Shape - Scoring function for how well the object resembles spherical, cylindrical, or cuboid form.
- Elongatedness - The ratio of the minor axes of the object to the major axis.
- Opening - Wether our opening detection algorithm detects an opening.
- Intensity and Gradient Filters - We apply gradient filters of first, second order over the 2D window containing the object along, in addition we extract the intensity values of the image. Since pixels correspond to the points on the point cloud of the object we map the values on to each point. We then compute a count normalized histograms (CNH) for all the points of the object for each of the filter values and the intensity values.
- Color Quantization - We segment the image using the segmentation algorithm of \cite{Felzenszwalb:2004vi} and encode each patch with its closest color from a chart of 15 different colors, in CIELab space. Each point in the point cloud of the object thus gets a quantized color assigned to it. We then form a CNH over all the points of the object.
- Color Uniformity - We compute the entropy, mean and variance of the color quantization.
- Bag of Words Histogram over Fast Point Feature Histogram (FPFH) \cite{Rusu:2009hf} - We encode the local curvature of the object using a Bag of Words model over the FPFH features of the point cloud using a small set of 40 keywords. We then extract the histogram of the keywords for the object.
- Texture - We employ a similar approach to \cite{Cimpoi:2015eg} by learning a Fisher vector representation from dense SIFTs over a set of labeled images containing the following textures: glass, carton, porcelain, metal, plastic and wood. We also extract the output of the fifth layer of a GooleLeNet CNN concatenating it with the Fisher vectors. These two features complement each other, the Fisher vector trained on textures, and the GooleLeNet in that certain objects are more likely to be made of a specific material. We use the final feature vector to train 6 1-against-$K$ linear SVMs. The output from the 6 SVMs when classifying the object can be considered as a score of how similar each object is to one of the textures and we use this as the feature vector.
\normalsize