📌 Paper
SECTOR: a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section.
Motivation:
- From a human perspective, it is mostly the authors themselves who help best to understand a text. Especially in long documents, an author thoughtfully designs a readable structure and guides the reader through the text by arranging topics into coherent passages
- In many cases, this structure is not formally expressed as section headings (e.g. in news articles, reviews, discussion forums) or it is structured according to domain-specific aspects (e.g. health reports, research papers, insurance documents)
- Ideally, systems for text analytics, such as topic detection and tracking (TDT), text summarization, information retrieval (IR) or question answering (QA) could access a document representation that is aware of both topical (i.e. latent semantic content) and structural information (i.e. segmentation) in the text
It is therefore important to understand topic segmentation and classification as a mutual task that requires to encode both topic information and document structure coherently.
In this article, we present SECTOR, an end-to-end model which learns an embedding of latent topics from potentially ambiguous headings and can be applied to entire documents to predict local topics on sentence level.
To the best of their knowledge, the combined task of segmentation and classification has not been approached on full document level before.
We introduce WIKISECTION, a large novel dataset of 38k articles from the English and German Wikipedia labeled with 242k sections, original headings and normalized topic labels for up to 30 topics from two domains: diseases and cities.
Three main problems:
- Topic modeling
- Text segmentation
- Text classification
Our method unifies those strongly interwoven tasks and is the first to evaluate the combined topic segmentation and classification task using a corresponding dataset with long structured documents.
For the evaluation of this task, we created WikiSection, a novel dataset containing a gold standard of 38k full-text documents from English and German Wikipedia comprehensively annotated with sections and topic labels.
Our dataset contains the article abstracts, plain text of the body, positions of all sections given by the Wikipedia editors with their original headings (e.g. "Causes | Genetic sequence") and a normalized topic label (e.g. disease.cause).
Initially, we expected articles to share congruent structure in naming and order. Instead, we observe a high variance with 8.5k distinct headings in the diseases domain and over 23k for English cities. A closer inspection reveals that Wikipedia authors utilize headings at different granularity levels, frequently copy and paste from other articles, but also introduce synonyms or hyponyms, which leads to a vocabulary mismatch problem.
As a result, the distribution of headings is heavy-tailed across all articles. Roughly 1% of headings appear more than 25 times while the vast majority (88%) appear 1 or 2 times only.
In order to use Wikipedia headlines as a source for topic labels, we contribute a normalization method to reduce the high variance of headings to few representative labels based on the clustering of Babel-Net synsets.
We introduce SECTOR, a neural embedding model that predicts a latent topic distribution for every position in a document. Because we do not know the expected number of sections, we formulate the objective of our model on sentence level and later segment based on the predictions.
We approach two variations of this task:
- WikiSection-topics: we choose a single topic label out of a small number of normalized topic labels (~25 classes). However, from this simplified classification task arises an entailment problem, because topics might be hierarchically structured. For example, a section with heading "Treatment | Gene Therapy" might describe genetics as a subtopic of treatment.
- WikiSection-headings: an extended task (~1.5k words) to capture ambiguity in a heading.It further eliminates the need for normalized topic labels.
Our SECTOR architecture consists of four stages: sentence encoding, topic embedding, topic classification and topic segmentation.
The first stage of our SECTOR model transforms each sentence from plain text into a fixed-size sentence vector which serves as input into the neural network layers.
- Bag-of-words encoding: As a baseline, we compose sentence vectors using a weighted bag-of-words scheme.
- Bloom filter embedding: For large vocabulary and long documents, input matrices grow too large to fit into GPU memory, especially with larger batch sizes. Therefore we apply a compression technique for sparse sentence vectors based on Bloom filter.
- Sentence embeddings: We use the strategy of Arora et al. (2017) to generate a distributional sentence representation based on pre-trained word2vec embeddings.
We model the second stage in our architecture to produce a dense distributional representation of latent topics for each sentence in the document. We use two layers of LSTM with forget gates connected to read the document in forward and backward direction. We feed the LSTM outputs to a ‘bottleneck’ layer with tanh activation as topic embedding.
The third stage in our architecture is the output layer that decodes the class labels. To learn model parameters required by the embedding, we need to optimize the full model for a training target.
- WikiSection-topics task: we use a simple one-hot encoding of the topic labels with a softmax activation output layer.
- WikiSection-headings task: we encode each heading as lowercase bag-of-words vector, e.g. {gene, therapy, treatment}. We then use a sigmoid activation function.
In the final stage, we leverage the information encoded in the topic embedding and output layers to segment the document and classify each section
As a simple baseline method, we use prior information from the text and split sections at newline characters (NL). Additionally, we merge two adjacent sections if they are assigned the same topic label after classification.
All information required to classify each sentence in a document is contained in our dense topic embedding matrix. We are now interested in the vector space movement of this embedding over the sequence of sentences. Therefore, we apply a number of transformations adapted from Laplacian-of-Gaussian edge detection on images to obtain the magnitude of embedding deviation (emd) per sentence.
We adopt the approach of Sehikh et al. (2017), who examine the difference between forward and backward layer of an LSTM for segmentation. However, our approach focuses on the difference of left and right topic context over time steps k, which allows for a sharper distinction between sections.
After segmentation, we assign each segment the mean class distribution of all contained sentences.
We conduct three experiments to evaluate the segmentation and classification task.
- WikiSection-topics experiment comprises segmentation and classification of each section with a single topic label out of a small number of clean labels (25–30 topics)
- WikiSection-headings experiment extends the classification task to multi-label per section with a larger target vocabulary (1.0k–2.8k words)
- Experiment to see how SECTOR performs across existing segmentation datasets
We compare SECTOR to common text segmentation methods as baseline, C99 and TopicTiling and the SOTA TextSeg segmenter.
We compare SECTOR to existing models for single and multi-label sentence classification. Because we are not aware of any existing method for combined segmentation and classification, we first compare all methods using given prior segmentation from newlines in the text and then additionally apply our own segmentation strategies for plain text input.
For the experiments, we train a Paragraph Vectors model using all sections of the training sets.
- Probabilistic Pk error score
- Measure text segmentation at sentence level
- Calculate the probability of a false boundary in a window of size k, lower numbers mean better segmentation
- As relevant section boundaries we consider all section breaks where the topic label changes
- Micro-averaged F1 score for single-label or Precision@1 for multi-label classification
- Measure classification performance on section level by comparing the topic labels of all ground truth sections with predicted sections
- Select the pairs by matching their positions using maximum boundary overlap
- Mean Average Precision (MAP)
- Evaluate the average fraction of true labels ranked above a particular label
- SECTOR outperforms existing classifier
- Topic embeddings improve segmentation
- Bloom filters on par with word embeddings
- Topic embeddings perform well on noisy data
SECTOR captures latent topics from context:
We see an exciting future application of SECTOR as a building block to extract and retrieve topical passages from unlabeled corpora, such as medical research articles or technical papers. One possible task is WikiPassageQA, a benchmark to retrieve passages as answers to non-factoid questions from long articles.