Skip to content

A semantic matcher is trained using BERT to identify all situations in which witnesses expresses their inability to remember.

Notifications You must be signed in to change notification settings

jonasengelmann/erinnerungsluecken-im-nsu-untersuchungsausschuss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

Erinnerungslücken im NSU-Untersuchungsausschuss

Translation: Gaps of memory during the investigation of the parliamentary enquiry committee into the NSU (national socialist underground)

A parliamentary inquiry committee was set up in 2016 with the aim of comprehensively investigating the significant failures and errors of the security authorities in connection with the right-extremist terrorist organization NSU. Particularly noticeable in these interrogations is the considerable lack of memory, which was always expressed by the summoned witnesses and eventually prevented a comprehensive investigation.

I have automatically captured a plethora of such instances where witnesses expressed their inability to remember. The results are visualized here. In this repository I will describe the tools and methods I used.

1. Scraping PDFs and parsing content

As I could only find the transcript in PDF format, I had to scrape its content first. The process is straightforward, I am using pdfminer to create an XML representation of each pdf page and then parse these XMLs.

2. Semantic matching

2.1. With Regular Expressions:

In a first attempt I matched a few commonly used expressions with simple regular expression rules. Sentences like these:

  • Ich erinnere das nicht mehr
  • Ich kann mich nicht mehr erinnern
  • Das ist mir nicht erinnerlich

can easily be matched with a rule like this:

.*?erinnere.*?nicht.*?|.*?nicht.*?erinner.*?

(.*? matches any character thus ensuring small variations with the same sentence structure are captured too)

However, this approach can easily misclassify, as there is quite a high chance that the negation matched does not relate to the verb or noun in question. To best avoid this behavior I run those matching rules only on subordinate clauses, i.e. using the start/end of a sentence, a comma, semicolon or colon as demarcations.

By adding a variety of synonyms expression that came to mind I was able to match quite a few instances with this method. Yet, it is quite easy to miss a few, so I was looking for a more generalized approach.

2.2 With BERT:

In a second attempt I trained a sentence classifier with BERT. BERT is a model that provides language representation based on contextual word embeddings, meaning it encodes words in a multidimensional vector space based on the contexts (the surrounding words) they appear in a sentence. This has the interesting effect that some level of semantic information is encoded in the vector space as well, as similar and synonymous words appear in vicinity to each other.

Using the regular expression from above I label a training dataset which I used to fine-tune the BERT model to this specific classification task, which is analogous to a spam vs not-spam problem, i.e. I-don't-remember vs anything-else.

The Jupyter notebook can be found here or directly opened in Colab: Open In Colab When opening in Colab, a GPU has to be selected under runtime -> Change runtime type.

About

A semantic matcher is trained using BERT to identify all situations in which witnesses expresses their inability to remember.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published