Multi-dimension Diversification in Legal Information Retrieval

This page is a companion for the WISE 2016 paper on [Multi-dimension Diversification in Legal Information Retrieval] (http://dx.doi.org/10.1007/978-3-319-48740-3_12), written by Koniaris Marios (me), Ioannis Anagnostopoulos and Yannis Vassiliou. This page hosts the complete dataset, ground-truth data, queries and relevance assessments we utilize in the article. Our goal is to encourage progress on the diversification in legal IR.

Dataset

CourtListener

Our corpus contains 63,742 precedential legal cases from the Supreme Court of the United States. The cases were originally downloaded from CourtListener. The legal corpus contains all cases from the Supreme Court of the United States, covering more than two centuries of legal history, spanning from 1754 up to 2015. We extracted from the cases text all the necessary information for our feature selection framework e.g. relationships to other documents, date of Judgment. RESTful API detailed instruction can be found here.

Supreme Court Database

Since our corpus was initially unclassified, we acquired topical taxonomies from the Supreme Court Database using commonly shared unique identification variable SCDB Case ID. Topical taxonomies within Supreme Court Database are the outcome of a manual analysis and interpretation of the legal provisions considered in each case. An introduction to the Online Code Book can be found here, while download and use instructions can be found here.

West Law Digest Topics

West Law Digest Topics is a taxonomy of identifying points of law from reported cases and organizing them by topic and key number. It is used to organize the entire body of American law.

we downloaded this list from WestLaw, process it and acquired a textual representation of it.

Original Topics/ queries

Each topic was issued as candidate query to our retrieval system. Outlier queries, whether too specific/rare or too general, where removed using the interquartile range, below or above values Q1 and Q3, sequentially in terms of number of hits in the result set and score distribution for the hits, demanding in parallel a minimum cover of min|N| results.

Used Topics/ queries

Our final list of user queries. In total, we kept 330 queries.

Query assessments and ground-truth.

For each topic/query we kept the top-n results. An LDA topic model, using an open source implementation (mallet) was trained on the top-n results for each query. From the resulting topic distributions for each document, with an acceptance threshold of 15%, we consider relevance judgments for each query/ document and subtopic. In other words, we consider the topics created from LDA as aspects of each query, and based on the topic/ document distribution we can infer whether a document is relevant for an aspect. In total, we acquired 1,650 subtopics for all the 330 queries. Our ground-truth data can be found:

Stop Words

Our stop word list can be found here

Citing

If you use queries and relevance assessments utilized in this work in your research, please cite:

@inproceedings{KoniarisAV16,
author="Koniaris, Marios and Anagnostopoulos, Ioannis and Vassiliou, Yannis",
title="Multi-dimension Diversification in Legal Information Retrieval",
bookTitle="Web Information Systems Engineering - {WISE} 2016 - 17th International Conference, Shanghai, China, November 8-10, 2016, Proceedings, Part I",
year="2016",
publisher="Springer International Publishing",    
pages="174--189",    
doi="10.1007/978-3-319-48740-3_12",
url="http://dx.doi.org/10.1007/978-3-319-48740-3_12"
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
QUERIES.txt		QUERIES.txt
README.md		README.md
aspects.txt		aspects.txt
qrels.txt		qrels.txt
stopwords.en		stopwords.en
westlaw.txt		westlaw.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-dimension Diversification in Legal Information Retrieval

Dataset

West Law Digest Topics

Query assessments and ground-truth.

Stop Words

Citing

About

Releases

Packages

mkoniari/MultiLegalDiv

Folders and files

Latest commit

History

Repository files navigation

Multi-dimension Diversification in Legal Information Retrieval

Dataset

West Law Digest Topics

Query assessments and ground-truth.

Stop Words

Citing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages