Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Columbia importer updated #2865

Open
wants to merge 127 commits into
base: main
Choose a base branch
from
Open

Columbia importer updated #2865

wants to merge 127 commits into from

Conversation

quevon24
Copy link
Member

@quevon24 quevon24 commented Jul 6, 2023

This PR contains the updated version of columbia importer, it contains many changes like:

  • Update codebase to match python 3.11 style
  • Replace deprecated functions
  • Typing added
  • Remove court regex and use courts-db to find courts (We may need to update courts-db for test to pass PR 74)
  • Change etree with Beautiful Soup to parse xml files
  • Store opinions in the correct order
  • Store opinion footnotes
  • Find duplicates using citation, docket number, case name, and opinion content
  • Add citations when a duplicate is found
  • Store syllabus
  • Pass a csv file path as an argument with absolute paths to xml files
  • If we have a possible match, we only log a message and abort the import of that file instead of adding data to the matched cluster, that way we can review the logs manually
  • Default xml directory: /opt/courtlistener/_columbia
  • Default csv location: /opt/courtlistener/_columbia/columbia_import.csv
  • Log all messages to a file so that it can be reviewed manually without needing to see the container logs

Based on some calculations, ~1.2M files have to be imported based on the data in local_path in the Opinion model, the number could be lower because some of the cases in this list of files are already imported but from a different source

Usage:

Import using a csv file with xml file path pointing to mounted directory and file path
docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/testfile.csv

Csv example:

filepath
michigan/supreme_court_opinions/documents/d5a484f1bad20ba0.xml

Import specifying the mounted directory where the xml files are located
docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python /opt/courtlistener/manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/files_to_import.csv --xml-dir /opt/courtlistener/columbia_files

@quevon24 quevon24 requested a review from flooie July 6, 2023 18:37
@quevon24 quevon24 self-assigned this Jul 6, 2023
@quevon24 quevon24 marked this pull request as draft July 15, 2023 01:08
@quevon24 quevon24 requested a review from grossir May 22, 2024 17:32
@quevon24
Copy link
Member Author

@grossir when you have time available you could take a look

this is a sample file to test the command:

random_sample_1.zip

to run the command you need to copy the zip content to cl/assets/media

docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python /opt/courtlistener/manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/random_sample_1.csv --xml-dir /opt/courtlistener/cl/assets/media/random_sample_1

if you have any questions, i'll stay tuned.

Copy link
Contributor

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ingested ~766 dockets/opinion clusters out of 1000 documents
I ran the script 4 times and got 6 triplicated ingestions, maybe you can check on your environment too. I used this query to detect them:

select 
    local_path, html_columbia,  count(*)
FROM 
    search_docket sd 
inner join search_opinioncluster oc 
    on docket_id=sd.id 
inner join search_opinion 
    on search_opinion.cluster_id=oc.id 
where 
    sd.date_created::date = '2024-05-23'::date 
group by local_path , html_columbia
having count(*) > 1;

I left some comments, mostly ideas for improvements

I haven't really tested the matching algorithms beyond most basic duplication. If you could send another sample file, but sampled from the most recent opinions (so that it is easier to search them on the web pages / scrape them), it would help testing those parts

}

# Add date data into columbia dict
columbia_data.update(find_dates_in_xml(soup))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some missing "FILED_TAGS" strings in columbia_utils.py
For example, for texas/court_opinions/documents/5c8dba31985162bf.xml in the sample, the following date is parsed: [[('opinion issued', datetime.date(2006, 10, 23))]] but it is not assigned to the "date_filed" key

Looking at the raw text I think it should be considered as "date_filed":

<opinion>
<reporter_caption><center>IN RE LARREW, 05-06-01227-CV (Tex.App.-Dallas 10-23-2006)</center></reporter_caption>
<caption><center>IN RE STEPHEN JAMES LARREW, Relator.</center></caption>
<docket><center>No. 05-06-01227-CV</center></docket><court><center>Court of Appeals of Texas, Fifth District, Dallas.</center></court>
<date><center>Opinion issued October 23, 2006.</center>

I see that on FILED_TAGS "opinion issued" is included in ARGUED_TAGS, not sure about the logic for this


I collected all the documents that do have dates, but no date filed. Some are obviously not OpinionCluster.date_filed, like "case announcements and administrative actions", but some others I am not so sure

{'texas/court_opinions/documents/5c8dba31985162bf.xml': 'opinion issued',
 'arkansas/court_opinions/documents/ae218d6345f5d320.xml': 'opinion delivered',
 'texas/court_opinions/documents/5f4fe3e1c4e72785.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/96742836d45c4996.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/d218fb45d4055bdd.xml': 'opinion delivered',
 'maryland/court_of_appeals_opinions/documents/61198adb4b840f4d.xml': 'denied',
 'arkansas/court_opinions/documents/5793152fb3e371a3.xml': 'opinion delivered',
 'maryland/court_of_appeals_opinions/documents/cda2e7a6c083f661.xml': 'denied',
 'texas/court_opinions/documents/f7f4eb4e0bb7e71a.xml': 'opinion delivered and filed',
 'connecticut/appellate_court_opinions/documents/e3e9aa07cc97f60f.xml': 'officially released',
 'arkansas/court_opinions/documents/0b027f05aa07c2af.xml': 'opinion delivered',
 'texas/court_opinions/documents/248981bf18493e9d.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/a968b68353ffe980.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/ebdc8da5b2ec8fe9.xml': 'opinion delivered',
 'michigan/supreme_court_opinions/documents/63efa26d555875ea.xml': 'leave to appeal denied',
 'texas/court_opinions/documents/d4a6653c3a7c08fe.xml': 'delivered',
 'maryland/court_of_appeals_opinions/documents/69cb6658d5b0324d.xml': 'granted',
 'texas/court_opinions/documents/0904c0a3016f8421.xml': 'delivered',
 'texas/court_opinions/documents/a43217e67bd08858.xml': 'opinion issued',
 'texas/court_opinions/documents/c48edff93471911d.xml': 'opinion issued',
 'ohio/court_opinions/documents/52a07db0c124634f.xml': 'case announcements and administrative actions',
 'arkansas/court_opinions/documents/e628b04ac0dcd6f1.xml': 'opinion delivered',
 'maryland/court_of_appeals_opinions/documents/9067dae3cd6d312e.xml': 'denied',
 'arkansas/court_opinions/documents/300ebbd01ba38398.xml': 'opinion delivered',
 'maryland/court_of_appeals_opinions/documents/217ae38fdf9869af.xml': 'denied',
 'arkansas/court_opinions/documents/9e3f71089f9d11dc.xml': 'opinion delivered',
 'maryland/court_of_appeals_opinions/documents/6600fe895d37d853.xml': 'denied',
 'arkansas/court_opinions/documents/2c71f85af35b9e0f.xml': 'opinion delivered',
 'texas/court_opinions/documents/cecfdd58268e8f07.xml': 'opinion issued',
 'texas/court_opinions/documents/60a231f3da6a421f.xml': 'memorandum opinion delivered and filed',
 'arkansas/court_opinions/documents/d37e7ba255a67a6d.xml': 'opinion delivered',
 'texas/court_opinions/documents/b78271984621969a.xml': 'opinion issued',
 'maryland/court_of_appeals_opinions/documents/4b97a5803331bb29.xml': 'denied',
 'maryland/court_of_appeals_opinions/documents/9eefe6f3e03131f7.xml': 'denied',
 'massachusetts/superior_court_opinions/documents/161739ca6ca6348b.xml': 'memorandum dated',
 'maryland/court_of_appeals_opinions/documents/6ac77c8a8002a723.xml': 'denied',
 'texas/court_opinions/documents/3bcee6268dd18a72.xml': 'opinion issued',
 'texas/court_opinions/documents/4d909c6b7d4de7e4.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/8ddc4fe19662d9fb.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/9d7c8e94e2c2b40f.xml': 'opinion delivered',
 'texas/court_opinions/documents/2db17b19d30d85df.xml': 'opinion issued',
 'arkansas/court_opinions/documents/3dd26fb70896c79b.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/ab59ead0feee789f.xml': 'opinion delivered',
 'maryland/court_of_appeals_opinions/documents/9d65b83825eed85f.xml': 'denied',
 'michigan/supreme_court_opinions/documents/1287940f26660dfa.xml': 'summary dispositions',
 'arkansas/court_opinions/documents/6ae7cc75fc0cd311.xml': 'opinion delivered',
 'connecticut/appellate_court_opinions/documents/96cb6396c50954b6.xml': 'officially released',
 'connecticut/appellate_court_opinions/documents/4df558796ec8e60c.xml': 'decision released'}

cl/corpus_importer/management/commands/import_columbia.py Outdated Show resolved Hide resolved
@quevon24
Copy link
Member Author

quevon24 commented Jun 4, 2024

I implemented three small tweaks to reduce the number of duplicates:

  • Use SHA1 from xml file to try to find cases already imported into the system from the same source and skip them
  • When the opinion source is harvard, it sometimes tends to end with a <page_number>, removing that tag increase the accuracy of the opinion content match a little but enough to match columbia's opinion content.
  • Sometimes in CL we have the same opinion content as in columbia, but with some extra data (for example, when we matched the xml file with a lawbox opinion, lawbox opinions contain multiple metadata in opinion content, such as citation, case name, court and docket number). One of the algorithms checks whether a given opinion is subset of the other, and adding an extra condition helps to overcome that issue.

Beside that i found that in some cases we can have a match(same filed date, citation, docket number and court) but the opinion content is largely different, for example with this file:
e32cc12d6481ddab.xml.zip
and the cluster: https://www.courtlistener.com/opinion/1599823/go/

in cl we have:

1 So.3d 181 (2009)
MORALES
v.
McNEIL.
No. 1D08-1497.
District Court of Appeal of Florida, First District.

January 21, 2009.
Decision without published opinion. Affirmed.

but in the xml file we have:

AFFIRMED.
BARFIELD, ALLEN, and THOMAS, JJ., CONCUR.
NOT FINAL UNTIL TIME EXPIRES TO FILE MOTION FOR REHEARING AND DISPOSITION THEREOF IF FILED.

even if we analyze the lawbox structure to remove metadata, the opinion content is so different, the algorithms that compare the opinions will fail.

I already discussed this with @flooie and he mentioned that there is no problem creating a few duplicates, currently there are already duplicates in the system so later we will need to implement some specialized command to merge/eliminate those duplicates.

I used some random data from the columbia merger matches to try to refine the matching process as much as possible to reduce duplicates.

It can be tested cloning these clusters @grossir:

docker exec -it cl-django python /opt/courtlistener/manage.py clone_from_cl --type search.OpinionCluster --id 1053004 1067190 1164370 1170784 1275317 1296066 1397507 1549674 1580680 1584677 1599823 1642902 1709280 1731105 1737294 1755415 1759137 1769728 1820817 1919526 1920946 2064181 2066843 2076964 2081374 2101230 2103734 2122631 2132495 2133675 2153727 2161907 2174492 2183086 2200571 2213250 2345270 2381555 2402588 2623236 2634645 2879561 3128761 4911658 4970197 5058966 5229804 5241280 5277538 5332211 5346849 5408815 5468476 5471787 5492419 5509380 5509866 5527155 5546291 5553255 5558539 5569133 5682554 5748198 5816741 5852819 5899451 5956020 6014385 6054446 6071818 6096538 6200591 6390061 6500612 6568509 6671890 6867776 6894438 7161052 7260227 7263229 7384628 7495412 7512606 7513120 7575587 7590136 7611798 7612979 7623609 7635344 7648384 7650369 7668072 7760901 7777728 7847184 7912915 8885056

putting these files in cl/assets/media/random_sample_2
random_sample_2.zip

and then running the command with this file:
random_sample_2.csv

docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python /opt/courtlistener/manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/random_sample_2.csv --xml-dir /opt/courtlistener/cl/assets/media/random_sample_2

Copy link
Contributor

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The importer is working, the duplicates issue is gone

As a general comment, I only got 5 new opinions ingested out of the 100 test cases, with most of the others that do have matches left aside for manual review with a message like "Match found with cluster id: 2161907 for columbia file: california/court_of_appeal_opinions/documents/59ca503825f41539.xml" . But will it be feasible to manually check 95% of the dataset, if this sample is somewhat representative?

help="If set, will run through the directories and files in random "
"order.",
inner_opinion_tags = all_opinions_soup.find_all()
if inner_opinion_tags and inner_opinion_tags[-1].name == "page_number":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some opinions have more than 1 <page_number> tag. Why not get rid of all of those tags instead of only the last one?
For example arizona/court_opinions/documents/e574c2908e4b3f3d.xml has 3 tags

@quevon24
Copy link
Member Author

I have already reviewed the code in detail, in the last changes the number of new cases was reduced so it is necessary to use another sample to see if the same thing happens, maybe the algorithm to find duplicates needs to be improved.

Gianfranco left several comments, some have already been worked on, it is still necessary to work on others of his comments.

The most complex part of the code is parsing the xml and merging the floating opinions (those that do not have an author) correctly since in the old importer some opinions were merged incorrectly (this can cause us not to find the case if the content of the opinion differs greatly).

An important part to improve is the list of allowed texts to identify the date type (FILED_TAGS, DECIDED_TAGS, ARGUED_TAGS, etc), specifically the date of filling, this part was taken from the original command

I need to work on this PR to see what can be improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants