-
-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Columbia importer updated #2865
base: main
Are you sure you want to change the base?
Conversation
remove unused code remove unused imports
Reduce amount of files used Use match_based_text function from harvard to find duplicated content Typing added
log message when court doesn't exist in courtlistener
handle duplicate citations in xml
Log message when case has no citations Handle single volume nominative reporters
fix typing
remove unused code remove unused imports
Reduce amount of files used Use match_based_text function from harvard to find duplicated content Typing added
log message when court doesn't exist in courtlistener
handle duplicate citations in xml
@grossir when you have time available you could take a look this is a sample file to test the command: to run the command you need to copy the zip content to cl/assets/media
if you have any questions, i'll stay tuned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It ingested ~766 dockets/opinion clusters out of 1000 documents
I ran the script 4 times and got 6 triplicated ingestions, maybe you can check on your environment too. I used this query to detect them:
select
local_path, html_columbia, count(*)
FROM
search_docket sd
inner join search_opinioncluster oc
on docket_id=sd.id
inner join search_opinion
on search_opinion.cluster_id=oc.id
where
sd.date_created::date = '2024-05-23'::date
group by local_path , html_columbia
having count(*) > 1;
I left some comments, mostly ideas for improvements
I haven't really tested the matching algorithms beyond most basic duplication. If you could send another sample file, but sampled from the most recent opinions (so that it is easier to search them on the web pages / scrape them), it would help testing those parts
} | ||
|
||
# Add date data into columbia dict | ||
columbia_data.update(find_dates_in_xml(soup)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are some missing "FILED_TAGS" strings in columbia_utils.py
For example, for texas/court_opinions/documents/5c8dba31985162bf.xml
in the sample, the following date is parsed: [[('opinion issued', datetime.date(2006, 10, 23))]]
but it is not assigned to the "date_filed" key
Looking at the raw text I think it should be considered as "date_filed":
<opinion>
<reporter_caption><center>IN RE LARREW, 05-06-01227-CV (Tex.App.-Dallas 10-23-2006)</center></reporter_caption>
<caption><center>IN RE STEPHEN JAMES LARREW, Relator.</center></caption>
<docket><center>No. 05-06-01227-CV</center></docket><court><center>Court of Appeals of Texas, Fifth District, Dallas.</center></court>
<date><center>Opinion issued October 23, 2006.</center>
I see that on FILED_TAGS
"opinion issued" is included in ARGUED_TAGS
, not sure about the logic for this
I collected all the documents that do have dates, but no date filed. Some are obviously not OpinionCluster.date_filed
, like "case announcements and administrative actions", but some others I am not so sure
{'texas/court_opinions/documents/5c8dba31985162bf.xml': 'opinion issued',
'arkansas/court_opinions/documents/ae218d6345f5d320.xml': 'opinion delivered',
'texas/court_opinions/documents/5f4fe3e1c4e72785.xml': 'opinion delivered',
'arkansas/court_opinions/documents/96742836d45c4996.xml': 'opinion delivered',
'arkansas/court_opinions/documents/d218fb45d4055bdd.xml': 'opinion delivered',
'maryland/court_of_appeals_opinions/documents/61198adb4b840f4d.xml': 'denied',
'arkansas/court_opinions/documents/5793152fb3e371a3.xml': 'opinion delivered',
'maryland/court_of_appeals_opinions/documents/cda2e7a6c083f661.xml': 'denied',
'texas/court_opinions/documents/f7f4eb4e0bb7e71a.xml': 'opinion delivered and filed',
'connecticut/appellate_court_opinions/documents/e3e9aa07cc97f60f.xml': 'officially released',
'arkansas/court_opinions/documents/0b027f05aa07c2af.xml': 'opinion delivered',
'texas/court_opinions/documents/248981bf18493e9d.xml': 'opinion delivered',
'arkansas/court_opinions/documents/a968b68353ffe980.xml': 'opinion delivered',
'arkansas/court_opinions/documents/ebdc8da5b2ec8fe9.xml': 'opinion delivered',
'michigan/supreme_court_opinions/documents/63efa26d555875ea.xml': 'leave to appeal denied',
'texas/court_opinions/documents/d4a6653c3a7c08fe.xml': 'delivered',
'maryland/court_of_appeals_opinions/documents/69cb6658d5b0324d.xml': 'granted',
'texas/court_opinions/documents/0904c0a3016f8421.xml': 'delivered',
'texas/court_opinions/documents/a43217e67bd08858.xml': 'opinion issued',
'texas/court_opinions/documents/c48edff93471911d.xml': 'opinion issued',
'ohio/court_opinions/documents/52a07db0c124634f.xml': 'case announcements and administrative actions',
'arkansas/court_opinions/documents/e628b04ac0dcd6f1.xml': 'opinion delivered',
'maryland/court_of_appeals_opinions/documents/9067dae3cd6d312e.xml': 'denied',
'arkansas/court_opinions/documents/300ebbd01ba38398.xml': 'opinion delivered',
'maryland/court_of_appeals_opinions/documents/217ae38fdf9869af.xml': 'denied',
'arkansas/court_opinions/documents/9e3f71089f9d11dc.xml': 'opinion delivered',
'maryland/court_of_appeals_opinions/documents/6600fe895d37d853.xml': 'denied',
'arkansas/court_opinions/documents/2c71f85af35b9e0f.xml': 'opinion delivered',
'texas/court_opinions/documents/cecfdd58268e8f07.xml': 'opinion issued',
'texas/court_opinions/documents/60a231f3da6a421f.xml': 'memorandum opinion delivered and filed',
'arkansas/court_opinions/documents/d37e7ba255a67a6d.xml': 'opinion delivered',
'texas/court_opinions/documents/b78271984621969a.xml': 'opinion issued',
'maryland/court_of_appeals_opinions/documents/4b97a5803331bb29.xml': 'denied',
'maryland/court_of_appeals_opinions/documents/9eefe6f3e03131f7.xml': 'denied',
'massachusetts/superior_court_opinions/documents/161739ca6ca6348b.xml': 'memorandum dated',
'maryland/court_of_appeals_opinions/documents/6ac77c8a8002a723.xml': 'denied',
'texas/court_opinions/documents/3bcee6268dd18a72.xml': 'opinion issued',
'texas/court_opinions/documents/4d909c6b7d4de7e4.xml': 'opinion delivered',
'arkansas/court_opinions/documents/8ddc4fe19662d9fb.xml': 'opinion delivered',
'arkansas/court_opinions/documents/9d7c8e94e2c2b40f.xml': 'opinion delivered',
'texas/court_opinions/documents/2db17b19d30d85df.xml': 'opinion issued',
'arkansas/court_opinions/documents/3dd26fb70896c79b.xml': 'opinion delivered',
'arkansas/court_opinions/documents/ab59ead0feee789f.xml': 'opinion delivered',
'maryland/court_of_appeals_opinions/documents/9d65b83825eed85f.xml': 'denied',
'michigan/supreme_court_opinions/documents/1287940f26660dfa.xml': 'summary dispositions',
'arkansas/court_opinions/documents/6ae7cc75fc0cd311.xml': 'opinion delivered',
'connecticut/appellate_court_opinions/documents/96cb6396c50954b6.xml': 'officially released',
'connecticut/appellate_court_opinions/documents/4df558796ec8e60c.xml': 'decision released'}
update comments and logging messages
…pdate-columbia-importer
I implemented three small tweaks to reduce the number of duplicates:
Beside that i found that in some cases we can have a match(same filed date, citation, docket number and court) but the opinion content is largely different, for example with this file: in cl we have:
but in the xml file we have:
even if we analyze the lawbox structure to remove metadata, the opinion content is so different, the algorithms that compare the opinions will fail. I already discussed this with @flooie and he mentioned that there is no problem creating a few duplicates, currently there are already duplicates in the system so later we will need to implement some specialized command to merge/eliminate those duplicates. I used some random data from the columbia merger matches to try to refine the matching process as much as possible to reduce duplicates. It can be tested cloning these clusters @grossir:
putting these files in cl/assets/media/random_sample_2 and then running the command with this file: docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python /opt/courtlistener/manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/random_sample_2.csv --xml-dir /opt/courtlistener/cl/assets/media/random_sample_2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The importer is working, the duplicates issue is gone
As a general comment, I only got 5 new opinions ingested out of the 100 test cases, with most of the others that do have matches left aside for manual review with a message like "Match found with cluster id: 2161907 for columbia file: california/court_of_appeal_opinions/documents/59ca503825f41539.xml"
. But will it be feasible to manually check 95% of the dataset, if this sample is somewhat representative?
help="If set, will run through the directories and files in random " | ||
"order.", | ||
inner_opinion_tags = all_opinions_soup.find_all() | ||
if inner_opinion_tags and inner_opinion_tags[-1].name == "page_number": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some opinions have more than 1 <page_number> tag. Why not get rid of all of those tags instead of only the last one?
For example arizona/court_opinions/documents/e574c2908e4b3f3d.xml has 3 tags
# Conflicts: # cl/corpus_importer/import_columbia/html_test.py
# Conflicts: # cl/corpus_importer/management/commands/import_columbia.py
I have already reviewed the code in detail, in the last changes the number of new cases was reduced so it is necessary to use another sample to see if the same thing happens, maybe the algorithm to find duplicates needs to be improved. Gianfranco left several comments, some have already been worked on, it is still necessary to work on others of his comments. The most complex part of the code is parsing the xml and merging the floating opinions (those that do not have an author) correctly since in the old importer some opinions were merged incorrectly (this can cause us not to find the case if the content of the opinion differs greatly). An important part to improve is the list of allowed texts to identify the date type (FILED_TAGS, DECIDED_TAGS, ARGUED_TAGS, etc), specifically the date of filling, this part was taken from the original command I need to work on this PR to see what can be improved. |
This PR contains the updated version of columbia importer, it contains many changes like:
Based on some calculations, ~1.2M files have to be imported based on the data in local_path in the Opinion model, the number could be lower because some of the cases in this list of files are already imported but from a different source
Usage:
Import using a csv file with xml file path pointing to mounted directory and file path
docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/testfile.csv
Csv example:
Import specifying the mounted directory where the xml files are located
docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python /opt/courtlistener/manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/files_to_import.csv --xml-dir /opt/courtlistener/columbia_files