Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(search): Rips out SOLR indexing pipeline #4791

Merged
merged 70 commits into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from 65 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
b4c1cae
feat(scrapers): Removes Solr indexing for cloned items
ERosendo Dec 5, 2024
cb1b45d
feat(search): Removes command for indexing data to Solr
ERosendo Dec 5, 2024
563e15a
feat(admin): Remove custom Solr indexing methods in admin classes
ERosendo Dec 5, 2024
5e55798
feat(models): Removes Solr indexing from model classes
ERosendo Dec 5, 2024
3aa77f8
feat(corpus_importer): Removes Solr indexing logic from commands
ERosendo Dec 5, 2024
b210e93
feat(corpus_importer): Removes Solr indexing logic from utils
ERosendo Dec 5, 2024
6a8e2c9
feat(recap): Removes Solr indexing logic from commands
ERosendo Dec 5, 2024
61c9213
feat(recap): Removes Solr indexing logic from mergers
ERosendo Dec 5, 2024
29d5b44
feat(tasks): Remove Solr indexing logic from task execution
ERosendo Dec 5, 2024
7f9664b
feat(stats): Removes Solr health check
ERosendo Dec 5, 2024
df52d49
Merge branch 'main' into 4726-feat-rip-out-solr-indexing-pipeline
ERosendo Dec 9, 2024
2fe3193
feat(settings): Clean up SOLR env vars
ERosendo Dec 9, 2024
33f4181
build(deps): Removes scorched
ERosendo Dec 9, 2024
34566db
feat(corpus_importer): Removes unused import
ERosendo Dec 9, 2024
abfc40b
feat(recap): Remove add_to_solr argument from save_iquery_to_docket
ERosendo Dec 9, 2024
31f73fc
feat(audio): Simplifies model by removing custom save logic
ERosendo Dec 9, 2024
37baaf4
feat(audio): Remove as_search_dict method
ERosendo Dec 9, 2024
fb7f459
feat(people_db): Remove as_search_dict method
ERosendo Dec 9, 2024
8b68c39
feat(search): Removes as_search_dict method
ERosendo Dec 9, 2024
5f859b4
feat(lib): Clean up module by removing scorched utils
ERosendo Dec 9, 2024
559d657
feat(citations): Removes command to add parallel citations
ERosendo Dec 9, 2024
c3e2802
feat(people_db): Removes command to use FTM API
ERosendo Dec 10, 2024
3105545
feat(lib): Removes helpers related to SOLR administration
ERosendo Dec 10, 2024
8feb6c3
feat(lib): Remove unused normalize_search_dicts helper
ERosendo Dec 10, 2024
92b87c8
refactor(recap_rss): Simplify merge_rss_feed_contents
ERosendo Dec 10, 2024
7251526
feat(alerts): Simplify send_alerts_and_webhooks
ERosendo Dec 10, 2024
f782e78
feat(citations): Clean up Scorched imports
ERosendo Dec 10, 2024
7189f87
feat(citations): Remove unused SOLR date range helper
ERosendo Dec 10, 2024
2923445
feat(citations): Remove SOLR dependency from count_citations command
ERosendo Dec 10, 2024
117e532
feat(citations): Removes SOLR code from find_citations command
ERosendo Dec 10, 2024
fffdbb4
feat(corpus_importer): Simplify recap_document_into_opinions helper
ERosendo Dec 10, 2024
52df1b4
feat(corpus_import): Simplify anon_2020_import command
ERosendo Dec 10, 2024
45a1567
feat(corpus_importer): Simplify harvard_opinions command
ERosendo Dec 10, 2024
834d487
feat(corpus_importer): Tweaks recap_into_opinions command
ERosendo Dec 10, 2024
0736a8c
docs(corpus_importer): Improve get_docket_and_claims docstring
ERosendo Dec 10, 2024
f1930a0
feat(corpus_importer): Simplifies scrape_pacer_free_opinions
ERosendo Dec 10, 2024
7193269
feat(citations): Simplifies citation storage and parenthetical update
ERosendo Dec 10, 2024
b8ad666
feat(Opinion): Remove index argument from save method
ERosendo Dec 10, 2024
6f8b5fa
docs(api): Updates helper method docstring
ERosendo Dec 10, 2024
ee956ae
docs(search): Tweaks helper function comments
ERosendo Dec 10, 2024
6d2816b
docs(citations): Updates make_name_param comment
ERosendo Dec 10, 2024
4b698f3
feat(scrapers): Refines extract_doc_content method
ERosendo Dec 10, 2024
9cffcbb
docs(recap): Refines helper methods docstring
ERosendo Dec 10, 2024
df1a826
docs(lib): Updates add_depth_counts docstring
ERosendo Dec 10, 2024
d529104
feat(OpinionCluster): Remove indexing-related arguments from save method
ERosendo Dec 10, 2024
4d7c603
feat(people_db): Removes custom actions from PersonAdmin class
ERosendo Dec 10, 2024
eafd295
refactor(recap_rss): Remove index argument from reprocess_item
ERosendo Dec 10, 2024
0e76971
refactor(search): Remove scorched import
ERosendo Dec 10, 2024
4e7ad07
Refactor(lib): Rename helper function and improve docstring
ERosendo Dec 10, 2024
b4ef223
fix(audio): Removes index argument from postgeneration hook
ERosendo Dec 10, 2024
0126936
refactor(scraper): Tweaks save_everything helper in scrape_opinions
ERosendo Dec 10, 2024
ec065ee
feat(OpinionCluster): Removes index argument from async save
ERosendo Dec 10, 2024
ebcbf6c
refactor(scraper): Remove index argument from save method
ERosendo Dec 10, 2024
d569ed8
docs(alerts): Improve docstring for remove_stale_rt_items
ERosendo Dec 10, 2024
09a18de
fix(citations): Adjust find_citations command to match signature change
ERosendo Dec 10, 2024
2838d78
tests(search): Remove test class for update index command
ERosendo Dec 10, 2024
9a365dd
test(citations): Remove test class for parallel citation logic
ERosendo Dec 10, 2024
293727a
test(lib): Remove base test classes for SOLR
ERosendo Dec 10, 2024
88fd880
tests(recap): Remove old SOLR mocks
ERosendo Dec 10, 2024
05d233d
docs(alerts): Tweaks comment in send alert test
ERosendo Dec 10, 2024
1bd09b4
feat(corpus_importer): Removes import_columbia command
ERosendo Dec 10, 2024
b80f076
Refactor(search): Removes unused imports
ERosendo Dec 10, 2024
76e37f1
refactor(audio): Removes unused imports
ERosendo Dec 10, 2024
f0b2af7
refactor(people_db): Removes unused imports
ERosendo Dec 10, 2024
053307a
refactor(citations): Removes unused imports
ERosendo Dec 10, 2024
aacd061
feat(search): Removes SOLR templates to index records
ERosendo Dec 11, 2024
9e0e645
docs(citations): Refines comment in store_opinion_citations_and_updat…
ERosendo Dec 11, 2024
5530be0
feat(stats): Adds ES health check
ERosendo Dec 11, 2024
3a0c710
Merge branch 'main' into 4726-feat-rip-out-solr-indexing-pipeline
ERosendo Dec 11, 2024
cef871b
Merge branch 'main' into 4726-feat-rip-out-solr-indexing-pipeline
mlissner Dec 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions cl/alerts/management/commands/cl_send_alerts.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,8 +295,6 @@ def clean_rt_queue(self):
def remove_stale_rt_items(self, age=2):
"""Remove anything old from the RTQ.

This helps avoid issues with solr hitting the maxboolean clause errors.

:param age: How many days old should items be before we start deleting
them?
"""
Expand Down
21 changes: 7 additions & 14 deletions cl/alerts/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -373,27 +373,20 @@ def send_alert_and_webhook(


@app.task(ignore_result=True)
def send_alerts_and_webhooks(
data: Dict[str, Union[List[Tuple], List[int]]]
) -> List[int]:
def send_alerts_and_webhooks(data: list[tuple[int, datetime]]) -> List[int]:
"""Send many docket alerts at one time without making numerous calls
to the send_alert_and_webhook function.

:param data: A dict with up to two keys:
:param data: A list of tuples. Each tuple contains the docket ID, and
a time. The time indicates that alerts should be sent for
items *after* that point.

d_pks_to_alert: A list of tuples. Each tuple contains the docket ID, and
a time. The time indicates that alerts should be sent for
items *after* that point.
rds_for_solr: A list of RECAPDocument ids that need to be sent to Solr
to be made searchable.
:returns: Simply passes through the rds_for_solr list, in case it is
consumed by the next task. If rds_for_solr is not provided, returns an
empty list.
:returns: An empty list
"""
for args in data["d_pks_to_alert"]:
for args in data:
send_alert_and_webhook(*args)

return cast(List[int], data.get("rds_for_solr", []))
return []


@app.task(ignore_result=True)
Expand Down
2 changes: 1 addition & 1 deletion cl/alerts/tests/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -1164,7 +1164,7 @@ def test_send_search_alert_webhooks_rates(self):
):
# Monthly alerts cannot be run on the 29th, 30th or 31st.
with time_machine.travel(self.mock_date, tick=False):
# Send Solr Alerts (Except OA)
# Send Alerts (Except OA)
call_command("cl_send_alerts", rate=rate)
# Send ES Alerts (Only OA for now)
call_command("cl_send_scheduled_alerts", rate=rate)
Expand Down
4 changes: 2 additions & 2 deletions cl/api/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ def send_es_search_alert_webhook(
"""Send a search alert webhook event containing search results from a
search alert object.

:param results: The search results returned by SOLR for this alert.
:param results: The search results returned for this alert.
:param webhook_pk: The webhook endpoint ID object to send the event to.
:param alert: The search alert object.
"""
Expand Down Expand Up @@ -134,7 +134,7 @@ def send_search_alert_webhook_es(
"""Send a search alert webhook event containing search results from a
search alert object.

:param results: The search results returned by SOLR for this alert.
:param results: The search results returned for this alert.
:param webhook_pk: The webhook endpoint ID object to send the event to.
:param alert_pk: The search alert ID.
"""
Expand Down
2 changes: 1 addition & 1 deletion cl/api/webhooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ def send_search_alert_webhook(
"""Send a search alert webhook event containing search results from a
search alert object.

:param results: The search results returned by SOLR for this alert.
:param results: The search results returned for this alert.
:param webhook: The webhook endpoint object to send the event to.
:param alert: The search alert object.
"""
Expand Down
10 changes: 0 additions & 10 deletions cl/audio/factories.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,6 @@ class Meta:
sha1 = Faker("sha1")
download_url = Faker("url")

@classmethod
def _create(cls, model_class, *args, **kwargs):
"""Creates an instance of the model class without indexing."""
obj = model_class(*args, **kwargs)
# explicitly sets `index=False` to prevent it from being indexed in SOLR.
# Once Solr is removed, we can just remove this method completely.
obj.save(index=False)
return obj

"""
These hooks are necessary to make this factory compatible with the
`make_dev_command`. by delegating the file creation to the hooks, we prevent
Expand Down Expand Up @@ -60,7 +51,6 @@ def _after_postgeneration(cls, instance, create, results=None):
if create and results:
# Some post-generation hooks ran, and may have modified the instance.
instance.save(
index=False,
update_fields=["local_path_mp3", "local_path_original_file"],
)

Expand Down
105 changes: 1 addition & 104 deletions cl/audio/models.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,11 @@
from typing import Dict, List, Union

import pghistory
from django.db import models
from django.template import loader
from django.urls import NoReverseMatch, reverse
from django.urls import reverse
from model_utils import FieldTracker

from cl.custom_filters.templatetags.text_filters import best_case_name
from cl.lib.date_time import midnight_pt
from cl.lib.model_helpers import make_upload_path
from cl.lib.models import AbstractDateTimeModel, s3_warning_note
from cl.lib.search_index_utils import (
InvalidDocumentError,
normalize_search_dicts,
null_map,
)
from cl.lib.storage import IncrementingAWSMediaStorage
from cl.lib.utils import deepgetattr
from cl.people_db.models import Person
from cl.search.models import SOURCES, Docket

Expand Down Expand Up @@ -196,98 +185,6 @@ def __str__(self) -> str:
def get_absolute_url(self) -> str:
return reverse("view_audio_file", args=[self.pk, self.docket.slug])

def save( # type: ignore[override]
self,
index: bool = True,
force_commit: bool = False,
*args: List,
**kwargs: Dict,
) -> None:
"""
Overrides the normal save method, but provides integration with the
bulk files and with Solr indexing.

:param index: Should the item be added to the Solr index?
:param force_commit: Should a commit be performed in solr after
indexing it?
"""
super().save(*args, **kwargs) # type: ignore
if index:
from cl.search.tasks import add_items_to_solr

add_items_to_solr([self.pk], "audio.Audio", force_commit)

def delete( # type: ignore[override]
self,
*args: List,
**kwargs: Dict,
) -> None:
"""
Update the index as items are deleted.
"""
id_cache = self.pk
super().delete(*args, **kwargs) # type: ignore
from cl.search.tasks import delete_items

delete_items.delay([id_cache], "audio.Audio")

def as_search_dict(self) -> Dict[str, Union[int, List[int], str]]:
"""Create a dict that can be ingested by Solr"""
# IDs
out = {
"id": self.pk,
"docket_id": self.docket_id,
"court_id": self.docket.court_id,
}

# Docket
docket = {"docketNumber": self.docket.docket_number}
if self.docket.date_argued is not None:
docket["dateArgued"] = midnight_pt(self.docket.date_argued)
if self.docket.date_reargued is not None:
docket["dateReargued"] = midnight_pt(self.docket.date_reargued)
if self.docket.date_reargument_denied is not None:
docket["dateReargumentDenied"] = midnight_pt(
self.docket.date_reargument_denied
)
out.update(docket)

# Court
out.update(
{
"court": self.docket.court.full_name,
"court_citation_string": self.docket.court.citation_string,
"court_exact": self.docket.court_id, # For faceting
}
)

# Audio File
out.update(
{
"caseName": best_case_name(self),
"panel_ids": [judge.pk for judge in self.panel.all()],
"judge": self.judges,
"file_size_mp3": deepgetattr(
self, "local_path_mp3.size", None
),
"duration": self.duration,
"source": self.source,
"download_url": self.download_url,
"local_path": deepgetattr(self, "local_path_mp3.name", None),
}
)
try:
out["absolute_url"] = self.get_absolute_url()
except NoReverseMatch:
raise InvalidDocumentError(
f"Unable to save to index due to missing absolute_url: {self.pk}"
)

text_template = loader.get_template("indexes/audio_text.txt")
out["text"] = text_template.render({"item": self}).translate(null_map)

return normalize_search_dicts(out)


@pghistory.track(
pghistory.InsertEvent(), pghistory.DeleteEvent(), obj_field=None
Expand Down
Loading
Loading