Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conformed cleaner to new parser json structure + unit test updates #163

Merged
merged 1 commit into from
Nov 9, 2024

Conversation

nicolassaw
Copy link
Contributor

After the new updates to the parser and cleaner separately, I conformed the cleaner to properly access fields in the new structure of json coming from the parser module.

Also, I fixed some unit tests to respond to the new structure differences. I also consolidated test and mock files and fixed mapping issues with some unit tests. Also commented out certain unit tests if they were too long or needed configuration.

@nicolassaw nicolassaw requested review from Matt343 and newswim November 2, 2024 21:43
"Case Metadata": {
"county": "hays"
},
"Defendant Information": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 we should use snake_case on these keys.

if isinstance(data, dict):
# Remove 'judicial officer' if it exists in this dictionary
if "judicial officer" in data:
del data["judicial officer"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 we should add a doc comment for this method. I can't recall why we need to delete the JO information.

"parsing_date": dt.datetime.today().strftime("%Y-%m-%d"),
"html_hash": input_dict["html_hash"],
"Case Metadata": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 just a reminder here to update the casing of these keys.

"Charge Information": [],
"Case Details": {
"earliest_charge_date": "",
"has_evidence_of_representation": False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just omit case details and charge information from this initializing record, then append the values later. I wonder if there's a chance to get accidental false negatives here.

elif SECTION == "disposition":
print(f'printing row: {row}')
if len(row) >= 2:
if row[1] in ["Disposition", "Disposition:", "Amended Disposition"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the colon be there in "Disposition:"?

@@ -8,11 +8,16 @@
import tempfile
from bs4 import BeautifulSoup

sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..','..')))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious about these lines and ran some questions through Claude.ai -- it suggested we use pythonpath setting in pyproject.toml to accomplish the same thing (modifying the way Python resolves module paths).

It seems like there are also a few other ways (like using setup.py and a config file for our test suite).

I assume this is fine for now, but we should try to figure out the discrepancies between how these modules are getting resolved, because right now it seems to be OS-dependent.


@patch("os.listdir", return_value=["case1.json", "case2.json"])
def test_process_single_case(self):
county = "hays"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good that we were able to avoid the @patch-ing happening for this test 👍


def get_database_container(self):
#This loads the environment for interacting with CosmosDB #Dan: Should this be moved to the .env file?
load_dotenv()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, just looking at the docs and it seems like you may want to use dotenv_values(".env") here instead of loading the env, just to make this a little more isolated.

https://pypi.org/project/python-dotenv/#:~:text=Other%20Use%20Cases-,Load%20configuration%20without%20altering%20the%20environment,-The%20function%20dotenv_values

So the following lines could read..

env = dotenv_values(".env") # I assume `env` isn't a reserved word

URL = env["URL"]
URL = env("URL")
KEY = env("KEY")
DATA_BASE_NAME = env("DATA_BASE_NAME")
CONTAINER_NAME_CLEANED = env("CONTAINER_NAME_CLEANED")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also just move load_dotenv() to the top of this module, I think.

@nicolassaw
Copy link
Contributor Author

I agree across the board on these updates. Thanks for reading through and giving comments. I'm going to leave a note to come back to this PR and make tickets for these.

@nicolassaw nicolassaw merged commit cc08b6a into main Nov 9, 2024
2 checks passed
@nicolassaw nicolassaw deleted the fix-cleaner-to-parser-mismatch branch November 9, 2024 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants