Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock cache file with python #168

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Lock cache file with python #168

wants to merge 1 commit into from

Conversation

kairstenfay
Copy link
Contributor

@kairstenfay kairstenfay commented Sep 17, 2020

Lock the cache file in python rather than relying on the user knowing to lock at runtime with shell's flock, etc.

  • Update backoffice cron jobs accordingly

Open questions:
How do we want to handle cache collision?
Right now if I run two REDCap DET ETL processes operating on the same cache, I sometimes get

_pickle.UnpicklingError: pickle data was truncated

Perhaps this means my implementation isn't working as expected?
Which command(s) do we want to use when creating a lock?

@kairstenfay kairstenfay requested a review from tsibley September 17, 2020 20:58
@@ -31,10 +31,10 @@
from smartystreets_python_sdk.us_street import Lookup
from smartystreets_python_sdk.us_extract import Lookup as ExtractLookup
from id3c.cli import cli
from id3c.cli.command import pickled_cache
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This edit can be dropped from this commit.

@kairstenfay
Copy link
Contributor Author

@kairstenfay kairstenfay changed the title Redcap cache Lock cache file with python Nov 17, 2020
@@ -125,7 +126,9 @@ def pickled_cache(filename: str = None) -> TTLCache:
LOG.info(f"Loading cache from «{filename}»")
try:
with open(filename, "rb") as file:
lock_file(file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on the implementation of lock_file() yet, but I think the usage here isn't doing what you expect because it's not being called as a context manager. It should be called as:

with lock_file(file):
    …

@tsibley
Copy link
Member

tsibley commented Dec 21, 2020

I think ultimately we may want to switch the geocoding caching from our in-house solution to diskcache, which I used for the Husky Musher caching. When we first added the caching, I didn't know diskcache existed and, unlike cachetools, didn't uncover it during due diligence.

@joverlee521 asked in Slack:

What's the difference between the FanoutCache used for Musher and the TTLCache used for geocoding?

and I thought it'd be useful to copy my answer here as well:

There's a couple differences:

  1. TTLCache comes from cachetools, which implements lower-level cache primitives than FanoutCache from diskcache.

  2. The TTLCache we use is entirely in-memory during execution. On top of that, we layer our own serialization/deserialization to load/save the cache. In contrast, the FanoutCache (diskcache.Cache subclass) is transactionally-safe and is loaded/saved per access. FanoutCache is a further-specialized subclass of diskcache.Cache to avoid waiting on blocking writes.

  3. TTLCache requires a TTL (expiry). Expirations are optional with FanoutCache, and we don't use them.

I found diskcache last night when looking not to reinvent the wheel. It seems like a very nice package that's robust and well-thought out. We could replace our TTLCache usage with it to make the geocoding caching a lot simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants