Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return S3 data links by default when in region #318

Merged
merged 4 commits into from
Oct 23, 2023

Conversation

jrbourbeau
Copy link
Collaborator

I went to run the code referenced here #316 (comment) and got an error because it turns out granule.data_links(access="direct", in_region=True) returns a HTTPS link. This PR fixes that specific case and also makes it so granule.data_links(in_region=True) returns S3 links by default (which seems like it's the expected behavior, but @betolink @MattF-NSIDC let me know if you think otherwise).

I'm testing this PR out now to make sure if in fact fixes things

@github-actions
Copy link

github-actions bot commented Oct 13, 2023

Binder 👈 Launch a binder notebook on this branch for commit 509e42c

I will automatically update this comment whenever this PR is modified

Binder 👈 Launch a binder notebook on this branch for commit eee3f3d

Binder 👈 Launch a binder notebook on this branch for commit 4db15a7

Binder 👈 Launch a binder notebook on this branch for commit 3bcb243

@jrbourbeau
Copy link
Collaborator Author

jrbourbeau commented Oct 13, 2023

Okay, this seems to working as expected. For reference, here's the testing code I'm using

import os
import tempfile

import coiled
import earthaccess
import xarray as xr

earthaccess.login()
granules = earthaccess.search_data(
    short_name="SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205",
    temporal=("2020", "2022"),
    count=2,
)

# Processing function for each data file
@coiled.function(
    region="us-west-2",
    environ={"EARTHDATA_USERNAME": os.environ["EARTHDATA_USERNAME"], "EARTHDATA_PASSWORD": os.environ["EARTHDATA_PASSWORD"]},
    keepalive="20 minutes",
)
def process(granule):
    results = []
    earthaccess.login()
    with tempfile.TemporaryDirectory() as tmpdir:
        files = earthaccess.download([granule], tmpdir)
        for file in files:
            ds = xr.open_dataset(file)
            ds = ds.sel(Latitude=slice(23, 50), Longitude=slice(270, 330))
            ds = ds.SLA.where((ds.SLA >= 0) & (ds.SLA < 10))
            results.append(ds)
    return xr.concat(results, dim="Time")


# Run processing on all the data granules
chunks = process.map(granules)

# Combine and plot results
ds = xr.concat(chunks, dim="Time")
ds.std("Time").plot(figsize=(14, 6), x="Longitude", y="Latitude").figure.savefig("foo.png")

Comment on lines +12 to +21
assert g.data_links(access="direct")[0].startswith("s3://")
assert g.data_links(access="external")[0].startswith("https://")
# `in_region` specified
assert g.data_links(in_region=True)[0].startswith("s3://")
assert g.data_links(in_region=False)[0].startswith("https://")
# When `access` and `in_region` are both specified, `access` takes priority
assert g.data_links(access="direct", in_region=True)[0].startswith("s3://")
assert g.data_links(access="direct", in_region=False)[0].startswith("s3://")
assert g.data_links(access="external", in_region=True)[0].startswith("https://")
assert g.data_links(access="external", in_region=False)[0].startswith("https://")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the intended behavior we want, but let me know if I'm missing something.

As a side note, I'm not sure why we have separate access and in_region kwargs for determining if we want to use s3 or https urls. Is one kwarg sufficient?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, I'm not sure why we have separate access and in_region kwargs for determining if we want to use s3 or https urls. Is one kwarg sufficient?

I can't answer the question directly, but these keywords also feel unintuitive to me. What about access="s3"? To me, "direct" and "external" don't mean anything without more context, but "s3" and "https" do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that two kwargs is clunky and direct/external aren't the most descriptive names, and we could likely handle it with a single kwarg.

that said, since they are the current interface, we should probably open an issue for possibly refactoring it and not block this PR.

@@ -325,7 +325,6 @@ def data_links(
else:
# we are not in us-west-2, even cloud collections have HTTPS links
return https_links
return https_links
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just cosmetic (this line would never be called, so I decided to remove it)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I tend to go the other way and drop the else statement, but six-of-one...

@jrbourbeau
Copy link
Collaborator Author

@betolink @MattF-NSIDC do either of you have bandwidth to take a quick look at this PR? Totally fine if not -- I'm also okay just merging this (I think the actual changes here should be pretty uncontroversial) and owning any follow-up work if there is any.

@jrbourbeau jrbourbeau mentioned this pull request Oct 20, 2023
@mfisher87
Copy link
Collaborator

mfisher87 commented Oct 20, 2023

Hey James, I've been refraining from speaking on your questions about intended behavior because I don't know :)

The change itself looks great, I love the addition of a unit test for this aspect of the interface. I'm 100% on board with merging and dealing with any unexpected results as they come! 🚀

@jrbourbeau
Copy link
Collaborator Author

Sounds good thanks @mfisher87

It turns out that while I can approve / merge PRs, I don't have sufficient permissions to override the "Review required" check on GitHub. Would you (or someone else) mind approving?

@@ -325,7 +325,6 @@ def data_links(
else:
# we are not in us-west-2, even cloud collections have HTTPS links
return https_links
return https_links
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I tend to go the other way and drop the else statement, but six-of-one...

Comment on lines +12 to +21
assert g.data_links(access="direct")[0].startswith("s3://")
assert g.data_links(access="external")[0].startswith("https://")
# `in_region` specified
assert g.data_links(in_region=True)[0].startswith("s3://")
assert g.data_links(in_region=False)[0].startswith("https://")
# When `access` and `in_region` are both specified, `access` takes priority
assert g.data_links(access="direct", in_region=True)[0].startswith("s3://")
assert g.data_links(access="direct", in_region=False)[0].startswith("s3://")
assert g.data_links(access="external", in_region=True)[0].startswith("https://")
assert g.data_links(access="external", in_region=False)[0].startswith("https://")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that two kwargs is clunky and direct/external aren't the most descriptive names, and we could likely handle it with a single kwarg.

that said, since they are the current interface, we should probably open an issue for possibly refactoring it and not block this PR.

@jrbourbeau
Copy link
Collaborator Author

Thanks @mfisher87 @jhkennedy!

@jrbourbeau jrbourbeau merged commit cfc61a9 into nsidc:main Oct 23, 2023
7 checks passed
@jrbourbeau jrbourbeau deleted the data-links-fixup branch October 23, 2023 15:33
@jrbourbeau
Copy link
Collaborator Author

@jhkennedy see #327 for the follow up issue on access / in_region kwargs

@mfisher87
Copy link
Collaborator

It turns out that while I can approve / merge PRs, I don't have sufficient permissions to override the "Review required" check on GitHub.

Shoot, let's fix that! My work laptop is put away, and I similarly don't have those permissions on this account, and also can't change branch protection rules (I'll do it on Wednesday if not resolved sooner).

We could make James and other trusted maintainers admins of the repo (I think this is the best way forward so maintainers can also maintain repo settings), or we could remove the branch protection rule or add James to the list of people that can bypass it. @betolink @jrbourbeau what do you think?

@MattF-NSIDC
Copy link

@jrbourbeau you can bypass the protection rule now if you feel it's needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants