Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overly restrictive S3 limitations #107

Open
benjwadams opened this issue Aug 23, 2023 · 3 comments
Open

Overly restrictive S3 limitations #107

benjwadams opened this issue Aug 23, 2023 · 3 comments

Comments

@benjwadams
Copy link

I've recently been using ERDDAP, version 2.23 using Axiom's Docker images in an attempt to load in HF Radar files with the data sections stored as CSV which are stored in S3.
Bucket is here and is currently publicly readable:

https://hfradar.s3.amazonaws.com/

I've run into several surprising behaviors.

  1. ERDDAP docs do mention to use a particular availablility zone, e.g. instead of using https://hfradar.s3.amazonaws.com/SUBPATH/, use https://hfradar.s3.us-east-1.amazonaws.com/SUBPATH/ . However, the first form is usually what you'd get from copying from the AWS console, and also would support multiple availability zones. From all I can determine, the content served by both is the same barring downtime error conditions, but the former has the advantage of multi-AZ support. ERDDAP wants the other form and if you do not provide it, you will get an error message saying that no files were found. Docs do say to use a form with AZ specified, but this is confusing behavior given that the content is going to be the same in 99.9%+ of cases. I don't understand why the former form can't be supported, especially given that it's likely to be more robust in a production environment given the fact that multiple availability zones are supported.

  2. Even though the bucket and its contents are publicly readable, ERDDAP will refuse to read the contents of the S3 bucket if you don't pass along credentials. Why is this step necessary for something that should already be readable without supplying AWS credentials?

@benjwadams
Copy link
Author

benjwadams commented Aug 23, 2023

Confirmation that both HTTPS URL forms yield the same result:

[badams@localhost]~% curl -s https://hfradar.s3.us-east-1.amazonaws.com/ | md5sum
e3d288cb10af4d93a3509f3776a45dec  -
[badams@localhost]~% curl -s https://hfradar.s3.amazonaws.com/ | md5sum       
e3d288cb10af4d93a3509f3776a45dec -

@BobSimons
Copy link
Collaborator

  1. When I updated to v2 of the AWS Java SDK, the only way I could get it to work was to specify the zone, so I chose the form of the URL which includes the zone. (Note that when just using the URL, e.g., in a browser, I didn't need to specify the zone. The SDK works differently.) Note that the ERDDAP code doesn't simply use this URL, it parses it and uses the pieces, including the zone. I didn't know about the multiAZ option. Perhaps there is a way to support this with the SDK. That would be up to Chris to pursue if he chooses to.

Assuming it is even possible to use the SDK without specifying the zone, not specifying the zone when multiple zones are valid opens up potential problems with frequently changing datasets, where one zone will have a given new file and another zone doesn't yet. That will potentially cause annoying errors for users. People will think ERDDAP is flaky and complain (or just grumble to themselves.)

There is also a potential delay added when accessing files and not specifying the zone, as AWS will have to determine which zone to work with. Delays that involve signals travelling long distances can be significant. Presumably, the delay would occur each time the dataset or a file in it is accessed. The huge underlying problem with S3 is latency and this just exacerbates the problem.

Specifying which zone to use is more straightforward and efficient. Even if not specifying the zone becomes an option, I would encourage admins to continue to specify the zone.

  1. Same basic issue: with v2 of the SDK, I was only able to get the system to work by including credentials. It may be (it seems likely) that simply using the SDK (even for public buckets) requires credentials. I don't know. Again, Chris can choose to pursue this if he wants.

Both times above, I said Chris can choose to pursue this if he wants, but you can also pursue this: search the AWS v2 Java SDK docs for information about these issues and post them here. The documentation may also include recommendations about best practices.

Best wishes.

@benjwadams
Copy link
Author

  1. Looks like buckets are region specific underneath the hood, so I guess this is a reasonable limitation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants