-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload recording feature #787
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this together @KIRA009 ! Just left a few small comments, happy to chat about any of it if you like! 🙏 😄
openadapt/utils.py
Outdated
with open(file_path, "rb") as file: | ||
files = {"file": (filename, file)} | ||
resp = requests.put(upload_url, files=files) | ||
resp.raise_for_status() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about returning the response here?
@@ -0,0 +1,244 @@ | |||
|
|||
# Created by https://www.gitignore.io/api/osx,linux,python,windows,pycharm,visualstudiocode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, can you please clarify why this is necessary / preferable to a minimal .gitignore
? What did you need to ignore here? Why not keep it in the root .gitignore
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is autogenerated from the template. You are right though, this isn't needed
scripts/recording_uploader/README.md
Outdated
|
||
## Deploy the application | ||
|
||
There is a `deploy` script that creates the s3 bucket and deploys the application using the SAM CLI (included as part of the dev dependencies of this project). The bucket name is hardcoded in the script. The SAM CLI is set up to run in `guided` mode, which will prompt the user every time befor deploying, in case the user wants to change the default values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: befor
-> before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the bucket name configurable, with the aws region etc?
@@ -0,0 +1 @@ | |||
"""Init file for the recording_uploader package.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a package, should we move it outside of scripts
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be named as a module instead 😅
def get_presigned_url() -> dict: | ||
"""Generate a presigned URL for uploading a recording to S3.""" | ||
bucket = "openadapt" | ||
region_name = "us-east-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about putting these in NAMED_CONSTANTS
at the top of this file, and passing them in as default keyword arguments to get_presigned_url
?
scripts/recording_uploader/deploy.py
Outdated
if guided: | ||
commands.append("--guided") | ||
subprocess.run(commands, cwd=CURRENT_DIR, check=True) | ||
print("Lambda function deployed successfully.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using logger
here?
scripts/recording_uploader/deploy.py
Outdated
Bucket=bucket, | ||
) | ||
except (s3.exceptions.BucketAlreadyExists, s3.exceptions.BucketAlreadyOwnedByYou): | ||
proceed = input(f"Bucket '{bucket}' already exists. Proceed? [y/N] ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? What happens if the user proceeds if the bucket has already been created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this would be good to have in case the user doesn't want to overwrite existing buckets, but I realise that is quite a niche case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this remove existing data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its unlikely that it will, given we are generating random filenames, so yes maybe we can remove this part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this
"Bucket": bucket, | ||
"Key": key, | ||
}, | ||
ExpiresIn=3600, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make this a named constant, e.g. ONE_HOUR_IN_SECONDS = 60 * 60
?
region_name=region_name, | ||
endpoint_url=f"https://s3.{region_name}.amazonaws.com", | ||
) | ||
key = f"recordings/{uuid4()}.zip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding the user's unique id to the path, e.g. recordings/{user_id}/{upload_id}.zip
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
Once we start scaling we should consider supporting B2, e.g. app.py:
For now let's stick with s3. |
Before this is merged, we need to setup the upload url and update in on |
scripts/recording_uploader/deploy.py
Outdated
region_name (str): The AWS region to deploy the Lambda function to. | ||
guided (bool): Whether to use the guided SAM deployment. | ||
""" | ||
s3 = boto3.client( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KIRA009 can you please modify this to use the credentials specified in openadapt.config
? Specifically we should add AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
or similar.
Also, please add some documentation regarding the permissions required for this IAM user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add these keys to the config? They won't be used anywhere else in the project, and I think when you run the deploy script, if boto3 doesn't find appropriate keys in the place where its looking, the user is notified of that.
For the access keys, I followed this without any explicit permissions - https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey. These keys are the ones that you need to run the deploy script, which then creates appropriate iam users on its own. Once its deployed, if you want you can delete the original access keys. Should I add this in the readme?
@@ -0,0 +1 @@ | |||
boto3==1.34.84 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about moving all of this outside of /scripts
, and into e.g. /admin
or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the uploader
module has to remain inside the recording_uploader
module. If you are talking about the recording_uploader
module, we could, but I think scripts
is also a good enough place for it to be in. Let me know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I meant to move recording_uploader
into a new directory, /admin
or similar.
If you want to self host the app, you should run the following scripts | ||
|
||
**recording_uploader** | ||
- Ensure that you have valid AWS credentials added in your environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about loading the AWS credentials from config.py
?
bfffb05
to
3554bb4
Compare
except Exception as exc: | ||
logger.exception(exc) | ||
|
||
Thread(target=_inner).start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add some error handling / retrying / reporting around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added retries
openadapt/utils.py
Outdated
file_path (str): The path to the file to upload. | ||
body (dict): The body of the request. | ||
""" | ||
filename = os.path.basename(file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filename = os.path.basename(file_path) | |
file_name = os.path.basename(file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about logging this path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this together @KIRA009 !
What do you think about something like this:
class Recording(db.Base):
__tablename__ = "recording"
...
s3_url = sa.Column(sa.String, nullable=True)
And perhaps display this value in the Dashboard.
What do you think about renaming the directory to simply |
@KIRA009 what do you think about https://www.cloudflare.com/en-ca/developer-platform/products/r2/:
|
# check if aws credentials are set | ||
if os.getenv("AWS_ACCESS_KEY_ID") is None: | ||
raise ValueError("AWS_ACCESS_KEY_ID is not set") | ||
if os.getenv("AWS_SECRET_ACCESS_KEY") is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not read this from config
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is not supposed to be part of the OpenAdapt app, its an admin script that needs to be run by the owner (you) on a machine that has the relevant aws creds in its environment. From the PR description
From the project root, run python -m scripts.recording_uploader.deploy (ensure that you have the necessary aws creds configured). Once the command completes, note the api url in the output, and paste that onto the config.py's RECORDING_UPLOAD_URL variable
Because config.py is more closely related to settings of the app, I didn't think it'd be useful to add these there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to read from config.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you suggest a user override these settings? The default values will be empty in config.py
and config.defaults.json
, so the only ways are to manually edit the config.json
file in the data
folder, or we expose it in the dashboard settings page? A regular user won't be needing to edit this, and might get confused if its in the dashboard settings
region_name=region_name, | ||
endpoint_url=f"https://s3.{region_name}.amazonaws.com", | ||
) | ||
bucket = "openadapt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about defining this is config.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same reason as above, plus this is hardcoded because this script will be run once in a while (only when the lambda function is changed). And ideally we won't be changing bucket names between runs.
What kind of change does this PR introduce?
This PR addresses #724
Summary
This PR adds a script to deploy an app to AWS lambda, that is then used by the openadapt app to upload zipfiles of recordings from users.
Checklist
How can your code be run and tested?
From the project root, run
python -m scripts.recording_uploader.deploy
(ensure that you have the necessary aws creds configured). Once the command completes, note the api url in the output, and paste that onto theconfig.py
'sRECORDING_UPLOAD_URL
variable. Once that is done, start the app, navigate to a recording detail page, and click on the "Upload recording" button. Check the s3 bucket to confirm that the recording has been uploaded (in the form of a zip file)Other information