Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of LakeFS #14

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from
Draft

Implementation of LakeFS #14

wants to merge 29 commits into from

Conversation

jameshod5
Copy link
Collaborator

@jameshod5 jameshod5 commented Sep 10, 2024

The main changes made to the actual code base are creating a LakeFS workflow (src/lake_fs.py) and moving the cleanup process that was previously apart of the ingestion workflow into this LakeFS workflow, to take advantage of the fact that the ingestion files are written to the disk locally during the whole process.

At the moment, I have a LakeFS server and Postgres server running on the STFC machine. Then, on the machine you want to run your ingestion/versioning code:

  • Set up a SSH tunnel into the STFC machine. With this, we can access the LakeFS setup UI when we first run the LakeFS server which gives us the access key and secret key to the server. We want to keep this somewhere safe. Create a repository through the UI and point to the S3 storage we want to use. Import all the existing date from the S3 storage through the big green "Import" button on the UI.
  • (Optional) If you want to create a new LakeFS repo that points to an S3 object store that already has a repo associated, then you will need to remove the "/data" and "_lakefs" directories from that S3 store first using s5cmd.
  • You will need to install LakeCTL (download the .tar with wget, un-tar it) so that we can use lakectl config and input the access key and secret key that the LakeFS UI setup gave us earlier. This will allow our machine to access the LakeFS server.
  • Run the command python src/lake_fs.py --repo REPO-NAME --data-dir /LOCATION/OF/DATA/ --commit-message "Ingesting my data"

The idea is to run the ingestion workflow as normal, either locally or to s3, and then run the LakeFS versioning code when we want to version. This will upload all of the data that is written to 'data/local' (for example, can be anywhere we want) and then remove those files similar to the cleanup() process that used to be apart of the ingestion workflow itself.

The reasoning for seperating LakeFS from the ingestion workflow is with the MPI processes that we use to speed up ingestion. I tried to work around this by only letting one MPI rank do the versioning task but found it cumbersome. I think having it seperate as well allows us to pick and choose when we want to version our data. But perhaps there is a better way around this.

Output:

image

<------->

image

LakeFS:

image

@jameshod5 jameshod5 added the enhancement New feature or request label Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant