Implementation of LakeFS #14

jameshod5 · 2024-09-10T10:06:23Z

The main changes made to the actual code base are creating a LakeFS workflow (src/lake_fs.py) and moving the cleanup process that was previously apart of the ingestion workflow into this LakeFS workflow, to take advantage of the fact that the ingestion files are written to the disk locally during the whole process.

At the moment, I have a LakeFS server and Postgres server running on the STFC machine. Then, on the machine you want to run your ingestion/versioning code:

Set up a SSH tunnel into the STFC machine. With this, we can access the LakeFS setup UI when we first run the LakeFS server which gives us the access key and secret key to the server. We want to keep this somewhere safe. Create a repository through the UI and point to the S3 storage we want to use. Import all the existing date from the S3 storage through the big green "Import" button on the UI.
(Optional) If you want to create a new LakeFS repo that points to an S3 object store that already has a repo associated, then you will need to remove the "/data" and "_lakefs" directories from that S3 store first using s5cmd.
You will need to install LakeCTL (download the .tar with wget, un-tar it) so that we can use lakectl config and input the access key and secret key that the LakeFS UI setup gave us earlier. This will allow our machine to access the LakeFS server.
Run the command python src/lake_fs.py --repo REPO-NAME --data-dir /LOCATION/OF/DATA/ --commit-message "Ingesting my data"

The idea is to run the ingestion workflow as normal, either locally or to s3, and then run the LakeFS versioning code when we want to version. This will upload all of the data that is written to 'data/local' (for example, can be anywhere we want) and then remove those files similar to the cleanup() process that used to be apart of the ingestion workflow itself.

The reasoning for seperating LakeFS from the ingestion workflow is with the MPI processes that we use to speed up ingestion. I tried to work around this by only letting one MPI rank do the versioning task but found it cumbersome. I think having it seperate as well allows us to pick and choose when we want to version our data. But perhaps there is a better way around this.

Output:

<------->

LakeFS:

…ir creation to the branch seperately

…ly due to network problems with the lakefs instance. WIP

…r the fact

…, requires a repo name

…n at the end

jameshod5 and others added 13 commits September 3, 2024 08:40

lakefs python code for branch creation, committing and merging

7cf606e

initial test of lakefs workflow

e0b0fe4

added crude versioning to local ingestion

eb69a9d

data versioning task added

06889bf

removed config requirements, reads from lakectl config now

1ae88df

removed cleanup function and versioning from workflow file

aa87edd

improved versioning to get past mpi processes overlapping

27ffe0b

seperate lakefs versioning task, run external to ingestion

bb327df

removed mpi versioning, now doing seperate versioning process

190fa49

added extra validation for data dir, lakectl install

0e0437f

added cleanup functionality

781df1e

removed cleanup from workflow, moved to lake_fs.py

30831e8

removed unused versioning data task

d866f06

jameshod5 added the enhancement New feature or request label Sep 10, 2024

jameshod5 requested review from samueljackson92, NathanCummings and khalsz September 10, 2024 10:06

James Hodson and others added 13 commits September 12, 2024 14:36

create branch for versioning at beginning

fd80c08

commit shot after it is creating to branch ingestion

a3b586f

removed versioning from workflow

207074e

working example of versioning, committing each shot at the end of the…

5cdaed6

…ir creation to the branch seperately

organised versioning into functions

8613524

added option to turn on or off versioning

a425809

removed args for versioning for now

2313959

added merge and branch delete

42c32f8

rework of ingestion workflow to use lakefs instead, does not work ful…

306f868

…ly due to network problems with the lakefs instance. WIP

added committing to the branch and cleanup to remove local files afte…

7629c80

…r the fact

included merge at the end of versioning run

5190edd

upload and commit the shot name instead of the local_path prefix

885508c

WIP file. Location of merge and execute functions used in main.

ade3c38

James Hodson and others added 3 commits September 18, 2024 12:36

now able to use --upload arg as an option to enable lakefs versioning…

a270e28

…, requires a repo name

added branch to config to enable merging of ingestion branch into mai…

950f186

…n at the end

removed excess class

d813606

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of LakeFS #14

Implementation of LakeFS #14

jameshod5 commented Sep 10, 2024 •

edited

Loading

Implementation of LakeFS #14

Are you sure you want to change the base?

Implementation of LakeFS #14

Conversation

jameshod5 commented Sep 10, 2024 • edited Loading

jameshod5 commented Sep 10, 2024 •

edited

Loading