Skip to content

treeverse/lakefs-iceberg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lakeFS logo          Apache Iceberg logo

lakeFS Iceberg Catalog

lakeFS enriches your Iceberg tables with Git capabilities: create a branch and make your changes in isolation, without affecting other team members.

See the instructions below on how to use it, and check out the integration in action in the lakeFS samples repository.

Install

Use the following Maven dependency to install the lakeFS custom catalog:

<dependency>
  <groupId>io.lakefs</groupId>
  <artifactId>lakefs-iceberg</artifactId>
  <version>0.1.4</version>
</dependency>

Configure

Here is how to configure the lakeFS custom catalog in Spark:

conf.set("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog");
conf.set("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog");
conf.set("spark.sql.catalog.lakefs.warehouse", "lakefs://example-repo");

You will also need to configure the S3A Hadoop FileSystem to interact with lakeFS:

conf.set("fs.s3a.access.key", "AKIAlakefs12345EXAMPLE")
conf.set("fs.s3a.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY")
conf.set("fs.s3a.endpoint", "https://example-org.us-east-1.lakefscloud.io")
conf.set("fs.s3a.path.style.access", "true")

Create a table

To create a table on your main branch, use the following syntax:

CREATE TABLE lakefs.main.table1 (id int, data string);

Create a branch

We can now commit the creation of the table to the main branch:

lakectl commit lakefs://example-repo/main -m "my first iceberg commit"

Then, create a branch:

lakectl branch create lakefs://example-repo/dev -s lakefs://example-repo/main

Make changes on the branch

We can now make changes on the branch:

INSERT INTO lakefs.dev.table1 VALUES (3, 'data3');

Query the table

If we query the table on the branch, we will see the data we inserted:

SELECT * FROM lakefs.dev.table1;

Results in:

+----+------+
| id | data |
+----+------+
| 1  | data1|
| 2  | data2|
| 3  | data3|
+----+------+

However, if we query the table on the main branch, we will not see the new changes:

SELECT * FROM lakefs.main.table1;

Results in:

+----+------+
| id | data |
+----+------+
| 1  | data1|
| 2  | data2|
+----+------+