Skip to content

Commit

Permalink
Update DeltaLake documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
IceS2 committed Jun 18, 2024
1 parent 8185882 commit c52425f
Show file tree
Hide file tree
Showing 3 changed files with 349 additions and 50 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -20,30 +20,57 @@ Configure and schedule Deltalake metadata and profiler workflows from the OpenMe
- [Metadata Ingestion](#metadata-ingestion)
- [dbt Integration](/connectors/ingestion/workflows/dbt)

{% partial file="/v1.5/connectors/ingestion-modes-tiles.md" variables={yamlPath: "/connectors/database/deltalake/yaml"} /%}
{% partial file="/v1.4/connectors/ingestion-modes-tiles.md" variables={yamlPath: "/connectors/database/deltalake/yaml"} /%}


## Requirements

Deltalake requires to run with Python 3.8, 3.9 or 3.10. We do not yet support the Delta connector
for Python 3.11

The DeltaLake connector is able to extract the information from a **metastore** or directly from the **storage**.

If extracting directly from the storage, some extra requirements are needed depending on the storage

### S3 Permissions

To execute metadata extraction AWS account should have enough access to fetch required data. The <strong>Bucket Policy</strong> in AWS requires at least these permissions:

```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<my bucket>",
"arn:aws:s3:::<my bucket>/*"
]
}
]
}
```

## Metadata Ingestion

{% partial
file="/v1.5/connectors/metadata-ingestion-ui.md"
{% partial
file="/v1.4/connectors/metadata-ingestion-ui.md"
variables={
connector: "Deltalake",
selectServicePath: "/images/v1.5/connectors/deltalake/select-service.png",
addNewServicePath: "/images/v1.5/connectors/deltalake/add-new-service.png",
serviceConnectionPath: "/images/v1.5/connectors/deltalake/service-connection.png",
}
connector: "Deltalake",
selectServicePath: "/images/v1.4/connectors/deltalake/select-service.png",
addNewServicePath: "/images/v1.4/connectors/deltalake/add-new-service.png",
serviceConnectionPath: "/images/v1.4/connectors/deltalake/service-connection.png",
}
/%}

{% stepsContainer %}
{% extraContent parentTagName="stepsContainer" %}

#### Connection Details
#### Connection Details For MetastoreConfig

- **Metastore Host Port**: Enter the Host & Port of Hive Metastore Service to configure the Spark Session. Either
of `metastoreHostPort`, `metastoreDb` or `metastoreFilePath` is required.
Expand All @@ -61,7 +88,7 @@ We are internally running with `pyspark` 3.X and `delta-lake` 2.0.0. This means
When connecting to an External Metastore passing the parameter `Metastore Host Port`, we will be preparing a Spark Session with the configuration

```
.config("hive.metastore.uris", "thrift://{connection.metastoreHostPort}")
.config("hive.metastore.uris", "thrift://{connection.metastoreHostPort}")
```

Then, we will be using the `catalog` functions from the Spark Session to pick up the metadata exposed by the Hive Metastore.
Expand All @@ -71,7 +98,7 @@ Then, we will be using the `catalog` functions from the Spark Session to pick up
If instead we use a local file path that contains the metastore information (e.g., for local testing with the default `metastore_db` directory), we will set

```
.config("spark.driver.extraJavaOptions", "-Dderby.system.home={connection.metastoreFilePath}")
.config("spark.driver.extraJavaOptions", "-Dderby.system.home={connection.metastoreFilePath}")
```

To update the `Derby` information. More information about this in a great [SO thread](https://stackoverflow.com/questions/38377188/how-to-get-rid-of-derby-log-metastore-db-from-spark-shell).
Expand All @@ -88,19 +115,81 @@ Here, we will need to inform all the common database settings (url, username, pa

You will need to provide the driver to the ingestion image, and pass the `classpath` which will be used in the Spark Configuration under `spark.driver.extraClassPath`.

{% partial file="/v1.5/connectors/database/advanced-configuration.md" /%}
#### Connection Details for StorageConfig - S3

- **AWS Access Key ID** & **AWS Secret Access Key**: When you interact with AWS, you specify your AWS security credentials to verify who you are and whether you have
permission to access the resources that you are requesting. AWS uses the security credentials to authenticate and
authorize your requests ([docs](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html)).

Access keys consist of two parts: An **access key ID** (for example, `AKIAIOSFODNN7EXAMPLE`), and a **secret access key** (for example, `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`).

You must use both the access key ID and secret access key together to authenticate your requests.

You can find further information on how to manage your access keys [here](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html).

- **AWS Region**: Each AWS Region is a separate geographic area in which AWS clusters data centers ([docs](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.RegionsAndAvailabilityZones.html)).

As AWS can have instances in multiple regions, we need to know the region the service you want reach belongs to.

Note that the AWS Region is the only required parameter when configuring a connection. When connecting to the
services programmatically, there are different ways in which we can extract and use the rest of AWS configurations.

You can find further information about configuring your credentials [here](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#configuring-credentials).

- **AWS Session Token (optional)**: If you are using temporary credentials to access your services, you will need to inform the AWS Access Key ID
and AWS Secrets Access Key. Also, these will include an AWS Session Token.

You can find more information on [Using temporary credentials with AWS resources](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_use-resources.html).

- **Endpoint URL (optional)**: To connect programmatically to an AWS service, you use an endpoint. An *endpoint* is the URL of the
entry point for an AWS web service. The AWS SDKs and the AWS Command Line Interface (AWS CLI) automatically use the
default endpoint for each service in an AWS Region. But you can specify an alternate endpoint for your API requests.

Find more information on [AWS service endpoints](https://docs.aws.amazon.com/general/latest/gr/rande.html).

- **Profile Name**: A named profile is a collection of settings and credentials that you can apply to a AWS CLI command.
When you specify a profile to run a command, the settings and credentials are used to run that command.
Multiple named profiles can be stored in the config and credentials files.

You can inform this field if you'd like to use a profile other than `default`.

Find here more information about [Named profiles for the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html).

- **Assume Role Arn**: Typically, you use `AssumeRole` within your account or for cross-account access. In this field you'll set the
`ARN` (Amazon Resource Name) of the policy of the other account.

A user who wants to access a role in a different account must also have permissions that are delegated from the account
administrator. The administrator must attach a policy that allows the user to call `AssumeRole` for the `ARN` of the role in the other account.

This is a required field if you'd like to `AssumeRole`.

Find more information on [AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html).

- **Assume Role Session Name**: An identifier for the assumed role session. Use the role session name to uniquely identify a session when the same role
is assumed by different principals or for different reasons.

By default, we'll use the name `OpenMetadataSession`.

Find more information about the [Role Session Name](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html#:~:text=An%20identifier%20for%20the%20assumed%20role%20session.).

- **Assume Role Source Identity**: The source identity specified by the principal that is calling the `AssumeRole` operation. You can use source identity
information in AWS CloudTrail logs to determine who took actions with a role.

Find more information about [Source Identity](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html#:~:text=Required%3A%20No-,SourceIdentity,-The%20source%20identity).


{% partial file="/v1.4/connectors/database/advanced-configuration.md" /%}

{% /extraContent %}

{% partial file="/v1.5/connectors/test-connection.md" /%}
{% partial file="/v1.4/connectors/test-connection.md" /%}

{% partial file="/v1.5/connectors/database/configure-ingestion.md" /%}
{% partial file="/v1.4/connectors/database/configure-ingestion.md" /%}

{% partial file="/v1.5/connectors/ingestion-schedule-and-deploy.md" /%}
{% partial file="/v1.4/connectors/ingestion-schedule-and-deploy.md" /%}

{% /stepsContainer %}

{% partial file="/v1.5/connectors/troubleshooting.md" /%}

{% partial file="/v1.5/connectors/database/related.md" /%}
{% partial file="/v1.4/connectors/troubleshooting.md" /%}

{% partial file="/v1.4/connectors/database/related.md" /%}
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Configure and schedule Deltalake metadata and profiler workflows from the OpenMe
- [Metadata Ingestion](#metadata-ingestion)
- [dbt Integration](#dbt-integration)

{% partial file="/v1.5/connectors/external-ingestion-deployment.md" /%}
{% partial file="/v1.4/connectors/external-ingestion-deployment.md" /%}

## Requirements

Expand All @@ -28,14 +28,23 @@ for Python 3.11

### Python Requirements

{% partial file="/v1.5/connectors/python-requirements.md" /%}
{% partial file="/v1.4/connectors/python-requirements.md" /%}

To run the Deltalake ingestion, you will need to install:

- If extracting from a metastore

```bash
pip3 install "openmetadata-ingestion[deltalake-spark]"
```

- If extracting directly from the storage

```bash
pip3 install "openmetadata-ingestion[deltalake]"
pip3 install "openmetadata-ingestion[deltalake-storage]"
```


## Metadata Ingestion

All connectors are defined as JSON Schemas.
Expand All @@ -51,13 +60,13 @@ The workflow is modeled around the following

### 1. Define the YAML Config

This is a sample config for Deltalake:
#### Source Configuration - From Metastore

{% codePreview %}

{% codeInfoContainer %}

#### Source Configuration - Service Connection
##### Source Configuration - Service Connection

{% codeInfo srNumber=1 %}

Expand All @@ -78,22 +87,22 @@ This is a sample config for Deltalake:
We are internally running with `pyspark` 3.X and `delta-lake` 2.0.0. This means that we need to consider Spark
configuration options for 3.X.

##### Metastore Host Port
###### Metastore Host Port

When connecting to an External Metastore passing the parameter `Metastore Host Port`, we will be preparing a Spark Session with the configuration

```
.config("hive.metastore.uris", "thrift://{connection.metastoreHostPort}")
.config("hive.metastore.uris", "thrift://{connection.metastoreHostPort}")
```

Then, we will be using the `catalog` functions from the Spark Session to pick up the metadata exposed by the Hive Metastore.

##### Metastore File Path
###### Metastore File Path

If instead we use a local file path that contains the metastore information (e.g., for local testing with the default `metastore_db` directory), we will set

```
.config("spark.driver.extraJavaOptions", "-Dderby.system.home={connection.metastoreFilePath}")
.config("spark.driver.extraJavaOptions", "-Dderby.system.home={connection.metastoreFilePath}")
```

To update the `Derby` information. More information about this in a great [SO thread](https://stackoverflow.com/questions/38377188/how-to-get-rid-of-derby-log-metastore-db-from-spark-shell).
Expand All @@ -104,7 +113,7 @@ To update the `Derby` information. More information about this in a great [SO th
Spark SQL [book](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-hive-metastore.html).


##### Metastore Database
###### Metastore Database

You can also connect to the metastore by directly pointing to the Hive Metastore db, e.g., `jdbc:mysql://localhost:3306/demo_hive`.

Expand All @@ -116,13 +125,13 @@ You will need to provide the driver to the ingestion image, and pass the `classp
{% /codeInfo %}


{% partial file="/v1.5/connectors/yaml/database/source-config-def.md" /%}
{% partial file="/v1.4/connectors/yaml/database/source-config-def.md" /%}

{% partial file="/v1.5/connectors/yaml/ingestion-sink-def.md" /%}
{% partial file="/v1.4/connectors/yaml/ingestion-sink-def.md" /%}

{% partial file="/v1.5/connectors/yaml/workflow-config-def.md" /%}
{% partial file="/v1.4/connectors/yaml/workflow-config-def.md" /%}

#### Advanced Configuration
##### Advanced Configuration

{% codeInfo srNumber=2 %}

Expand Down Expand Up @@ -151,19 +160,23 @@ source:
type: DeltaLake
```
```yaml {% srNumber=1 %}
metastoreConnection:
# Pick only of the three
## 1. Hive Service Thrift Connection
metastoreHostPort: "<metastore host port>"
## 2. Hive Metastore db connection
# metastoreDb: jdbc:mysql://localhost:3306/demo_hive
# username: username
# password: password
# driverName: org.mariadb.jdbc.Driver
# jdbcDriverClassPath: /some/path/
## 3. Local file for Testing
# metastoreFilePath: "<path_to_metastore>/metastore_db"
appName: MyApp
configSource:
connection:
# Pick only of these

## 1. Hive Service Thrift Connection
metastoreHostPort: "<metastore host port>"

## 2. Hive Metastore db connection
# metastoreDb: jdbc:mysql://localhost:3306/demo_hive
# username: username
# password: password
# driverName: org.mariadb.jdbc.Driver
# jdbcDriverClassPath: /some/path/

## 3. Local file for Testing
# metastoreFilePath: "<path_to_metastore>/metastore_db"
appName: MyApp
```
```yaml {% srNumber=2 %}
# connectionOptions:
Expand All @@ -174,17 +187,75 @@ source:
# key: value
```

{% partial file="/v1.5/connectors/yaml/database/source-config.md" /%}
{% partial file="/v1.4/connectors/yaml/database/source-config.md" /%}

{% partial file="/v1.4/connectors/yaml/ingestion-sink.md" /%}

{% partial file="/v1.4/connectors/yaml/workflow-config.md" /%}

{% /codeBlock %}

{% /codePreview %}

#### Source Configuration - From Storage - S3

{% codePreview %}

{% codeInfoContainer %}

##### Source Configuration - Service Connection

{% codeInfo srNumber=1 %}

* **awsAccessKeyId**: Enter your secure access key ID for your DynamoDB connection. The specified key ID should be authorized to read all databases you want to include in the metadata ingestion workflow.
* **awsSecretAccessKey**: Enter the Secret Access Key (the passcode key pair to the key ID from above).
* **awsRegion**: Specify the region in which your DynamoDB is located. This setting is required even if you have configured a local AWS profile.
* **schemaFilterPattern** and **tableFilterPattern**: Note that the `schemaFilterPattern` and `tableFilterPattern` both support regex as `include` or `exclude`. E.g.,

{% /codeInfo %}


{% partial file="/v1.4/connectors/yaml/database/source-config-def.md" /%}

{% partial file="/v1.4/connectors/yaml/ingestion-sink-def.md" /%}

{% partial file="/v1.4/connectors/yaml/workflow-config-def.md" /%}

{% /codeInfoContainer %}

{% codeBlock fileName="filename.yaml" %}

```yaml {% isCodeBlock=true %}
source:
type: deltalake
serviceName: <service_name>
serviceConnection:
config:
type: DeltaLake
```
```yaml {% srNumber=1 %}
configSource:
connection:
securityConfig:
awsAccessKeyId: aws access key id
awsSecretAccessKey: aws secret access key
awsRegion: aws region
bucketName: bucket name
prefix: prefix
```
{% partial file="/v1.4/connectors/yaml/database/source-config.md" /%}
{% partial file="/v1.5/connectors/yaml/ingestion-sink.md" /%}
{% partial file="/v1.4/connectors/yaml/ingestion-sink.md" /%}
{% partial file="/v1.5/connectors/yaml/workflow-config.md" /%}
{% partial file="/v1.4/connectors/yaml/workflow-config.md" /%}
{% /codeBlock %}
{% /codePreview %}
{% partial file="/v1.5/connectors/yaml/ingestion-cli.md" /%}
{% partial file="/v1.4/connectors/yaml/ingestion-cli.md" /%}
## dbt Integration
Expand Down
Loading

0 comments on commit c52425f

Please sign in to comment.