data-scalaxy-test-util

This library provides additional APIs to support testing frameworks for spark scala projects.

Getting Started

Add SBT dependency

To use data-scalaxy-test-util in an existing SBT project with Scala 2.12 or a later version, add the following dependency to your build.sbt

ThisBuild / resolvers += "Github Repo" at "https://maven.pkg.github.com/teamclairvoyant/data-scalaxy-test-util/"

ThisBuild / credentials += Credentials(
  "GitHub Package Registry",
  "maven.pkg.github.com",
  System.getenv("GITHUB_USERNAME"),
  System.getenv("GITHUB_TOKEN")
)

ThisBuild / libraryDependencies += "com.clairvoyant.data.scalaxy" %% "test-util" % "1.0.0" % Test

Make sure you add GITHUB_USERNAME and GITHUB_TOKEN to the environment variables.

GITHUB_TOKEN is the Personal Access Token with the permission to read packages.

Features:

1. Spark DataFrame Comparison

This library exposes APIs that can be used to compare two spark dataframes.

This feature is very much helpful in writing unit tests and integration tests for data scalaxy projects where two dataframes need to be compared for equality.

The below comparisons are made against two dataframes:

Validate Columns
Validate Size
Validate Schema
Validate Rows

Usage Examples

Consider a unit test where we have actual dataframe and expected dataframe. Now in order to compare these two dataframes, you need to use the api in the below manner:

Example-1

For the use case, where we have exactly same dataframes, the below test case will successfully pass.

import com.clairvoyant.data.scalaxy.test.util.matchers.DataFrameMatcher
import com.clairvoyant.data.scalaxy.test.util.readers.DataFrameReader

class DataFrameMatchersTest extends DataFrameMatcher with DataFrameReader {
  "matchExpectedDataFrame() - with 2 exact dataframes" should "compare two dataframes correctly" in {
    val df1 = readJSONFromText(
      """{
        |  "col_A": "val_A",
        |  "col_B": "val_B"
        |}""".stripMargin
    )

    val df2 = df1

    df1 should matchExpectedDataFrame(df2)
  }
}

Example-2

For the use case, where we have two dataframes having different columns, the below test case will fail with the error message:

Content of data frame does not match expected data.
* Actual DF has different columns than Expected DF
Actual DF columns: col_A,col_B
Expected DF columns: col_A,col_C
Extra columns: col_B
Missing columns col_C

import com.clairvoyant.data.scalaxy.test.util.matchers.DataFrameMatcher
import com.clairvoyant.data.scalaxy.test.util.readers.DataFrameReader

class DataFrameMatchersTest extends DataFrameMatcher with DataFrameReader {
  "matchExpectedDataFrame() - with 2 dataframes having different columns" should "fail dataframes comparison" in {
    val df1 = readJSONFromText(
      """{
        |  "col_A": "val_A",
        |  "col_B": "val_B"
        |}""".stripMargin
    )

    val df2 = readJSONFromText(
      """{
        |  "col_A": "val_A",
        |  "col_C": "val_B"
        |}""".stripMargin
    )

    df1 should matchExpectedDataFrame(df2)
  }
}

Please refer to examples for various use cases where you can use this library to compare two dataframes.

2. APIs to read data of several formats and parse it to spark dataframe

This library provides below APIs:

readJSONFromText
readJSONFromFile
readCSVFromText
readCSVFromFile
readXMLFromText
readXMLFromFile
readParquet

You can find the documentation for each API here.

3. Mocked API for S3 Bucket

This library provides API to mock S3 service of AWS and allows spark to read data from or write data to mocked S3 bucket.

Below is the usage example:

import com.clairvoyant.data.scalaxy.test.util.matchers.DataFrameMatcher
import com.clairvoyant.data.scalaxy.test.util.mock.S3BucketMock
import com.clairvoyant.data.scalaxy.test.util.readers.DataFrameReader

class S3BucketReaderSpec extends DataFrameReader with DataFrameMatcher with S3BucketMock {

  "read()" should "read a dataframe from the provided s3 path" in {
    val bucketName = "test-bucket"
    val path = "data"

    s3Client.createBucket(bucketName)

    val df = readJSONFromText(
      """|{
         |  "col_A": "val_A1",
         |  "col_B": "val_B1",
         |  "col_C": "val_C1"
         |}""".stripMargin
    )

    // Write dataframe to mocked S3 bucket
    df.write.json(s"s3a://$bucketName/$path")

    // Read data from mocked S3 bucket
    val actualDF = readJSONFromFile(s"s3a://$bucketName/data") 

    actualDF should matchExpectedDataFrame(df)
  }

}

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
project		project
src		src
.gitignore		.gitignore
.scalafix.conf		.scalafix.conf
.scalafmt.conf		.scalafmt.conf
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-scalaxy-test-util

Getting Started

Add SBT dependency

Features:

1. Spark DataFrame Comparison

Usage Examples

Example-1

Example-2

2. APIs to read data of several formats and parse it to spark dataframe

3. Mocked API for S3 Bucket

About

Releases 1

Packages

Languages

teamclairvoyant/data-scalaxy-test-util

Folders and files

Latest commit

History

Repository files navigation

data-scalaxy-test-util

Getting Started

Add SBT dependency

Features:

1. Spark DataFrame Comparison

Usage Examples

Example-1

Example-2

2. APIs to read data of several formats and parse it to spark dataframe

3. Mocked API for S3 Bucket

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages