This library provides additional APIs to support testing frameworks for spark scala projects.
To use data-scalaxy-test-util
in an existing SBT project with Scala 2.12 or a later version,
add the following dependency to your build.sbt
ThisBuild / resolvers += "Github Repo" at "https://maven.pkg.github.com/teamclairvoyant/data-scalaxy-test-util/"
ThisBuild / credentials += Credentials(
"GitHub Package Registry",
"maven.pkg.github.com",
System.getenv("GITHUB_USERNAME"),
System.getenv("GITHUB_TOKEN")
)
ThisBuild / libraryDependencies += "com.clairvoyant.data.scalaxy" %% "test-util" % "1.0.0" % Test
Make sure you add GITHUB_USERNAME
and GITHUB_TOKEN
to the environment variables.
GITHUB_TOKEN
is the Personal Access Token with the permission to read packages.
This library exposes APIs that can be used to compare two spark dataframes.
This feature is very much helpful in writing unit tests and integration tests for data scalaxy projects where two dataframes need to be compared for equality.
The below comparisons are made against two dataframes:
- Validate Columns
- Validate Size
- Validate Schema
- Validate Rows
Consider a unit test where we have actual dataframe and expected dataframe. Now in order to compare these two dataframes, you need to use the api in the below manner:
For the use case, where we have exactly same dataframes, the below test case will successfully pass.
import com.clairvoyant.data.scalaxy.test.util.matchers.DataFrameMatcher
import com.clairvoyant.data.scalaxy.test.util.readers.DataFrameReader
class DataFrameMatchersTest extends DataFrameMatcher with DataFrameReader {
"matchExpectedDataFrame() - with 2 exact dataframes" should "compare two dataframes correctly" in {
val df1 = readJSONFromText(
"""{
| "col_A": "val_A",
| "col_B": "val_B"
|}""".stripMargin
)
val df2 = df1
df1 should matchExpectedDataFrame(df2)
}
}
For the use case, where we have two dataframes having different columns, the below test case will fail with the error message:
Content of data frame does not match expected data.
* Actual DF has different columns than Expected DF
Actual DF columns: col_A,col_B
Expected DF columns: col_A,col_C
Extra columns: col_B
Missing columns col_C
import com.clairvoyant.data.scalaxy.test.util.matchers.DataFrameMatcher
import com.clairvoyant.data.scalaxy.test.util.readers.DataFrameReader
class DataFrameMatchersTest extends DataFrameMatcher with DataFrameReader {
"matchExpectedDataFrame() - with 2 dataframes having different columns" should "fail dataframes comparison" in {
val df1 = readJSONFromText(
"""{
| "col_A": "val_A",
| "col_B": "val_B"
|}""".stripMargin
)
val df2 = readJSONFromText(
"""{
| "col_A": "val_A",
| "col_C": "val_B"
|}""".stripMargin
)
df1 should matchExpectedDataFrame(df2)
}
}
Please refer to examples for various use cases where you can use this library to compare two dataframes.
This library provides below APIs:
- readJSONFromText
- readJSONFromFile
- readCSVFromText
- readCSVFromFile
- readXMLFromText
- readXMLFromFile
- readParquet
You can find the documentation for each API here.
This library provides API to mock S3 service of AWS and allows spark to read data from or write data to mocked S3 bucket.
Below is the usage example:
import com.clairvoyant.data.scalaxy.test.util.matchers.DataFrameMatcher
import com.clairvoyant.data.scalaxy.test.util.mock.S3BucketMock
import com.clairvoyant.data.scalaxy.test.util.readers.DataFrameReader
class S3BucketReaderSpec extends DataFrameReader with DataFrameMatcher with S3BucketMock {
"read()" should "read a dataframe from the provided s3 path" in {
val bucketName = "test-bucket"
val path = "data"
s3Client.createBucket(bucketName)
val df = readJSONFromText(
"""|{
| "col_A": "val_A1",
| "col_B": "val_B1",
| "col_C": "val_C1"
|}""".stripMargin
)
// Write dataframe to mocked S3 bucket
df.write.json(s"s3a://$bucketName/$path")
// Read data from mocked S3 bucket
val actualDF = readJSONFromFile(s"s3a://$bucketName/data")
actualDF should matchExpectedDataFrame(df)
}
}