- Overview
- Installation
- Getting Started
- Usage
- Examples
- Api - Reference
- Built With
- Versioning
- Development
- TODO
Project with the purpose of make easier the use of columnar spark files.
<dependency>
<groupId>com.jrgv89</groupId>
<artifactId>SparkApp</artifactId>
<version>1.0.0-SNAPSHOT</version>
</dependency>
git clone ...
cd SparkApp && mvn clean package
- Git
- Maven
- Spark 2.1
A typeSafe config file.
sparkApp {
config {
name: ""
spark: {}
options : {
hdfs: {}
logger: {}
debug:{}
}
defaultTable: {}
tables: []
}
app: {}
}
Key | Type | Info | Required | Default |
---|---|---|---|---|
sparkApp.config.name | String | Spark job name | true | |
sparkApp.config.spark | Object[String,String] | Spark Job Options | true | |
sparkApp.config.options | Config | App Options | false | |
sparkApp.config.options.hdfs | Config | Hdfs Configuration | false | |
sparkApp.config.options.hdfs.enable | Boolean | Hdfs is enable | false | true |
sparkApp.config.options.hdfs.properties | Object[String,String] | Properties for hdfs configuration | false | {}: new Configuration() |
sparkApp.config.options.logger | Config | properties for logger. No use for now | false | |
sparkApp.config.options.logger.enable | Boolean | Is logger active | true | true |
sparkApp.config.options.debug | Boolean | options for debug | false | false |
sparkApp.config.options.debug.enable | Boolean | is debug enable | true | false |
sparkApp.config.defaultTable.path | String | file path | true | |
sparkApp.config.defaultTable.format | String | File format <parquet, avro, csv> | true | |
sparkApp.config.defaultTable.mode | String | Mode for write. <overwrite, append...> | true | |
sparkApp.config.defaultTable.pks | Seq[String] | Primary keys of file | true | |
sparkApp.config.defaultTable. includePartitionsAsPk | Boolean | if true, takes partition column as pks | true | |
sparkApp.config.defaultTable.sortPks | Boolean | if true, sorts primaryKeys | true | |
sparkApp.config.defaultTable. partitionColumns | Seq[String] | partition columns | true | |
sparkApp.config.defaultTable.properties | Object[String,Any] | some properties. Mapped as [String,Any] | true | |
sparkApp.config.tables | Seq[Table] | list of tables (Note) | true | |
sparkApp.app | Config | whatever you want (Note2) | true | {} |
Note: Table has two required properties: name and pks. Any properties are the same as defaultTable. If some property not set, will take defaultTable's property value. Note2: Will be accessible from child class.
class Test(path: String) extends SparkApp(path) {
override def clazz: Class[_] = this.getClass
override def execute(spark: SparkSession): Int = ???
}
object Test extends App {
new Test("path/to/config/dummy.conf").start()
}
class Test extends DFOps {
val spark = SparkSession ...
}
- Read a table
override def execute(spark: SparkSession): Int = {
val df = tables("tableName").read(spark)
}
- Write a table
override def execute(spark: SparkSession): Int = {
tables("tableName").write(df)
}
- Update a table
override def execute(spark: SparkSession): Int = {
tables("tableName").update(df)
}
NOTE: updatePartition in which data change partition value should be update with update instead of updatePartition
- Read a table
class Test extends DFOps {
val spark = SparkSession ...
val df = readDF(spark, "path", "csv")
}
- Write a table
class Test extends DFOps {
val spark = SparkSession ...
writeDF(df,"path", "parquet", "overwrite")
}
- Update a table
class Test extends DFOps {
val spark = SparkSession ...
updateDF(df, "path", "avro")
}
To see more see Javadoc
- Maven - Dependency Management
We use SemVer for versioning. For the versions available, see the tags on this repository.
Feel free to do pull request.
- Create Examples
- Improve Logger
- Unit Test
- Acceptance Test
- Add reader and writer options
- Review TODOs
- Add SQL Utils for DF and Tables
- More...