diff --git a/doc/dataframe.md b/doc/dataframe.md index 514f69ab..2ac4e6ad 100644 --- a/doc/dataframe.md +++ b/doc/dataframe.md @@ -75,7 +75,7 @@ It is used by spark-redis internally when reading DataFrame back to Spark memory ### Specifying Redis key -By default, spark-redis generates UUID identifier for each row to ensure +By default spark-redis generates UUID identifier for each row to ensure their uniqueness. However, you can also provide your own column as a key. This is controlled with `key.column` option: ```scala @@ -157,7 +157,7 @@ df.write ### Persistence model -By default, DataFrames are persisted as Redis Hashes. It allows to write data with Spark and query from non-Spark environment. +By default DataFrames are persisted as Redis Hashes. It allows to write data with Spark and query from non-Spark environment. It also enables projection query optimization when only a small subset of columns are selected. On the other hand, there is currently a limitation with Hash model - it doesn't support nested DataFrame schema. One option to overcome it is making your DataFrame schema flat. If it is not possible due to some constraints, you may consider using Binary persistence model. diff --git a/doc/getting-started.md b/doc/getting-started.md index 600ccf15..a6ebcdd9 100644 --- a/doc/getting-started.md +++ b/doc/getting-started.md @@ -20,25 +20,21 @@ cd spark-redis mvn clean package -DskipTests ``` -## Using the library -Add Spark-Redis to Spark with the `--jars` command line option. For example, use it from spark-shell, include it in the following manner: +### Using the library with spark shell +Add Spark-Redis to Spark with the `--jars` command line option. -``` +```bash $ bin/spark-shell --jars /spark-redis--jar-with-dependencies.jar +``` -Welcome to - ____ __ - / __/__ ___ _____/ /__ - _\ \/ _ \/ _ `/ __/ '_/ - /___/ .__/\_,_/_/ /_/\_\ version 2.3.1 - /_/ +By default it connects to `localhost:6379` without any password, you can change the connection settings in the following manner: -Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101) +```bash +$ bin/spark-shell --jars /spark-redis--jar-with-dependencies.jar --conf "spark.redis.host=localhost" --conf "spark.redis.port=6379" --conf "spark.redis.auth=passwd" ``` -The following sections contain code snippets that demonstrate the use of Spark-Redis. To use the sample code, you'll need to replace `your.redis.server` and `6379` with your Redis database's IP address or hostname and port, respectively. -### Configuring Connections to Redis using SparkConf +### Configuring connection to Redis in a self-contained application Below is an example configuration of SparkContext with redis configuration: @@ -47,21 +43,33 @@ import com.redislabs.provider.redis._ ... -sc = new SparkContext(new SparkConf() +val sc = new SparkContext(new SparkConf() .setMaster("local") .setAppName("myApp") - // initial redis host - can be any node in cluster mode .set("spark.redis.host", "localhost") - // initial redis port .set("spark.redis.port", "6379") - // optional redis AUTH password - .set("spark.redis.auth", "") + .set("spark.redis.auth", "passwd") ) ``` +The SparkSession can be configured in a similar manner: + +```scala +val spark = SparkSession + .builder() + .appName("myApp") + .master("local[*]") + .config("spark.redis.host", "localhost") + .config("spark.redis.port", "6379") + .config("spark.redis.auth", "passwd") + .getOrCreate() + +val sc = spark.sparkContext +``` + ### Create RDD ```scala @@ -83,6 +91,8 @@ df.write ### Create Stream ```scala +import com.redislabs.provider.redis._ + val ssc = new StreamingContext(sc, Seconds(1)) val redisStream = ssc.createRedisStream(Array("foo", "bar"), storageLevel = StorageLevel.MEMORY_AND_DISK_2) diff --git a/doc/python.md b/doc/python.md index e0138cf3..ff124bab 100644 --- a/doc/python.md +++ b/doc/python.md @@ -8,9 +8,16 @@ Here is an example: 1. Run `pyspark` providing the spark-redis jar file ```bash -$ ./bin/pyspark --jars /your/path/to/spark-redis--jar-with-dependencies.jar +$ ./bin/pyspark --jars /spark-redis--jar-with-dependencies.jar ``` +By default it connects to `localhost:6379` without any password, you can change the connection settings in the following manner: + +```bash +$ bin/spark-shell --jars /spark-redis--jar-with-dependencies.jar --conf "spark.redis.host=localhost" --conf "spark.redis.port=6379" --conf "spark.redis.auth=passwd" +``` + + 2. Read DataFrame from json, write/read from Redis: ```python df = spark.read.json("examples/src/main/resources/people.json") @@ -19,7 +26,7 @@ loadedDf = spark.read.format("org.apache.spark.sql.redis").option("table", "peop loadedDf.show() ``` -2. Check the data with redis-cli: +3. Check the data with redis-cli: ```bash 127.0.0.1:6379> hgetall people:Justin @@ -29,3 +36,16 @@ loadedDf.show() 4) "Justin" ``` +The self-contained application can be configured in the following manner: + +```python +SparkSession\ + .builder\ + .appName("myApp")\ + .config("spark.redis.host", "localhost")\ + .config("spark.redis.port", "6379")\ + .config("spark.redis.auth", "passwd")\ + .getOrCreate() +``` + +