title | description | services | documentationcenter | author | manager | editor | ms.assetid | ms.service | ms.custom | ms.devlang | ms.topic | ms.tgt_pltfrm | ms.workload | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Use Apache Phoenix with HBase - Azure HDInsight | Microsoft Docs |
Learn how to use Apache Phoenix in HDInsight. |
hdinsight |
mumian |
jhubbard |
cgronlun |
cda0f33b-a2e8-494c-972f-ae0bb482b818 |
hdinsight |
hdinsightactive |
na |
article |
na |
big-data |
05/26/2017 |
jgao |
If you want to query HBase using SQL instead of scans, you can by using Apache Phoenix in HDInsight. You use the SQLLine utility to submit SQL to Phoenix and view the results.
For more information about Phoenix, see Phoenix in 15 minutes or less and Phoenix in HDInsight for an overview.
For information on the Phoenix grammar, see Phoenix Grammar.
Note
For the Phoenix version information in HDInsight, see What's new in the Hadoop cluster versions provided by HDInsight?.
SQLLine is a command-line utility to execute SQL.
Before you can use SQLLine, you must have the following:
- An HBase cluster in HDInsight. For information on provisioning an HBase cluster, see Get started with Apache HBase in HDInsight.
- Connect to the HBase cluster via SSH. For instructions, see [Connect to HDInsight using SSH][hdinsight-hadoop-linux-use-ssh-unix].
To query HBase, you SSH into the head node and then run queries. These queries need to commnuicate with the ZooKeeper quorum.
- Open Ambari by browsing to
https://<ClusterName>.azurehdinsight.net
. - Enter the HTTP (cluster) username and password to login.
- Click ZooKeeper from the left menu. You see three ZooKeeper Servers listed.
- Click one of the ZooKeeper Servers listed. On the Summary pane, find the Hostname. It is similar to
zk1-jdolehb.3lnng4rcvp5uzokyktxs4a5dhd.bx.internal.cloudapp.net
.
To highlight how you can efficiently leverage HDInsight HBase and Phoenix to analyze big data stores, this walkthrough shows you how to use HBase Phoenix to do sales analysis of an imaginary online book store.
To begin, we need to upload the sample data we'll use for the remaining exercies. Read more about various ways to upload data for Hadoop jobs in HDInsight.
-
Browse to the HBase HDInsight cluster on the Azure portal.
-
On the left-hand menu, select Storage accounts.
-
Select your Azure Blob Storage or Azure Data Lake Store account listed in the Storage accounts pane.
-
Browse to the
/example/data
folder.
- If using Azure Blob Storage, the
/example/data
folder will be located within your cluster's container. - In Azure Data Lake Store, use the data explorer to navigate to
/clusters/hbase/example/data
.
- Upload the sample weblogs.csv file to this location.
Phoenix provides two ways to bulk load into HBase. The PSQL
command line utility, which is a single-threaded client loading tool, and a MapReduce bulk loading utility. PSQL is suited for gigabytes of data, whereas MapReduce is used for much larger data volumes.
To bulk upload data into HBase using the Phoenix PSQL command line tool, perform the following actions:
-
SSH into your HBase cluster. For more information, see Use SSH with HDInsight.
-
From SSH, run the following commands to run SQLLine:
cd /usr/hdp/current/phoenix-client/bin ./sqlline.py <ZooKeeperHostname>:2181:/hbase-unsecure
-
Run the following command to create the weblogs table.
CREATE TABLE weblogs ( transactionid varchar(500) Primary Key, "t1".transactiondate Date NULL, "t1".customerid varchar(50) NULL, "t1".bookid varchar(50) NULL, "t1".purchasetype varchar(50) NULL, "t1".orderid varchar(50) NULL, "t1".bookname varchar(50) NULL, "t1".categoryname varchar(50) NULL, "t1".invoicenumber varchar(50) NULL, "t1".invoicestatus varchar(50) NULL, "t1".city varchar(50) NULL, "t1".state varchar(50) NULL, "t2".paymentamount DOUBLE NULL, "t2".quantity INTEGER NULL, "t2".shippingamount DOUBLE NULL);
Note: The above query creates a weblogs table with two column families, t1 and t2. Column families are stored separately in different HFiles, thus it makes sense to have a separate column family for data which is queried often. The paymentamount, quantity, and shippingamount columns will be queried often, so they are in a different column family.
-
Run the following command in a new Hadoop command line window.
cd /usr/hdp/current/phoenix-client/bin
-
Now, let's copy the weblogs.csv file from our storage acount via
hdfs
to our local temp directory.hdfs dfs -copyToLocal /example/data/weblogs.csv /tmp/
-
Finally, we'll use PSQL to bulk insert the rows into our new HBase table.
./psql.py -t WEBLOGS [The FQDN of one of the Zookeepers] /tmp/weblogs.csv
The above code executes the PSQL client utility to bulk upload data into the weblogs table. It takes three parameters: table name (-t), zookeeper fqdn, and the path of the csv file to bulk load from.
Once the PSQL operation is complete, you should have an output on your command window similar to the following:
CSV Upsert complete. 278972 rows upserted
Time: 64.549 sec(s)
Close the extra Hadoop command line window you opened to run the PSQL script, and switch back over to the one running SQLLine. If you closed it, please follow the steps to once again open SQLLine.
Execute the following query to select the book name and quantity sold of books purchased between January and March:
SELECT bookname, sum(quantity) AS QuantitySold from WEBLOGS
WHERE Purchasetype='Purchased' and Month(transactiondate)>=1 and Month(transactiondate)<=3
GROUP BY bookname;
You should see an output similar to the following:
+---------------------------------------+---------------+
| BOOKNAME | QUANTITYSOLD |
+---------------------------------------+---------------+
| Advances in school psychology | 90233 |
| History of political economy | 93772 |
| New Christian poetry | 90467 |
| Science in Dispute | 92221 |
| Space fact and fiction | 95544 |
| THE BOOK OF WITNESSES | 93753 |
| The adventures of Arthur Conan Doyle | 93710 |
| The voyages of Captain Cook | 94207 |
| Understanding American politics | 91119 |
+---------------------------------------+---------------+
9 rows selected (21.942 seconds)
For more information, see SQLLine manual and Phoenix Grammar.
In this article, you have learned how to use Apache Phoenix in HDInsight. To learn more, see:
- HDInsight HBase overview: HBase is an Apache, open-source, NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semistructured data.
- Provision HBase clusters on Azure Virtual Network: With virtual network integration, HBase clusters can be deployed to the same virtual network as your applications so that applications can communicate with HBase directly.
- Configure HBase replication in HDInsight: Learn how to configure HBase replication across two Azure datacenters.
- Analyze Twitter sentiment with HBase in HDInsight: Learn how to do real-time sentiment analysis of big data by using HBase in a Hadoop cluster in HDInsight.