Skip to content

Latest commit

 

History

History
197 lines (145 loc) · 9.65 KB

hdinsight-hbase-phoenix-squirrel-linux.md

File metadata and controls

197 lines (145 loc) · 9.65 KB
title description services documentationcenter author manager editor ms.assetid ms.service ms.custom ms.devlang ms.topic ms.tgt_pltfrm ms.workload ms.date ms.author
Use Apache Phoenix with HBase - Azure HDInsight | Microsoft Docs
Learn how to use Apache Phoenix in HDInsight.
hdinsight
mumian
jhubbard
cgronlun
cda0f33b-a2e8-494c-972f-ae0bb482b818
hdinsight
hdinsightactive
na
article
na
big-data
05/26/2017
jgao

Use Apache Phoenix with HBase clusters in HDInsight

If you want to query HBase using SQL instead of scans, you can by using Apache Phoenix in HDInsight. You use the SQLLine utility to submit SQL to Phoenix and view the results.

For more information about Phoenix, see Phoenix in 15 minutes or less and Phoenix in HDInsight for an overview.

For information on the Phoenix grammar, see Phoenix Grammar.

Note

For the Phoenix version information in HDInsight, see What's new in the Hadoop cluster versions provided by HDInsight?.

Use SQLLine to create a new HBase table

SQLLine is a command-line utility to execute SQL.

Prerequisites

Before you can use SQLLine, you must have the following:

  • An HBase cluster in HDInsight. For information on provisioning an HBase cluster, see Get started with Apache HBase in HDInsight.
  • Connect to the HBase cluster via SSH. For instructions, see [Connect to HDInsight using SSH][hdinsight-hadoop-linux-use-ssh-unix].

To query HBase, you SSH into the head node and then run queries. These queries need to commnuicate with the ZooKeeper quorum.

Locating your ZooKeeper host names

  1. Open Ambari by browsing to https://<ClusterName>.azurehdinsight.net.
  2. Enter the HTTP (cluster) username and password to login.
  3. Click ZooKeeper from the left menu. You see three ZooKeeper Servers listed.
  4. Click one of the ZooKeeper Servers listed. On the Summary pane, find the Hostname. It is similar to zk1-jdolehb.3lnng4rcvp5uzokyktxs4a5dhd.bx.internal.cloudapp.net.

To highlight how you can efficiently leverage HDInsight HBase and Phoenix to analyze big data stores, this walkthrough shows you how to use HBase Phoenix to do sales analysis of an imaginary online book store.

Upload sample data to HDInsight cluster storage account

To begin, we need to upload the sample data we'll use for the remaining exercies. Read more about various ways to upload data for Hadoop jobs in HDInsight.

  1. Browse to the HBase HDInsight cluster on the Azure portal.

  2. On the left-hand menu, select Storage accounts.

    Storage accounts

  3. Select your Azure Blob Storage or Azure Data Lake Store account listed in the Storage accounts pane.

  4. Browse to the /example/data folder.

  • If using Azure Blob Storage, the /example/data folder will be located within your cluster's container.
  • In Azure Data Lake Store, use the data explorer to navigate to /clusters/hbase/example/data.
  1. Upload the sample weblogs.csv file to this location.

Bulk Load data into HBase using Phoenix

Phoenix provides two ways to bulk load into HBase. The PSQL command line utility, which is a single-threaded client loading tool, and a MapReduce bulk loading utility. PSQL is suited for gigabytes of data, whereas MapReduce is used for much larger data volumes.

To bulk upload data into HBase using the Phoenix PSQL command line tool, perform the following actions:

  1. SSH into your HBase cluster. For more information, see Use SSH with HDInsight.

  2. From SSH, run the following commands to run SQLLine:

    cd /usr/hdp/current/phoenix-client/bin
    ./sqlline.py <ZooKeeperHostname>:2181:/hbase-unsecure
  3. Run the following command to create the weblogs table.

    CREATE TABLE weblogs (
      transactionid varchar(500) Primary Key,
      "t1".transactiondate Date NULL,
      "t1".customerid varchar(50) NULL,
      "t1".bookid varchar(50) NULL,
      "t1".purchasetype varchar(50) NULL,
      "t1".orderid varchar(50) NULL,
      "t1".bookname varchar(50) NULL,
      "t1".categoryname varchar(50) NULL,
      "t1".invoicenumber varchar(50) NULL,
      "t1".invoicestatus varchar(50) NULL,
      "t1".city varchar(50) NULL,
      "t1".state varchar(50) NULL,
      "t2".paymentamount DOUBLE NULL,
      "t2".quantity INTEGER NULL,
      "t2".shippingamount DOUBLE NULL);

    Note: The above query creates a weblogs table with two column families, t1 and t2. Column families are stored separately in different HFiles, thus it makes sense to have a separate column family for data which is queried often. The paymentamount, quantity, and shippingamount columns will be queried often, so they are in a different column family.

  4. Run the following command in a new Hadoop command line window.

    cd /usr/hdp/current/phoenix-client/bin
  5. Now, let's copy the weblogs.csv file from our storage acount via hdfs to our local temp directory.

    hdfs dfs -copyToLocal /example/data/weblogs.csv /tmp/
  6. Finally, we'll use PSQL to bulk insert the rows into our new HBase table.

    ./psql.py -t WEBLOGS [The FQDN of one of the Zookeepers] /tmp/weblogs.csv

The above code executes the PSQL client utility to bulk upload data into the weblogs table. It takes three parameters: table name (-t), zookeeper fqdn, and the path of the csv file to bulk load from.

Once the PSQL operation is complete, you should have an output on your command window similar to the following:

CSV Upsert complete. 278972 rows upserted
Time: 64.549 sec(s)

Perform data retrieval queries with SQLLine

Close the extra Hadoop command line window you opened to run the PSQL script, and switch back over to the one running SQLLine. If you closed it, please follow the steps to once again open SQLLine.

Execute the following query to select the book name and quantity sold of books purchased between January and March:

SELECT bookname, sum(quantity) AS QuantitySold from WEBLOGS
WHERE Purchasetype='Purchased' and Month(transactiondate)>=1 and Month(transactiondate)<=3
GROUP BY bookname;

You should see an output similar to the following:

+---------------------------------------+---------------+
|               BOOKNAME                | QUANTITYSOLD  |
+---------------------------------------+---------------+
| Advances in school psychology         | 90233         |
| History of political economy          | 93772         |
| New Christian poetry                  | 90467         |
| Science in Dispute                    | 92221         |
| Space  fact and fiction               | 95544         |
| THE BOOK OF WITNESSES                 | 93753         |
| The adventures of Arthur Conan Doyle  | 93710         |
| The voyages of Captain Cook           | 94207         |
| Understanding American politics       | 91119         |
+---------------------------------------+---------------+
9 rows selected (21.942 seconds)

For more information, see SQLLine manual and Phoenix Grammar.

Next steps

In this article, you have learned how to use Apache Phoenix in HDInsight. To learn more, see: