Integrating Pivotal HDB with Isilon

Pivotal HDB and HDP can be configured with EMC Isilon® Network Attached Storage for Hadoop HDFS services. This document provides instructions for how to configure HAWQ with EMC Isilon running the OneFS operating system.

Overview of Using HAWQ with EMC Isilon

Combining Hadoop with the scalability and capacity of Isilon storage enables you to take advantage of big data analytics quickly and simply. Configuring HDP on EMC Isilon storage clusters exponentially increases the amount of data you can use to create business insights.

The HDP - Isilon solution moves the HDFS file storage to the Isilon cluster. This configuration makes it easier to run a large Hadoop cluster by simplifying data import and export and by separating disk management from the compute cluster. Using Isilon storage brings high availability of namenode services to Hadoop by eliminating the HDFS namenode as a single point of failure. Isilon’s remote replication and snapshot capabilities bring additional enterprise data management to Hadoop.

Supported Configurations

The following components are required for using Pivotal HDB with EMC Isilon Network Attached Storage.

  • An Isilon storage system with at least 3 nodes, running OneFS 7.2.1.3 or higher
  • Pivotal HDP or Hortonworks HDP version 2.4.x
  • Pivotal HDB 2.0.x. (See the Pivotal HAWQ 1.3.x documentation for earlier integration instructions.)
  • Ambari 2.2.2
  • A licensed version of SmartConnect Advanced
  • DNS services configured to forward SmartConnect Zone name lookups to the Isilon cluster, per Isilon best practices
  • Administrator access to Isilon: privileged commands will be run from the OneFS CLI

See also Preparing the Isilon Configuration to make required changes to your Isilon deployment before installation.

Updated compatibility information may be available at Hadoop Distributions and Products Supported by OneFS.

Preparing the Isilon Configuration

In order to integrate Isilon storage with HDP and HAWQ, you must configure the storage zone that will be exposed via Isilon’s HDFS implementation. Perform these steps in the Isilon cluster before you start to implement the HDB cluster.

Access Pattern:

Set the access pattern for data in Isilon’s HDFS layer to Streaming.

If the HDFS root directory in your environment does not match /ifs/zone1/hadoop in the commands below, specify the correct root directory to set the access pattern.

Check the current state by executing the command:

isi get /ifs/zone1/hadoop

The PERFORMANCE column should read streaming. If streaming is not displayed, execute the following command using the Isilon CLI:

 isi set -R -a streaming -l streaming /ifs/zone1/hadoop

Isilon HDFS Global Settings To review the current global HDFS settings, execute the following command using the Isilon CLI:

isi hdfs settings view

Thread Count: The Server Threads parameter must be set to auto. If it is not set to auto, execute the following command using the Isilon CLI:

 isi hdfs settings modify --server-threads=auto

Block Size: You must set the Isilon HDFS block size, the HAWQ block size, and the HDP block size to exactly the same value. For Isilon, you can use the command below to modify its Block Size:

 isi hdfs settings modify --default-block-size=128M

The HAWQ and HDP block sizes can be configured separately, as described in Post-Install Configuration.

Balanced Network Traffic:

Pivotal recommends that you create two address pools so that namenode traffic can be handled differently than datanode traffic.

In the Isilon Web administration UI under the Network Configuration tab, create 2 IP Pools on your Isilon Cluster:

  • NameNode IP Pool Requirements Image21

    1. Should have 1 IP address for each Isilon 10G Network Interface in the cluster.
    2. Should use the Round Robin connection policy.
    3. Should use static IP allocation.
    4. The SmartConnect ZoneName for this IP pool is what gets used to configure PHD to point to Isilon HDFS.
  • DataNode IP Pool Requirements Image20

    1. Should have at least 3 IP addresses for each Isilon 10G Network Interface in the cluster.
    2. Should use the Round Robin connection policy.
    3. Should use dynamic IP allocation.
  • DataNode Rack Setup To separate the network traffic, set Isilon HDFS to use this IP Pool as its Rack. Execute the following command using the Isilon CLI: isi hdfs racks create --client-ip-ranges=<compute-env-ip-range> --ip-pools=<subnet>:<datanode-pool> --rack=/rack0 Example: isi hdfs racks create --client-ip-ranges=192.0.2.1-192.0.2.254 --ip-pools=subnet0:datanodePool --rack=/rack0

Assign NameNode & Ambari Server:

In order for Isilon to be recognized by the Ambari server, you must configure the Isilon Zone with the FQDN of your NameNode host and your Ambari Server host. The settings can be seen by viewing the configuration for a zone using the following command:

Note: Replace <zone> with the actual zone name in your environment. Replace <NameNode FQDN> with the SmartConnect FQDN of the NameNode IP Pool.

isi zone zones view <zone>

The two key settings are HDFS Ambari Server and HDFS Ambari Namenode. Set these to hostnames that the Isilon cluster can resolve. Be sure that the Isilon cluster is able to resolve the hostname that is assigned to the Ambari server.

  • Assign NameNode FQDN Use this command to assign the SmartConnect zone name for the NameNode IP Pool that was created in the previous step:

    isi zone zones modify --hdfs-ambari-namenode=<NameNode FQDN> <zone>
    
  • Assign Ambari Server FQDN Use this command to assign the FQDN of the ambari server host that you will be using in your environment:

    isi zone zones modify --hdfs-ambari-server=<ambari server FQDN> <zone>
    

Create Users and Directories: Isilon provides these 2 scripts to help you create the users and directories that are required to effectively use the Isilon HDFS layer with PHD:

  1. Download the 2 files shown above and place them in any directory under /ifs. Pivotal recommends that you use the directory /ifs/isiloncluster1/scripts where isiloncluster1 is the name of your Isilon cluster. You can use wget to download these files as shown in these commands:

    wget https://github.com/claudiofahey/isilon-hadoop-tools/raw/master/onefs/isilon_create_users.sh
    wget https://github.com/claudiofahey/isilon-hadoop-tools/raw/master/onefs/isilon_create_directories.sh
    
  2. Use scp to copy these files over to the Isilon cluster:

    scp isilon_create_users.sh root@isilon_node_ip_address:/ifs/isiloncluster1/scripts/
    scp isilon_create_directories.sh root@isilon_node_ip_address:/ifs/isiloncluster1/scripts/
    
  3. Execute the isilon_create_users.sh script to create all required users and groups for the Hadoop services and applications.

    Warning: The isilon_create_users.sh script creates local user and group accounts on your Isilon cluster for Hadoop services. If you are using a directory service such as Active Directory and you want these users and groups to be defined in your directory service, then DO NOT run this script. Instead, refer to the OneFS documentation and EMC Isilon Best Practices for Hadoop Data Storage document (http://www.emc.com/collateral/white-paper/h12877-wp-emc-isilon-hadoop-best-practices.pdf).

    Script Usage:

    isilon_create_users.sh –dist <DIST> [–startgid <GID>] [–startuid <UID>] [–zone <ZONE>]
    

    dist – Your Hadoop distribution - phd3

    startgid – Group IDs begin with this value. For example: 501

    startuid – User IDs begin with this value. This is generally the same as gid_base. For example: 501

    zone – Tje Access Zone name that you will use. For example: zone1

    Create the local users and groups by using this command:

    bash /ifs/isiloncluster1/scripts/isilon_create_users.sh --dist phd3 --startgid 501 --startuid 501 --zone zone1
    
  4. Execute the isilon_create_directories.sh script to create all required directories with the appropriate ownership and permissions.

    Script Usage:

    isilon_create_directories.sh –dist <DIST> [–fixperm] [–zone <ZONE>]
    

    dist – Your Hadoop distribution - phd3

    fixperm – If specified, ownership and permissions will be set on existing directories.

    zone – Access Zone name. For example: zone1

    Create the directories by using this command:

    bash /ifs/isiloncluster1/scripts/isilon_create_directories.sh --dist phd3 --fixperm --zone zone1
    
  5. Map the hdfs user to the Isilon superuser. This allows the hdfs user to chown (change ownership of) all files.

    Warning: The commands below restart the HDFS service on your Isilon cluster to ensure that any cached user mapping rules are flushed. Restarting temporarily interrupts any HDFS connections to the Isilon cluster.

    isi zone zones modify --user-mapping-rules=’’hdfs=>root’’ --zone zone1
    isi services isi_hdfs_d disable ; isi services isi_hdfs_d enable
    

Installing HDP

Follow the instructions in the EMC Isilon Hadoop Starter Kit to install the Hortonworks HDP cluster. Continue with the Pivotal HDB installation instructions below after the HDP cluster is available.

Installing Pivotal HDB

Follow the complete instructions in Install Apache HAWQ using Ambari to install Pivotal HDB 2.0.x using Ambari. During the installation, remember these key points:

  • Node Selection: When using Isilon as the HDFS layer you will install the HAWQ segments and PXF services on all of the worker nodes, even though those nodes do not run DataNode services. All HAWQ temp files must be stored on each local drive available to the worker nodes. For example, if your compute nodes have 6 local drives available, you would use /data1/tmp../data06/tmp as the spill space for each segment instance.

  • Block Size: The Isilon HDFS block size, HAWQ block size, and PHD block size must be set to exactly the same value. If you need to change the default block size used by HAWQ after the installation, edit the property below in the hdfs-client.xml on all HAWQ nodes in your cluster. For example:

    <name>dfs.default.blocksize</name>
    <value>134217728</value>
    

    To set the block size used by HDP, use the HDFS > Configs section of the Ambari Administration UI, as shown here:

    Image19

After installing HDB, perform these steps to validate and configure HAWQ and PXF for use with Isilon:

  1. Login to the HAWQ master node and execute these commands to ensure that HAWQ is installed:

    $ source /usr/local/hawq/greenplum_path.sh
    $ psql -d postgres
    psql (8.2.15)
    Type "help" for help.
    postgres=#
    
  2. While still on the master node, configure Isilon integration using the commands:

    $ hawq config -c pxf_isilon -v 1 --skipvalidation
    $ hawq restart cluster
    $ psql -d postgres
    postgres=# show pxf_isilon;
    

    The output should now show that pxf_isilon is on:

    pxf_isilon
    ------------
    on
    (1 row)