Install Apache HAWQ using Ambari
- Install a compatible version of HDP and Ambari, and ensure that your HDP system is fully functional. See the Release Notes for more information about HDFS compatibility.
- Select and prepare all host machines that will run the HAWQ and PXF services. See Select HAWQ Host Machines.
- Login to the Ambari server host machine as the
Create a staging directory where you will download and extract the tarballs for HAWQ and the HAWQ Ambari plug-in. The staging directory and all the directories above it must be readable and executable by the system user that runs the httpd process (typically
apache). Make the directory readable and executable by all users. For example:
$ mkdir /staging $ chmod a+rx /staging
Note: Do not use the /tmp directory as a staging directory because files under /tmp can be removed at any time.
Download the required Pivotal software tarball files from Pivotal Network, saving them into the staging directory that you just created. The required tarball files are:
Stack Name Filename Description HDB-126.96.36.199 hdb-188.8.131.52-<build>.tar.gz Pivotal HDB is a parallel SQL query engine that includes features from Apache HAWQ (Incubating) such as PXF. HDB-AMBARI-PLUGIN-2.0.0 hdb-ambari-plugin-2.0.0-hdp-<build>.tar.gz The HAWQ plug-in provides Ambari installation and monitoring functionality for Apache HAWQ (Incubating).
Extract each tarball file into the staging directory:
$ tar -xvzf /staging/hdb-184.108.40.206-*.tar.gz -C /staging/ $ tar -xvzf /staging/hdb-ambari-plugin-2.0.0-hdp-*.tar.gz -C /staging/
Install and/or run the HTTP service if it is not already running:
$ yum install httpd $ service httpd start
Each tarball is an archived yum repository and has a setup_repo.sh script. The script creates a symlink from the document root of the httpd server (/var/www/html) to the directory where the tarball was extracted. On the host that will be used as a YUM repo, execute the setup_repo.sh script that is shipped as a part of each tarball file:
$ cd /staging/hdb* $ ./setup_repo.sh $ cd /staging/hdb-ambari-plugin* $ ./setup_repo.sh
Update the Yum repo to install the HAWQ Ambari plug-in:
$ yum install hdb-ambari-plugin
Restart the Ambari server:
$ ambari-server restart
Note: You must restart the Ambari server before proceeding. The above command restarts only the Ambari server, but leaves other Hadoop services running.
If you have already installed a HDP cluster and are adding HDB to the existing cluster, execute the following script to add the HDB repository to the Ambari server. (This step is not required if you are installing a new HDP cluster and HDB together at the same time.):
$ cd /var/lib/hawq $ ./add_hdb_repo.py -u admin -p admin
Note: Substitute the correct Ambari administrator password for your system.
Access the Ambari web console at http://ambari.server.hostname:8080, and login as the “admin” user. (The default password is also “admin”.) Verify that the HDB component is now available.
Select HDFS, then click the Configs tab.
Customize the HDFS configuration by following these steps:
- On the Settings tab, change the DataNode setting DataNode max data transfer threads (dfs.datanode.max.transfer.threads parameter ) to 40960.
- Select the Advanced tab and expand DataNode. Ensure that the DataNode directories permission (dfs.datanode.data.dir.perm parameter) is set to 750.
- Expand the General tab and change the Access time precision (dfs.namenode.accesstime.precision parameter) to 0.
Expand Advanced hdfs-site. Set the following properties to their indicated values.
Note: If a property described below does not appear in the Ambari UI, select Custom hdfs-site and click Add property… to add the property definition and set it to the indicated value.
Property Setting dfs.allow.truncate true dfs.block.access.token.enable false for an unsecured HDFS cluster, or true for a secure cluster dfs.block.local-path-access.user gpadmin HDFS Short-circuit read (dfs.client.read.shortcircuit) true dfs.client.socket-timeout 300000000 dfs.client.use.legacy.blockreader.local false dfs.datanode.handler.count 60 dfs.datanode.socket.write.timeout 7200000 dfs.namenode.handler.count 600 dfs.support.append true
Note: HAWQ requires that you enable
dfs.allow.truncate. The HAWQ service will fail to start if
dfs.allow.truncateis not set to “true.”
Expand Advanced core-site, then set the following properties to their indicated values:
Note: If a property described below does not appear in the Ambari UI, select Custom core-site and click Add property… to add the property definition and set it to the indicated value.
Property Setting ipc.client.connection.maxidletime 3600000 ipc.client.connect.timeout 300000 ipc.server.listen.queue.size 3300
Click Save and enter a name for the configuration change (for example, HAWQ prerequisites). Click Save again, then OK.
If Ambari indicates that a service must be restarted, click Restart and allow the service to restart before you continue.
Select Actions > Add Service on the home page.
Select both HAWQ and PXF from the list of services, then click Next to display the Assign Masters page.
Select the hosts that should run the HAWQ Master and HAWQ Standby Master, or accept the defaults. The HAWQ Master and HAWQ Standby Master must reside on separate hosts. Click Next to display the Assign Slaves and Clients page.
Note: Only the HAWQ Master and HAWQ Standby Master entries are configurable on this page. NameNode, SNameNode, ZooKeeper and others may be displayed for reference, but they are not configurable when adding the HAWQ and PXF services.
Note: The HAWQ Master component must not reside on the same host that is used for Hive Metastore if the Hive Metastore uses the new PostgreSQL database. This is because both services attempt to use port 5432. If it is required to co-locate these components on the same host, provision a PostgreSQL database beforehand on a port other than 5432 and choose the “Existing PostgreSQL Database” option for the Hive Metastore configuration. The same restriction applies to the admin host, because neither the HAWQ Master nor the Hive Metastore can run on the admin host where the Ambari Server is installed.
On the Assign Slaves and Clients page, choose the hosts that will run HAWQ segments and PXF, or accept the defaults. The Add Service Wizard automatically selects hosts for the HAWQ and PXF services based on available Hadoop services.
Note: PXF must be installed on the HDFS NameNode, the Standby NameNode (if configured), and on each HDFS DataNode. A HAWQ segment must be installed on each HDFS DataNode.
Click Next to continue.
On the Customize Services page, the Settings tab configures basic properties of the HAWQ cluster. In most cases you can accept the default values provided on this page. Several configuration options may require attention depending on your deployment:
- HAWQ Master Directory, HAWQ Segment Directory: This specifies the base path for the HAWQ master or segment data directory.
- HAWQ Master Temp Directories, HAWQ Segment Temp Directories: HAWQ temporary directories are used for spill files. Enter one or more directories in which the HAWQ Master or a HAWQ segment stores these temporary files. Separate multiple directories with a comma. Any directories that you specify must already be available on all host machines. If you do not specify master or segment temporary directories, temporary files are stored in
As a best practice, use multiple master and segment temporary directories on separate, large disks (2TB or greater) to load balance writes to temporary files (for example,
/disk1/tmp,/disk2/tmp). For a given query, HAWQ will use a separate temp directory (if available) for each virtual segment to store spill files. Multiple HAWQ sessions will also use separate temp directories where available to avoid disk contention. If you configure too few temp directories, or you place multiple temp directories on the same disk, you increase the risk of disk contention or running out of disk space when multiple virtual segments target the same disk. Each HAWQ segment node can have 6 virtual segments.
Resource Manager: Select the resource manager to use for allocating resources in your HAWQ cluster. If you choose Standalone, HAWQ exclusively uses resources from the whole cluster. If you choose YARN, HAWQ contacts the YARN resource manager to negotiate resources. You can change the resource manager type after the initial installation. You will also have to configure some YARN-related properties in step 22. For more information on using YARN to manage resources, see Managing Resources.
Caution: If you are installing HAWQ in secure mode (Kerberos-enabled), then set Resource Manager to Standalone to avoid encountering a known installation issue. You can enable YARN mode post-installation if YARN resource management is desired in HAWQ.
VM Overcommit: Set this value according to the instructions in the System Requirements document.
Click the Advanced tab and enter a HAWQ System User Password. Retype the password where indicated.
(Optional.) On the Advanced tab, you can change numerous configuration properties for the HAWQ cluster. Hover your mouse cursor over the entry field to display help for the associated property. Default values are generally acceptable for a new installation. The following properties are sometimes customized during installation:
Property Action General > HAWQ DFS URL The URL that HAWQ uses to access HDFS General > HAWQ Master Port Enter the port to use for the HAWQ master host or accept the default, 5432. CAUTION: If you are installing HAWQ in a single-node environment (or when the Ambari server and HAWQ are installed the same node) then you cannot accept the default port. Enter a unique port for the HAWQ master General > HAWQ Segment Port The base port to use for HAWQ segment hosts
If you selected YARN as the Resource Manager, then you must configure several YARN properties for HAWQ. On the Advanced tab of HAWQ configuration, enter values for the following parameters:
Property Action Advanced hawq-site > hawq_rm_yarn_address Specify the address and port number of the YARN resource manager server (the value of
yarn.resourcemanager.address). For example: rm1.example.com:8050
Advanced hawq-site > hawq_rm_yarn_queue_name Specify the YARN queue name to use for registering the HAWQ resource manager. For example:
defaultNote: If you specify a custom queue name other than
default, you must configure the YARN scheduler and custom queue capacity, either through Ambari or directly in YARN’s configuration files. See Integrating YARN with HAWQ for more details.
Advanced hawq-site > hawq_rm_yarn_scheduler_address Specify the address and port number of the YARN scheduler server (the value of
yarn.resourcemanager.scheduler.address). For example: rm1.example.com:8030
If you have enabled high availability for YARN, then verify that the following values have been populated correctly in HAWQ:
Property Action Custom yarn-client > yarn.resourcemanager.ha Comma-delimited list of the fully qualified hostnames of the resource managers. When high availability is enabled, YARN ignores the value in hawq_rm_yarn_address and uses this property’s value instead. For example,
Custom yarn-client > yarn.resourcemanager.scheduler.ha Comma-delimited list of fully qualified hostnames of the resource manager schedulers. When high availability is enabled, YARN ignores the value in hawq_rm_yarn_scheduler_address and uses this property’s value instead. For example,
Click Next to continue the installation. (Depending on your cluster configuration, Ambari may recommend that you change other properties before proceeding.)
Review your configuration choices, then click Deploy to begin the installation. Ambari now begins to install, start, and test the HAWQ and PXF configuration. During this procedure, you can click on the Message links to view the console output of individual tasks.
Click Next after all tasks have completed. Review the summary of the install process, then click Complete. Ambari may indicate that components on cluster need to be restarted. Choose Restart > Restart All Affected if necessary.
To verify that HAWQ is installed, login to the HAWQ master as
gpadminand perform some basic commands:
greenplum_path.shfile to set your environment for HAWQ:
$ source /usr/local/hawq/greenplum_path.sh
If you use a custom HAWQ master port number, set it in your environment. For example:
$ export PGPORT=5432
psqlinteractive utility, connecting to the postgres database:
$ psql -d postgres psql (8.2.15) Type "help" for help. postgres=#
Create a new database and connect to it:
postgres=# create database mytest; CREATE DATABASE postgres=# \c mytest You are now connected to database "mytest" as user "*username*".
Create a new table and insert sample data:
mytest=# create table t (i int); CREATE TABLE mytest=# insert into t select generate_series(1,100);
Activate timing and perform a simple query:
mytest=# \timing Timing is on. mytest=# select count(*) from t; count ------- 100 (1 row) Time: 7.266 ms
In order to use the installed PXF service with HBase on a HDP cluster, you must manually add the path to the
pxf-hbase.jar file to the
HBASE_CLASSPATH environment variable and restart HBase.
If you are using Kerberos to secure Hive and HBase, you must configure proxy users, enable user impersonation, and configure PXF access to tables.
Follow this procedure to make the required changes:
Use either a text editor or the Ambari Web interface to edit the
hbase-env.shfile, and add the line:
Note: Restart HBase after adding the PXF service in order to load the newly-installed PXF JAR file.
(Optional.) For secure Hive installations, use either a text editor or the Ambari Web interface to edit the
hive-site.xmlfile, and add the property:
<property> <name>hive.server2.enable.impersonation</name> <description>Enable user impersonation for HiveServer2</description> <value>true</value> </property>
(Optional.) For secure Hive and HBase installations, use either a text editor or the Ambari Web interface to edit the core-site.xml file, and add the properties:
<property> <name>hadoop.proxyuser.hive.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hive.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hbase.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hbase.groups</name> <value>*</value> </property>
Restart both Hive and HBase to use the updated classpath and new properties.
In order to use PXF with HBase or Hive tables, you must grant the
pxfuser read permission on those tables:
For HBase, use the
GRANTcommand for each table that you want to access with PXF. For example:
hbase(main):001:0> grant 'pxf', 'R', 'my_table'
Because Hive uses the HDFS ACLs for access control, ensure that the pxf has read permission on all of the HDFS directories that map to your database, tables, and partitions.