Install Apache HAWQ using Ambari
Install a compatible version of HDP and Ambari, and ensure that your HDP system is fully functional. See the Release Notes for more information about HDFS compatibility.
Note: If you are using Ambari 2.4.0 or 2.4.1 and you want to install both HDP and HAWQ at the same time, see Installing HDP and HDB with Ambari 2.4.0 or 2.4.1 before you begin.
The Ambari plug-in for HAWQ is no longer a separate download, but now included in the HDB software installation package.
Follow the instructions in Setting up HDB Repositories to set up local
yumHDB repositories on the single system (call it
repo_host) you choose to host the HDB software. This system must be accessible to all nodes in your HAWQ cluster. This system may be your Ambari server host if you choose.
Log in to the Ambari server host machine as the
Install the HAWQ Ambari plug-in; the plug-in will be installed from the HDB repository on
repo_hostthat you set up in the previous step:
$ yum install -y hawq-ambari-plugin
Installing the Ambari HAWQ plug-in creates the directory
/var/lib/hawqand installs required scripts and template files there.
Ensure that the Ambari server is running:
$ ambari-server status Using python /usr/bin/python Ambari-server status Ambari Server running
If Ambari server is not running, execute:
$ ambari-server start
add-hawq.pyscript to add the HDB repository to the Ambari server.
Note that you must provide different script options depending on whether you set up your repositories on the Ambari server host or on a different host:
If you set up repositories on the Ambari server host:
$ cd /var/lib/hawq $ ./add-hawq.py --user <admin-username> --password <admin-password>
If you set up your repositories on a host other than the Ambari server host:
$ cd /var/lib/hawq $ ./add-hawq.py --user <admin-username> --password <admin-password> --stack HDP-2.4 --hawqrepo <hdb-126.96.36.199-url> --addonsrepo <hdb-add-ons-188.8.131.52-url>
Note: Substitute the correct Ambari administrator username and password for your system. Also substitute the correct URL to the HAWQ repo and the HAWQ add-ons repo. For example:
$ ./add-hawq.py --user admin --password admin --stack HDP-2.4 --hawqrepo http://myserver.example.com/hdb-184.108.40.206 --addonsrepo http://myserver.example.com/hdb-add-ons-220.127.116.11
You must include the
--stackoption if you use any stack other than HDP 2.4. Execute
add-hawq.py -hto view all available options for the script.
Restart the Ambari server:
$ ambari-server restart
Note: You must restart the Ambari server before proceeding. The above command restarts only the Ambari server, but leaves other Hadoop services running.
Access the Ambari web console at http://ambari.server.hostname:8080, and login as the “admin” user. (The default password is also “admin”.) Verify that the HDB component is now available.
Select HDFS, then click the Configs tab.
Customize the HDFS configuration as needed. If you are using Ambari 2.4, note that some of these parameters may already be configured for you:
- On the Settings tab, change the DataNode setting DataNode max data transfer threads (dfs.datanode.max.transfer.threads parameter ) to 40960.
- Select the Advanced tab and expand DataNode. Ensure that the DataNode directories permission (dfs.datanode.data.dir.perm parameter) is set to 750.
- Expand the General tab and change the Access time precision (dfs.namenode.accesstime.precision parameter) to 0.
Expand Advanced hdfs-site. Set the following properties to their indicated values.
Note: If a property described below does not appear in the Ambari UI, select Custom hdfs-site and click Add property… to add the property definition and set it to the indicated value.
Property Setting dfs.allow.truncate true dfs.block.access.token.enable false for an unsecured HDFS cluster, or true for a secure cluster dfs.block.local-path-access.user gpadmin HDFS Short-circuit read (dfs.client.read.shortcircuit) true dfs.client.socket-timeout 300000000 dfs.client.use.legacy.blockreader.local false dfs.datanode.handler.count 60 dfs.datanode.socket.write.timeout 7200000 dfs.namenode.handler.count 600 dfs.support.append true
Note: HAWQ requires that you enable
dfs.allow.truncate. The HAWQ service will fail to start if
dfs.allow.truncateis not set to “true.”
Expand Advanced core-site, then set the following properties to their indicated values:
Note: If a property described below does not appear in the Ambari UI, select Custom core-site and click Add property… to add the property definition and set it to the indicated value.
Property Setting ipc.client.connection.maxidletime 3600000 ipc.client.connect.timeout 300000 ipc.server.listen.queue.size 3300
Click Save and enter a name for the configuration change (for example, HAWQ prerequisites). Click Save again, then OK.
If Ambari indicates that a service must be restarted, click Restart and allow the service to restart before you continue.
Select Actions > Add Service on the home page.
Select both HAWQ and PXF from the list of services, then click Next to display the Assign Masters page.
Select the hosts that should run the HAWQ Master and HAWQ Standby Master, or accept the defaults. The HAWQ Master and HAWQ Standby Master must reside on separate hosts. Click Next to display the Assign Slaves and Clients page.
Note: Only the HAWQ Master and HAWQ Standby Master entries are configurable on this page. NameNode, SNameNode, ZooKeeper and others may be displayed for reference, but they are not configurable when adding the HAWQ and PXF services.
Note: The HAWQ Master component must not reside on the same host that is used for Hive Metastore if the Hive Metastore uses the new PostgreSQL database. This is because both services attempt to use port 5432. If it is required to co-locate these components on the same host, provision a PostgreSQL database beforehand on a port other than 5432 and choose the “Existing PostgreSQL Database” option for the Hive Metastore configuration. The same restriction applies to the admin host, because neither the HAWQ Master nor the Hive Metastore can run on the admin host where the Ambari Server is installed.
On the Assign Slaves and Clients page, choose the hosts that will run HAWQ segments and PXF, or accept the defaults. The Add Service Wizard automatically selects hosts for the HAWQ and PXF services based on available Hadoop services.
Note: PXF must be installed on the HDFS NameNode, the Standby NameNode (if configured), and on each HDFS DataNode. A HAWQ segment must be installed on each HDFS DataNode.
Click Next to continue.
On the Customize Services page, the Settings tab configures basic properties of the HAWQ cluster. In most cases you can accept the default values provided on this page. Several configuration options may require attention depending on your deployment:
- HAWQ Master Directory, HAWQ Segment Directory: This specifies the base path for the HAWQ master or segment data directory.
- HAWQ Master Temp Directories, HAWQ Segment Temp Directories: HAWQ temporary directories are used for spill files. Enter one or more directories in which the HAWQ Master or a HAWQ segment stores these temporary files. Separate multiple directories with a comma. Any directories that you specify must already be available on all host machines. If you do not specify master or segment temporary directories, temporary files are stored in
As a best practice, use multiple master and segment temporary directories on separate, large disks (2TB or greater) to load balance writes to temporary files (for example,
/disk1/tmp,/disk2/tmp). For a given query, HAWQ will use a separate temp directory (if available) for each virtual segment to store spill files. Multiple HAWQ sessions will also use separate temp directories where available to avoid disk contention. If you configure too few temp directories, or you place multiple temp directories on the same disk, you increase the risk of disk contention or running out of disk space when multiple virtual segments target the same disk. Each HAWQ segment node can have 6 virtual segments.
Resource Manager: Select the resource manager to use for allocating resources in your HAWQ cluster. If you choose Standalone, HAWQ exclusively uses resources from the whole cluster. If you choose YARN, HAWQ contacts the YARN resource manager to negotiate resources. You can change the resource manager type after the initial installation. You will also have to configure some YARN-related properties in step 22. For more information on using YARN to manage resources, see Managing Resources.
Caution: If you are installing HAWQ in secure mode (Kerberos-enabled), then set Resource Manager to Standalone to avoid encountering a known installation issue. You can enable YARN mode post-installation if YARN resource management is desired in HAWQ.
VM Overcommit: Set this value according to the instructions in the System Requirements document.
Click the Advanced tab and enter a HAWQ System User Password. Retype the password where indicated.
Note: Be sure to make appropriate user and password system administrative changes in order to prevent operational disruption. For example, you may need to disable the password expiration policy for this HAWQ System User account.
(Optional.) On the Advanced tab, you can change numerous configuration properties for the HAWQ cluster. Hover your mouse cursor over the entry field to display help for the associated property. Default values are generally acceptable for a new installation. The following properties are sometimes customized during installation:
Property Action General > HAWQ DFS URL The URL that HAWQ uses to access HDFS General > HAWQ Master Port Enter the port to use for the HAWQ master host or accept the default, 5432. CAUTION: If you are installing HAWQ in a single-node environment (or when the Ambari server and HAWQ are installed the same node) then you cannot accept the default port. Enter a unique port for the HAWQ master General > HAWQ Segment Port The base port to use for HAWQ segment hosts
Note: Verify that all port numbers you select are unused and are available for use on your HAWQ machines.
If you selected YARN as the Resource Manager, then you must configure several YARN properties for HAWQ. On the Advanced tab of HAWQ configuration, enter values for the following parameters:
Property Action Advanced hawq-site > hawq_rm_yarn_address Specify the address and port number of the YARN resource manager server (the value of
yarn.resourcemanager.address). For example: rm1.example.com:8050
Advanced hawq-site > hawq_rm_yarn_queue_name Specify the YARN queue name to use for registering the HAWQ resource manager. For example:
defaultNote: If you specify a custom queue name other than
default, you must configure the YARN scheduler and custom queue capacity, either through Ambari or directly in YARN’s configuration files. See Integrating YARN with HAWQ for more details.
Advanced hawq-site > hawq_rm_yarn_scheduler_address Specify the address and port number of the YARN scheduler server (the value of
yarn.resourcemanager.scheduler.address). For example: rm1.example.com:8030
If you have enabled high availability for YARN, then verify that the following values have been populated correctly in HAWQ:
Property Action Custom yarn-client > yarn.resourcemanager.ha Comma-delimited list of the fully qualified hostnames of the resource managers. When high availability is enabled, YARN ignores the value in hawq_rm_yarn_address and uses this property’s value instead. For example,
Custom yarn-client > yarn.resourcemanager.scheduler.ha Comma-delimited list of fully qualified hostnames of the resource manager schedulers. When high availability is enabled, YARN ignores the value in hawq_rm_yarn_scheduler_address and uses this property’s value instead. For example,
Click Next to continue the installation. (Depending on your cluster configuration, Ambari may recommend that you change other properties before proceeding.)
Review your configuration choices, then click Deploy to begin the installation. Ambari now begins to install, start, and test the HAWQ and PXF configuration. During this procedure, you can click on the Message links to view the console output of individual tasks.
Click Next after all tasks have completed. Review the summary of the install process, then click Complete. Ambari may indicate that components on cluster need to be restarted. Choose Restart > Restart All Affected if necessary.
(Optional) If you enabled temporary password-based authentication while preparing/configuring your HAWQ host systems, turn off password-based authentication as described in Apache HAWQ System Requirements.
To verify that HAWQ is installed, login to the HAWQ master as
gpadminand perform some basic commands:
greenplum_path.shfile to set your environment for HAWQ:
$ source /usr/local/hawq/greenplum_path.sh
If you use a custom HAWQ master port number, set it in your environment. For example:
$ export PGPORT=5432
psqlinvocation by providing a default for the
-p(port) option. Add this setting to your
.bash_profileto set the environment variable automatically each time you log in.
If you will routinely operate on a specific database, make this database the default by setting the
$ export PGDATABASE=databasename
psqlinvocation by providing a default for the
-d(database) option. Also add this setting to your
.bash_profileto set the environment variable automatically each time you log in.
psqlinteractive utility, connecting to the postgres database:
$ psql -d postgres
psql (8.2.15) Type "help" for help. postgres=#
Create a new database and connect to it:
postgres=# create database mytest;
CREATE DATABASE postgres=# \c mytest You are now connected to database "mytest" as user "*username*".
Create a new table and insert sample data:
mytest=# create table t (i int); CREATE TABLE mytest=# insert into t select generate_series(1,100);
Activate timing and perform a simple query:
mytest=# \timing Timing is on. mytest=# select count(*) from t; count ------- 100 (1 row) Time: 7.266 ms
If you plan to access Hive or HBase with PXF, perform the post-install procedures identified in PXF Post-Installation Procedures for Hive/HBase to complete the installation and configuration of the associated PXF plug-ins.
If you employ Ambari 2.2.2 to manage your HAWQ cluster and plan to use the PXF JSON plug-in, you must explicitly add the JSON profile definition to the PXF service configuration:
Click on the
PXFservice in the left pane and select the
Scroll to the bottom of the
pxf-profilestext block and copy/paste the following definition just above the
</profiles>line (notice the
<profile> <name>Json</name> <description> Access JSON data either as: * one JSON record per line (default) * or multiline JSON records with an IDENTIFIER parameter indicating a member name used to determine the encapsulating json object to return </description> <plugins> <fragmenter>org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter> <accessor>org.apache.hawq.pxf.plugins.json.JsonAccessor</accessor> <resolver>org.apache.hawq.pxf.plugins.json.JsonResolver</resolver> </plugins> </profile>
Savebutton, and add a note to the
Save Configurationdialog if you choose.
Service Configuration Changesdialog.
You must restart the PXF service after adding a new profile. Select the now orange colored
Restart All Affected.
When PXF restart is complete, the JSON plug-in will be available for use in your Ambari-managed HDB 2.0.1 cluster.