Accessing Hive Data

This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF’s integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.

Prerequisites

Check the following before using PXF to access Hive:

  • The PXF HDFS plug-in is installed on all cluster nodes.
  • The PXF Hive plug-in is installed on all cluster nodes.
  • The Hive JAR files and conf directory are installed on all cluster nodes.
  • Test PXF on HDFS before connecting to Hive or HBase.
  • You are running the Hive Metastore service on a machine in your cluster. 
  • You have set the hive.metastore.uris property in the hive-site.xml on the NameNode.

Hive Command Line

To start the Hive command line and work directly on a Hive table:

$ ${HIVE_HOME}/bin/hive

Here is an example of how to create and load data into a sample Hive table from an existing file.

Hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
Hive> LOAD DATA local inpath '/local/path/data.txt' into table test;

Using PXF Tables to Query Hive

Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:

  • Text based
  • SequenceFile
  • RCFile
  • ORCFile
  • Parquet
  • Avro

The source table can also be a combination of these types. The PXF Hive plug-in uses this information to query the data in runtime.

Syntax Example

The following PXF table definition is valid for any Hive file storage type.

CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    ( column_name data_type [, ...] | LIKE other_table )
LOCATION ('pxf://host[:port]/hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')

where <pxf parameters> is:

   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
 | PROFILE=profile-name

Note: While <host> may identify any PXF agent node, use the HDFS NameNode as it is guaranteed to be available in a running HDFS cluster. If HDFS High Availability is enabled, <host> must identify the HDFS NameService. <port> identfies the PXF port. If <port> is omitted, PXF assumes <host> identifies a High Availability HDFS Nameservice and connects to the port number designated by the pxf_service_port server configuration parameter value. Default is 51200.

PXF has three built-in profiles for Hive tables:

  • Hive
  • HiveRC
  • HiveText

The Hive profile works with any Hive storage type. Use HiveRC and HiveText to query RC and Text formats respectively. The HiveRC and HiveText profiles are faster than the generic Hive profile. When using the HiveRC and HiveText profiles, you must specify a DELIMITER option in the LOCATION clause. See Using Profiles to Read and Write Data for more information on profiles.

The following example creates a readable external table for a Hive table named /user/eddie/test using the PXF Hive profile:

CREATE EXTERNAL TABLE hivetest(id int, newid int)
LOCATION ('pxf://namenode:51200/user/eddie/test?PROFILE=Hive')
FORMAT 'custom' (formatter='pxfwritable_import');

Hive Complex Types

PXF tables support Hive data types that are not primitive types. The supported Hive complex data types are array, struct, map, and union. This Hive CREATE TABLE statement, for example, creates a table with each of these complex data types:

CREATE TABLE sales_collections (
  item STRING,
  price FLOAT,
  properties ARRAY<STRING>,
  hash MAP<STRING,FLOAT>,
  delivery_address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>,
  others UNIONTYPE<FLOAT, BOOLEAN, STRING>
)  
ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/local/path/<some data file>' INTO TABLE sales_collection;

You can use HAWQ functions or application code to extract the components of the complex data columns as needed.

Using PXF and HCatalog to Query Hive

Hive tables can be queried directly through HCatalog integration with HAWQ and PXF, regardless of the underlying file storage format.

Previously, in order to query Hive tables using HAWQ and PXF, you needed to create an external table in PXF that described the target table’s Hive metadata. Since HAWQ is now integrated with HCatalog, HAWQ can use metadata stored in HCatalog instead of external tables created for PXF. HCatalog is built on top of the Hive metastore and incorporates Hive’s DDL. This provides several advantages:

  • You do not need to know the table schema of your Hive tables
  • You do not need to manually enter information about Hive table location or format
  • If Hive table metadata changes, HCatalog provides updated metadata. This is in contrast to the use of static external PXF tables to define Hive table metadata for HAWQ.

The following diagram depicts how HAWQ integrates with HCatalog to query Hive tables:

  1. HAWQ retrieves table metadata from HCatalog using PXF.
  2. HAWQ creates in-memory catalog tables from the retrieved metadata. If a table is referenced multiple times in a transaction, HAWQ uses its in-memory metadata to reduce external calls to HCatalog.
  3. PXF queries Hive using table metadata that is stored in the HAWQ in-memory catalog tables. Table metadata is dropped at the end of the transaction.

Usage

To enable HCatalog query integration in HAWQ, perform the following steps:

  1. Make sure your deployment meets the requirements listed in Prerequisites.
  2. If necessary, set the pxf_service_address global configuration property to a hostname or IP address and port where you have installed the PXF Hive plug-in. By default, the value is set to localhost:51200.

    SET pxf_service_address TO "hivenode:51200"
    
  3. To query a Hive table with HCatalog integration, simply query HCatalog directly. The query syntax is:

    SELECT * FROM hcatalog.hive-db-name.hive-table-name;
    

    For example:

    SELECT * FROM hcatalog.default.sales;
    
  4. To obtain a description of a Hive table with HCatalog integration, you can use the psql client interface.

    • Within HAWQ, use either the \d hcatalog.hive-db-name.hive-table-name or \d+ hcatalog.hive-db-name.hive-table-name commands to describe a single table. For example, from the psql client interface:

      $ psql -d postgres
      postgres=# \d hcatalog.default.test
      
      PXF Hive Table "default.test"
          Column    |  Type  
      --------------+--------
       name         | text
       type         | text
       supplier_key | int4
       full_price   | float8 
      
    • Use \d hcatalog.hive-db-name.* to describe the whole database schema. For example:

      postgres=# \d hcatalog.default.*
      
      PXF Hive Table "default.test"
          Column    |  Type  
      --------------+--------
       type         | text
       name         | text
       supplier_key | int4
       full_price   | float8
      
      PXF Hive Table "default.testabc"
       Column | Type 
      --------+------
       type   | text
       name   | text
      
    • Use \d hcatalog.*.* to describe the whole schema:

      postgres=# \d hcatalog.*.*
      
      PXF Hive Table "default.test"
          Column    |  Type  
      --------------+--------
       type         | text
       name         | text
       supplier_key | int4
       full_price   | float8
      
      PXF Hive Table "default.testabc"
       Column | Type 
      --------+------
       type   | text
       name   | text
      
      PXF Hive Table "userdb.test"
        Column  | Type 
      ----------+------
       address  | text
       username | text
      
      

    Note: When using \d or \d+ commands in the psql HAWQ client, hcatalog will not be listed as a database. If you use other psql compatible clients, hcatalog will be listed as a database with a size value of -1 since hcatalog is not a real database in HAWQ.

  5. Alternatively, you can use the pxf_get_item_fields user-defined function (UDF) to obtain Hive table descriptions from other client interfaces or third-party applications. The UDF takes a PXF profile and a table pattern string as its input parameters.

    Note: Currently the only supported input profile is 'Hive'.

    For example, the following statement returns a description of a specific table. The description includes path, itemname (table), fieldname, and fieldtype.

    # select * from pxf_get_item_fields('Hive','default.test');
    
      path   | itemname |  fieldname   | fieldtype 
    ---------+----------+--------------+-----------
     default | test     | name         | text
     default | test     | type         | text
     default | test     | supplier_key | int4
     default | test     | full_price   | float8
    (4 rows)
    

    The following statement returns table descriptions from the default database.

    postgres=# select * from pxf_get_item_fields('Hive','default.*');
    
      path   | itemname |  fieldname   | fieldtype 
    ---------+----------+--------------+-----------
     default | test     | name         | text
     default | test     | type         | text
     default | test     | supplier_key | int4
     default | test     | full_price   | float8
     default | testabc  | name         | text
     default | testabc  | type         | text
    (6 rows)
    

    The following statement returns a description of the entire schema.

    # select * from pxf_get_item_fields('Hive', '*.*');
    
      path   | itemname |  fieldname   | fieldtype 
    ---------+----------+--------------+-----------
     default | test     | name         | text
     default | test     | type         | text
     default | test     | supplier_key | int4
     default | test     | full_price   | float8
     default | testabc  | name         | text
     default | testabc  | type         | text
     userdb  | test     | username     | text
     userdb  | test     | address      | text
    (8 rows)
    

Limitations

HCatalog integration has the following limitations:

  • HCatalog integration queries and describe commands do not support complex types; only primitive types are supported. Use PXF external tables to query complex types in Hive instead. (See Hive Complex Types for example.)
  • Even for primitive types, HCatalog metadata descriptions produced by \d and\d+ are converted to HAWQ types. For example, the Hive type tinyint is converted to HAWQ type int2. (See Data Types for a list of data types in HAWQ.)
  • HAWQ reserves the database name “hcatalog” for system use. You cannot connect to or alter the system “hcatalog” database.

Partition Filtering

The PXF Hive plug-in uses the Hive partitioning feature and directory structure. This enables partition exclusion on HDFS files that contain the Hive table. To use the partition filtering feature to reduce network traffic and I/O, run a PXF query using a WHERE clause that refers to a specific partition in the partitioned Hive table.

To take advantage of PXF partition filtering push-down, name the partition fields in the external table. These names must be the same as the names stored in the Hive table. Otherwise, PXF ignores Partition filtering and the filtering is performed on the HAWQ side, impacting performance.

Note: The Hive plug-in only filters on partition columns, not on other table attributes.

Example

Create a Hive table sales_part with two partition columns, delivery_state and delivery_city:

CREATE TABLE sales_part (name STRING, type STRING, supplier_key INT, price DOUBLE)
PARTITIONED BY (delivery_state STRING, delivery_city STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Load data into this Hive table and add some partitions:

LOAD DATA LOCAL INPATH '/local/path/data1.txt' INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'San Francisco');
LOAD DATA LOCAL INPATH '/local/path/data2.txt' INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento');
LOAD DATA LOCAL INPATH '/local/path/data3.txt' INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA'    , delivery_city = 'Reno');
LOAD DATA LOCAL INPATH '/local/path/data4.txt' INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA'    , delivery_city = 'Las Vegas');

The Hive storage directory should appears as follows:

/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=’San Francisco’/data1.txt
/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=Sacramento/data2.txt
/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=Reno/data3.txt
/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=’Las Vegas’/data4.txt

To define a PXF table that will read this Hive table and take advantage of partition filter push-down, define the fields corresponding to the Hive partition fields at the end of the attribute list. In HiveQL, a select * statement on a partitioned table shows the partition fields at the end of the record.

CREATE EXTERNAL TABLE pxf_sales_part(
  item_name TEXT, 
  item_type TEXT, 
  supplier_key INTEGER, 
  item_price DOUBLE PRECISION, 
  delivery_state TEXT, 
  delivery_city TEXT
)
LOCATION ('pxf://namenode_host:51200/sales_part?Profile=Hive')
FORMAT 'custom' (FORMATTER='pxfwritable_import');

SELECT * FROM pxf_sales_part;

Example

In the following example, the HAWQ query filters the delivery_city partition Sacramento. The filter on  item_name is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data on Sacramento is transferred for processing.

SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'shirt';

Example

The following HAWQ query reads all the data under delivery_state partition CALIFORNIA, regardless of the city.

SELECT * FROM pxf_sales_part WHERE delivery_state = 'CALIFORNIA';

Using PXF with Hive Default Partitions

This topic describes a difference in query results between Hive and PXF queries when Hive tables use a default partition. When dynamic partitioning is enabled in Hive, a partitioned table may store data in a default partition. Hive creates a default partition when the value of a partitioning column does not match the defined type of the column (for example, when a NULL value is used for any partitioning column). In Hive, any query that includes a filter on a partition column excludes any data that is stored in the table’s default partition.

Similar to Hive, PXF represents a table’s partitioning columns as columns that are appended to the end of the table. However, PXF translates any column value in a default partition to a NULL value. This means that a HAWQ query that includes an IS NULL filter on a partitioning column can return different results than the same Hive query.

Example

Consider a Hive partitioned table that is created with the statement:

CREATE TABLE sales (order_id bigint, order_amount float) PARTITIONED BY (date date);

The table is loaded with five rows that contain the following data:

1    1.0    1900-01-01
2    2.2    1994-04-14
3    3.3    2011-03-31
4    4.5    NULL
5    5.0    2013-12-06

In this case, the insertion of row 4 creates a Hive default partition, because the partition column “date” contains a null value.

In Hive, any query that filters on the partition column omits data in the default partition. For example the following query returns no rows:

hive> select * from sales where date is null;

However, if you map this table as a PXF external table in HAWQ, all default partition values are translated into actual NULL values. In HAWQ, executing the same query against the PXF table returns row 4 as the result, because the filter matches the NULL value.

Keep this behavior in mind when executing IS NULL queries on Hive partitioned tables.

Accessing Hive Tables in Parquet Format

The PXF Hive profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, in Hive if a table is created using:

hive> create table hive_parquet_table (fname string, lname string, custid int, acctbalance double) stored as
      parquet;

Then you would define the HAWQ external table using:

create external table pxf_parquet_table (fname text, lname text, custid int, acctbalance double precision) 
    location ('pxf://localhost:51200/hive_parquet_table?profile=Hive') 
    format 'custom' (formatter='pxfwritable_import');