Using the ORC File Format (Beta)

HAWQ 2.1.2 includes enhancements to the Optimized Row Columnar (ORC) file format support Beta released in HAWQ 2.1.1. ORC is a columnar file format that provides a highly efficient way to both store and access HDFS data. ORC files offer improvements over text and RCFile formats in terms of both compression and performance.

ORC support is provided by a new profile exposed in the PXF Hive plug-in, HiveORC. The HiveORC profile currently supports access to data stored in ORC format only through a Hive mapped table.

Features

  • ORC file access - You can access Hive tables stored in ORC format using the HiveORC profile.

  • Enhanced query performance - Column projection information is leveraged to enhance query performance by reducing disk I/O and data payload.

  • Optimized predicate pushdown - Predicate pushdown is optimized for:

    • int2, int4, int8, float8, text, bpchar, and boolean data type and =, >, <, >=, <=, IS NULL, and IS NOT NULL operator combinations
    • =, >, <, >=, <=, IS NULL, and IS NOT NULL operators and comparisons between the integer types
    • =, >, <, >=, <=, IS NULL, and IS NOT NULL operators and comparisons between the float8 and float4 types
    • IN operator on arrays of int2, int4, int8, boolean, and text
  • Complex type support - You can access Hive tables composed of array, map, struct, and union data types. PXF serializes each of these complex types to text.

ORC Known Issues and Limitations

  • Aggregate queries (count, min, max, etc.) are supported, but they have not yet been optimized to leverage ORC file- and stripe-level metadata.

  • The HiveORC profile does not yet use the Vectorized Batch reader.

Enabling ORC Support

You must install or upgrade to HDB 2.1.2 and PXF 3.2.0 for ORC file support. Ensure that your HDB and PXF components are the required versions.

ORC file support is provided via a specific profile in the PXF Hive Plug-in.

Procedure for Command-Line-Managed HAWQ Clusters

If you installed/upgraded and manage your cluster from the command line, the PXF Hive plug-in HiveORC profile is already installed and configured. No additional steps are required.

Procedure for Ambari-Managed HAWQ Clusters

If you manage your cluster with Ambari, you must explicitly add or modify the HiveORC profile definition in the PXF service configuration:

  1. Start the Ambari UI.

  2. Click on the PXF service in the left pane and select the Configs tab.

  3. Expand the Advanced pxf-profiles field.

  4. Scroll to the bottom of the pxf-profiles text block. If you upgraded from an HDB 2.1.1 installation in which you had configured ORC access, ensure that the HiveORC profile definition matches that below, including the outputFormat property setting. If you did not enable ORC access in your HDB 2.1.1 installation, or your HDB 2.1.2 deployment is a fresh install, copy/paste the following HiveORC profile definition just above the </profiles> line (notice the s).

    <profile>
        <name>HiveORC</name>
        <description>This profile is suitable only for Hive tables stored in  ORC files
            and serialized with either the ColumnarSerDe or the LazyBinaryColumnarSerDe.
            It is much faster than the general purpose Hive profile.
            Supports GPDBWritable output format, as specified in FORMAT header parameter.
        </description>
        <plugins>
            <fragmenter>org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter</fragmenter>
            <accessor>org.apache.hawq.pxf.plugins.hive.HiveORCAccessor</accessor>
            <resolver>org.apache.hawq.pxf.plugins.hive.HiveORCSerdeResolver</resolver>
            <metadata>org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher</metadata>
            <outputFormat>org.apache.hawq.pxf.service.io.GPDBWritable</outputFormat>
        </plugins>
    </profile>
    
  5. Press the Save button, and add a note to the Save Configuration dialog if you choose. Save again and OK the Service Configuration Changes dialog.

  6. You must restart the PXF service after adding a new profile. Click the now orange-colored Restart button to Restart All Affected.

  7. After the PXF service restarts, the HiveORC profile is available for use in your Ambari-managed HDB 2.1.2 cluster.

ORC File Format

The Optimized Row Columnar (ORC) file format is type-aware and specifically designed for Hadoop workloads. HAWQ/PXF supports ORC version 1.2.1.

ORC files store both the type of and encoding information for the data in the file. All columns within a single group of row data (also known as stripe) are stored together on disk in ORC format files. The columnar nature of the ORC format type enables read projection, helping avoid accessing unecessary columns during a query.

ORC also supports predicate pushdown with built-in indexes at the file, stripe, and row levels, moving the filter operation to the data loading phase.

Refer to the Apache orc and the Apache Hive LanguageManual ORC websites for detailed information about the ORC file format.

HiveORC Profile

The PXF HiveORC profile is optimized to support column projection and predicate pushdown.

Example: Using the HiveORC Profile

Review the PXF Hive Plug-in introductory documentation, particularly the Prerequisites, Hive File Formats, Data Type Mapping, and Querying External Hive Data sections. You should also complete the example exercises in Sample Data Set, Hive Command Line, and Hive Profile.

After completing the example exercises, you will have created a text file with several rows of comma-separated value data. You will also have loaded this text file into a new Hive table named sales_info that is stored in textfile format.

In the following example, you will create a Hive table stored in ORC format and use the HiveORC profile to query this Hive table.

  1. Create a Hive table with ORC file format:

    $ HADOOP_USER_NAME=hdfs hive
    
    hive> CREATE TABLE sales_info_ORC (location string, month string,
            number_of_orders int, total_sales double)
          STORED AS ORC;
    
  2. Insert the data from the sales_info table into sales_info_ORC:

    hive> INSERT INTO TABLE sales_info_ORC SELECT * FROM sales_info;
    

    A copy of the sample data set is now stored in ORC format in sales_info_ORC.

  3. Perform a Hive query on sales_info_ORC to verify that the data was loaded successfully:

    hive> SELECT * FROM sales_info_ORC;
    
  4. Start the psql subsystem and turn on timing:

    $ psql -d postgres
    
    postgres=> \timing
    Timing is on.
    
  5. Use the PXF HiveORC profile to create a queryable HAWQ external table from the Hive table named sales_info_ORC you created in Step 1. The FORMAT clause must specify 'CUSTOM'. The HiveORC CUSTOM format supports only the built-in 'pxfwritable_import' formatter.

    postgres=> CREATE EXTERNAL TABLE salesinfo_hiveORCprofile(location text, month text, num_orders int, total_sales float8)
                 LOCATION ('pxf://namenode:51200/default.sales_info_ORC?PROFILE=HiveORC')
                 FORMAT 'CUSTOM' (formatter='pxfwritable_import');
    
  6. Query the external table:

    postgres=> SELECT * FROM salesinfo_hiveORCprofile;
    
       location    | month | number_of_orders | total_sales 
    ---------------+-------+------------------+-------------
     Prague        | Jan   |              101 |     4875.33
     Rome          | Mar   |               87 |     1557.39
     Bangalore     | May   |              317 |     8936.99
     ...
    
    Time: 425.416 ms
    

Example: Using the HiveORC Profile with Complex Data Types

Review the PXF Hive Plug-in introductory documentation. You should also complete the Hive Profile Complex Data Type Example. After completing the example, you will have created a text file with several rows of comma-separated value data. This data includes an array of integers and a key/value map. You will also have loaded this text file into a new Hive table named table_complextypes that is stored in textfile format.

In the following example, you will create a Hive table stored in ORC format and use the HiveORC profile to query the complex types in the table_complextypes Hive table.

  1. Create a Hive table with ORC file format:

    $ HADOOP_USER_NAME=hdfs hive
    
    hive> CREATE TABLE table_complextypes_ORC( index int, name string, intarray ARRAY<int>, propmap MAP<string, string>)
            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
            COLLECTION ITEMS TERMINATED BY '%'
            MAP KEYS TERMINATED BY ':'
          STORED AS ORC;
    
  2. Insert the data from the table_complextypes table into table_complextypes_ORC:

    hive> INSERT INTO TABLE table_complextypes_ORC SELECT * FROM table_complextypes;
    

    A copy of the sample data set is now stored in ORC format in table_complextypes_ORC.

  3. Perform a Hive query on table_complextypes_ORC to verify that the data was loaded successfully:

    hive> SELECT * FROM table_complextypes_ORC;
    
    OK
    3       Prague       [1,2,3]    {"zone":"euro","status":"up"}
    89      Rome         [4,5,6]    {"zone":"euro"}
    400     Bangalore    [7,8,9]    {"zone":"apac","status":"pending"}
    ...
    
  4. Start the psql subsystem:

    $ psql -d postgres
    
  5. Use the PXF HiveORC profile to create a queryable HAWQ external table from the Hive table named table_complextypes_ORC you created in Step 1. The FORMAT clause must specify 'CUSTOM'. The HiveORC CUSTOM format supports only the built-in 'pxfwritable_import' formatter.

    postgres=> CREATE EXTERNAL TABLE complextypes_hiveorc(index int, name text, intarray text, propmap text)
             LOCATION ('pxf://namenode:51200/default.table_complextypes_ORC?PROFILE=HiveORC')
                 FORMAT 'CUSTOM' (formatter='pxfwritable_import');
    

    Notice that the integer array and map complex types are mapped to type text.

  6. Query the external table:

    postgres=> SELECT * FROM complextypes_hiveorc;
    
     index |    name    | intarray |              propmap               
    -------+------------+----------+------------------------------------
         3 | Prague     | [1,2,3]  | {"zone":"euro","status":"up"}
        89 | Rome       | [4,5,6]  | {"zone":"euro"}
       400 | Bangalore  | [7,8,9]  | {"zone":"apac","status":"pending"}
     ...
    
    

    intarray and propmap are each serialized as text strings.

Other Use Cases

You may already store large chunks of data in HDFS. To take advantage of both the storage efficiency and high speed processing capabilities provided by ORC format files and the advanced querying capabilities of HAWQ:

  1. Create a Hive external table that references the HDFS data.
  2. Create a Hive table stored in ORC format.
  3. Insert the data from the external Hive table into the ORC format Hive table.
  4. Create a HAWQ external table referencing the Hive table using the PXF HiveORC profile.
  5. Join the HAWQ PXF external ORC format table with HAWQ native tables and query away!