Using the ORC File Format (Beta)

HAWQ 2.1.1 includes a Beta release of Optimized Row Columnar (ORC) file format support. ORC is a columnar file format that provides a highly efficient way to both store and access HDFS data. ORC files offer improvements over text and RCFile formats in terms of both compression and performance.

ORC support is provided by a new profile exposed in the PXF Hive plug-in, HiveORC. The HiveORC profile supports access to data stored in ORC format only through a Hive mapped table.


  • ORC file access - You can access Hive tables stored in ORC format using the HiveORC profile.

  • Enhanced query performance - Column projection information is leveraged to enhance query performance by reducing disk I/O and data payload.

  • Optimized predicate pushdown - Predicate pushdown is optimized for:

    • int2, int4, int8, float8, text, bpchar, and boolean data type and =, >, <, >=, <=, IS NULL, and IS NOT NULL operator combinations
    • =, >, <, >=, <=, IS NULL, and IS NOT NULL operators and comparisons between the integer types
    • =, >, <, >=, <=, IS NULL, and IS NOT NULL operators and comparisons between the float8 and float4 types
    • IN operator on arrays of int2, int4, int8, boolean, and text

ORC Known Issues and Limitations

  • The HiveORC profile does not yet support complex data types.

  • Aggregate queries (count, min, max, etc.) have not yet been optimized to leverage ORC file- and stripe-level metadata.

  • The HiveORC profile does not yet use the Vectorized Batch reader.

Enabling ORC Support

You must install or upgrade to HDB 2.1.1 and PXF 3.1.1 for ORC file support. Ensure that your HDB and PXF components are the required versions.

ORC file support is provided via a specific profile in the PXF Hive Plug-in.

Procedure for Command-Line-Managed HAWQ Clusters

If you installed/upgraded and manage your cluster from the command line, the PXF Hive plug-in HiveORC profile is already installed and configured. No additional steps are required.

Procedure for Ambari-Managed HAWQ Clusters

If you manage your cluster with an Ambari version prior to, and including, version 2.4.1, you must explicitly add the HiveORC profile definition to the PXF service configuration:

  1. Start the Ambari UI.

  2. Click on the PXF service in the left pane and select the Configs tab.

  3. Expand the Advanced pxf-profiles field.

  4. Scroll to the bottom of the pxf-profiles text block and copy/paste the following definition just above the </profiles> line (notice the s).

        <description>This profile is suitable only for Hive tables stored in ORC files
            and serialized with either the ColumnarSerDe or the LazyBinaryColumnarSerDe.
            It is much faster than the general purpose Hive profile.
  5. Press the Save button, and add a note to the Save Configuration dialog if you choose. Save again and OK the Service Configuration Changes dialog.

  6. You must restart the PXF service after adding a new profile. Select the now orange colored Restart button to Restart All Affected.

  7. When PXF service restart is complete, the HiveORC profile will be available for use in your Ambari-managed HDB 2.1.1 cluster.

ORC File Format

The Optimized Row Columnar (ORC) file format is type-aware and specifically designed for Hadoop workloads. HAWQ/PXF supports ORC version 1.2.1.

ORC files store both the type of and encoding information for the data in the file. All columns within a single group of row data (also known as stripe) are stored together on disk in ORC format files. The columnar nature of the ORC format type enables read projection, helping avoid accessing unecessary columns during a query.

ORC also supports predicate pushdown with built-in indexes at the file, stripe, and row levels, moving the filter operation to the data loading phase.

Refer to the Apache orc and the Apache Hive LanguageManual ORC websites for detailed information about the ORC file format.

HiveORC Profile

The PXF HiveORC profile is optimized to support column projection and predicate pushdown. The HiveORC profile is more performant than the Hive profile for the ORC file format type.

Example: Using the HiveORC Profile

Review the PXF Hive Plug-in introductory documentation, particularly the Prerequisites, Hive File Formats, Data Type Mapping, and Querying External Hive Data sections. You should also complete the example exercises in Sample Data Set, Hive Command Line, and Hive Profile.

After completing the example exercises, you will have created a text file with several rows of comma-separated value data. You will also have loaded this text file into a new Hive table named sales_info that is stored in textfile format.

In the following example, you will create a Hive table stored in ORC format and use the HiveORC profile to query this Hive table.

  1. Create a Hive table with ORC file format:

    $ HADOOP_USER_NAME=hdfs hive
    hive> CREATE TABLE sales_info_ORC (location string, month string,
            number_of_orders int, total_sales double)
          STORED AS ORC;
  2. Insert the data from the sales_info table into sales_info_ORC:

    hive> INSERT INTO TABLE sales_info_ORC SELECT * FROM sales_info;

    A copy of the sample data set is now stored in ORC format in sales_info_ORC.

  3. Perform a Hive query on sales_info_ORC to verify that the data was loaded successfully:

    hive> SELECT * FROM sales_info_ORC;
  4. Start the psql subsystem and turn on timing:

    $ psql -d postgres
    postgres=> \timing
    Timing is on.
  5. Use the PXF HiveORC profile to create a queryable HAWQ external table from the Hive table named sales_info_ORC you created in Step 1. The FORMAT clause must specify 'CUSTOM'. The HiveORC CUSTOM format supports only the built-in 'pxfwritable_import' formatter.

    postgres=> CREATE EXTERNAL TABLE salesinfo_hiveORCprofile(location text, month text, num_orders int, total_sales float8)
                 LOCATION ('pxf://namenode:51200/default.sales_info_ORC?PROFILE=HiveORC')
                 FORMAT 'CUSTOM' (formatter='pxfwritable_import');
  6. Query the external table:

    postgres=> SELECT * FROM salesinfo_hiveORCprofile;
       location    | month | number_of_orders | total_sales 
     Prague        | Jan   |              101 |     4875.33
     Rome          | Mar   |               87 |     1557.39
     Bangalore     | May   |              317 |     8936.99
    Time: 425.416 ms

Other Use Cases

You may already store large chunks of data in HDFS. To take advantage of both the storage efficiency and high speed processing capabilities provided by ORC format files and the advanced querying capabilities of HAWQ:

  1. Create a Hive external table that references the HDFS data.
  2. Create a Hive table stored in ORC format.
  3. Insert the data from the external Hive table into the ORC format Hive table.
  4. Create a HAWQ external table referencing the Hive table using the PXF HiveORC profile.
  5. Join the HAWQ PXF external ORC format table with HAWQ native tables and query away!