Using the ORC File Format (Beta)
HAWQ 2.1.1 includes a Beta release of Optimized Row Columnar (ORC) file format support. ORC is a columnar file format that provides a highly efficient way to both store and access HDFS data. ORC files offer improvements over text and RCFile formats in terms of both compression and performance.
ORC support is provided by a new profile exposed in the PXF Hive plug-in,
HiveORC profile supports access to data stored in ORC format only through a Hive mapped table.
ORC file access - You can access Hive tables stored in ORC format using the
Enhanced query performance - Column projection information is leveraged to enhance query performance by reducing disk I/O and data payload.
Optimized predicate pushdown - Predicate pushdown is optimized for:
booleandata type and
IS NULL, and
IS NOT NULLoperator combinations
IS NULL, and
IS NOT NULLoperators and comparisons between the integer types
IS NULL, and
IS NOT NULLoperators and comparisons between the
INoperator on arrays of
HiveORCprofile does not yet support complex data types.
Aggregate queries (count, min, max, etc.) have not yet been optimized to leverage ORC file- and stripe-level metadata.
HiveORCprofile does not yet use the Vectorized Batch reader.
You must install or upgrade to HDB 2.1.1 and PXF 3.1.1 for ORC file support. Ensure that your HDB and PXF components are the required versions.
ORC file support is provided via a specific profile in the PXF Hive Plug-in.
If you installed/upgraded and manage your cluster from the command line, the PXF Hive plug-in
HiveORC profile is already installed and configured. No additional steps are required.
If you manage your cluster with an Ambari version prior to, and including, version 2.4.1, you must explicitly add the
HiveORC profile definition to the PXF service configuration:
Start the Ambari UI.
Click on the PXF service in the left pane and select the Configs tab.
Expand the Advanced pxf-profiles field.
Scroll to the bottom of the pxf-profiles text block and copy/paste the following definition just above the
</profiles>line (notice the
<profile> <name>HiveORC</name> <description>This profile is suitable only for Hive tables stored in ORC files and serialized with either the ColumnarSerDe or the LazyBinaryColumnarSerDe. It is much faster than the general purpose Hive profile. </description> <plugins> <fragmenter>org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter</fragmenter> <accessor>org.apache.hawq.pxf.plugins.hive.HiveORCAccessor</accessor> <resolver>org.apache.hawq.pxf.plugins.hive.HiveORCSerdeResolver</resolver> <metadata>org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher</metadata> </plugins> </profile>
Press the Save button, and add a note to the Save Configuration dialog if you choose. Save again and OK the Service Configuration Changes dialog.
You must restart the PXF service after adding a new profile. Select the now orange colored Restart button to Restart All Affected.
When PXF service restart is complete, the
HiveORCprofile will be available for use in your Ambari-managed HDB 2.1.1 cluster.
The Optimized Row Columnar (ORC) file format is type-aware and specifically designed for Hadoop workloads. HAWQ/PXF supports ORC version 1.2.1.
ORC files store both the type of and encoding information for the data in the file. All columns within a single group of row data (also known as stripe) are stored together on disk in ORC format files. The columnar nature of the ORC format type enables read projection, helping avoid accessing unecessary columns during a query.
ORC also supports predicate pushdown with built-in indexes at the file, stripe, and row levels, moving the filter operation to the data loading phase.
HiveORC profile is optimized to support column projection and predicate pushdown. The
HiveORC profile is more performant than the
Hive profile for the ORC file format type.
Review the PXF Hive Plug-in introductory documentation, particularly the Prerequisites, Hive File Formats, Data Type Mapping, and Querying External Hive Data sections. You should also complete the example exercises in Sample Data Set, Hive Command Line, and Hive Profile.
After completing the example exercises, you will have created a text file with several rows of comma-separated value data. You will also have loaded this text file into a new Hive table named
sales_info that is stored in textfile format.
In the following example, you will create a Hive table stored in ORC format and use the
HiveORC profile to query this Hive table.
Create a Hive table with ORC file format:
$ HADOOP_USER_NAME=hdfs hive
hive> CREATE TABLE sales_info_ORC (location string, month string, number_of_orders int, total_sales double) STORED AS ORC;
Insert the data from the
hive> INSERT INTO TABLE sales_info_ORC SELECT * FROM sales_info;
A copy of the sample data set is now stored in ORC format in
Perform a Hive query on
sales_info_ORCto verify that the data was loaded successfully:
hive> SELECT * FROM sales_info_ORC;
psqlsubsystem and turn on timing:
$ psql -d postgres
postgres=> \timing Timing is on.
Use the PXF
HiveORCprofile to create a queryable HAWQ external table from the Hive table named
sales_info_ORCyou created in Step 1. The
FORMATclause must specify
CUSTOMformat supports only the built-in
postgres=> CREATE EXTERNAL TABLE salesinfo_hiveORCprofile(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://namenode:51200/default.sales_info_ORC?PROFILE=HiveORC') FORMAT 'CUSTOM' (formatter='pxfwritable_import');
Query the external table:
postgres=> SELECT * FROM salesinfo_hiveORCprofile;
location | month | number_of_orders | total_sales ---------------+-------+------------------+------------- Prague | Jan | 101 | 4875.33 Rome | Mar | 87 | 1557.39 Bangalore | May | 317 | 8936.99 ... Time: 425.416 ms
You may already store large chunks of data in HDFS. To take advantage of both the storage efficiency and high speed processing capabilities provided by ORC format files and the advanced querying capabilities of HAWQ:
- Create a Hive external table that references the HDFS data.
- Create a Hive table stored in ORC format.
- Insert the data from the external Hive table into the ORC format Hive table.
- Create a HAWQ external table referencing the Hive table using the PXF
- Join the HAWQ PXF external ORC format table with HAWQ native tables and query away!