Accessing HDFS File Data

Prerequisites

Before working with HDFS file data using HAWQ and PXF, you should perform the following operations:

  • Test PXF on HDFS before connecting to Hive or HBase.
  • Ensure that all HDFS users have read permissions to HDFS services and that write permissions have been limited to specific users.

Syntax

The syntax for creating an external HDFS file is as follows: 

CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    ( column_name data_type [, ...] | LIKE other_table )
LOCATION ('pxf://host[:port]/path-to-data?<pxf parameters>[&custom-option=value...]')
      FORMAT '[TEXT | CSV | CUSTOM]' (<formatting_properties>);

where <pxf parameters> is:

   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
 | PROFILE=profile-name

Note: While <host> may identify any PXF agent node, use the HDFS NameNode as it is guaranteed to be available in a running HDFS cluster. If HDFS High Availability is enabled, <host> must identify the HDFS NameService. <port> identfies the PXF port. If <port> is omitted, PXF assumes <host> identifies a High Availability HDFS Nameservice and connects to the port number designated by the pxf_service_port server configuration parameter value. Default is 51200.

Use an SQL SELECT statement to read from an HDFS READABLE table:

SELECT ... FROM table_name;

Use an SQL INSERT statement to add data to an HDFS WRITABLE table:

INSERT INTO table_name ...;

To read the data in the files or to write based on the existing format, use FORMAT, PROFILE, or one of the classes.

This topic describes the following:

  • FORMAT clause
  • Profile
  • Accessor
  • Resolver
  • Avro

Note: For more details about the API and classes, see PXF External Tables and API.

FORMAT clause

Use one of the following formats to read data with any PXF connector:

  • FORMAT 'TEXT': Use with plain delimited text files on HDFS.
  • FORMAT 'CSV': Use with comma-separated value files on HDFS.
  • FORMAT 'CUSTOM': Use with all other files, including Avro format and binary formats. Must always be used with the built-in formatter ’pxfwritable_import’ (for read) or ’pxfwritable_export’ (for write).

Note: When creating PXF external tables, you cannot use the HEADER option in your FORMAT specification.

Profile

For plain or comma-separated text files in HDFS use either the HdfsTextSimple or HdfsTextMulti Profile, or the classname org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter. Use the Avro profile for Avro files. See Using Profiles to Read and Write Data for more information.

Note: For read tables, you must include a Profile or a Fragmenter in the table definition.

Accessor

The choice of an Accessor depends on the HDFS data file type. 

Note: You must include either a Profile or an Accessor in the table definition.

File Type Accessor FORMAT clause Comments
Plain Text delimited org.apache.hawq.pxf.plugins. hdfs.LineBreakAccessor FORMAT ‘TEXT’ (format param list)  Read + Write

You cannot use the HEADER option.

Plain Text CSV  org.apache.hawq.pxf.plugins. hdfs.LineBreakAccessor FORMAT 'CSV’ (format param list

LineBreakAccessor is parallel and faster.

Use if each logical data row is a physical data line.

Read + Write 

You cannot use the HEADER option.

Plain Text CSV  org.apache.hawq.pxf.plugins. hdfs.QuotedLineBreakAccessor FORMAT 'CSV’ (format param list

QuotedLineBreakAccessor is slower and non-parallel.

Use if the data includes embedded (quoted) linefeed characters.

Read Only 

You cannot use the HEADER option.

SequenceFile org.apache.hawq.pxf.plugins. hdfs.SequenceFileAccessor FORMAT 'CUSTOM’ (formatter='pxfwritable_import’)  Read + Write (use formatter='pxfwritable_export’ for write)
AvroFile org.apache.hawq.pxf.plugins. hdfs.AvroFileAccessor FORMAT 'CUSTOM’ (formatter='pxfwritable_import’)  Read Only

Resolver

Choose the Resolver format if data records are serialized in the HDFS file. 

Note: You must include a Profile or a Resolver in the table definition.

Record Serialization Resolver Comments
Avro org.apache.hawq.pxf.plugins. hdfs.AvroResolver
  • Avro files include the record schema, Avro serialization can be used in other file types (e.g, Sequence File). 
  • For Avro serialized records outside of an Avro file, include a schema file name (.avsc) in the url under the optional Schema-Data option.
  • Deserialize Only (Read) .
Java Writable org.apache.hawq.pxf.plugins. hdfs.WritableResolver
  • Include the name of the Java class that uses Writable serialization in the URL under the optional Schema-Data.
  • The class file must exist in the public stage directory (or in Hadoop’s class path).
  • Deserialize and Serialize (Read + Write). 
  • See Customized Writable Schema File Guidelines.
None (plain text) org.apache.hawq.pxf.plugins. hdfs.StringPassResolver
  • Does not serialize plain text records. The database parses plain records. Passes records as they are.
  • Deserialize and Serialize (Read + Write).

Schema File Guidelines for WritableResolver

When using a WritableResolver, a schema file needs to be defined. The file needs to be a Java class file and must be on the class path of PXF.

The class file must follow the following requirements:

  1. Must implement org.apache.hadoop.io.Writable interface.
  2. WritableResolver uses reflection to recreate the schema and populate its fields (for both read and write). Then it uses the Writable interface functions to read/write. Therefore, fields must be public, to enable access to them. Private fields will be ignored.
  3. Fields are accessed and populated in the order in which they are declared in the class file.
  4. Supported field types:

    • boolean
    • byte array
    • double
    • float
    • int
    • long
    • short
    • string

    Arrays of any of the above types are supported, but the constructor must define the array size so the reflection will work.

Additional Options

Table 1. Additional PXF Options
Option Name Description
COLLECTION_DELIM (Avro or Hive profiles only.) The delimiter character(s) to place between entries in a top-level array, map, or record field when PXF maps a Hive or Avro complex data type to a text column. The default is a “,” character.
COMPRESSION_CODEC
  • Useful for WRITABLE PXF tables.
  • Specifies the compression codec class name for compressing the written data. The class must implement the org.apache.hadoop.io.compress.CompressionCodec interface.
  •  Some valid values are org.apache.hadoop.io.compress.DefaultCodec org.apache.hadoop.io.compress.GzipCodec org.apache.hadoop.io.compress.BZip2Codec.
  • Note: org.apache.hadoop.io.compress.BZip2Codec runs in a single thread and can be slow.
  • This option has no default value. 
  • When the option is not defined, no compression will be done.
COMPRESSION_TYPE
  • Useful WRITABLE PXF tables with SequenceFileAccessor.
  • Ignored when COMPRESSION_CODEC is not defined.
  • Specifies the compression type for sequence file.
  • Valid options are: 
    • RECORD - only the value part of each row is compressed.
    • BLOCK - both keys and values are collected in 'blocks’ separately and compressed.
  • Default value: RECORD.
MAPKEY_DELIM (Avro or Hive profiles only.) The delimiter character(s) to place between the key and value of a map entry when PXF maps a Hive or Avro complex data type to a text colum. The default is a “:” character.
RECORDKEY_DELIM (Avro profile only.) The delimiter character(s) to place between the field name and value of a record entry when PXF maps an Avro complex data type to a text colum. The default is a “:” character.
SCHEMA-DATA The data schema file used to create and read the HDFS file. For example, you could create an avsc (for Avro), or a Java class (for Writable Serialization) file. Make sure that you have added any JAR files containing the schema to pxf-public.classpath.

This option has no default value.

THREAD-SAFE Determines if the table query can run in multithread mode or not. When set to FALSE, requests will be handled in a single thread.

Should be set when a plug-in or other elements that are not thread safe are used (e.g. compression codec).

Allowed values: TRUE, FALSE. Default value is TRUE - requests can run in multithread mode.

 <custom> Any option added to the pxf URI string will be accepted and passed, along with its value, to the Fragmenter, Accessor, and Resolver implementations.

Accessing Data on a High Availability HDFS Cluster

To access data on a High Availability HDFS cluster, change the authority in the URI in the LOCATION. Use HA_nameservice instead of name_node_host:51200.

CREATE [READABLE|WRITABLE] EXTERNAL TABLE <tbl name> (<attr list>)
LOCATION ('pxf://<HA nameservice>/<path to file or directory>?Profile=profile[&<additional options>=<value>]')
FORMAT '[TEXT | CSV | CUSTOM]' (<formatting properties>);

The opposite is true when a highly available HDFS cluster is reverted to a single namenode configuration. In that case, any table definition that has the nameservice specified should use the <NN host>:<NN rest port> syntax. 

Using a Record Key with Key-Value File Formats

For sequence file and other file formats that store rows in a key-value format, the key value can be accessed through HAWQ by using the saved keyword ’recordkey’ as a field name.

The field type must correspond to the key type, much as the other fields must match the HDFS data. 

WritableResolver supports read and write of recordkey, which can be of the following Writable Hadoop types:

  • BooleanWritable
  • ByteWritable
  • DoubleWritable
  • FloatWritable
  • IntWritable
  • LongWritable
  • Text

If the recordkey field is not defined, the key is ignored in read, and a default value (segment id as LongWritable) is written in write.

Example

A data schema Babies.class contains three fields: (name text, birthday text, weight float). An external table must include these three fields, and can either include or ignore the recordkey.

-- writable table with recordkey
CREATE WRITABLE EXTERNAL TABLE babies_registry (recordkey int, name text, birthday text, weight float)
  LOCATION ('pxf://namenode_host:51200/babies_1940s'
            '?ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor'
            '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver'
            '&DATA-SCHEMA=Babies')
  FORMAT 'CUSTOM' (formatter='pxfwritable_export');
INSERT INTO babies_registry VALUES (123456, "James Paul McCartney", "June 18, 1942", 3.800);

-- writable table without recordkey
CREATE WRITABLE EXTERNAL TABLE babies_registry2 (name text, birthday text, weight float)
  LOCATION ('pxf://namenode_host:51200/babies_1940s'
            '?ACCESSOR=org.apache.hawq.pxf.plugins.SequenceFileAccessor'
            '&RESOLVER=org.apache.hawq.pxf.plugins.WritableResolver'
            '&DATA-SCHEMA=Babies')
  FORMAT 'CUSTOM' (formatter='pxfwritable_export');

-- this record's key will have some default value
INSERT INTO babies_registry VALUES ("Richard Starkey", "July 7, 1940", 4.0);

The same goes for reading data from an existing file with a key-value format, e.g. a Sequence file.

-- readable table with recordkey
CREATE EXTERNAL TABLE babies_1940 (recordkey int, name text, birthday text, weight float)
  LOCATION ('pxf://namenode_host:51200/babies_1940s'
            '?FRAGMENTER=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter'
            '&ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor'
            '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver'
            '&DATA-SCHEMA=Babies')
  FORMAT 'CUSTOM' (formatter='pxfwritable_import');
-- retrieve each record's key
SELECT * FROM babies_1940;

-- readable table without recordkey
CREATE EXTERNAL TABLE babies_1940_2 (name text, birthday text, weight float)
  LOCATION ('pxf://namenode_host:51200/babies_1940s'
            '?FRAGMENTER=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter'
            '&ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor'
            '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver'
            '&DATA-SCHEMA=Babies') 
  FORMAT 'CUSTOM' (formatter='pxfwritable_import');
-- ignores the records' key
SELECT * FROM babies_1940_2; 

Working with Avro Files

Avro files combine their data with a schema, and can support complex data types such as arrays, maps, records, enumerations, and fixed types. When you create a PXF external table to represent Avro data, map top-level fields in the schema that use a primitive data type to HAWQ columns of the same type. Map top-level fields that use a complex data type to a TEXT column in the external table. The PXF Avro profile automatically separates components of a complex type by inserting delimiters in the text column. You can then use functions or application code to further process components of the complex data.

The following table summarizes external table mapping rules for Avro data.

Table 2. Avro Data Type Mapping

Avro Data Type PXF Type
Primitive type (int, double, float, long, string, bytes, boolean) Corresponding HAWQ data type. See Data Types.
Complex type: Array, Map, Record, or Enum TEXT, with default delimiters inserted between collection items, mapped key-value pairs, and record data.
Complex type: Fixed BYTEA
Union Follows the above conventions for primitive or complex data types, depending on the union. Null values are supported in Unions.

For complex types, the PXF Avro profile inserts default delimiters between collection items and values. You can use non-default delimiter characters by including the COLLECTION_DELIM, MAPKEY_DELIM, and/or RECORDKEY_DELIM optional parameters on the Avro profile. See Additional PXF Options for a description of the parameters.

Example

The following example uses the Avro schema shown in Sample Avro Schema and the associated data file shown in Sample Avro Data (JSON).

Sample Avro Schema

{
  "type" : "record",
  "name" : "example_schema",
  "namespace" : "com.example",
  "fields" : [ {
    "name" : "id",
    "type" : "long",
    "doc" : "Id of the user account"
  }, {
    "name" : "username",
    "type" : "string",
    "doc" : "Name of the user account"
  }, {
    "name" : "followers",
    "type" : {"type": "array", "items": "string"},
    "doc" : "Users followers"
  }, {
    "name": "rank",
    "type": ["null", "int"],
    "default": null
  }, {
    "name": "fmap",
    "type": {"type": "map", "values": "long"}
  }, {
    "name": "address",
    "type": {
        "type": "record",
        "name": "addressRecord",
        "fields": [
            {"name":"number", "type":"int"},
            {"name":"street", "type":"string"},
            {"name":"city", "type":"string"}]
    }
  }, {
   "name": "relationship",
    "type": {
        "type": "enum",
        "name": "relationshipEnum",
        "symbols": ["MARRIED","LOVE","FRIEND","COLLEAGUE","STRANGER","ENEMY"]
    }
  }, {
    "name" : "md5",
    "type": {
        "type" : "fixed",
        "name" : "md5Fixed",
        "size" : 4
    }
  } ],
  "doc:" : "A basic schema for storing messages"
}

Sample Avro Data (JSON)

{"id":1, "username":"john","followers":["kate", "santosh"], "rank":null, "relationship": "FRIEND", "fmap": {"kate":10,"santosh":4},
"address":{"street":"renaissance drive", "number":1,"city":"san jose"}, "md5":\u3F00\u007A\u0073\u0074}

{"id":2, "username":"jim","followers":["john", "pam"], "rank":3, "relationship": "COLLEAGUE", "fmap": {"john":3,"pam":3}, 
"address":{"street":"deer creek", "number":9,"city":"palo alto"}, "md5":\u0010\u0021\u0003\u0004}

To map this Avro file to an external table, the top-level primitive fields (“id” of type long and “username” of type string) are mapped to their equivalent HAWQ types (bigint and text). The remaining complex fields are mapped to text columns:

gpadmin=# CREATE EXTERNAL TABLE avro_complex 
  (id bigint, 
  username text, 
  followers text, 
  rank int, 
  fmap text, 
  address text, 
  relationship text,
  md5 bytea) 
LOCATION ('pxf://namehost:51200/tmp/avro_complex?PROFILE=Avro')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

The above command uses default delimiters for separating components of the complex types. This command is equivalent to the one above, but it explicitly sets the delimiters using the Avro profile parameters:

gpadmin=# CREATE EXTERNAL TABLE avro_complex 
  (id bigint, 
  username text, 
  followers text, 
  rank int, 
  fmap text, 
  address text, 
  relationship text,
  md5 bytea) 
LOCATION ('pxf://localhost:51200/tmp/avro_complex?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

A simple query of the external table shows the components of the complex type data separated with delimiters:

gpadmin=# select * from avro_complex;
id | username |  followers  |    rank     |  fmap   |    address  |  relationship  |  md5
1| john | [kate,santosh] |   | {kate:10,santosh:4} | {number:1,street:renaissance drive,city:san jose} | FRIEND | ?zst
2| jim | [john,pam] | 3 | {john:3,pam:3} | {number:9,street:deer creek,city:palo alto} | COLLEAGUE | \020!\003\004

You can process the delimited components in the text columns as necessary for your application. For example, the following command uses the string_to_array function to convert entries in the “followers” field to a text array column in a new view. The view is then queried to filter rows based on whether a particular follower appears in the array:

gpadmin=# create view followers_view as 
  select username, address, string_to_array(substring(followers from 2 for (char_length(followers) - 2)), ',')::text[] 
    as followers 
  from avro_complex;

gpadmin=# select username, address 
from followers_view 
where john = ANY(followers);
username | address
jim | {number:9,street:deer creek,city:palo alto}