Acts as an interface to the external table parallel loading feature. Executes a load specification defined in a YAML-formatted control file to invoke the HAWQ parallel file server (
hawq load -f <control_file> [-l <log_file>] [--gpfdist_timeout <seconds>] [[-v | -V] [-q]] [-D] [<connection_options>] hawq load -? hawq load --version
<connection_options> = [-h <host>] [-p <port>] [-U <username>] [-d <database>] [-W]
The client machine where
hawq load is executed must have the following:
Python 2.6.2 or later,
pygresql(the Python interface to PostgreSQL), and
pyyaml. Note that Python and the required Python libraries are included with the HAWQ server installation, so if you have HAWQ installed on the machine where
hawq loadis running, you do not need a separate Python installation. Note: HAWQ Loaders for Windows supports only Python 2.5 (available from www.python.org).
The gpfdist parallel file distribution program installed and in your
$PATH. This program is located in
$GPHOME/binof your HAWQ server installation.
Network access to and from all hosts in your HAWQ array (master and segments).
Network access to and from the hosts where the data to be loaded resides (ETL servers).
hawq load is a data loading utility that acts as an interface to HAWQ’s external table parallel loading feature. Using a load specification defined in a YAML formatted control file,
hawq load executes a load by invoking the HAWQ parallel file server (gpfdist), creating an external table definition based on the source data defined, and executing an
INSERT operation to load the source data into the target table in the database.
The operation, including any SQL commands specified in the
SQL collection of the YAML control file (see Control File Format), are performed as a single transaction to prevent inconsistent data when performing multiple, simultaneous load operations on a target table.
gpfdistparallel file distribution program to send a response. Enter a value from
30seconds (entering “
0” to disables timeouts). Note that you might need to increase this value when operating on high-traffic networks.
~/hawq/Adminlogs/hawq_load_YYYYMMDD. For more information about the log file, see Log File Format.
$PGDATABASEor defaults to the current system user name.
$PGHOSTor defaults to
$PGPORTor defaults to 5432.
$PGUSERor defaults to the current system user name.
$PGPASSWORDor from a password file specified by
~/.pgpass. If these are not set, then
hawq loadwill prompt for a password even if
-Wis not supplied.
hawq load control file uses the YAML 1.1 document format and then implements its own schema for defining the various steps of a HAWQ load operation. The control file must be a valid YAML document.
hawq load program processes the control file document in order and uses indentation (spaces) to determine the document hierarchy and the relationships of the sections to one another. The use of white space is significant. White space should not be used simply for formatting purposes, and tabs should not be used at all.
The basic structure of a load control file is:
--- VERSION: 22.214.171.124 DATABASE: db_name USER: db_username HOST: master_hostname PORT: master_port GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - hostname_or_ip PORT: http_port | PORT_RANGE: [start_port_range, end_port_range] FILE: - /path/to/input_file SSL: true | false CERTIFICATES_PATH: /path/to/certificates - COLUMNS: - field_name: data_type - TRANSFORM: 'transformation' - TRANSFORM_CONFIG: 'configuration-file-path' - MAX_LINE_LENGTH: integer - FORMAT: text | csv - DELIMITER: 'delimiter_character' - ESCAPE: 'escape_character' | 'OFF' - NULL_AS: 'null_string' - FORCE_NOT_NULL: true | false - QUOTE: 'csv_quote_character' - HEADER: true | false - ENCODING: database_encoding - ERROR_LIMIT: integer - ERROR_TABLE: schema.table_name OUTPUT: - TABLE: schema.table_name - MODE: insert | update | merge - MATCH_COLUMNS: - target_column_name - UPDATE_COLUMNS: - target_column_name - UPDATE_CONDITION: 'boolean_condition' - MAPPING: target_column_name: source_column_name | 'expression' PRELOAD: - TRUNCATE: true | false - REUSE_TABLES: true | false SQL: - BEFORE: "sql_command" - AFTER: "sql_command"
Control File Schema Elements
The control file contains the schema elements for:
- GPLOAD file
hawq loadcontrol file schema, for example: 126.96.36.199.
$PGDATABASEif set or the current system user name. You can also specify the database on the command line using the
$PGUSERif set. You can also specify the database role on the command line using the
If the user running
hawq load is not a HAWQ superuser, then the server configuration parameter
gp_external_grant_privileges must be set to
on for the load to be processed.
$PGHOSTif set. You can also specify the master host name on the command line using the
$PGPORTif set. You can also specify the master port on the command line using the
GPLOADspecification must have an
hawq loadwill start one or more instances of the gpfdist file distribution program on the current host and create the required external table definition(s) in HAWQ that point to the source data. Note that the host from which you run
hawq loadmust be accessible over the network by all HAWQ hosts (master and segments).
SOURCEblock of an
INPUTspecification defines the location of a source file. An
INPUTsection can have more than one
SOURCEblock defined. Each
SOURCEblock defined corresponds to one instance of the gpfdist file distribution program that will be started on the local machine. Each
SOURCEblock defined must have a
hawq loadis running. If this machine is configured with multiple network interface cards (NICs), you can specify the host name or IP of each individual NIC to allow network traffic to use all NICs simultaneously. The default is to use the local machine’s primary host name or IP only.
PORT_RANGEto select an available port from the specified range. If both
PORT_RANGEare defined, then
PORTtakes precedence. If neither
PORT_RANGEare defined, the default is to select an available port between 8000 and 9000.
If multiple host names are declared in
LOCAL_HOSTNAME, this port number is used for all hosts. This configuration is desired if you want to use all NICs to load the same file or set of files in a given directory location.
PORTto supply a range of port numbers from which
hawq loadcan choose an available port for this instance of the gpfdist file distribution program.
If the files are compressed using
bzip2 (have a
.bz2 file extension), the files will be uncompressed automatically (provided that
bunzip2 is in your path).
When specifying which source files to load, you can use the wildcard character (
*) or other C-style pattern matching to denote multiple files. The files specified are assumed to be relative to the current directory from which
hawq load is executed (or you can declare an absolute path).
true; cannot be specified when SSL is
falseor unspecified. The location specified in
CERTIFICATES_PATHmust contain the following files:
- The server certificate file,
- The server private key file,
- The trusted certificate authorities,
The root directory (
/) cannot be specified as
DELIMITERcharacter in the source file is what separates two data value fields (columns). A row is determined by a line feed character (
If the input
COLUMNS are not specified, then the schema of the output
TABLE is implied, meaning that the source data must have the same column order, number of columns, and data format as the target table.
The default source-to-target mapping is based on a match of column names as defined in this section and the column names in the target
TABLE. This default mapping can be overridden using the
hawq load. For more information about XML transformations, see “Loading and Unloading Data.”.
TEXT) or comma separated values (
CSV) format. Defaults to
TEXTif not specified. For more information about the format of the source data, see “Loading and Unloading Data” .
\014represents the shift out character..
\100, and so on) and for escaping data characters that might otherwise be taken as row or column delimiters. Make sure to choose an escape character that is not used anywhere in your actual column data. The default escape character is a \ (backslash) for text-formatted files and a
"(double quote) for csv-formatted files, however it is possible to specify another character to represent an escape. It is also possible to disable escaping in text-formatted files by specifying the value
'OFF'as the escape value. This is very useful for data such as text-formatted web log data that has many embedded backslashes that are not intended to be escapes.
TEXTmode, and an empty value with no quotations in
CSVmode. You might prefer an empty string even in
TEXTmode for cases where you do not want to distinguish nulls from empty strings. Any source data item that matches this string will be considered a null value.
CSV. Specifies the quotation character for
CSVmode. The default is double-quote (
'SQL_ASCII'), an integer encoding number, or
'DEFAULT'to use the default client encoding. If not specified, the default client encoding is used.
ERROR_TABLE. The value of ERROR_LIMIT must be 2 or greater.
ERROR_LIMITis declared. Specifies an error table where rows with formatting errors will be logged when running in single row error isolation mode. You can then examine this error table to see error rows that were not loaded (if any). If the
ERROR_TABLEspecified already exists, it will be used. If it does not exist, it will be automatically generated.
For more information about handling load errors, see “Loading and Unloading Data”.
INSERTif not specified. There are three available load modes:
INSERT INTO target_table SELECT * FROM input_data;
UPDATE_COLUMNSof the target table where the rows have
MATCH_COLUMNSattribute values equal to those of the input data, and the optional
UPDATE_COLUMNSof existing rows where
MATCH_COLUMNSattribute values are equal to those of the input data, and the optional
UPDATE_CONDITIONis true. New rows are identified when the
MATCH_COLUMNSvalue in the source data does not have a corresponding value in the existing data of the target table. In those cases, the entire row from the source file is inserted, not only the
UPDATEcolumns. If there are multiple new
MATCH_COLUMNSvalues that are the same, only one new row for that value will be inserted. Use
UPDATE_CONDITIONto filter out the rows to discard.
MERGE. Specifies the column(s) to use as the join condition for the update. The attribute value in the specified target column(s) must be equal to that of the corresponding source data column(s) in order for the row to be updated in the target table.
MERGE. Specifies the column(s) to update for the rows that meet the
MATCH_COLUMNScriteria and the optional
WHEREclause) that must be met for a row in the target table to be updated (or inserted in the case of a
COLUMNSsection and the column names of the target
TABLE. A mapping is specified as either:
Where <expression> is any expression that you would specify in the
SELECT list of a query, such as a constant value, a column reference, an operator invocation, a function call, and so on.
hawq loadwill remove all rows in the target table prior to loading it.
hawq loadwill not drop the external table objects and staging table objects it creates. These objects will be reused for future load operations that use the same load specifications. Reusing objects improves performance of trickle loads (ongoing small loads to the same target table).
AFTERcommands. List commands in the desired order of execution.
If your database object names were created using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the
hawq load control file. For example, if you create a table as follows:
CREATE TABLE "MyTable" ("MyColumn" text);
hawq load control file would refer to the above table and column names as follows:
- COLUMNS: - '"MyColumn"': text OUTPUT: - TABLE: public.'"MyTable"'
Log files output by
hawq load have the following format:
Where <timestamp> takes the form:
YYYY-MM-DD HH:MM:SS, <level> is one of
ERROR, and <message> is a normal text message.
INFO messages that may be of interest in the log files are (where # corresponds to the actual number of seconds, units of data, or failed rows):
INFO|running time: #.## seconds INFO|transferred #.# kB of #.# kB. INFO|hawq load succeeded INFO|hawq load succeeded with warnings INFO|hawq load failed INFO|1 bad row INFO|# bad rows
Run a load job as defined in
$ hawq load -f my_load.yml
Example load control file:
--- VERSION: 188.8.131.52 DATABASE: ops USER: gpadmin HOST: mdw-1 PORT: 5432 GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - etl1-1 - etl1-2 - etl1-3 - etl1-4 PORT: 8081 FILE: - /var/load/data/* - COLUMNS: - name: text - amount: float4 - category: text - desc: text - date: date - FORMAT: text - DELIMITER: '|' - ERROR_LIMIT: 25 - ERROR_TABLE: payables.err_expenses OUTPUT: - TABLE: payables.expenses - MODE: INSERT SQL: - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)" - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"