Lesson 4 - Sample Data Set and HAWQ Schemas
The sample Retail demo data set used in the tutorial exercises models an online retail store operation. The store carries different categories of products. Customers order the products. The company delivers the products to the customers.
This and later exercises operate on this example data set. The data set is provided in a set of
.tsv (tab-separated values) text files. The exercises also reference scripts and other supporting files that operate on the data set.
In this section, you are introduced to the Retail demo data schema. You will download and examine the data set and work files. You will also load some of the data into HDFS.
Ensure that you have Created the HAWQ Tutorial Database and that your HAWQ cluster is up and running.
Perform the following steps to download the sample data set and scripts:
Open a terminal window and log in to the HAWQ master node as the
$ ssh gpadmin@<master>
Create a working directory for the data files and scripts:
gpadmin@master$ mkdir /tmp/hawq_getstart gpadmin@master$ cd /tmp/hawq_getstart
You may choose a different base work directory. If you do, ensure that all path components up to and including the
hawq_getstartdirectory have read and execute permissions for all.
Download the tutorial work and data files from
github, checking out the appropriate tag/branch:
gpadmin@master$ git clone https://github.com/pivotalsoftware/hawq-samples.git Cloning into 'hawq-samples'... remote: Counting objects: 42, done. remote: Total 42 (delta 0), reused 0 (delta 0), pack-reused 42 Unpacking objects: 100% (42/42), done. Checking out files: 100% (18/18), done. gpadmin@master$ cd hawq-samples gpadmin@master$ git checkout hawq2x_tutorial
Save the path to the work files base directory:
gpadmin@master$ export HAWQGSBASE=/tmp/hawq_getstart/hawq-samples
(If you chose a different base work directory, modify the command as appropriate.)
$HAWQGSBASEenvironment variable setting to your
Examine the tutorial files. Exercises in this guide reference data files and SQL and shell scripts residing in the
Directory Content datasets/retail/ Retail demo data set data files (
tutorials/getstart/ Getting Started with HAWQ guide work files tutorials/getstart/hawq/ SQL and shell scripts used by the HAWQ tables exercises tutorials/getstart/pxf/ SQL and shell scripts used by the PXF exercises
hawq-samplesrepository directories not mentioned in the table above are not used by the Getting Started with HAWQ exercises.)
A HAWQ schema is a namespace for a database. It contains named objects like tables, data types, functions, and operators. Access these objects by qualifying their name with the prefix
Perform the following steps to create the Retail demo data schema:
gpadmin@master$ psql hawqgsdb=#
You are connected to the
List the HAWQ schemas:
hawqgsdb=# \dn List of schemas Name | Owner --------------------+--------- hawq_toolkit | gpadmin information_schema | gpadmin pg_aoseg | gpadmin pg_bitmapindex | gpadmin pg_catalog | gpadmin pg_toast | gpadmin public | gpadmin (7 rows)
Every database includes a schema named
public. Database objects you create without specifying a schema are created in the default schema. The default HAWQ schema is the
publicschema, unless you explicitly set it to another schema. (More about this later.)
Display the tables in the
hawqgsdb=#\dt public.* List of relations Schema | Name | Type | Owner | Storage --------+-----------+-------+---------+------------- public | first_tbl | table | gpadmin | append only (1 row)
In Lesson 3, you created the
first_tbltable in the
Create a schema named
retail_demoto represent the Retail demo namespace:
hawqgsdb=# CREATE SCHEMA retail_demo; CREATE SCHEMA
search_pathserver configuration parameter identifies the order in which HAWQ should search or apply schemas for objects. Set the schema search path to include the new
hawqgsdb=# SET search_path TO retail_demo, public; SET
retail_demo, the first schema in your
search_path, becomes your default schema.
search_pathin this manner sets the parameter only for the current
psqlsession. You must re-set
Create another table named
hawqgsdb=# CREATE TABLE first_tbl( i int ); CREATE TABLE hawqgsdb=# INSERT INTO first_tbl SELECT generate_series(100,103); INSERT 0 4 hawqgsdb=# SELECT * FROM first_tbl; i ----- 100 101 102 103 (4 rows)
HAWQ creates this table named
first_tblin your default schema since no schema was explicitly identified for the table. Your default schema is
retail_demodue to your current
Verify that this
first_tblwas created in the
retail_demoschema by displaying the tables in this schema:
hawqgsdb=#\dt retail_demo.* List of relations Schema | Name | Type | Owner | Storage -------------+----------------------+-------+---------+------------- retail_demo | first_tbl | table | gpadmin | append only (1 row)
first_tbltable that you created in Lesson 3:
hawqgsdb=# SELECT * from public.first_tbl; i --- 1 2 3 4 5 (5 rows)
You must prepend the table name with
public.to explicitly identify the
first_tbltable in which you are interested.
The Retail demo data set includes the entities described in the table below. A fact table consists of business facts. Orders and order line items are fact tables. Dimension tables provide descriptive information for the measurements in a fact table. The other entities are represented in dimension tables.
|customers_dim||Customer data: first/last name, id, gender|
|customer_addresses_dim||Address and phone number of each customer|
|email_addresses_dim||Customer e-mail addresses|
|categories_dim||Product category name, id|
|products_dim||Product details including name, id, category, and price|
|date_dim||Date information including year, quarter, month, week, day of week|
|payment_methods||Payment method code, id|
|orders||Details of an order such as the id, payment method, billing address, day/time, and other fields. Each order is associated with a specific customer.|
|order_lineitems||Details of an order line item such as the id, item id, category, store, shipping address, and other fields. Each line item references a specific product from a specific order from a specific customer.|
Perform the following steps to load the Retail demo dimension data into HDFS for later consumption:
Navigate to the PXF script directory:
gpadmin@master$ cd $HAWQGSBASE/tutorials/getstart/pxf
Using the provided script, load the sample data files representing dimension data into an HDFS directory named
/retail_demo. The script removes any existing
/retail_demodirectory and contents before loading the data:
gpadmin@master$ ./load_data_to_HDFS.sh running: sudo -u hdfs hdfs -rm -r -f -skipTrash /retail_demo sudo -u hdfs hdfs dfs -mkdir /retail_demo/categories_dim sudo -u hdfs hdfs dfs -put /tmp/hawq_getstart/hawq-samples/datasets/retail/categories_dim.tsv.gz /retail_demo/categories_dim/ sudo -u hdfs hdfs dfs -mkdir /retail_demo/customer_addresses_dim sudo -u hdfs hdfs dfs -put /tmp/hawq_getstart/hawq-samples/datasets/retail/customer_addresses_dim.tsv.gz /retail_demo/customer_addresses_dim/ ...
load_to_HDFS.shloads the dimension data
.tsv.gzfiles directly into HDFS. Each file is loaded to its respective
View the contents of the HDFS
gpadmin@master$ sudo -u hdfs hdfs dfs -ls /retail_demo/* -rw-r--r-- 3 hdfs hdfs 590 2017-04-10 19:59 /retail_demo/categories_dim/categories_dim.tsv.gz Found 1 items -rw-r--r-- 3 hdfs hdfs 53995977 2017-04-10 19:59 /retail_demo/customer_addresses_dim/customer_addresses_dim.tsv.gz Found 1 items -rw-r--r-- 3 hdfs hdfs 4646775 2017-04-10 19:59 /retail_demo/customers_dim/customers_dim.tsv.gz Found 1 items ... Because the retail demo data exists only as `.tsv.gz` files in HDFS, you cannot immediately query the data using HAWQ. In the next lesson, you create HAWQ external tables that reference these data files, after which you can query them via PXF.
In this lesson, you downloaded the tutorial data set and work files, created the
retail_demo HAWQ schema, and loaded the Retail demo dimension data into HDFS.
In Lessons 5 and 6, you will create and query HAWQ internal and external tables in the
Lesson 5: HAWQ Tables