Abstract

This first article in the Easy IWA series details the regular requirement for a referentially complete database schema along with the capability to generate and load data during testing and POC exercises. This article provides information on implementing the TPCH database schema in IBM Informix, the dbgen data generation utility and loading the generated data.

Content

The TPCH database and dbgen data generation utility, courtesy of http://www.tpc.org, were developed to provide an approach to benchmarking and include:

  • The tpch Database structure
    • A referentially complete database schema
  • The tpch dbgen utility
    • A utility to populate the database with a specified amount of data (Scale Factor – SF)
  • The tpch benchmark queries – not detailed here
    • A set of pre-defined data warehouse queries to run against the database

This article details the creation of the tpch database and population using the dbgen utility to generate data; data population is detailed using flat files generated by dbgen and also pipes.



In essence, the schema consists of 8 tables, 8 explicit unique indexes supporting 8 primary keys and 9 explicit indexes supporting 9 foreign keys.
The tpch dbgen utility generates, by default, a set of flat files suitable for loading into the tpch schema with the size based on the “Scale Factor” argument; a scale factor of 1 produces a complete data set of approximately 1 GB, a scale factor of 10 produces a data set of approximately 10 GB etc.

Download the following zip file http://www.tpc.org/tpch/spec/tpch_2_17_0.zip to a temporary directory and unzip.

Go to the extracted tpch_2_17_0/dbgen directory and copy makefile.suite to Makefile; within the Makefile amend the following to suit your environment:

Copy to Clipboard

Then run make ensuring a clean compilation!

After making sure that there is adequate filesystem disk space available (i.e. more than 1 GB!), run ./dbgen –s 1

If the following files are produced, then dbgen has been successfully built:

Copy to Clipboard

For completeness and readability, perform the following:
  • Remove the just generated .tbl files under ../tpch_2_17_0/dbgen
  • Create a new directory, /home/Informix/dbgen_article (for example)
  • Copy ./tpch_2_17_0/dbgen/dists.dss to /home/Informix/dbgen_article/
  • Copy ./tpch_2_17_0/dbgen/dbgen to /home/Informix/dbgen_article/
The dbgen utility can be run with various options, some examples are detailed below:
  • ./dbgen —
    • Show complete usage
  • ./dbgen –s 1 –f
    • Force overwrite of existing files
  • ./dbgen –s 1 –T c
    • Generate just the customers (there are options for each table)
The following helper script generates all table data in parallel as flat files:

Copy to Clipboard

The generated data can also be placed, in parallel, on pipes with a slight amendment to the above script:

Copy to Clipboard

However, the data generation will not proceed until each pipe has started to be read; the following helper script can be used for flushing all data through the pipes:

Copy to Clipboard

Hint – wait until 100% displayed for each dbgen execution before executing this script.

With the above information, there are two approaches that can be followed to load data; one is loading the data from flat files and the second is loading the data from pipes.
Using the region table as an example, the following SQL details:
  • Creation of the database
  • Creation of the region table as raw
  • Creation of an external “disk” table region_ext “sameas” region
  • Insertion of data into the region table
  • Altering the region table to standard
  • The addition of a unique index and the primary key to the region table
Before running this SQL, the file /home/informix/dbgen_article/region.tbl should be created using ./dbgen -v -T r –s 1:
Copy to Clipboard
Hint – the DBDATE format is YMD4-
Hint – the database, jj_dbgen, cannot be dropped until a Level 0 archive is performed
A fake backup can be run using onbar –b –F
Alternatively, the individual tables can be dropped without a Level 0 archive
The only differences to load from a pipe as opposed to disk are:
  • Remove the flat file “region.tbl” rm region.tbl
  • Create the “region.tbl” as a pipe mknod region.tbl p
  • Amend the external table definition for region_ext as a “pipe” create external table “informix”.region_ext sameas region using ( datafiles(“pipe:/home/informix/dbgen_article/region.tbl”));
  • Prime the region pipe with data – note this will remain running ./dbgen –v –T r –s 1
Then run the preceding short SQL Hint – do not attempt to read from an external table based on pipes – IT WILL HANG – unless there is something to read. If a read is attempted then a simple echo “A” > tablename.tbl can be run and the SQL will complete.
Below is the complete SQL which is able to be run using dbaccess – jj_dbgen.sql after running jj_dbgen.sh to generate the data as flat files.

Modify the database creation statement to denote an appropriate dbspace.

In order to load the data from pipes, change the external table definitions from “disk” to “pipe” and modify the jj_dbgen_data.sh script to generate the data on pipes and run in the background ./jj_dbgen_data.sh &, then run dbaccess – jj_dbgen.sql.

It should be noted that the 12.10.xC4 feature of “NOVALIDATE” when creating foreign keys is being used; remove the “NOVALIDATE” if working with versions prior to 12.10.FC4

Copy to Clipboard

Conclusion

For testing and POC exercises, often what is required is a populated database of a specific size; this article provides enough information to implement the TPCH database with a data population of any size ranging from 1 GB to any size and demonstrates using IBM Informix external tables from “disk” or “pipe”.

Disclaimer

The code fix suggested above is provided “as is” without warranty of any kind, either express or implied, including without limitation any implied warranties of condition, uninterrupted use, merchantability, fitness for a particular purpose, or non-infringement.