Document toolboxDocument toolbox

Hashed File Database Tables

Introduction

During conversion of Server jobs to Parallel jobs the Hashed File stages are replaced with Dynamic Relational (DRS) stages and utility jobs are generated to enable the data previously stored in Hashed Files to be (optionally) migrated to DRS-compatible database tables. For the purposes of this page we’ll refer to these tables as ‘Hashed File database tables’. This mapping between Hashed File and database capabilities introduces a number of challenges:

  • Hashed Files can use null values in key columns when used in lookups.

  • Where they DO use nulls, Hashed Files treat null == null but the SQL standard does not (as ANSI SQL uses Three-Valued Logic).

  • Some databases (and by ‘some’, we mean Oracle) can’t differentiate between an empty string and null.

  • The solution requires a database table naming scheme which enables us to associate the table with its original Hashed File whilst avoiding naming conflicts.

  • The solution requires a transparent implementation (not relying on opaque custom functions, for example) so that users can easily understand what’s been delivered by S2PX.

This page describes how S2PX addresses these challenges.

Table Structure and Contents

Fields in a Hashed File are identified by a combination of column order and a key column ‘flag’. A row in a Hashed File can be interpreted differently by different Hashed File stages, and the same Hashed File can be interpreted as having a different number of columns, even within the same job! This is quite different from Database tables which have strict, consistent metadata definitions for each table. To address this, S2PX uses database table definitions where columns are named generically based on their position and key flag, such as Key1, Key2KeyN for key columns and Data1, Data2DataN for non-key columns. All database columns are defined as unbounded varchar columns and the DRS Connector maps these to the appropriate native types in your selected database platform.

The creation of each Hashed File database table in your target database depends upon the original configuration of your Hashed File:

  • If the Hashed File was configured to delete and create on each execution then the table will be created by the DRS Connector’s Create table table action option which S2PX sets to Yes.

  • For all other scenarios, S2PX assumes that the existing data will need to be migrated into the target database and so the utility migration jobs will drop and create your database table based on the schema used by the Hashed File being migrated.

Hashed File database tables store data in a way that enables traditional database tables to emulate the capabilities of traditional Server Hashed Files. As described earlier, there are some differences which we need to explicitly cater for (i.e. distinguishing between an empty string and a null value, etc.). To handle these situations S2PX formats your data when storing it in Hashed File database tables and removes this formatting when reading it. Specifically…

  • All non-null values are prefixed with a ! character. For example, Fred Flintstone becomes !Fred Flintstone, and an empty string simply becomes a ! character.

  • All null values are represented by a single @ character.

  • Non-integer numeric data numeric data is stored with 15 significant digits with up to 9 decimal places to replicate the way Server jobs treat floating point numbers (by default). (By way of example, the number 1234567890.123456789 will only have 5 decimal places as 10 significant digits are ‘used’ to the left of the decimal place: 1234567890.12345).

A simple Hashed File database table might look like this:

KEY1

KEY2

Key3

DATA1

DATA2

DATA3

DATA4

KEY1

KEY2

Key3

DATA1

DATA2

DATA3

DATA4

!1

!AA

!Y

!some

!data

!stored

!here

!2

!AB

@

!some

!data

@

!here

!3

!CA

@

!some

!data

!stored

!here

This ensures that columns in the database are treated completely based on positional behaviour and would replicate how Hashed Files work. Jobs may become less readable, but this is a necessary side-effect of mapping the flexible structure of a Hashed File to the strict format of a database table. Users may be concerned that these data are difficult to query or process outside of the jobs generated by the S2PX solution, however this was always the case with Hashed Files and so no reduction in capability has been introduced. Furthermore, the S2PX implementation is simple and transparent, and is easily adapted to different storage formats or technologies as required.

Note that S2PX also creates a set of utility jobs (one per Hashed File) which facilitates the loading of data from existing Hashed Files into your nominated database. These jobs will typically only be required to be executed once during Serve to Parallel conversion.

Table Naming

When considering the creation and maintenance of database tables to hold data previously stored in Hashed Files we need a way of naming the database tables which…

  • helps us understand which database table relates to which Hashed File,

  • ensures that table names for each Hashed File are guaranteed to be unique,

  • uses names which fit within database-specific table name length limits, and

  • does not include database-specific reserved words that might be impossible to create.

A fully-qualified Hashed File reference comprises a location (which is either an Account or a filesystem Directory) and a file name. To achieve uniqueness when generating a database table name S2PX will generate two unique, limited-length hash values based on the Hashed File's location (Account or Directory) and the file name.

Unfortunately, Accounts and Directories can share the same names, potentially leading to a situation where even fully-qualified file references (of the form {account}_{filenname} and {directory}_{filename}, for example) could deliver duplicate names. To avoid this S2PX uses a prefixed character to indicate the type of location (Account or Directory) being referenced:

Directory Hashed Files: D_{location-hash-value}_{filename-hash-value}

Account Hashed Files: A_{location-hash-value}_{filename-hash-value}

It is possible to configure a Hashed File stage with a blank account. In this case, an empty string will be passed to the hashing algorithm; this will return the value FFFFFFFF. The “LOCATION” column in the HASH_DICTIONARY table (see below) will have a null value.

The HASH_DICTIONARY Table

S2PX configures each generated DRS Connector with a Before SQL statement which acts to maintain a single reference table called HASH_DICTIONARY. This table describes which Hashed File database table is used to store the contents of which Server Hashed File.

This database table needs to be created manually by your Database Administrator using DDL provided here.

Here's a (self-explanatory) example of a HASH_DICTIONARY table:

TABLENAME

TYPE

LOCATION

FILE_NAME

TABLENAME

TYPE

LOCATION

FILE_NAME

D_F123456_1B23456

DIRECTORY

/data/project/myproject/temp

Employee_File

A_F123456_1B23456

ACCOUNT

MYACCOUNT

Employee_File

A_F123456_E345678

ACCOUNT

MYACCOUNT

Transaction_File

D_F123456_E345678

DIRECTORY

/data/project/myproject/temp

Transaction_File

D_B834922_8364A9C

DIRECTORY

/home/developer/mydata

Developer_Test

A_FFFFFFFF_E345678

ACCOUNT

(null)

Transaction_File

Your solution will require a single HASH_DICTIONARY table per schema, as it is accessed through the same connection used to populate the Hashed File database table to which the dictionary refers. The HASH_DICTIONARY table will need to be created manually by a database administrator before your can run S2PX-generated Parallel Jobs which use it. You can see example DDL here.

Referencing Hashed File Database Tables in Jobs

The database table name needs to be provided as a property of the relevant DRS Connector(s) which read and/or write the relevant tables. Unfortunately these properties are established at the point at which the job is invoked and cannot be derived using function calls during job runtime. For this reason each hashed File database name needs to be derived outside the job and the DRS Connector needs to be provided with its table name via a job parameter. Below we describe how this process works.

Original Server Job Hashed File References

 

This example Server Hashed File stage defines a file called MyCustomerData located within a Directory whose name is provided by Parameter Set value #pGlobal.TransientDir# (with a trailing slash added).

S2PX-Generated Parallel Job DRS Connectors

The S2PX-generated Parallel job will have parameters added to it to support referencing the Hashed File database table from within the DRS Connector:

For each distinct Hashed File name encountered in the Server job the Parallel job will have a parameter of the form SV_HashFileNameN added where N is an increasing integer starting at one. For example, a Server job with three distinct Hashed Files will generate a Parallel job with three new parameters: SV_HashFileName1, SV_HashFileName1 and SV_HashFileName3.

Similarly, for each distinct Hashed File location encountered in the Server job the Parallel job will have a parameter of the form SV_HashDirectoryN (for Directory-referenced files) or SV_HashAccountN (for Account-referenced files) added where N is an increasing integer starting at one.

For example, a Server job with two distinct Account references and one distinct Directory reference will generate a Parallel job which includes three new parameters: SV_HashAccount1, SV_HashAccount2 and SV_HashDirectory1.

Job Sequence File References

 

The value of these parameters is calculated outside the job, in a S2PX-generated job sequence and provided to the generated Parallel Job at runtime. The expression to calculate the Hashed File database table name is visible in the respective Sequence’s Job Activity stage:

SV_HashDirectory1: "D_":OConv(CRC32(pGlobal.TransientDir:"/"),"MX")

SV_HashFileName1: OConv(CRC32("MyCustomerData"),"MX")

… and the expression used within the DRS Connector of the called jobs is:

drsMyCustomer: #SV_HashDirectory1#_#SV_HashFileName1#

Using Hashed File database tables in lookups

As a consequence of Hashed Files data being stored in a custom format (using ! and @ prefix characters) S2PX will add a pre-DRS-write Transformer stage to prepare the data appropriately for the DRS Connector, and a post-DRS-read Transformer to remove that storage-specific formatting.

One special case is when we are using a Hashed File (and consequently a DRS Connector) as a lookup reference.

In this case the values being looked up using the DRS Connector (link custRef in the example above) will not match due to the formatting changes evident in the DRS-derived data. To ensure the lookup will function as expected S2PX will introduce a Transformer (Prepare , above) on the primary (non-reference) link into the Lookup stage which will generate one or more appropriately formatted ‘twin’ of the Lookup’s key column(s) using the same logic as the stored Hashed File data - i.e. using ! and @ prefix characters. The lookup logic is modified to use the temporarily-generated columns and the output from the Lookup stage is connected to a Transformer stage which subsequently drops these columns.


See also

© 2015-2024 Data Migrators Pty Ltd.