Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Introduction

MettleCI will handle the management and deployment of DataStage Assets, Project Environment Variables and Parameter Sets. However, most ETL processes consistent of more than just DataStage jobs and will usually require additional file system assets in order to function correctly. Just like DataStage jobs and sequences, these file system assets should be checked into version control and automatically deployed and tested as part of your MettleCI deployment pipeline.

...

This guide will first explain how file system asset deployments are performed by MettleCI and then discuss some best practices intended to ensure your automated deployments are both repeatable and easy to maintain.

Deployment Process

Regardless of whether a deployment is being performed as part of Continuous Integration or Release Deployment, MettleCI will complete the following steps:

...

The parameters <datastage project> and <environment> refer to the name of the target DataStage project and the logical environment name (eg. CI, TEST, PROD, etc) respectively. Note that <environment> should ideally align with the related var.* override file and the Environment postfix.

Example

As a simple example of how filesystem deployments should work, consider a DataStage project that depends on the following directory structure in order to function:

...

📄 system_accounts.csv

📄 deploy.sh

Info

Note that we

hvaen’t

haven’t bothered to create

input

the ‘input',

output

‘output’ and

transient

‘transient’ directories in Git

as it

. This is because

1) Git only version-controls

fiels

files, not directories

(

since, strictly speaking, a directory is a property of

the

a file

) / doesn’t handle empty directories and

; and

2) the deploy.sh script creates these folders at deploy time, as the following instructions explain.

Finally, the content of the deploy.sh shell script (for a *nix-based target Engine host) would be:

...

  • The first half of deploy.sh contains boilerplate validation code. This can be re-used across all your deploy.sh scripts.

  • All required directories are created if they don’t already exist using mkdir -p

  • Rather than copy each Project script or reference file individually, deploy.sh deletes the existing content of the scripts and reference directories and replaces it with the files being deployed. This keeps deploy.sh easy to maintain and understand and it also means that the file system is automatically cleaned up when files are removed from Git.

  • The ${ENV_NAME} variable can be used to perform deployment steps specific to a given environment. In this example we clear the transient directory for Continuous Integration (CI), an explanation of why you may want to do this is covered in the Best Practices section.

Best Practices

Keep Git file system directory structure as close to the deployed directory structure as possible

When the file system directory in Git does not match that of the DataStage engine, developers will need to inspect the deploy.sh script to figure out how one directory structure maps to the other. Keeping the Git and DataStage engine directory structure closely aligned keeps things simple and reduces debugging and maintenance effort.

Clear and deploy directories, not individual files

Rather than writing deploy.sh to move each each and every file individually, deploy entire directories in one go by first performing an rm -rf and followed by cp -r (see scripts and reference data in the previous example). This approach not only reduces how often the deploy.sh script needs to be changed but it also ensures that old files are automatically removed as part of the deployment process.

For example, imagine your DataStage project contained a script called legacy_script.sh that was originally checked into the Git directory /filesystem/scripts and which was removed in a later revision of the project. If the deploy.sh script just had a long list of shell scripts to copy (including our legacy_script.sh), the legacy_script.sh file would never be removed when we deploy new versions. Worse still, any ETL job or sequence still referring to that script would still pass in testing as legacy_script.sh may still exist on the file system. By clearing the scripts directory on the DataStage engine, then copying all scripts from /filesystem/scripts in Git, legacy_script.sh would be removed automatically as part of deployment and any ETL jobs or sequences still referring to it would (correctly) fail during testing.

Don’t put files from multiple projects in one directory structure

When a single directory structure contains files from multiple projects, it is impossible to clear files in a directory during one project deployment without negatively affecting the other projects. Always ensure that a file can be identified as being “owned” by a particular project based on the directory structure or file naming convention.

...

By ensuring the DataStage engine directory structure is separated by project (eg, /data/scripts/<project>/, /<project>/scripts/, /data/<project>/scripts/, etc), its clear which project “owns” the scripts. Otherwise, an approach which is less conventional but just as effective is to copy scripts to a common directory but to pre/post fix the names with a project identifier. In this case, the /data/scripts/ directory could contain project1_copy_file.sh, project1_rename_files.sh, project2_wait_for_file.sh and project2_transfer_file.sh. The deploy.sh script could then clear the scripts directory by running rm -rf /data/scripts/<project>_*.

Clear transient files that are created and used in a single ETL batch

Most ETL processes are composed of multiple DataStage jobs which are executed in sequence and communicate by writing to and reading from files. When these files are transient and are only expected to live for the life of a single batch, it is good practice to remove them as part of the file system deployment process. Doing so ensures that problems introduced when renaming files or removing upstream jobs are quickly identified during test execution.

...

If transient files are not removed as part of the deployment process, then “Job C” will continue to run (invalidly) during testing because “File B” would still exist on the file system from previous executions. By deleting all transient files (File A and File B), running the ETL process in testing would (validly) fail when executing Job C and the problem could be quickly detected and corrected. The same issue can occur if a developer renames files written or read by jobs without ensuring all other jobs are correctly updated to reflect the change.

Be careful when deleting Dataset files

As documented by IBM, the structure of a DataStage Dataset means that they cannot simply be deleted by removing just the descriptor file (often named with a *.ds extension). One solution to clearing file system directories that may contain Datasets is to use find -name “*.ds” | xargs -l orchadmin delete to clean up any Datasets before performing rm -rf on a directory. This is effective but can be quite slow.

...