...
Introduction
MettleCI will handle the management and deployment of DataStage Assets, Project Environment Variables and Parameter Sets. However, most ETL processes consistent of more than just DataStage jobs and will usually require additional file system assets in order to function correctly. Just like DataStage jobs and sequences, these file system assets should be checked into version control and automatically deployed and tested as part of your MettleCI deployment pipeline.
...
This guide will first explain how file system asset deployments are performed by MettleCI and then discuss some best practices intended to ensure your automated deployments are both repeatable and easy to maintain.
Deployment Process
Regardless of whether a deployment is being performed as part of Continuous Integration or Release Deployment, MettleCI will complete the following steps:
...
The parameters <datastage project>
and <environment>
refer to the name of the target DataStage project and the logical environment name (eg. CI, TEST, PROD, etc) respectively. Note that <environment>
should ideally align with the related var.* override file and the Environment postfix.
Example
As a simple example of how filesystem deployments should work, consider a DataStage project that depends on the following directory structure in order to function:
...
📄 system_accounts.csv
📄 deploy.sh
Info |
---|
Note that we |
haven’t bothered to create |
the ‘input', |
‘output’ and |
‘transient’ directories in Git |
. This is because 1) Git only version-controls |
files, not directories |
since, strictly speaking, a directory is a property of |
a file |
; and 2) the deploy.sh script creates these folders at deploy time, as the following instructions explain. |
Finally, the content of the deploy.sh
shell script (for a *nix-based target Engine host) would be:
...
The first half of
deploy.sh
contains boilerplate validation code. This can be re-used across all yourdeploy.sh
scripts.All required directories are created if they don’t already exist using
mkdir -p
Rather than copy each Project script or reference file individually,
deploy.sh
deletes the existing content of the scripts and reference directories and replaces it with the files being deployed. This keepsdeploy.sh
easy to maintain and understand and it also means that the file system is automatically cleaned up when files are removed from Git.The
${ENV_NAME}
variable can be used to perform deployment steps specific to a given environment. In this example we clear the transient directory for Continuous Integration (CI), an explanation of why you may want to do this is covered in the Best Practices section.
Best Practices
Keep Git file system directory structure as close to the deployed directory structure as possible
When the file system directory in Git does not match that of the DataStage engine, developers will need to inspect the deploy.sh
script to figure out how one directory structure maps to the other. Keeping the Git and DataStage engine directory structure closely aligned keeps things simple and reduces debugging and maintenance effort.
Clear and deploy directories, not individual files
Rather than writing deploy.sh
to move each each and every file individually, deploy entire directories in one go by first performing an rm -rf
and followed by cp -r
(see scripts and reference data in the previous example). This approach not only reduces how often the deploy.sh
script needs to be changed but it also ensures that old files are automatically removed as part of the deployment process.
For example, imagine your DataStage project contained a script called legacy_script.sh
that was originally checked into the Git directory /filesystem/scripts
and which was removed in a later revision of the project. If the deploy.sh
script just had a long list of shell scripts to copy (including our legacy_script.sh
), the legacy_script.sh
file would never be removed when we deploy new versions. Worse still, any ETL job or sequence still referring to that script would still pass in testing as legacy_script.sh
may still exist on the file system. By clearing the scripts
directory on the DataStage engine, then copying all scripts from /filesystem/scripts
in Git, legacy_script.sh
would be removed automatically as part of deployment and any ETL jobs or sequences still referring to it would (correctly) fail during testing.
Don’t put files from multiple projects in one directory structure
When a single directory structure contains files from multiple projects, it is impossible to clear files in a directory during one project deployment without negatively affecting the other projects. Always ensure that a file can be identified as being “owned” by a particular project based on the directory structure or file naming convention.
...
By ensuring the DataStage engine directory structure is separated by project (eg, /data/scripts/<project>/
, /<project>/scripts/
, /data/<project>/scripts/
, etc), its clear which project “owns” the scripts. Otherwise, an approach which is less conventional but just as effective is to copy scripts to a common directory but to pre/post fix the names with a project identifier. In this case, the /data/scripts/
directory could contain project1_copy_file.sh
, project1_rename_files.sh
, project2_wait_for_file.sh
and project2_transfer_file.sh
. The deploy.sh script could then clear the scripts directory by running rm -rf /data/scripts/<project>_*
.
Clear transient files that are created and used in a single ETL batch
Most ETL processes are composed of multiple DataStage jobs which are executed in sequence and communicate by writing to and reading from files. When these files are transient and are only expected to live for the life of a single batch, it is good practice to remove them as part of the file system deployment process. Doing so ensures that problems introduced when renaming files or removing upstream jobs are quickly identified during test execution.
...
If transient files are not removed as part of the deployment process, then “Job C” will continue to run (invalidly) during testing because “File B” would still exist on the file system from previous executions. By deleting all transient files (File A and File B), running the ETL process in testing would (validly) fail when executing Job C and the problem could be quickly detected and corrected. The same issue can occur if a developer renames files written or read by jobs without ensuring all other jobs are correctly updated to reflect the change.
Be careful when deleting Dataset files
As documented by IBM, the structure of a DataStage Dataset means that they cannot simply be deleted by removing just the descriptor file (often named with a *.ds extension). One solution to clearing file system directories that may contain Datasets is to use find -name “*.ds” | xargs -l orchadmin delete
to clean up any Datasets before performing rm -rf
on a directory. This is effective but can be quite slow.
...