Data Migrators strongly recommend the use of Trunk Based Development practices over Feature-driven branching strategies like GitFlow. This page describes the challenges of branching in DataStage as part of the rationale.

Introduction

Source code is a vital asset to any software development team, and DataStage development is no exception. Version Control Systems (VCS), such as Git, allow development teams to track changes over time and keep their source code in shape. A VCS provides developers with the ability to reliably recreate previous versions of software while continuing to work on future versions of the same software. Source code can also be duplicated to allow multiple versions of the software be modified independently and in parallel; this is what a VCS refers to as a code branch. At a later point in time, code changes made to multiple branches can be integrated into a single version of software using a process called a merge:

This guide describes how to work with DataStage, Git, and Branches. The first three sections outline using DataStage with common Git activities:

Committing changes from DataStage
Creating a branch for DataStage development
Merging DataStage changes from multiple branches

Each of these sections will discuss considerations which are unique to DataStage and the constraints that they impose.

Committing changes from DataStage

Before diving into the process of committing changes from DataStage into Git, it is useful to review how Git is used with traditional software development languages like C/C++, Java, Python, etc. These programming languages are text based and each developer works on their own copy of source code called a working copy:

Without any additional tools, there is no way for developers to collaborate or share their work with other team members. Developers could use network storage to work on a shared working copy but, due to the nature of traditional programming languages, that wouldn’t be very productive as developers would continually interfere with each other and nobody would be able to compile the software. Using Git, developers work on changes to their own working copy and collaborate with other team members by regularly committing their work to a shared repository. Git is a distributed VCS, so the process of performing a commit is a little more involved than an older, centralized VCS like CVS and SVN. When using Git, every developer has a working copy and a local Git repository. Changes to the local copy are committed to the Local Repository which is then synchronized with a common remote repository through operations called push and pull:

In contrast, DataStage developers log into a DataStage Project on a central server where they can make changes to source code. While DataStage does not allow assets such as Jobs to be modified by more than one developer at a time, it has been designed to allow multiple Developers to work concurrently on the same DataStage Project. Whenever a developer saves changes to a DataStage asset, the changes are visible to all other developers working in the same DataStage Project. From a VCS perspective, a DataStage Project is the common working copy which can be concurrently worked on by multiple developers:

Unfortunately, DataStage source code isn’t text based and is kept on the server in a proprietary format. In order to commit changes from a DataStage Project, the modified assets need to be exported to a file which can then be added to Git. This process is automatically handled by MettleCI Workbench:

A DataStage project can be connected to a remote Git repository by using the project registration administrator functions within Workbench. Since MettleCI exposes a DataStage Project as just another working copy, non-DataStage assets (such as shell scripts, SQL, etc) can be committed alongside DataStage assets using the standard Git commit process. In the example below, DataStage developers continue to update and commit changes made to the DataStage Project while two other developers update shell scripts and SQL by using the same workflow used by traditional, text based programming languages:

Considerations

Because of the shared nature of using a DataStage Project as a working copy, it is good development practice to ensure that changes to a DataStage Project do not remain uncommitted for long periods of time. Doing so can result in developers unwittingly committing overlapping changes.

For example, imagine Bill started work on TransformJobA and moved on to another task before committing his change. In the meantime, Mary also needs to make a change to TransformJobA which she commits when finished. Unfortunately, Mary has just committed a version of TransformJobA which contains both her changes and Bill’s which may not have been complete.

Other than keeping DataStage jobs open (and therefore locked, preventing concurrent edits) until changes are committed, a simple solution is to create a copy of a job before working on it. As long as everyone follows the same naming convention (e.g. ${Job Name}_WIP), DataStage developers can easily identify when changes are in progress and by whom. Having a copy also makes it trivial to discard changes when necessary.

Returning to the previous example, Bill creates a copy of TransformJobA and calls it TransformJobA_WIP. He makes changes to this job and moves on to another task before committing his change. In the meantime, Mary needs to make a change to TransformJobA but can’t create a copy called TransformJobA_WIP because it already exists. She finds the existing TransformJobA_WIP job and checks the last modified by user property to find out who was working on it. Mary contacts Bill to understand the changes he was making and decide the best way to proceed.

DataStage Project categories are used to group DataStage assets and impose logical structure. Most developers consider categories to be the DataStage equivalent to folders on the filesystem. However, it is important to note that DataStage asset names for a given type (eg. Jobs, Table Definitions, Routines, etc) must be unique across the entire project rather than within a specific category. Therefore, using a common naming convention for work-in-progress jobs stills alerts developers if they attempt concurrent work on the same DataStage asset.

Creating a branch for DataStage development

By default, a Git repository will start with a single Branch, historically called master or more recently main. New branches can be created quickly and easily within Git by choosing the particular commit you’d like to branch from:

For developers who aren’t working with DataStage assets, performing a Git checkout from a branch will update the working copy to reflect the version of software represented by the branch. All changes to the working copy are always committed to the checked-out branch. It is not possible to commit to a different branch without first discarding uncommitted changes and performing a new checkout which updates the working copy. In the following example, a branch called my-branch is checked out from Git. Changes made to the working copy will always be committed to my-branch:

Using a DataStage Project as the working copy for a Git branch is the same, with the only difference being that the DataStage Project and Git branch are associated with each other through the MettleCI Workbench Project Registration interface, rather than being a function of Git’s checkout process:

The process of performing a Git checkout will not itself update a DataStage Project. Creating a Git branch with a DataStage Project as working copy requires the following steps:

Create the desired branch in Git (my-branch in the example above)
Checkout the desired branch using Git
Create a new Project in DataStage (e.g. my-branch) which will be used as the Working Copy
Register the new DataStage Project in MettleCI Workbench, ensuring the Git repository setting is configured with the desired branch (e.g. my-branch)
Import all DataStage ISX files which were checked out in Step 2 into the DataStage Project created in Step 3
Compile all imported Jobs

Steps 5 and 6 can be performed from the command line using MettleCI’s Deployment command or using the Information Server Manager and Multi Job Compile tools which are installed with the DataStage Client.

When creating a DataStage Project to use as a branch working copy, we recommend using a naming convention which makes it clear which repository and branch a DataStage project relates to.

For example, you could use ${Git Repo Name}_${branch name}_develop so you know the project is used for development and which repository/branch it is associated with.

Considerations

Developers who have used Git for development in text-based programming languages are used to a Git branch being created and available to work on within seconds. Unfortunately, this assumption does not hold true when working with DataStage. Depending on the DataStage Project size, the steps to initialize a new DataStage Project as a working copy could take over an hour. For example, depending on hardware, a typical DataStage Project with 500 assets might require approximately 1 hour and 15 minutes just to import and compile. For this reason it is worth considering how long a branch is intended to remain active before creating it. If a branch is expected to remain open for only a matter of minutes or hours, it may not be worth the overhead of creating the branch in the first place.

Merging DataStage changes from multiple branches

While branching allows developers to work on multiple streams of development in isolation and in parallel, at some point changes from branches will need to be integrated using a merge:

An old joke says that if you fall off a tall building, the falling isn't going to hurt you, but the landing will. So it is with source code: branching is easy, merging is harder.

Git is very good at merging text based files as used by conventional programming languages. However, conflicts can still occur and resolving them can range from the simple-but-time-consuming to problems which can be more difficult to identify and resolve. For example, if the name of a variable is changed in both the main and my-branch, Git would detect a conflict and require human intervention to determine which variable name should be used as a result of the merge. A conflict like this is a pretty common textural conflict which is usually easy to resolve.

But what if the variable was renamed in my-branch but a new function which refers to the old variable name is added to main? In this case, there is no textural conflict but the merged code wont work correctly. We’ll refer to this as a Semantic Conflict. This class of conflicts is particularly challenging, as they can go completely undetected and, depending on associated testing disciplines, can take a long time to track down when the code behaves unexpectedly.

DataStage-Specific Constraints

Unlike traditional programming languages DataStage source code is exposed as a set of proprietary-formatted exports rather than traditional text. DataStage assets can be exported in two possible formats:

ISX – Supports all Information Server asset types and is the format used by MettleCI. Compressed XML (binary) based format.
DSX – Older Information Server format that only supports a subset of all Information Server asset types. Text based format, not supported by MettleCI.

While the DSX format might be text based like traditional source code, the text inside both DSX and ISX formats represent each DataStage job as an acyclic graph with complex relationships and extensive metadata properties. The graph-based nature of these files means that even though Git might be able to merge text based DSX files, the resulting output will contain a large number of both textural and semantic conflicts. To make matters worse, a developer could open the exported files in a text editor but, since these are not intended as human readable, it would be virtually impossible to fully understand the DataStage asset they represent, making conflict resolution near impossible. For all practical purposes DataStage exports should be considered as binary files within the context of a VCS.

Git will still be able to perform a merge, but if two different versions of the same DataStage export are detected Git will report a conflict and the versions will need to be manually merged. Since DataStage does not include any tools for merging, the only way to resolve a conflict is for a developer to inspect each version in detail using the DataStage user interface and manually construct the merged version. This process is time consuming, tedious, and error prone.

Considerations

When asked to visualize several branches being merged on a regular basis, developers will usually draw a diagram like this:

However, as changes are made to branches over time, the software versions each branch represents will diverge. A more accurate representation would be:

As can be seen from the updated diagram, branches create ‘distance’ that increases every time a change is made over the lifetime of a branch. Branch distance raises the following risks when it’s time to merge:

conflicts which are difficult to resolve
unexpected defects as a result of merge
duplicated work which is not discovered until it's merged

The likelihood of these risks manifesting as issues is directly proportional to the distance between the branches being merged. The merge constraints inherent in DataStage assets increase the impact of the first two risks exponentially as branch distance increases.

Summary

Using DataStage with Git branches is possible, but any attempt at merging runs the risk of discovering conflicts between different versions of DataStage exports. Resolving DataStage conflicts is a manual process which is extremely tedious, time consuming and prone to error. The best way to avoid this risk is not the branch in the first place. If you believe branching is unavoidable in your organization, follow these recommendations to constrain the branch creation effort and limit merge risks:

Limit the number of active branches. More branching results in more merging and associated risk
Don’t create lots of short lived branches, the effort expended initializing the associated DataStage Projects will outweigh the cost of making branch changes.
Avoid long lived branches to limit the “distance” between branches being merged

It is worth noting that the last two recommendations are in conflict with each other. You want branches to be as short lived as possible without causing too much overhead due to creation of branches.

Our recommendation is to foster communication and collaboration between developers working on a single DataStage Project, rather than encouraging the isolated, concurrent development of features. Only use Git branching when you absolutely need to concurrently maintain two or more versions of your source code. For this reason, we strongly recommend the use of Trunk Based Development practices over Feature-driven branching (GitFlow being the most prominent example).

Working with DataStage and Git Branches

Introduction

Committing changes from DataStage

Considerations

Creating a branch for DataStage development

Considerations

Merging DataStage changes from multiple branches

DataStage-Specific Constraints

Considerations

Summary