Document toolboxDocument toolbox

Working with DataStage and Git Branches

Data Migrators strongly recommend the use of Trunk Based Development practices over Feature-driven branching strategies like GitFlow. This page describes the challenges of branching in DataStage as part of the rationale.

Introduction

Source code is a vital asset to any software development team, and DataStage development is no exception. Version Control Systems (VCS), such as Git, allow development teams to track changes over time and keep their source code managed in a consistent way. A VCS provides developers with the ability to reliably recreate previous versions of software while continuing to work on future versions of the same software. Source code can also be duplicated to allow multiple versions of the software be modified independently and in parallel; this is what a VCS refers to as a code branch. At a later point in time, code changes made to multiple branches can be integrated into a single version of software using a process called a merge:

This guide describes how to work with DataStage, Git, and Branches. The first three sections outline using DataStage with common Git activities:

  1. Committing changes from DataStage

  2. Creating a branch for DataStage development

  3. Merging DataStage changes from multiple branches

Each of these sections will discuss considerations which are unique to DataStage and the constraints that they impose.

Committing changes from DataStage

Before diving into the process of committing changes from DataStage into Git, it is useful to review how Git is used with traditional software development languages like C/C++, Java, Python, etc. These programming languages are text based and each developer works on their own copy of source code called a working copy:

Without any additional tools, there is no way for developers to collaborate or share their work with other team members. Developers could use network storage to work on a shared working copy but, due to the nature of traditional programming languages, that wouldn’t be very productive as developers would continually interfere with each other and nobody would be able to compile the software. Using Git, developers work on changes to their own working copy and collaborate with other team members by regularly committing their work to a shared repository. Git is a distributed VCS, so the process of performing a commit is a little more involved than an older, centralized VCS like CVS and SVN. When using Git, every developer has a working copy and a local Git repository. Changes to the local copy are committed to the Local Repository which is then synchronized with a common remote repository through operations called push and pull:

In contrast, DataStage developers log into a DataStage Project on a central server where they can make changes to source code. While DataStage does not allow assets such as Jobs to be modified by more than one developer at a time, it has been designed to allow multiple Developers to work concurrently on the same DataStage Project. Whenever a developer saves changes to a DataStage asset, the changes are visible to all other developers working in the same DataStage Project. From a VCS perspective, a DataStage Project is the common working copy which can be concurrently worked on by multiple developers:

Unfortunately, DataStage source code isn’t text based and is kept on the server in a proprietary format. Therefore, in order to commit changes from a DataStage Project to a Git repository, the modified assets must be exported to a file first, then subsequently added to Git. This process is automatically handled by MettleCI Workbench:

A DataStage project can be connected to a remote Git repository by using the project registration administrator functions within Workbench. Since MettleCI exposes a DataStage Project as just another working copy, non-DataStage assets (such as shell scripts, SQL, etc) can be committed alongside DataStage assets using the standard Git commit process. In the example below, DataStage developers continue to update and commit changes made to the DataStage Project while two other developers update shell scripts and SQL (potentially within local repositories on their own laptops) by using the same workflow that applies to conventional, text-based programming languages:

Considerations

Because of the shared nature of using a DataStage Project as a working copy, it is good development practice to ensure that changes to a DataStage Project do not remain uncommitted for long periods of time. Doing so can result in developers unwittingly committing overlapping changes.

For example, imagine Bill started work on TransformJobA and moved on to another task before committing his change. In the meantime, Mary also needs to make a change to TransformJobA which she commits when finished. Unfortunately, Mary has just committed a version of TransformJobA which contains both her changes and Bill’s which may not have been complete.

Other than keeping DataStage jobs open (and therefore locked, preventing concurrent edits) until changes are committed, a simple solution is to create a copy of a job before working on it. As long as everyone follows the same naming convention (e.g. ${Job Name}_WIP), DataStage developers can easily identify when changes are in progress and by whom. Having a copy also makes it trivial to discard changes when necessary.

Returning to the previous example, Bill creates a copy of TransformJobA and calls it TransformJobA_WIP. He makes changes to this job and moves on to another task before committing his change. In the meantime, Mary needs to make a change to TransformJobA but can’t create a copy called TransformJobA_WIP because it already exists. She finds the existing TransformJobA_WIP job and checks the last modified by user property to find out who was working on it. Mary contacts Bill to understand the changes he was making and decide the best way to proceed.

DataStage Project categories are used to group DataStage assets and impose logical structure. Most developers consider categories to be the DataStage equivalent to folders on the filesystem. However, it is important to note that DataStage asset names for a given type (eg. Jobs, Table Definitions, Routines, etc) must be unique across the entire project rather than within a specific category. Therefore, using a common naming convention for work-in-progress jobs stills alerts developers if they attempt concurrent work on the same DataStage asset.

Creating a branch for DataStage development

By default, a Git repository will start with a single Branch, historically called master and, increasingly, main. New branches can be created quickly and easily within Git by choosing the particular commit you’d like to branch from:

For developers who aren’t working with DataStage assets, performing a Git checkout from a branch will update their working copy to reflect the version of software represented by the branch. All changes to the working copy are always committed to the checked-out branch. It is not possible to commit to a different branch without first discarding uncommitted changes and performing a new checkout which updates the working copy. In the following example, a branch called my-branch is checked out from Git. Changes made to the working copy will always be committed to my-branch:

You can achieve the same state by creating and using a new, branch-specific DataStage Project as the working copy for a Git branch. The only difference is that the newly-created DataStage Project and Git branch are associated with each other through the MettleCI Workbench Project Registration interface, rather than being a function of Git’s checkout process which, of course, doesn’t handle the particular needs of DataStage development:

The process of performing a Git checkout will not itself update a DataStage Project. Creating a Git branch with a DataStage Project as working copy requires the following steps:

  1. Create the desired branch in Git (my-branch in the example above)

  2. Checkout the desired branch using Git

  3. Create a new Project in DataStage (e.g. my-branch) which will be used as the Working Copy

  4. Register the new DataStage Project in MettleCI Workbench, ensuring the Git repository setting is configured with the desired branch (e.g. my-branch)

  5. Import all DataStage ISX files which were checked out in Step 2 into the DataStage Project created in Step 3

  6. Compile all imported Jobs

Steps 5 and 6 can be performed from the command line using MettleCI’s Deployment command or using the Information Server Manager and Multi Job Compile tools which are installed with the DataStage Client.

When creating a DataStage Project to use as a branch working copy, we recommend using a naming convention which makes it clear which repository and branch a DataStage project relates to.

For example, you could use ${Git Repo Name}_${branch name}_develop so you know the project is used for development and which repository/branch it is associated with.

Considerations

Developers who have used Git for development in text-based programming languages are used to a Git branch being created and available to work on within seconds. Unfortunately, this assumption does not hold true when working with DataStage assets. Depending on the DataStage Project size, the steps to initialize a new DataStage Project as a working copy could take over an hour. For example, depending on hardware, a typical DataStage Project with 500 assets might require approximately 1 hour and 15 minutes just to import and compile. For this reason it is worth considering how long a branch is intended to remain active before creating it. If a DataStage project branch is expected to remain open for only a matter of minutes or hours, it may not be worth the overhead of creating the branch in the first place.

Merging DataStage changes from multiple branches

While branching allows developers to work on multiple streams of development in isolation and in parallel, at some point changes from branches will need to be integrated using a merge:

An old joke says that if you fall off a tall building, the falling probably won’t hurt you, but the landing definitely will. So it is with source code: branching is easy, merging is harder.

Git is very good at merging text based files as used by conventional programming languages. However, conflicts can still occur and resolving them can range from the simple-but-time-consuming to problems which can be more difficult to identify and resolve. For example, if the name of a variable is changed in both the main and my-branch, Git would detect a conflict and prompt for human intervention to determine which variable name should be used as a result of the merge. A conflict like this is a pretty common textural conflict which is usually easy to resolve.

But what if the variable was renamed in my-branch but a new function which refers to the old variable name is added to main? In this case, there is no textural conflict but the merged code won't work correctly. We’ll refer to this as a semantic conflict. This class of conflicts is particularly challenging, as they can go completely undetected and, depending on associated testing disciplines, can take a long time to track down when the code behaves unexpectedly.

DataStage-Specific Constraints

Unlike traditional programming languages DataStage source code is exposed as one or more exported files with a proprietary format rather than conventional text. DataStage assets can be exported in two possible formats:

  1. ISX – Supports all Information Server asset types and is the format used by MettleCI.  Compressed XML (binary) based format. 

  2. DSX – Older Information Server format that only supports a subset of all Information Server asset types.  Text based format, not supported by MettleCI.

While the DSX format might be text-based like traditional source code, the text inside both DSX and ISX formats represent each DataStage job as an acyclic graph with complex relationships and extensive metadata properties. The graph-based nature of these files means that even though Git might attempt to merge text based DSX files, the resulting output will contain a large number of both textural and semantic conflicts. To make matters worse, a developer could open the exported files in a text editor but, since these are not intended to be human readable, it would be virtually impossible to fully understand the DataStage asset they represent, making conflict resolution near impossible. For all practical purposes, DataStage exports should be considered binary files within the context of a VCS.

Technically, Git will still be able to perform a merge, but if two different versions of the same DataStage export are detected, Git will report a conflict and the versions will need to be manually merged. Since DataStage does not include any tools to support the resolution of merge conflicts, the only way to resolve these is for a developer to inspect each version in detail using the DataStage user interface and manually construct the merged version. This process is time consuming, tedious, and error prone.

Considerations

When asked to visualize several branches being merged on a regular basis, developers will usually draw a diagram like this:

However, as changes are made to branches over time, the software versions each branch represents will diverge. A more accurate representation would be:

As can be seen from the updated diagram, branches create ‘distance’ that increases every time a change is made over the lifetime of a branch. Branch distance raises the following risks when it’s time to merge:

  1. conflicts which are difficult to resolve

  2. unexpected defects as a result of merge

  3. duplicated work which is not discovered until it's merged

The likelihood of these risks manifesting as issues is directly proportional to the distance between the branches being merged. The merge constraints inherent in DataStage assets increase the impact of the first two risks exponentially as branch distance increases.

Summary

Using DataStage with Git branches is possible, but any attempt at merging runs the risk of discovering conflicts between different versions of DataStage exports. Resolving DataStage conflicts is a manual process which is extremely tedious, time consuming and prone to error, undermining whatever benefits were sought by branching in the first place. The best way to avoid this risk is to avoid branching in the first place. If you believe branching is unavoidable in your organization, follow these recommendations to constrain the branch creation effort and limit merge risks:

  • Limit the number of active branches. More branching results in more merging and associated risk

  • Don’t create lots of short lived branches, the effort expended initializing the associated DataStage Projects will outweigh the cost of making branch changes.

  • Avoid long lived branches to limit the “distance” between branches being merged

It is worth noting that the last two recommendations are in conflict with each other. You want branches to be as short lived as possible without causing too much overhead due to creation of branches.

Our recommendation is to foster communication and collaboration between developers working on a single DataStage Project, rather than encouraging the isolated, concurrent development of features. Only use Git branching when you absolutely need to concurrently maintain two or more versions of your source code - which ought to be rare in a healthy DataStage solution. For this reason, we strongly recommend the use of Trunk Based Development practices over Feature-driven branching (GitFlow being the most prominent example).

© 2015-2024 Data Migrators Pty Ltd.