Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
Info

Data Migrators strongly recommend the use of Trunk Based Development practices over Feature-driven branching strategies like GitFlow. This page describes the challenges of branching in DataStage as part of the rationale.

Introduction

Source code is a vital asset to any software development team, and DataStage development is no exception. Version Control Systems (VCS), such as Git, allow development teams to track changes over time and keep their source code in shape. A VCS provides developers with the ability to reliably recreate previous versions of software while continuing to work on future versions of the same software. Source code can also be duplicated to allow multiple versions of the software be modified independently and in parallel; this is what a VCS refers to as a code branch. At a later point in time, code changes made to multiple branches can be integrated into a single version of software using a process called a merge:

...

Developers who have used Git for traditional development will assume that development in text-based programming languages are used to a Git branch can be being created and worked available to work on within seconds. Unfortunately, this assumption does not hold true when working with DataStage. This is because, depending Depending on the DataStage Project size, the steps to initialize a new DataStage Project as a working copy can could take over an hour. For example, depending on hardware, a typical DataStage Project with 500 assets can commonly might require approximately 1 hour and 15 minutes just to import and compile. For this reason it is worth considering how long a branch will is intended to remain active before creating it. If a branch is expected to remain open for only a matter of minutes or hours, it may not be worth the overhead of creating the branch in the first place.

...

An old joke says that if you fall off a tall building, the falling isn't going to hurt you, but the landing will. So it is with source code: branching is easy, merging is harder.

Compared to older VCS, Git is very good at merging text based files as used by traditional conventional programming languages. However, conflicts can still occur and resolving them can range from the simple-but-time-consuming to problems which can be more difficult to identify and resolve. For example, if the name of a variable is changed in both the main and my-branch, Git would detect a conflict and require human intervention to determine which variable name should be used as a result of the merge. A conflict like this is a pretty common textural conflict which is usually easy to resolve.

But what if the variable was renamed in my-branch but a new function which refers to the old variable name is added to main? In this case, there is no textural conflict but the merged code wont work correctly. We’ll refer to this as a Semantic Conflict. This class of conflicts are is particularly nastychallenging, as they can go completely undetected and, depending on associated testing disciplines, can take a long time to track down when the code behaves unexpectedly.

DataStage-Specific Constraints

Unlike traditional programming languages DataStage source code is exposed as a set of proprietary-formatted exports rather than traditional text. DataStage assets can be exported in two possible formats:

...

While the DSX format might be text based like traditional source code, the text inside both DSX and ISX formats represent each DataStage job as an acyclic graph with complex relationships and extensive metadata properties. The graph-based nature of these files means that even though Git might be able to merge text based DSX files, the resulting output will contain a large number of both textural and semantic conflicts. To make matters worse, a developer could open the exported files in a text editor but, since these are not intended as human readable, it is would be virtually impossible to fully understand the DataStage asset they represent, making conflict resolution near impossible. For all practical purposes DataStage exports should be considered as binary files within the context of a VCS.

Git will still be able to perform a merge, but if two different versions of the same DataStage export are detected Git will report a conflict and the versions will need to be manually merged. Since DataStage does not include any tools for merging two DataStage exports , the only way to resolve a conflict is for a developer to ‘eye-ball’ inspect each version in detail using the DataStage user interface and manually construct the merged version. This process is time consuming, tedious, and error prone.

...

The likelihood of these risks manifesting as issues is directly proportional to the distance between the branches being merged. Due to The merge constraints imposed by DataStage, however, inherent in DataStage assets increase the impact of the first two risks increases exponentially as branch distance increases.

...

Using DataStage with Git branches is possible, but any attempt at merging runs the risk of discovering conflicts between different versions of DataStage exports. Resolving DataStage conflicts is a manual process which is extremely tedious, time consuming and prone to error. The best way to avoid this risk is not the branch in the first place. If you believe branching is unavoidable in your organization, follow these recommendations to constrain the branch creation effort and limit merge risks:

...

Our recommendation is to foster communication and collaboration between developers working on a single DataStage Project, rather than encouraging the isolated, concurrent development of features. Only use Git branching when you absolutely need to concurrently maintain two or more versions of your source code. For this reason, we strongly recommend the use of Trunk Based Development practices over Feature-driven branching (GitFlow being the most prominent example).