Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Source code is a vital asset to any software development team, and DataStage development is no exception. Version Control Systems (VCS), such as Git, allow development teams to track changes over time and keep their source code managed in shapea consistent way. A VCS provides developers with the ability to reliably recreate previous versions of software while continuing to work on future versions of the same software. Source code can also be duplicated to allow multiple versions of the software be modified independently and in parallel; this is what a VCS refers to as a code branch. At a later point in time, code changes made to multiple branches can be integrated into a single version of software using a process called a merge:

...

Unfortunately, DataStage source code isn’t text based and is kept on the server in a proprietary format. In Therefore, in order to commit changes from a DataStage Project to a Git repository, the modified assets need to must be exported to a file which can first, then be subsequently added to Git. This process is automatically handled by MettleCI Workbench:

...

A DataStage project can be connected to a remote Git repository by using the project registration administrator functions within Workbench. Since MettleCI exposes a DataStage Project as just another working copy, non-DataStage assets (such as shell scripts, SQL, etc) can be committed alongside DataStage assets using the standard Git commit process. In the example below, DataStage developers continue to update and commit changes made to the DataStage Project while two other developers update shell scripts and SQL (potentially within local repositories on their own laptops) by using the same workflow used by traditionalthat applies to conventional, text-based programming languages:

...

By default, a Git repository will start with a single Branch, historically called master or more recently and, increasingly, main. New branches can be created quickly and easily within Git by choosing the particular commit you’d like to branch from:

...

For developers who aren’t working with DataStage assets, performing a Git checkout from a branch will update the their working copy to reflect the version of software represented by the branch. All changes to the working copy are always committed to the checked-out branch. It is not possible to commit to a different branch without first discarding uncommitted changes and performing a new checkout which updates the working copy. In the following example, a branch called my-branch is checked out from Git. Changes made to the working copy will always be committed to my-branch:

Gliffy
imageAttachmentIdatt1407483991
macroId9be02be7-40f5-405c-a286-31b9f2ad44b5
baseUrlhttps://datamigrators.atlassian.net/wiki
nameBranch Working Copy
diagramAttachmentIdatt1407877208
containerId1389821953
timestamp1612482766647

Using a DataStage Project as the You can achieve the same state by creating and using a new, branch-specific DataStage Project as the working copy for a Git branch is the same, with the . The only difference being is that the newly-created DataStage Project and Git branch are associated with each other through the MettleCI Workbench Project Registration interface, rather than being a function of Git’s checkout process which, of course, doesn’t handle the particular needs of DataStage development:

Gliffy
imageAttachmentIdatt1410990096
macroIdbe94eb9e-bffc-454f-af3a-da7990fab11b
baseUrlhttps://datamigrators.atlassian.net/wiki
nameDataStage Branch Working Copy
diagramAttachmentIdatt1410695199
containerId1389821953
timestamp1612482931337

...

Developers who have used Git for development in text-based programming languages are used to a Git branch being created and available to work on within seconds. Unfortunately, this assumption does not hold true when working with DataStage assets. Depending on the DataStage Project size, the steps to initialize a new DataStage Project as a working copy could take over an hour. For example, depending on hardware, a typical DataStage Project with 500 assets might require approximately 1 hour and 15 minutes just to import and compile. For this reason it is worth considering how long a branch is intended to remain active before creating it. If a DataStage project branch is expected to remain open for only a matter of minutes or hours, it may not be worth the overhead of creating the branch in the first place.

...

An old joke says that if you fall off a tall building, the falling isn't going to probably won’t hurt you, but the landing definitely will. So it is with source code: branching is easy, merging is harder.

Git is very good at merging text based files as used by conventional programming languages. However, conflicts can still occur and resolving them can range from the simple-but-time-consuming to problems which can be more difficult to identify and resolve. For example, if the name of a variable is changed in both the main and my-branch, Git would detect a conflict and require prompt for human intervention to determine which variable name should be used as a result of the merge. A conflict like this is a pretty common textural conflict which is usually easy to resolve.

But what if the variable was renamed in my-branch but a new function which refers to the old variable name is added to main? In this case, there is no textural conflict but the merged code wont won't work correctly. We’ll refer to this as a Semantic Conflict semantic conflict. This class of conflicts is particularly challenging, as they can go completely undetected and, depending on associated testing disciplines, can take a long time to track down when the code behaves unexpectedly.

...

Unlike traditional programming languages DataStage source code is exposed as a set of proprietary-formatted exports rather than traditional one or more exported files with a proprietary format rather than conventional text. DataStage assets can be exported in two possible formats:

...

While the DSX format might be text-based like traditional source code, the text inside both DSX and ISX formats represent each DataStage job as an acyclic graph with complex relationships and extensive metadata properties. The graph-based nature of these files means that even though Git might be able attempt to merge text based DSX files, the resulting output will contain a large number of both textural and semantic conflicts. To make matters worse, a developer could open the exported files in a text editor but, since these are not intended as to be human readable, it would be virtually impossible to fully understand the DataStage asset they represent, making conflict resolution near impossible. For all practical purposes, DataStage exports should be considered as binary files within the context of a VCS.

Technically, Git will still be able to perform a merge, but if two different versions of the same DataStage export are detected, Git will report a conflict and the versions will need to be manually merged. Since DataStage does not include any tools for mergingto support the resolution of merge conflicts, the only way to resolve a conflict these is for a developer to inspect each version in detail using the DataStage user interface and manually construct the merged version. This process is time consuming, tedious, and error prone.

...

Using DataStage with Git branches is possible, but any attempt at merging runs the risk of discovering conflicts between different versions of DataStage exports. Resolving DataStage conflicts is a manual process which is extremely tedious, time consuming and prone to error, undermining whatever benefits were sought by branching in the first place. The best way to avoid this risk is not the branch to avoid branching in the first place. If you believe branching is unavoidable in your organization, follow these recommendations to constrain the branch creation effort and limit merge risks:

...

Our recommendation is to foster communication and collaboration between developers working on a single DataStage Project, rather than encouraging the isolated, concurrent development of features. Only use Git branching when you absolutely need to concurrently maintain two or more versions of your source code - which ought to be rare in a healthy DataStage solution. For this reason, we strongly recommend the use of Trunk Based Development practices over Feature-driven branching (GitFlow being the most prominent example).