Keeping track of file versions has been a long time issue in the world of software. This remains true if you’re writing source code for an application, command-line scripts, or even authoring a book or documentation. Sure, you could just create .ZIP files with a date/time stamp naming convention, or even adopt some kind of server-based Source Control Management (SCM) system. However, there are a ton of issues that can occur with many of these solutions. These issues are the very reason Linus Torvalds (the creator of the Linux operating system) created the Git version control system. In fact, Linus created Git since all the other solutions available had failed at adequately managing the Linux source code.
What is Git?
Git is a distributed version control system that is both Free and Open Source. It’s designed to be optimized for speed and efficiency so it can handle anything from small to very large projects. Git has also become pretty much the standard version control system within the industry. Git is primarily used for source code management in the software industry, but it has the capability to track changes of any set of files. If you are editing files and need to keep track of versions, and possibly revert back changes when necessary, then you can use Git within your workflow to manage those files.
Git is a distributed version control system that is both Free and Open Source. It’s designed to be optimized for speed and efficiency so it can handle anything from small to very large projects.
Git Origin Story
The initial release of Git was released in 2005. There have been many different contributors to the Git project over the years, but the initial release was written by Linus Torvalds; the creator of the Linux operating system. He created Git as a means to solve the issues the maintainers of the Linux kernel had with other source control management systems; in particularly BitKeeper which was a proprietary SCM that had formerly been used to manage the Linux kernel project.
Linus Torvalds created Git as a distributed source control versioning system that could be used similarly to BitKeeper, but with greater performance. The reason for the performance need was that the Linux kernel had grown to push the limits of BitKeeper’s performance as such that it could take 30 seconds or more to apply a single path. This was especially a problem for the Linux kernel, as syncing source code with fellow maintainers on the project could require 250 of those actions at one. That could take 7,500 seconds or over 2 hours each time!
Linus Torvalds created Git as a distributed source control versioning system that could be used similarly to BitKeeper, but with greater performance.
Git was built as a distributed version control system that is lightweight and fast running. It’s distributed nature enables multiple people to work on the same project simultaneously in a disconnected fashion, then sync up all their changes together when ready. It also implements a feature called Branching that enables a person to create a snapshot or copy of a repository at a point in time within Git, use that isolated copy to make changes, then merge those changes back with the “main” repository (may also be called “master”) when finished. Branching enables multiple people to effectively work on the same project simultaneously without breaking or overriding each others work on accident. And, the distributed nature of Git enables that to all be done on a local machine without Internet or corporate network connectivity until the time of which the changes need to be sync’ed back up together again.
Before discussing how to use Git, it’s worth mentioning a few important details. Git has basically become the standard Version Control system used in the IT industry. It can be used across all Linux, Windows, and macOS computers, and has been adopted by many major corporations, including Microsoft, Google, Facebook, Netflix, LinkedIn, and many others, far beyond just being a tool used by the Open Source community.
Here’s a view at a few of the corporations that have adopted Git:
The above image was taken as a screenshot on http://git-scm.com
Why Central Repositories Fail
If you’re familiar with some of the “older” style of version control systems, like Microsoft Team Foundation Server (TFS), Visual SourceSafe (VSS), or Subversion (SVN), or even a network file share folder then you’re familiar with the idea of having a central place for an individual or team to store the latest version of the files for a project. A file share may not track file versions (depending on the technology used), while systems like TFS and SVN do track version. However, all of these have similarities in how they enable individuals or teams to contribute to the project being tracked. They are generally used with a single Repository where every person on the team contributes to, and that Repository resides solely on the server.
With a Central Repository, there’s a single source of truth for the current state and versioning of files in a project. This works great for teams that work in close proximity to each other in an environment where the network never goes down. In an environment like a corporate on-site office, everyone can “always” connect when they are in the office and get work down. However, even in a corporate, on-site environment this “always” connected is even a fallacy. Every team has times where the network goes down, a server reboots / crashed, or what ever that prevents them from being able to work for a period of time when using the Central Repository style version control systems.
A Distributed Repository system is the answer to everything that’s wrong, or bad, with a Central Repository system.
A Distributed Repository system is the answer to everything that’s wrong, or bad, with a Central Repository system. With a Distributed Repository, each team member will have their own, local copy of the Central Repository to work from. This eliminates any issues of not being able to get work done if the Central Repository is inaccessible for some reason. This is something that becomes tremendously more important when team members are distributes across multiple corporate offices, countries, continents, or even work from home some or all of the time. With a Distributed Repository, team members can contribute back to the Central Repository and retrieve changes to a project from other team members of the Distributed Repository system as necessary.
Simplified Git Flow
Git is a Distributes Repository that solves the issues found in Central Repository systems, but it doesn’t stop there. There are additional features in Git that enable smooth workflow management of any sized project. These features are things like Repositories, Branching, Committing, and Merging. Obviously there are other features of Git, but these are the keys that enable everything.
Before getting into an explanation of a Simplified Git Flow, let’s lay the ground work of defining these key features of Git:
- Repository – A Repository is a container for maintaining files. With a version control system, like Git, each file stored in this container is tracked to keep a record of the file history when ever changes are made.
- Branch – Each Repository must have at least 1 Branch. A Branch is sort of a sub-Container within the Repository for tracking file versioning. Each Branch has it’s own history that is maintained. A Branch can also be created at any time, and made as a “child” with another Branch as the “parent”. This means that you can start making changes to one or more files, by spawning a new Branch to make those changes. A Branch provides an isolated environment to make file changes without affecting other Branches. This can be important for maintaining the integrity of the files or source code a different environment deployments; like Production, Testing, or others.
- Commit – When changes are made to one or more files, those are saved to a Branch as a Commit. To persist file changes and add them to the file version tracking of the Branch, you Commit them to the Branch. Each time a file change is added to the history of a Branch, a Commit is made. Each Commit can contain the changes to one or multiple files in a single Commit.
- Merge – After changes to a Branch are finished they are Merged into another Branch. This enables you to essentially replay all the changes Committed to a Branch on another Branch to contribute those changes; generally by merging a “child” Branch back to it’s “parent” Branch.
Oh, and since Git is a Distributed Repository based version control system, there are a couple more terms that you’ll need to be familiar with to understand the workflow of using Git. These terms are used to distinguish between the server or computer location of the Git repositories and branches:
- Origin – Origin is the term used to reference the Repository that is stored on different computer, or the Git server. Since this is the “source of truth” copy of the Repository that all team members copy down before beginning to make changes, it’s referred to as the Origin.
- Remote – Remote is the term used to reference the local copies of the Repository and Branches on an individual team members computer. Before making an file changes and subsequent commits to a Repository or Branch, a Remote needs to be created based on the Origin to get started.
While working as an individual, you could essentially remove the idea of an Origin by working on a single computer. However, when working in teams of 2 or more individuals, then concepts of Remote and Origin become important. These concepts are how the Distributed nature of Git is implemented.
Now, let’s take a look at a Simplified Git Flow. The following diagram illustrates this visually:
Looking at this diagram after reading the above definitions it may seem obvious how a Simplified Git Flow works. However, let’s lay out the steps to make sure you’re perfectly clear on the Simplified Git Flow.
Let’s take a team working on a Project that has 2 members, John and Sarah. Git is the version control system being used, and the Project is comprised of a single Repository with a Branch named “master” that is used to maintain the source code files for the Production deployment of the Project.
- The “master” branch is used to track the source code that gets deployed to the Production environment for the project
- When either John or Sarah begin working on a feature, they each create a new branch to work within.
- When working on that feature, they Commit their changes into that Branch.
- Then when they’ve finished working on the feature, they test their changes by compiling the source code and running it locally as well as in a test environment.
- Then once testing is completed, they merge their changes from their “feature” Branch into the “master” branch to contribute it back up the chain.
- Then testing is completed with the new changes merged into “master”.
- Once testing and validation has completed, the latest code in “master” is deployed out to Production and the new feature become live in the project.
- After all this, the feature Branch is then deleted to clean up the repository and keep any branches from cluttering things up and confusing future development work later on. Also, changes made have been recorded in “master” so the history of the Branch is no longer necessary.
Keep in mind this is a Simplified Git Flow. There are obviously more complex ways to manage projects with maintaining Tags, Releases, and multiple deployments environments. Understanding the Simplified Git Flow lays the foundation to understanding more complex scenarios in Git.
You mean, Git and GitHub aren’t the same thing? No, they are not, and let’s take a look at what GitHub offers on top of what Git provides. Let’s take a look at where Git ends, and GitHub begins!
You may not be aware, but the cool thing everyone talks about called a “Pull Request” or “PR” isn’t actually a feature of the Git version control system. Pull Requests are something that GitHub has created in their system that implements and supports Git. By doing this, they’ve added another couple steps in the middle of the Simplified Git Flow, to create what is called the GitHub Flow.
The above diagram lays out the Simplified Git Flow, with the purple elements added that represent the GitHub specific stuff that turns it into the GitHub Flow. The main steps remain the same, except there are some changes to how a team member, or developer, works in their Branch when working on a new feature for the project.
When working in a Branch, a Pull Request gets created. This Pull Request (PR) is a place within GitHub, outside of Git itself, where the team member(s) working a a feature/enhancement/bug fix can get feedback from other team members along the way. They can then use this feedback to make further changes and Commits to the Branch before ultimately testing and finally Merging their changes back up to the “master” Branch.
Beyond the Simplified Git and GitHub Flows
The flows descried above in the Simplified Git Flow, as well as the GitHub Flow are really the most simple way to work with Git Repositories, Branches, Commits, and Merging. Things can get more complicated when multiple long living Branches are added to a Repository; in addition to the “master” Branch. Generally it’s not recommended to create additional long living Branches, but some teams may deem it necessary.
One of the big reasons teams using long living Branches within a Git Repository, is that they need a way to manage Releases of their project. Git has a built-in feature called Tags that can be utilized for this purpose. When a Release is made, the Branch and specific Commit for that Release version can be Tagged. This Tag can then be easily referenced later on to retrieve the snapshot of code that was released with that version of the project; such as a v2.0 release.
Then if there are bug fixes or modifications necessary for that release, you can create a new Branch from the Tag. This allows you to start with the snapshot of source code of the project for that specific Release in a reliable way. Then you can make commits and manage updates and fixes to that specific Release even as the team has moved on to a future Releases work in the “master” branch. Then the fixes can be merged into “master” when complete to contribute them to the latest source code of the Project, in addition to Publishing for the Tagged Release the fix needed to be implemented for.
There are also methods of being able to update a Branch with the latest source code from the Branch that it was spawned from (its “parent”) if necessary to bring it up to speed with latest changes. This enables longer living Branches to be updated after they’ve been created without needing to Merge any pending changes in the current Branch with “master” or some other “parent” branch.
These are just some simple examples of using Git with projects using more complex scenarios that the Simplified Git Flow, and the GitHub Flow. It’s generally recommended to keep as close to these Flows as possible to keep your Projects simpler to maintain. However, there are times when more complex Git version control strategy is necessary, but those topics will have to wait for a future post.
If you’re interested in more content around using Git to manage versioning and source control of projects, please post comments below asking what you’d like to learn about. Also, if you have comments and suggestions to add to what’s described in this article, please post those in the comments below as well. Thanks!
The definition of a remote doesn’t make any sense. It *would* make sense if you replaced “remote” with “clone”, though…
I do know what you mean, but the official terminology is to call it a “remote”.
I agree with Thomas. When you clone the central “server” repo, then from the view from your cloned local repo, the Remote is the central repo (to which you keep the tracking link). It makes little sense to look at it from the “server” side (then you would call it Remote in the sense of your article) – you have no idea what your remote is, you don’t even know how many there are.
Nice article, big thanks! It would be nice to mention the Git Workflow approach – a way to manage a more complex, “real-world” setup when you need to maintain QA, UAT branches, etc.
Thanks for the article, just wanted to point out few typos though:
1. 4th para of the ‘Git Origin Story’ – “It’s can be used across all Linux, Windows, …” –> “It can be…”.
2. 3rd last para of the ‘Beyond the Simplified Git and GitHub Flows’ – “…Merge any pending changes in the current Batch…” –> “… current Branch…”.
3. Same heading 2nd last para – “… to keep as closed to these Flows…” –> “…as close to these Flows…”.
4. Same heading 1nd last para – “… and source control of projects; please post …” –> “of projects, please post…”.
Oops! Thanks for letting me know. I’ve fixed these in the article. Thanks!
GitFlow is an actual thing though https://github.com/nvie/gitflow
Excellent article, as usual, Chris
Thank you very much!!
Thank you for a terrific explanation of GIT. I have been a TFS and VSS user in the corporate single on-site location you mentioned. As such the significance of the Git flow was lost on me. Since we had used only long running branches for quarterly releases the concept of continual branching was confusing. Now I need to figure out how this applies to Azure Data Factories and Logic Apps.
How to do GitFlow with Azure Data Factory?
Interesting question! I haven’t worked with it yet, but here’s a link for information about using Azure Data Factory with GitHub integration: https://azure.microsoft.com/en-us/blog/azure-data-factory-visual-tools-now-supports-github-integration/