What is version control and Git?

24/02/2021 Off By joanna.angel9251

Version control uses and definition

Version control is the name of any system adopted aiming to keep a record of changes for a set of files (and sometimes a single file) over time. Version control is usually a stand-alone software, however it may be incorporated into various other programs such as spreadsheets and word-processors. These records can then be used later for reverting back to a previous working version if something should go wrong e.g. files have been corrupted or a bug that has been introduced to the code. I say ‘code’ as version control is mostly used by software developers or designers, although version control can be used on nearly any type of computer file.

Some other uses of version control include:

  • Comparing changes to the files over time
  • Seeing what was last modified to help track down the cause of a bug
  • Seeing who made what changes
  • Seeing the dates when specific changes were made
  • Settling disputes over code ownership and ideas

You may wonder why you cannot just do this manually by regularly copying the files into dated folders, maybe weekly or after each significant change. However, this is error-prone as it is incredibly easy to accidentally copy files into the wrong folder, overwrite important files, copy the wrong files or simply forgetting to do it in the first place.

Types of version control

There are three main types of version control systems (VCSs): local, centralised and distributed. Each have their own advantages and disadvantages.

Local

Local version control systems are where the developer of the files keeps a record of the file changes on his local system (normally the same computer used for the development). The records are kept in a database using patch sets. Patch sets are the differences between files. An example of this is RCS (Revision Control System), which today is still distributed with many computers. Using the database, RCS can recreate any given file for any given date by adding up the recorded patches. A disadvantage of using a local system however, is that only one person has access to it and if the hardware storing it were to fail, or if backups aren’t kept, there is a risk of losing everything, including all the file history.

Diagram of localised version control
Localised version control

Centralised

Centralised VCSs, also known as CVCSs, are version control systems that store the files and file history on a separate server. The advantages of this are that multiple developers have access to the files and, if necessary, admins can set up rules defining who is allowed to access and modify certain files. Also, as the system will record who made what changes, everyone is aware of what other team members are working on. Disadvantages however, include the fact that everyone is relying on the same server to access the files, if the server were to go down, nobody would be able to do any work except on the few files they may have checked out (downloaded) to their local computer. Centralised version control still has the risk of losing everything if the files on the server are not backed up. Examples of CVCSs include Subversion, Perforce and CVS.

Diagram of centralised version control
Centralised version control

Centralised systems, because they allow multiple developers to have access to the same files at the same time, introduce the problem of file conflicts. This is where the VCS is unable to merge the changes. A person must therefore go back to the files to manually compare them and choose which of the modified lines they wish to keep. This process is called resolving. If the code is complex and therefore are a lot of conflicts in the file(s), this can be a very difficult and time-consuming task. The resulting file may need to be tested again to ensure the correct changes have been kept.

A way around this issue is what is known as ‘file locking’. This is where only one person can make changes to a particular file at a time. Other may access and read the file but not modify it. However, this can also cause problems. On large projects, locked files may get forgotten about. It is possible to get around this by doing a forced overwrite of the file, however, file conflicts could occur.

Distributed

You guessed it, distributed systems are also called DVCSs. The main difference between centralised and distributed systems is that when clients (developers) checkout files from the database, they not only checkout the latest version, but also the entire history of the files too. This is much more secure as should the files be lost from the centralised server, there are still copies of them and their history on the local machines of anyone using them. This can then be recopied to the server. However, it may not be the latest version, for example, if one developer had made a recent commit to the server and then another developer’s local version, without these new changes, was then used to restore the server, the first developer’s changes would not be there. However, these would still be retrievable from their local system at a later time. Because of this, DVCSs have been standard practice among developers for many years.

Diagram of distributed version control
Distributed version control

Due to all the files, as well as the entire project history being downloaded locally, should the server go offline, there is very little you cannot do without that connection. Operations are also very fast as there is no network slowdowns. Another advantage of using a distributed version control system is that other workflows can be used that are not possible on a centralised system e.g. allowing developers to work on the same files at the same time with ease. Examples of DVCSs include: Mercurial, Bazaar, Darcs and Git.

Git

Git logo

Git has been around since 2005 and is a free, open-source version control software. Originally developed by Linus Torvalds, it’s a very popular choice because of its efficiency and easy-to-use branching system. Unlike other systems such as Perforce and Bazaar, which store their data as a list of file changes (called delta-based version control), Git takes a “snapshot” of the files each time there is a commit (captures what the files look like at that moment). This is done by generating a checksum for each file using SHA-1 hash – a 40 character string calculated using the contents of the file. To improve efficiency, this is only done for files that have actually changed. This process ensures that files that have changed are never missed.

Diagram of a Git repository
Git repository

The diagram below illustrates this process – note that the underlines show the files that have changed at each commit. Git is mostly operated using the command line, but GUIs do exist. Almost all actions add data to the database and commands to remove data are lengthy. This security means you can experiment with your code without the risk of messing it up.

How Git works

Three states

In Git there are three states that a file can be in:

  • Modified – You have changed a file on your local system
  • Staged – The file you changed in the current version has been marked to go into the next commit (which when done will create a new version)
  • Committed – The data is stored in the Git repository

Workflow

There are three main sections of a Git project: the working tree, the staging area and the Git directory. Using Git goes a bit like this: First, you checkout the files from the Git repository. This creates a working tree on your local computer. Next, you make your necessary changes i.e. a new feature, bug fix… Then the files you changed that you want to merge with the master branch (you may have made some accidental changes to other files while you were working or testing) are selected and put into the staging area. Finally, you do a commit. This permanently adds all of the staged files to the Git repository.

Diagram of the Git workflow
Git workflow

Branching

Part of the reason why Git is so useful is it’s easy branching feature. Each time you make a new branch, a new copy of the latest version is created. It’s possible to give names to branches e.g. BUGFIX806, HOTFIX023 so you can keep track of the purpose the branches should you need to switch between multiple tasks.

Git branching diagram
Branching

Common commands

git init <directory>Creates a Git repository in a specified directory
git remote add <name> <url>Create connection to a remote repository. In future, the specified <name> can be used in place of the <url> for convenience
git add <directory/file>Stage all changes in the stated file or directory ready for the next commit
git statusList which files are modified, staged or untracked (files inside a project that have been configured with Git, but are not part of the Git repository. Use “git add” to add them to the Git repository.)
git diffShow the unstaged differences between your modified files and your working directory
git branchList all branches in your repository
git branch <branchName>Create a new branch
git checkout <branchName>Checkout an existing branch
git merge <branchName>Merge the specified branch into the current branch
git pull <remote>Get the remote’s copy of the branch you are in and merge it with your local one
git pushPushes your current branch to the remote repository. If it does not exist on the remote yet, one will be created
Table of common Git commands

Further security

For further security, GitHub is available. This allows you to sign up and store your Git repository online. Although this is a paid-for service, with the price increasing depending on the amount of storage you need, the amount of actions you need to perform, support and other features such as wikis, free versions are available. However, this will limit your storage and actions and will therefore not be suitable for large projects.

Sources