GitLab Geo vs. HA – Architectural Decisions in Enterprise Operations

Introduction

GitLab often starts in many organizations as a pure version control tool. A place for code, maybe initial CI/CD pipelines, some automation. As infrastructure matures, however, this role fundamentally changes.

GitLab becomes the central platform for development, deployment, and operational control. Decisions that were previously made manually are moved into pipelines. Configurations are versioned. Infrastructure is no longer just operated, but described.

In our setup, this development goes one step further.

GitLab is the central control layer for us for Infrastructure as Code, DNS as Code, Configuration as Code, CI/CD pipelines, and deployment and build processes.

This creates a new reality:
GitLab is not just a tool, but a critical component of the entire platform.

An outage of GitLab does not only mean limited development. In the worst case, the ability to change infrastructure, provision systems, or fix errors is also affected.

For exactly this reason, a central question arises at some point:

How do you operate GitLab in a productive, distributed environment in such a way that you not only avoid outages, but can also react in a controlled way in case of failure?

When GitLab becomes critical infrastructure

In small environments, an outage of GitLab is usually manageable. Developers can continue working locally, and deployments can be performed manually in an emergency.

In mature environments, this looks different.

Deployments run automatically via pipelines. Infrastructure is created from repositories. DNS is controlled dynamically. Security checks are part of the build process. Changes are no longer made directly on systems, but exclusively via defined processes.

GitLab thus becomes a kind of control center.

And this is exactly where the real challenge arises: the more you rely on automation, the more critical the platform that controls this automation becomes.

The question is therefore no longer just:
“How do we keep GitLab available?”

But rather:
“How do we remain operational when GitLab is not available?”

The first impulse: High Availability

The obvious approach in many cases is a classic high availability setup. Multiple GitLab instances are operated behind a load balancer. The database is replicated, Redis is distributed, storage is shared. The goal is to transparently absorb the failure of individual components.

At first glance, this is a convincing solution. Systems remain reachable, and ideally users do not notice any difference. However, this perspective is often too short-sighted.

The reality of HA in complex environments

An HA setup is not a single system, but a complex interaction of many components.

Database, storage, cache, and network must work together consistently. Errors in one of these layers immediately affect the entire system.

Within a single region, this can be operated stably with sufficient effort. However, once multiple locations come into play, the situation changes significantly.

Latencies increase. Synchronous replication becomes more difficult. Networks become more susceptible to disruptions. The likelihood of inconsistent states increases.

A typical example is the so-called split-brain scenario. Due to network problems, multiple nodes may assume at the same time that they are active. Data diverges, and recovery becomes complex and error-prone.

The topic of storage is also often underestimated. Shared storage is a central component in many HA architectures and at the same time one of the biggest weaknesses.

Performance problems, locking behavior, and dependencies quickly lead to this component becoming a bottleneck or even a single point of failure.

The result is a system that is theoretically highly available, but in practice comes with high complexity and corresponding operational effort.

GitLab Geo takes a different approach.

Instead of trying to build a globally consistent system, the architecture is deliberately simplified: a primary node handles all write operations, one or more secondary nodes replicate the data, and replication is asynchronous.

This asynchrony is not a disadvantage, but a deliberate decision. It enables a clear separation of responsibilities and reduces dependency on globally synchronized systems.

While HA tries to make failures invisible, Geo accepts that differences between locations can exist and uses this consciously to reduce complexity.

Our architecture: two locations, clear roles

In our environment, we operate GitLab across two locations: Germany as the primary and Austria as the secondary.

All write operations, authentications, and pipeline executions take place via the primary. The secondary continuously replicates and serves as an additional access point as well as the basis for disaster recovery.

This clear distribution of roles ensures predictable and controllable behavior.

There is no competition between multiple active write nodes. There is no complex global synchronization. Instead, there is a clearly defined place for changes and a replicated copy for failure scenarios.

The decisive shift in perspective: recovery instead of perfect uptime

The most important decision in our architecture was not technical, but conceptual.

We deliberately decided against optimizing for perfect availability.

Instead, we optimize for fast recoverability.

Perfect uptime is a theoretical goal that is bought in practice with increasing complexity. The more you try to completely avoid failures, the more complex and therefore error-prone the system becomes.

Recovery, on the other hand, is based on a different assumption:
Failures happen. What matters is how quickly and in a controlled way you can respond to them.

A realistic scenario: site failure

Let us take a concrete example.

The primary site fails completely. Network problems, infrastructure errors, or a larger incident scenario make access impossible.

In a classic HA setup, the question of consistency, failover, and stability arises immediately.

In our setup, the focus shifts.

The secondary already contains a replicated version of all relevant data. Repositories are present, artifacts are available, and the basis for all further processes is given.

The next step is not to stabilize a complex system, but to restore the ability to work.

Pipelines can be started. Infrastructure can be rebuilt from code. Systems can be provisioned reproducibly.

GitLab is not just part of the solution, it is the foundation for the entire recovery process.

GitLab as a single source of truth

This approach only works because GitLab plays a central role in our architecture. All relevant information about the state of the infrastructure is located in the system: configuration definitions, deployment logic, DNS zones, and automation processes.

This means that we are not dependent on individual system states, but are always able to recreate the desired state.

Infrastructure is therefore not just operated, but described. And it is precisely this description that makes it possible to restore it in case of failure.

Limits and practical aspects

As convincing as the Geo approach is, it also comes with limitations.

Authentication is usually handled via the primary. The secondary is not designed to fully take over all functions.

Not all GitLab features are fully Geo-aware. Certain downloads or artifact accesses may still run via the primary.

These characteristics are not flaws, but part of the design. They can be planned for, but must be consciously considered in the architecture.

GitLab Runner and security as an integral component

Another important component of our setup is GitLab Runner.

These are operated in isolation, for example within Kubernetes environments, and are clearly separated from the GitLab instance. This allows workloads to be cleanly separated and scaled flexibly.

In combination with GitLab Ultimate, an additional advantage arises: security mechanisms are directly integrated into the development processes.

Analyses such as SAST, DAST, or dependency scanning do not take place afterwards, but directly within the pipelines. Vulnerabilities are detected early, and security requirements are enforced automatically. Security thus does not become a separate process, but an integral part of the system.

When HA makes sense and when it does not

GitLab Geo is not the right choice in every scenario. In environments with only one location, low latency, and clear requirements for immediate consistency, a classic HA setup can make sense.

However, as soon as multiple regions come into play, disaster recovery becomes more important and systems need to be built reproducibly, the evaluation shifts. This is where Geo shows its strengths.

Outlook: what happens when GitLab itself is not available?

Beyond the focus on architecture and operations, a further question arises:

What happens if not only our own infrastructure fails, but the platform itself that we rely on?

In a globally connected IT landscape, dependencies on individual providers are not only technical but also strategic risks. Political developments, regulatory requirements, or changes in business models can influence whether and how certain services can be used.

Especially with platforms like GitLab, which are deeply integrated into development and operational processes, this creates a new dimension of dependency.

For this reason, we generally consider critical systems from an additional perspective:

How replaceable is the system in case of emergency?

For GitLab, this means that we deliberately design our processes so that they are not inseparably tied to a single platform. Infrastructure as Code, standardized interfaces, and reproducible deployments allow us to map central parts of our environment using alternative tools.

This includes, for example, alternative Git platforms, independent CI/CD systems, exportable configurations, and standardized deployment processes.

The goal is not to replace GitLab in the short term. Rather, it is about remaining capable of acting in the long term.

Critical systems are therefore not only operated in a highly available manner, but are also designed in such a way that they can be replaced if necessary.

Conclusio

The decision between GitLab HA and GitLab Geo is not purely a technological one.

It is a question of architecture, priorities, and long-term strategy.

In our case, the focus was not on maximum availability at any cost, but on a system that remains controllable even in case of failure.

GitLab Geo is a central building block in this.

Through the combination of clear role distribution, continuous replication, and Infrastructure as Code, an environment is created that can not only be operated stably, but can also be restored quickly if necessary.

And that is often the decisive advantage in complex, distributed infrastructures.