MMO Architecture: Optimizing Server Performance with Lockless Queues

More on MMO Architecture:
- MMO Architecture: Source of truth, Dataflows, I/O bottlenecks and how to solve them
- MMO Architecture: client connections, sockets, threads and connection-oriented servers

I'm writing this article within the last 2 hours of 2024, so I guess I'm forced to say Merry Christmas  Thank you for your time, I hope you enjoy it and learn something new.

In modern MMO (Massively Multiplayer Online) games, server performance is as crucial as any other part of the game’s architecture. You want your servers to handle thousands of player connections seamlessly, process real-time game logic, and keep latency as low as possible.

A common technique in building such systems (and any other high performance socket-based software tbh, you already know my obsession with MMOs…) is to separate the networking thread from the main game logic thread(s) so the application doesn’t stall on big net transfers. This separation goes a long way to improving performance, maintainability, and scalability, especially in high-concurrency environments.

In this article, we will explore the reasons behind this design decision, discuss race conditions and how they can occur in multi-threaded code, and see why lockless queues are often a great choice for handling messages between threads in.

Finally, we will present a working C++ implementation of a lockless SPMCQueue (Single Producer, Multiple Consumer Queue) and discuss the key concepts behind it. This queue allows a single producing thread to push data into the queue while multiple consumer threads can pop data concurrently.

Why should I separate the network thread from the game logic thread?

In many game server designs, you will encounter a classic question: “Should my main loop also handle all the networking events?” While it is possible to do so in small or hobby projects and most high-level languages has a native way to concurrency, large-scale MMO servers often choose low-level solutions with peak performance to run the network code on a dedicated thread (or set of threads).

Multithreading generally offers far greater performance scalability than single-threaded concurrency

What about single-threaded concurrency?

Multithreading generally offers far greater performance scalability than single-threaded concurrency because it allows different tasks to run truly in parallel, taking advantage of modern multicore CPUs.

In a single-threaded model like the one NodeJS implements, concurrency is largely “simulated” via time-slicing—rapidly switching one thread among multiple tasks. This switching causes overhead, can introduce idle CPU cores, and often leads to poor responsiveness under heavy load which is good enough for generalistic web applications or RestFULL services but falls short for live real-time online games.

By contrast, with multiple threads running on separate CPU cores, workloads that can be divided into discrete, parallel subtasks (such as handling independent network connections or processing distinct game entities) can execute simultaneously, thereby decreasing latency and boosting throughput.

By placing the socket I/O operations on a different thread, you ensure that the main game logic thread never blocks on long or dropped calls. This results in better responsiveness since the game logic loop does not have to wait for network data to arrive or for data to be sent out.

Using a dedicated thread (or thread pool) for networking allows you to scale more easily

MMO servers typically handle large numbers of connections. Using a dedicated thread (or thread pool) for networking allows you to scale horizontally more easily. You can add additional network threads as your server grows (or until you find your throughput limit), without significantly changing your game logic code.
And by isolating the networking code in its own thread or layer, the rest of your code can focus on game logic as long as you maintain a common interface between layers. This separation of concerns makes your system more modular and easier to test. Network updates can be handled independently, and you can mock or simulate network traffic while testing your game logic.

Where does that performance boost come from?

Threads are independent units of execution that an operating system schedules to run on one or more CPU cores. Essentially, when you start a program, at least one main thread is created to run its instructions. As you create additional threads, children of the main one, they can share memory and other resources with the main thread, but each thread has its own execution flow and stack.

Splitting tasks between threads provides better parallel utilization

The operating system switches between these threads so that they appear to run in parallel (NodeJS mimics this OS-level concurrency in the application layer). On multicore processors, multiple threads can actually run at the same time, each on a different core.

CPU utilization can be distributed more evenly across cores. If you try to handle all network and game logic in one thread, you risk saturating that thread, while other CPU cores remain idle. Splitting tasks between threads provides better parallel utilization.

The High-Level Architecture

Having said that (long-ass explanation), a common approach used in such systems looks like this:

Network Thread: Listens for incoming TCP or UDP packets, reads them, processes the initial I/O logic, and creates and enqueues meaningful messages from the net packets so they can be handled by the game logic.
Game Logic Thread(s): Responsible for updating the state of the game world, running AI, performing collision detection, and so forth. This thread (or multiple threads if you are implementing parallel game logic) consumes the messages placed by the network thread and publishes any net update needed.
Network Thread: Packs and sends any message pending.

The key to making this architecture work smoothly is a non-blocking, fast and efficient mechanism to pass data between threads—enter the lockless queue.

Race conditions in multi-threaded code

Before discussing lockless queues, let’s talk about race conditions. A race condition occurs when two or more threads access shared data at the same time, and the final outcome depends on the order of operations, which is not guaranteed. In other words, the result of a sequence of operations could differ from one execution to the next, making your program’s behavior unpredictable.

If multiple threads read and write a variable (e.g., a global counter or a shared buffer) simultaneously without synchronization, you will definetly end up with unexpected values and, since the threads are independent and scheduled by the OS, the behavior is absolutely unpredictable.

Lock-based solutions like std::mutex in C are usually used for synchronization and can effectively prevent race conditions, but those mechanisms comes with their own drawbacks.
Locking and unlocking adds overhead and slows the procedure down, and if used incorrectly—double-locking, forgetting to lock in certain code paths, or unlocking prematurely—subtle, hard to find bugs can slip in.

So, if even the tools designed for thread synchronization might fall sort for MMOs, do we even have a solution? Not really, but there’s a good enough approach used in many implementations: Using well-defined data structures (like lockless queues), strict rules for accessing shared state (e.g., single producer, multiple consumer patterns) and pray there are no corner cases to limit the chance of race conditions as much as possible.

Locking vs. lockless strategies

A straightforward way to manage concurrency is with locks —which we already visited above—. For instance, if you have a shared queue, you can protect it with a mutex: before accessing the queue, you lock the mutex; when you’re done, you unlock it. This approach is simple and works fine for moderate levels of concurrency. However, in very high-performance applications—like an MMO server with thousands of messages per second—locks can become a bottleneck.

Whenever one thread holds a lock, other threads that need the same resource must wait, potentially causing thread context switches, which are historically expensive (imho almost negligible today for modern processors, compiler optimizations and the wheelbarrow of shit-code and libraries any standard project already has).

What is definetly true is that, if many threads frequently access the locked resource, the contention can significantly degrade performance and the misuse of locks can lead to deadlocks, where two or more threads permanently wait on each other to release locks.

The case for lockless

Lockless (or lock-free) data structures aim to reduce or eliminate the need for locks, thereby minimizing blocking and contention. They rely heavily on atomic operations provided by modern CPUs (such as compare-and-swap).

when done right, lockless data structures can offer way higher throughput and low latency under heavy concurrency

Atomic operations are operations on memory guaranteed to complete as a single, indivisible step, preventing partial or interleaved reads and writes by other threads (In concept, something similar to an SQL transaction). In modern CPUs, these operations are typically supported at the hardware level (e.g., compare-and-swap or fetch-and-add instructions) to ensure that no other process or thread can observe or modify the affected memory while the operation is in progress.
By using them, lock-free data structures implementations perform concurrent reads and writes without the overhead of acquiring and releasing locks.

complexity grows rapidly with concurrency.

However, lockless data structures are more complex to write and reason about than their lock-based counterparts. A single mistake in memory ordering or atomic operation usage can cause very subtle bugs so they must be used carefully.
Despite the complexity, when done right, lockless data structures can offer way higher throughput and low latency under heavy concurrency.

A real example for a lockless SPMCQueue in C++

To fulfill the need of a high-throughput multi-threaded networking system we need a queue where messages can be stored for future processing.
In the simplest case, there is only one producer (the network thread) that pushes messages into a queue, and multiple consumers (one or more game logic threads) that pop messages for processing. The queue must allow fast, concurrent access without the overhead of locking.

Key features of SPMCQueue

Single Producer
- Only one thread calls Push(). This thread is usually the networking thread receiving data from clients. Since there’s only one pushing, we don’t need to protect the push operation with atomic operations, there’s no risk for race conditions.
Multiple Consumers
- Multiple threads call Pop(). Typically, these are worker threads or the main game logic thread that processes incoming messages. Here on the other hand, we need to make sure every consumer gets a different message from the queue and there’s none skipped.
Lockless Implementation
- Atomic operations manage the read and write indices in the buffer, avoiding explicit locks.

An SPMCQueue typically uses a ring buffer, a special queue with connected head and tail so it loops. This implementations usually keeps track of two atomic counters, one for writing (w) and one for reading (r). Since only one thread calls Push(), contention on w is minimal. On the consumer side, multiple threads attempt to increment r atomically.

Deep dive into the C++ implementation

Below is the SPMCQueue class we will be using. It provides a simple lockless ring buffer structure suitable for single-producer multiple-consumer scenarios.

#pragma once

#include <atomic>
#include <vector>

template<typename T>
class SPMCQueue {
    public:
        SPMCQueue(size_t max) : size(max), buffer(max), r(0), w(0) {}
        bool Push(T& data) {
            auto write = w.load(std::memory_order_relaxed);
            auto read = r.load(std::memory_order_acquire);
            if (write - read >= size) return false;

            buffer[write % size] = std::move(data);
            w.store(write + 1, std::memory_order_release);
            return true;
        }

        T* Pop() {
            while (true) {
                auto read = r.load(std::memory_order_relaxed);
                auto write = w.load(std::memory_order_acquire);
                if (read >= write) return nullptr;
                
                if (r.compare_exchange_weak(read, read + 1, std::memory_order_acquire, std::memory_order_relaxed)) {
                    return &buffer[read % size];
                }
            }
        }
    private:
        const size_t size;
        std::vector<T> buffer;
        std::atomic<uint64_t> r;
        std::atomic<uint64_t> w;
};

Let’s break down the key points of this implementation:

Ring Buffer Storage
- The queue is implemented as a ring buffer using std::vector<T> buffer;. The maximum capacity is size, passed to the constructor.
Atomic Counters
- std::atomic<uint64_t> r; tracks the read pointer position.
- std::atomic<uint64_t> w; tracks the write pointer position.
- Because we only have one producer, w is only incremented by a single thread. However, multiple consumers might try to increment r. Thus, both r and w are atomic.
Push Operation
- We load the current write index (write = w.load(std::memory_order_relaxed)) and the current read index (read = r.load(std::memory_order_acquire)).
- We check if the buffer is full by seeing if (write - read) >= size. If it is full, we return false.
- Otherwise, we move the data into the correct buffer slot (buffer[write % size] = std::move(data)).
- We then store the updated write index with w.store(write + 1, std::memory_order_release).
Pop Operation
- The function enters a while (true) loop because we might need to attempt to pop multiple times if another consumer thread is competing.
- We load the current read and write indices. If read >= write, the queue is empty; we return nullptr.
- If the queue is not empty, we attempt to increment the read index using an atomic compare-exchange (r.compare_exchange_weak(...)). This is done with memory ordering parameters (std::memory_order_acquire for success, std::memory_order_relaxed for failure).
- If the compare-exchange succeeds, it means we have effectively “claimed” the slot at read % size. We return the address of that element.
- If the compare-exchange fails, it means another consumer got there first, so we try again.

Memory Ordering

Memory ordering refers to the rules that govern how read and write operations on shared variables can be reordered or observed by different threads in a concurrent system. In practical terms, memory ordering constraints define when updates made by one thread become visible to another, and in what sequence.

Proper use of memory ordering ensures that our lock-free queue execute correctly, preventing obscure data races.
This is, indeed the hardest part (also depends on the CPU architecture).

In the example, we use std::memory_order_relaxed for loads on r and w in some cases because we do not always need a full barrier and std::memory_order_acquire or std::memory_order_release where needed to ensure proper ordering around critical operations.
This pattern ensures that updates to the buffer are visible to consumers once w is incremented, and that changes to r are visible to all relevant threads.

Complexity and afterthought

The SPMC (Single Producer, Multiple Consumer) and MPSC patterns are way simpler than MPMC (Multiple Producers, Multiple Consumers). If your design can be adjusted to reduce concurrency to 1:N or N:1, do it—your life will be much simpler.

Complex software can also be elegant

Lockless queues work best when transferring relatively small, self-contained messages. Large, complicated objects or shared pointers that require reference counting can reintroduce hidden contention or complexity that we should avoid.
Complex software can also be elegant by reducing unneeded edge-cases and splitting the system in small, well scoped pieces.

Conclusion

Designing a scalable and performant MMO server, just like any other high performance system, requires careful attention to concurrency and thread management. Splitting your server into a dedicated network thread and a separate game logic thread prevents I/O operations from blocking critical simulation tasks, enhancing both latency and throughput.

Yet, multi-threaded programming brings with it the possibility of unpredictable outcomes stemming from concurrent access to shared data. Traditional locks (e.g., std::mutex) can address these issues but often at a performance cost, especially in high-frequency environments where thousands of messages might need processing every second.

This is where lockless queues step in. By relying on atomic operations and well-defined memory orderings, a lockless queue can deliver high throughput with minimal contention. The example shared above illustrates how a single producer can safely share data with multiple consumers without requiring a single global lock. Instead, each operation is carefully managed through atomic compare-and-swap (CAS) instructions on read and write indices in a ring buffer.

But, as always, lockless solutions are not free, when implementing lockless data structures we have to be careful so we can avoid unneeded complexity and obscure bugs.
Keeping the design as simple as possible and apply this patterns only in performance-critical points of the application is a must—complexity grows rapidly with concurrency.

Following these guidelines and thoroughly understanding the concepts behind concurrency, race conditions, and lockless programming does help us to build robust MMO servers (or any other software indeed). By isolating your networking layer, reducing blocking calls, and ensuring safe data passage with a lockless queue, you can scale your multiplayer game to support more players with lower latency and fewer costs.

MMO Architecture: Optimizing Server Performance with Lockless Queues

Why should I separate the network thread from the game logic thread?

The High-Level Architecture

Race conditions in multi-threaded code

Locking vs. lockless strategies

The case for lockless

A real example for a lockless SPMCQueue in C++

Key features of SPMCQueue

Deep dive into the C++ implementation

Conclusion

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List