Playing 2048 with CUDA C++
This article explores how CUDA C++ is leveraged to accelerate an AI for the game 2048. The techniques and principles discussed can be applied to a wide range of tasks.
Introduction
The deceptively simple puzzle game 2048 is a fitting challenge for GPU-accelerated AI. Its vast search space makes perfect play nearly impossible. As a nascent GPU programmer, I sought a practical project to solidify my understanding of CUDA C++. To that end, I developed an AI for 2048. The AI outperforms human players by leveraging the GPU’s parallelism to explore billions of moves per second.
Understanding the game
Can you beat a simple puzzle game? In 2048, you combine similar tiles to reach a high score. Slide to merge identical tiles, doubling their value. The goal is to reach 2048 before running out of space. Test your skills by trying it in your browser.
Like chess, 2048 requires strategic thinking. However, the random tile insertions introduce an element of chance which makes planning difficult. As you progress, space becomes increasingly limited, adding to the challenge. These properties make higher tiles exponentially more difficult to reach.
Why 2048?
2048’s simple rules and complex problem space make it an ideal choice for testing AI algorithms. Its discrete state space makes it suitable for evaluating various techniques, such as tree searches and machine learning. As a teenager, I was fascinated by the game and wondered if AI could solve it.
The AI approach: Expectimax search
To conquer 2048, the AI needs strategy and foresight. To that end, I made it explore the potential outcomes of each move. It considers all possible tile spawns and calculates the expected value of each move based on these outcomes. The AI selects the move with the highest expected value.
To evaluate future board configurations, the AI repeats the process, exploring deeper into the game tree. However, the number of possibilities grows exponentially with depth, making it infeasible to explore everything.
To overcome this, the AI uses a heuristic function to estimate a board’s quality beyond a certain depth. The heuristic prioritizes board configurations that facilitate easy merges. By leveraging a heuristic, the AI can make informed decisions without exploring every possible sequence of moves.
Altogether, the AI employs a strategy known as expectimax search to explore the game tree and select the optimal move. It explores 5 moves ahead, searching up to ~10 billion board states. I chose this depth to balance accuracy with runtime.
Performance
The AI routinely reaches the 4096 tile and has achieved the 16384 tile. For comparison, my personal best is significantly lower at the 2048 tile. It was incredibly satisfying to see the AI surpass my own abilities.
The AI explores approximately 1.2 billion moves per second on my RTX 3060, a mid-range consumer GPU.
GPU acceleration with CUDA C++
The AI is GPU-accelerated with CUDA C++. This enables the AI to explore more moves, leading to better decisions.
Key techniques include:
- Parallel search: The search tree is partitioned into subtrees, each assigned to a thread block.
- Dynamic parallelism: Kernels are launched dynamically to reduce CPU-GPU communication.
- Shared memory: Memory access latency is minimized using shared memory.
- Cooperative groups: Threads use cooperative groups to perform collective operations.
- Unified memory: Unified memory simplifies data transfer between the host and device.
- Modern C++: C++11 and later features are used to improve code clarity and efficiency.
- Template metaprogramming: Generic algorithms and data structures make code reusable.
- Batch processing: Batches of games are processed in parallel to maximize utilization.
- Unit testing: Unit tests ensure code correctness.
Game tree
The AI represents the possible game states as a tree structure. By partitioning the game tree into subtrees assigned to different thread blocks, the AI can explore many possibilities in parallel.
The game tree contains all board configurations and the transitions between them. Each node represents a board state, and each edge represents a move and the subsequent tile spawn. The AI searches this tree up to a maximum depth.
Block parallelism
To parallelize the search, the game tree is divided into smaller subtrees. Each subtree is assigned to a different thread block for exploration. These blocks execute concurrently on the GPU’s streaming multiprocessors (SMs). The results of these parallel searches are then combined to determine the optimal move for the root node.
Each block performs a self-contained search, leveraging fast shared memory and avoiding slow global memory accesses. This approach results in high thread occupancy and arithmetic intensity, making efficient use of the GPU’s resources.
Thread parallelism
Within each block, many threads cooperate to explore a branch of the game tree. Threads calculate all of a node’s children simultaneously. At the maximum search depth, the heuristic function is invoked in parallel for many board states. The results from these explorations are propagated back up the tree using parallel reductions.
To limit memory usage, nodes are processed and discarded as needed. One set of children is stored in memory at each depth. This permits threads to operate on different nodes. Once all of a node’s children have been evaluated, their values are averaged to determine the parent node’s value. The children are then discarded, and the next node’s children are calculated.
The diagram above shows one step in the search process. Green nodes have values, yellow nodes are being valued, and red nodes have not been assigned values yet.
Game operations
Tile slides and spawns are processed in parallel. Groups of four threads cooperate to slide tiles across rows. Subsequently, individual threads handle tile spawns. Invalid board configurations, resulting from tile spawns on occupied cells, are discarded. A parallel scan is used to compact the valid board configurations, keeping them contiguous in memory.
Dynamic parallelism
To minimize the overhead of CPU-GPU communication, the entire game runs on the GPU. A single kernel runs the game, employing dynamic parallelism to launch search kernels before every move. This approach eliminates the need to synchronize with the CPU, keeping the GPU busy.
The game kernel cannot wait for its child kernels to complete. Thus, it maintains its state in global memory and tail-launches itself after the search completes. It repeats the process until the game ends. The final state, including the score, is then transferred back to host memory.
Shared memory
Shared memory has high bandwidth and low latency, so the AI uses it where possible. However, shared memory is precious because it limits occupancy. To cut consumption, the AI reuses shared memory across operations. Lower consumption permits more blocks to be resident on each SM, improving utilization.
Cooperative groups
Cooperative groups simplify the code by explicitly defining thread groupings. This makes the code more readable and maintainable, as it eliminates the need to infer group membership and rank from thread indices.
The AI uses cooperative groups to coordinate the work of multiple threads. Collective functions take the group as an argument, making it clear which threads must participate. Block tiles are used for warp and sub-warp operations like shuffles.
The AI uses CUB for parallel reductions and scans, instead of cooperative groups, since CUB has much better performance. This appears to be a bug in CUDA.
Unified memory
Unified memory is employed to simplify data transfer between the host and device. This allows the AI to seamlessly access memory from both the CPU and GPU, eliminating the need for explicit device allocations and transfers. However, unified memory has limitations, particularly when using multiple GPUs.
If the GPUs do not support peer-to-peer (P2P) communication, unified memory can degrade performance, as all allocations reside in system memory. While the current implementation does not utilize multiple GPUs, this is an important consideration for the future.
Modern C++
To write clear and concise code, I leveraged modern C++. Features like auto
, const
, constexpr
, consteval
, decltype
, initializer lists, structured bindings, and lambdas improved the code’s readability and efficiency. Additionally, the CUDA C++ standard library offers modern C++ abstractions like array
, span
, optional
, and variant
. These provide safer alternatives to traditional C-style arrays and pointers.
Template metaprogramming
The search algorithm is templated for modularity and reusability. Node operations are accepted as template arguments, enabling them to be implemented and tested separately. The compiler can still inline them, improving performance. The search logic can also be repurposed to make an AI for a game other than 2048.
Many device functions accept the block size as a template argument. This enables optimizations like loop unrolling since strides can be calculated at compile time. CUB also requires the block size for all its block operations.
The AI’s extensive use of templates has necessitated putting everything in header files. A benefit is that the compiler can perform more optimizations because everything is in the same translation unit. Unfortunately, compile times have suffered. I have yet to find a solution.
Batching
To evaluate the AI’s performance, I ran many games in parallel, processing games in batches of 32. This approach maximizes GPU utilization, especially during the later stages of the game when the search tree becomes smaller. By executing multiple games within a single kernel, scheduling overhead is minimized, further improving performance.
Testing
Unit tests are written using Catch2 to verify code correctness. Device functions are tested by wrapping them in kernel functions that are launched from the CPU. This approach is verbose, and there may be a better way. Assertions, omitted in release builds, are used liberally to verify preconditions.
Future improvements
Although the AI already outperforms humans, there are a few areas where it can be improved.
Machine learning
Machine learning can be incorporated to improve the AI’s decisions. By training a neural network, the AI can learn to evaluate board states more effectively. The learned heuristic, when combined with a traditional tree search, may guide the AI towards better moves. The neural network can be trained with evolutionary strategies or reinforcement learning. Fully fused neural networks can execute efficiently on the GPU.
Pruning
By employing pruning techniques, the search can focus on the most promising moves. This will allow the AI to explore deeper into the game tree, giving it more foresight.
Closing remarks
This project provided invaluable experience in GPU acceleration using CUDA C++. I’m excited to apply these skills to real-world applications, such as AI and machine learning. GPUs are revolutionizing computing, and I look forward to being a part of this exciting future.