There are many kinds of memory in a GPU but in this section we are interested in RAM, the biggest chunk of memory used for textures and data. The main characteristic used to describe RAM is its size, more RAM allows processing more data at a time. Its bandwidth, how fast can we transfer data to it, is also an important aspect that affects performance.
The specs of the memory used in the GeForce RTX 4090 is given below.
Memory size | 24 GB |
Memory type | GDDR6X |
Memory interface width | 384-bit |
Memory clock (data rate) | 21 Gbps |
Memory bandwidth | 1008 GB/s |
PCI Express interface | Gen 4 (x16: 31.508 GB/s) |
To understand those specs we must understand how the CPU, GPU and the GPU memory are interconnected.
There is a link between the GPU and its RAM but there is also a link between the GPU and the CPU. The bandwidth of the PCI-e bus is for the transfer of data between the GPU and the CPU, while the memory bandwidth is between the GPU and its GDDR RAM.
The PCI-e bus is not only used to transfer data between the CPU and the GPU but also to communicate and program the GPU via MMIO registers.
The bandwidth of GDDR will usually be much bigger than the bandwidth of PCI-e.
1 2 3 4 5 6 7 8 9 |
|
The type of RAM used in a GPU is usually GDDR SDRAM (Graphics Double Data Rate Synchronous Dynamic Random-Access Memory), which is similar to the DDR SDRAM used with the CPU but optimized for throughput instead of latency.
The DDR (Double Data Rate) technology essentially transfers data on both the rising and falling edges of the memory clock signal, effectively doubling the data rate (2 bits of data can be transfered during each clock cycle).
The GDDR6X used in the RTX 4090 is the 6th generation of GDDR and is able to transmit 4 bits of data per cycle thanks to fancy signal modulation.
A single chip of GDDR has N data pins, in the case of 4090 = 384, each pin can transmit in parallel one bit at the memory clock rate. DDR
1008 GB/s, GDDR size = 24GB 1008/24 = can W whole RAM 42 times in 1 second doesnt fit int RAM, so its not about the data size but how much R/W per second can you do => how fast the RAM is
The memory bandwidth is the number of bytes that can be transfered per second. The theoretical maximum memory bandwidth can be computed as follow:
1 |
|
memory clock speed Gbps (Giga Bits Per Second) = data rate GHz = base clock
1 2 3 4 5 6 7 8 9 10 11 |
|
Note that this is much higher than the maximum PCI-e 4.0 bandwidth of 31.508 GB/s (using x16: 16 lanes). So the data transfer between the CPU and the GPU using PCI-e will be the bottleneck.
If its internal clock runs at 100 MHz, then the effective rate is 200 MT/s, because there are 100 million rising edges per second and 100 million falling edges per second of a clock signal running at 100 MHz.
In practice we rarely reach the maximum memory bandwidth. To compute the effective memory bandwidth of a given CUDA kernel, we can use the kernel execution time:
1 2 3 4 5 |
|
For our SAXPY example:
1 2 3 4 5 |
|