The four-level hierarchy of modern parallel systems. Nodes contain disjoint DRAM address spaces, and communicate over a message-passing network in the CPU case, or over a shared PCI-Express network in the GPU case. Sockets within a node (only one shown) share DRAM but have private caches – the L3 cache in CPU systems and the L2 cache in Fermi-class systems. Similarly Cores share access to the Socket-level cache, but have private caches (CPU L2, GPU L1/scratchpad). Vector-style parallelism within a core is leveraged via Lanes – SSE-style SIMD instructions on CPUs, or the SIMT-style execution of GPUs.