HPC, CUDA and OPEN CL

4. Memory space that acts as an on-chip, software-managed cache utilized by a single Work-Group or block to avoid global memory latency.
6. Network architecture extending a standard mesh by connecting boundary nodes to improve scalability and reduce communication paths.
9. The absolute smallest unit of concurrent execution in the OpenCL programming model.
10. An open standard framework requiring significant boilerplate code to adapt dynamically to various hardware environments.
11. The physical barrier preventing processor clock speeds from increasing further due to excessive heat dissipation.
13. A specialized read-only memory optimized for broadcasting the exact same data to multiple threads simultaneously.
14. NVIDIA's proprietary framework where the fundamental execution unit is the "thread" rather than the "work-item".
16. GPUs are strictly optimized to maximize this metric, unlike CPUs which focus on minimizing latency.
17. The N-dimensional index space mapped to physical hardware during OpenCL kernel execution.

1. Occurs when threads in a single 32-thread SIMT group take varying execution paths due to a conditional statement, hurting performance.
2. This network performance metric defines the maximum data transfer rate in an interconnect network.
3. The architectural model that utilizes both the CPU for serial execution regions and the GPU for parallel computation regions.
4. A software utility responsible for allocating nodes and managing resources for user job submissions in an HPC cluster.
5. The fundamental design limitation causing the "memory wall," resulting from a shared pathway for both data and instructions.
7. The Linux-based commodity cluster that provided a cost-effective alternative to custom supercomputers in the 2000s.
8. For maximum bandwidth efficiency, GPU global memory accesses must be this, meaning adjacent threads access adjacent memory addresses.
12. A group of 32 parallel threads managed and scheduled synchronously by a CUDA multiprocessor.
15. A C++ function marked with the global specifier, executed millions of times in parallel on the device.