HPC, CUDA and OPEN CL
Across
- 4. Memory space that acts as an on-chip, software-managed cache utilized by a single Work-Group or block to avoid global memory latency.
- 6. Network architecture extending a standard mesh by connecting boundary nodes to improve scalability and reduce communication paths.
- 9. The absolute smallest unit of concurrent execution in the OpenCL programming model.
- 10. An open standard framework requiring significant boilerplate code to adapt dynamically to various hardware environments.
- 11. The physical barrier preventing processor clock speeds from increasing further due to excessive heat dissipation.
- 13. A specialized read-only memory optimized for broadcasting the exact same data to multiple threads simultaneously.
- 14. NVIDIA's proprietary framework where the fundamental execution unit is the "thread" rather than the "work-item".
- 16. GPUs are strictly optimized to maximize this metric, unlike CPUs which focus on minimizing latency.
- 17. The N-dimensional index space mapped to physical hardware during OpenCL kernel execution.
Down
- 1. Occurs when threads in a single 32-thread SIMT group take varying execution paths due to a conditional statement, hurting performance.
- 2. This network performance metric defines the maximum data transfer rate in an interconnect network.
- 3. The architectural model that utilizes both the CPU for serial execution regions and the GPU for parallel computation regions.
- 4. A software utility responsible for allocating nodes and managing resources for user job submissions in an HPC cluster.
- 5. The fundamental design limitation causing the "memory wall," resulting from a shared pathway for both data and instructions.
- 7. The Linux-based commodity cluster that provided a cost-effective alternative to custom supercomputers in the 2000s.
- 8. For maximum bandwidth efficiency, GPU global memory accesses must be this, meaning adjacent threads access adjacent memory addresses.
- 12. A group of 32 parallel threads managed and scheduled synchronously by a CUDA multiprocessor.
- 15. A C++ function marked with the global specifier, executed millions of times in parallel on the device.