Trending
HGPU group's Avatar

HGPU group

@hgpu

High performance computing on graphics processing units (GPU): AMD, Nvidia, Intel, CUDA, OpenCL, OpenGL, HPC

88
Followers
11
Following
320
Posts
15.11.2024
Joined
Posts Following

Latest posts by HGPU group @hgpu

Preview
CONCUR: Benchmarking LLMs for Concurrent Code Generation Leveraging Large Language Models (LLMs) for code generation has increasingly emerged as a common practice in the domain of software engineering. Relevant benchmarks have been established to evaluat…

CONCUR: Benchmarking LLMs for Concurrent Code Generation

#CodeGeneration #LLM #Package

hgpu.org?p=30644

08.03.2026 16:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
RepoLaunch: Automating Build & Test Pipeline of Code Repositories on ANY Language and ANY Platform Building software repositories typically requires significant manual effort. Recent advances in large language model (LLM) agents have accelerated automation in software engineering (SWE). We intro…

RepoLaunch: Automating Build & Test Pipeline of Code Repositories on ANY Language and ANY Platform

#LLM #Package

hgpu.org?p=30643

08.03.2026 16:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Ray Tracing using HIP In this technical report, we introduce the basics of ray tracing and explain how to accelerate the computation of the rendering algorithm in HIP. We also show how to use a HIP ray tracing framework…

Ray Tracing using HIP

#HIP #AMD #Raytracing #Rendering #Package

hgpu.org?p=30642

08.03.2026 16:33 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty-first century. Traditional methods for this include time-consuming and expensive experiment…

Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent

#Chemistry #LLM #Catalyst

hgpu.org?p=30641

08.03.2026 16:32 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs Training large-scale Mixture-of-Experts (MoE) models is bottlenecked by activation memory and expert-parallel communication, yet FP4 training remains impractical on Hopper-class GPUs without native…

Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

#CUDA #LLM #Hopper #FP4 #Precision #Package

hgpu.org?p=30640

08.03.2026 16:30 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
CUDABench: Benchmarking LLMs for Text-to-CUDA Generation Recent studies have demonstrated the potential of Large Language Models (LLMs) in generating GPU Kernels. Current benchmarks focus on the translation of high-level languages into CUDA, overlooking …

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

#CUDA #LLM #Benchmarking #Package

hgpu.org?p=30630

04.03.2026 20:14 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side setti…

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

#CUDA #CodeGeneration #LLM

hgpu.org?p=30629

04.03.2026 20:13 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large lang…

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

#CUDA #CodeGenerarion #LLM #Package

hgpu.org?p=30628

04.03.2026 20:13 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundame…

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

#CodeGeneration #LLM #Package

hgpu.org?p=30620

01.03.2026 20:08 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
CL4SE: A Context Learning Benchmark For Software Engineering Tasks Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without…

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

#CodeGeneration #LLM #Package

hgpu.org?p=30619

01.03.2026 20:07 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
From Prompts to Performance: Evaluating LLMs for Task-based Parallel Code Generation Large Language Models (LLM) show strong abilities in code generation, but their skill in creating efficient parallel programs is less studied. This paper explores how LLMs generate task-based paral…

From Prompts to Performance: Evaluating LLMs for Task-based Parallel Code Generation

#OpenMP #LLM #CodeGeneration

hgpu.org?p=30618

01.03.2026 20:07 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
A Survey of Recent Developments in SYCL Compiler Implementations This survey discusses recent advancements in SYCL compiler implementations, one of the crucial aspects of compiler construction for heterogeneous computing systems. We explore the transition from t…

A Survey of Recent Developments in SYCL Compiler Implementations

#SYCL #Compilers

hgpu.org?p=30617

01.03.2026 20:06 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Joint Training on AMD and NVIDIA GPUs As large language models continue to scale, training demands on compute and system capacity grow rapidly, making single-vendor homogeneous clusters insufficient. This paper presents a technical sol…

Joint Training on AMD and NVIDIA GPUs

#CUDA #ROCm #LLM #NVIDIA #AMD

hgpu.org?p=30616

01.03.2026 20:05 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Fine-Tuning GPT-5 for GPU Kernel Generation Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertis…

Fine-Tuning GPT-5 for GPU Kernel Generation

#Triton #CUDA #LLM

hgpu.org?p=30592

22.02.2026 22:14 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware-specific o…

KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

#CUDA #LLM #Performance

hgpu.org?p=30591

22.02.2026 22:13 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
HPC++: An LLVM-Based Automatic Parallelization Framework with Heterogeneous CPU–GPU Execution We present HPC++, an automatic parallelization framework that transforms sequential C++ programs into efficient parallel implementations targeting both multi-core CPUs and OpenCL-capable GPUs. Oper…

HPC++: An LLVM-Based Automatic Parallelization Framework with Heterogeneous CPU–GPU Execution

#OpenCL #HPC #LLVM

hgpu.org?p=30590

22.02.2026 22:13 πŸ‘ 0 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Preview
OptiML: An End-to-End Framework for Program Synthesis and CUDA Kernel Optimization Generating high-performance CUDA kernels remains challenging due to the need to navigate a combinatorial space of low-level transformations under noisy and expensive hardware feedback. Although lar…

OptiML: An End-to-End Framework for Program Synthesis and CUDA Kernel Optimization

#CUDA #LLM #Performance

hgpu.org?p=30588

22.02.2026 22:12 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5 The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has driven major gains in reasoning, perception, and generation across language and vision, yet whet…

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

#LLM #Security #Package

hgpu.org?p=30589

22.02.2026 22:12 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A The AMD MI300A APU integrates CDNA3 GPUs with high-bandwidth memory and advanced accelerator features: FP8 matrix cores, asynchronous compute engines (ACE), and 2:4 structured sparsity. These capab…

Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A

#AMD #HIP #ROCm

hgpu.org?p=30572

15.02.2026 22:11 πŸ‘ 1 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Preview
Improving Code Generation via Small Language Model-as-a-judge Large language models (LLMs) have shown remarkable capabilities in automated code generation. While effective for mainstream languages, they may underperform on less common or domain-specific langu…

Improving Code Generation via Small Language Model-as-a-judge

#CodeGeneration #LLM

hgpu.org?p=30571

15.02.2026 22:11 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Deep Kernel Fusion for Transformers Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a maj…

Deep Kernel Fusion for Transformers

#CUDA #LLM #Performance

hgpu.org?p=30570

15.02.2026 22:10 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly we…

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

#CUDA #LLM #CodeGeneration #Package

hgpu.org?p=30569

15.02.2026 22:09 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards Large language models (LLMs) have demonstrated strong code generation capabilities, yet the runtime performance of generated code is not guaranteed, and there have been few attempts to train LLMs u…

Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards

#CUDA #OpenMP #HPC #CodeGeneration #LLM

hgpu.org?p=30568

15.02.2026 22:09 πŸ‘ 0 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Preview
Inside VOLT: Designing an Open-Source GPU Compiler (Tool) Recent efforts in open-source GPU research are opening new avenues in a domain that has long been tightly coupled with a few commercial vendors. Emerging open GPU architectures define SIMT function…

Inside VOLT: Designing an Open-Source GPU Compiler (Tool)

#OpenCL #CUDA #FPGA #Compilers #Package

hgpu.org?p=30544

08.02.2026 21:44 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
HetCCL: Accelerating LLM Training with Heterogeneous GPUs The rapid growth of large language models is driving organizations to expand their GPU clusters, often with GPUs from multiple vendors. However, current deep learning frameworks lack support for co…

HetCCL: Accelerating LLM Training with Heterogeneous GPUs

#LLM #DeepLearning #DL

hgpu.org?p=30545

08.02.2026 21:42 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Just-in-Time Catching Test Generation at Meta We report on Just-in-Time catching test generation at Meta, designed to prevent bugs in large scale backend systems of hundreds of millions of line of code. Unlike traditional hardening tests, whic…

Just-in-Time Catching Test Generation at Meta

#LLM #Testing

hgpu.org?p=30543

08.02.2026 21:41 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters The growing demand for GPU resources has led to widespread shortages in data centers, prompting the exploration of CPUs as an alternative for executing GPU programs. While prior research supports e…

Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters

#CUDA #Triton #Compilers

hgpu.org?p=30542

08.02.2026 21:40 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Accelerating Scientific Research with Gemini: Case Studies and Common Techniques Recent advances in large language models (LLMs) have opened new avenues for accelerating scientific research. While models are increasingly capable of assisting with routine tasks, their ability to…

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

#LLM #Gemini #Review

hgpu.org?p=30541

08.02.2026 21:40 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robus…

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

#CUDA #Triton #Package

hgpu.org?p=30540

08.02.2026 21:39 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
SciDef: Automating Definition Extraction from Academic Literature with Large Language Models Definitions are the foundation for any scientific work, but with a significant increase in publication numbers, gathering definitions relevant to any keyword has become challenging. We therefore in…

SciDef: Automating Definition Extraction from Academic Literature with Large Language Models

#LLM #NLP #Package

hgpu.org?p=30539

08.02.2026 21:39 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0