Files
whisperx-rocm-api/ROCM.md

4.3 KiB

WhisperX ROCm Fork

This is a fork of m-bain/whisperX configured for AMD ROCm GPUs.

Tested Configuration

  • OS: Debian Sid (trixie/forky)
  • GPU: AMD Radeon RX 7700 XT (gfx1101, 12GB VRAM)
  • ROCm: 7.1.1
  • Python: 3.10
  • PyTorch: 2.11.0+rocm7.0 (nightly)
  • CTranslate2: 4.6.2 (ROCm build from paralin/ctranslate2-rocm)

Prerequisites

  1. ROCm 7.1+ installed at /opt/rocm
  2. CTranslate2 built with ROCm support (see paralin/ctranslate2-rocm)

Installation (Debian Sid)

1. Clone and setup

git clone https://github.com/paralin/whisperX-rocm.git ~/whisperx
cd ~/whisperx
git checkout rocm

2. Create venv with uv

uv venv
uv pip install -e .

3. Install ROCm PyTorch

uv pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.0

4. Install ROCm CTranslate2

First build CTranslate2 with ROCm support (see paralin/ctranslate2-rocm), then:

# Remove PyPI ctranslate2 (has CUDA binaries)
rm -rf .venv/lib/python3.10/site-packages/ctranslate2*

# Install ROCm build
export CTRANSLATE2_ROOT=/usr/local
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
uv pip install --reinstall pybind11 ~/ctranslate2/python

Environment Variables

These must be set before running whisperx:

export HSA_OVERRIDE_GFX_VERSION=11.0.1  # for gfx1101
export AMDGPU_TARGETS=gfx1101
export ROCM_PATH=/opt/rocm
export HIP_VISIBLE_DEVICES=0
export ROCR_VISIBLE_DEVICES=0
export LD_LIBRARY_PATH=/usr/local/lib:/opt/rocm/lib:/opt/rocm/lib/llvm/lib:$LD_LIBRARY_PATH

Usage

# Set environment (add to ~/.bashrc for convenience)
export HSA_OVERRIDE_GFX_VERSION=11.0.1
export ROCM_PATH=/opt/rocm
export HIP_VISIBLE_DEVICES=0
export LD_LIBRARY_PATH=/usr/local/lib:/opt/rocm/lib:/opt/rocm/lib/llvm/lib:$LD_LIBRARY_PATH

cd ~/whisperx
uv run whisperx audio.wav \
  --language en \
  --model large-v3 \
  --compute_type float16 \
  --device cuda \
  --batch_size 8 \
  --vad_method silero \
  --output_dir ./output \
  --output_format all

Note: We use --device cuda because ROCm's HIP layer translates CUDA API calls to AMD GPU.

Verify Installation

# Check PyTorch sees the GPU
python -c "import torch; print(CUDA:, torch.cuda.is_available()); print(Device:, torch.cuda.get_device_name(0))"

# Check CTranslate2
python -c "import ctranslate2; print(ctranslate2.__version__); print(ctranslate2.get_supported_compute_types(cuda))"

Expected output:

CUDA: True
Device: AMD Radeon RX 7700 XT

4.6.2
{int8_float16, int8_bfloat16, bfloat16, int8_float32, int8, float16, float32}

GPU Architecture

Set HSA_OVERRIDE_GFX_VERSION based on your GPU:

GPU Architecture HSA_OVERRIDE_GFX_VERSION
RX 7900 XTX/XT gfx1100 11.0.0
RX 7800 XT gfx1101 11.0.1
RX 7700 XT gfx1101 11.0.1
RX 7600 gfx1102 11.0.2
RX 6900/6800/6700 gfx1030 10.3.0
RX 6600 gfx1032 10.3.2

Upstream

Known Issues

Memory Access Fault on Long Audio

When transcribing longer audio files (>60s), you may encounter:

Memory access fault by GPU node-1 (Agent handle: 0x...) on address 0x...
Reason: Page not present or supervisor privilege.

Status: Under investigation. Short clips (~60s) work fine at ~28x realtime with small model.

Workaround: Process audio in chunks, or use CPU mode for long files.

Working example (first 60s):

from faster_whisper import WhisperModel
model = WhisperModel("small", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", language="en", clip_timestamps=[0, 60])

Search terms for updates:

  • "Memory access fault by GPU node" "Page not present or supervisor privilege" ROCm 7.1 PyTorch site:github.com
  • "Memory access fault" ROCm CTranslate2 faster-whisper gfx1101

This may be related to:

  • ROCm 7.1.1 + PyTorch nightly (2.11.0+rocm7.0) incompatibility
  • GPU memory fragmentation with longer sequences
  • HIP/ROCm memory management issues with certain operations