Hardware Optimization

Heretic includes sophisticated hardware detection and optimization features that automatically tune performance for your system. This guide covers both automatic and manual optimization techniques.

Automatic Batch Size Detection

By default, Heretic automatically determines the optimal batch size for your hardware:

config.toml

# Automatic batch size detection (default)
batch_size = 0  # 0 = auto-detect

When set to 0, Heretic will:

Benchmark Different Batch Sizes

Starting from batch size 1, doubles the batch size and tests performance (2, 4, 8, 16, …)

Measure Throughput

For each batch size, measures tokens/second after a warmup run

Find Optimal Size

Selects the batch size that achieves the highest throughput before OOM

Use Throughout Session

Applies the chosen batch size for all subsequent operations

How It Works

The automatic detection process:

# Pseudo-code from main.py:332-376
batch_size = 1
best_batch_size = -1
best_performance = -1

while batch_size <= max_batch_size:
    try:
        # Warmup run to build computation graph
        model.get_responses(prompts)
        
        # Benchmark run
        start_time = time.perf_counter()
        responses = model.get_responses(prompts)
        end_time = time.perf_counter()
        
        # Calculate throughput
        performance = total_tokens / (end_time - start_time)
        
        if performance > best_performance:
            best_batch_size = batch_size
            best_performance = performance
            
    except Exception:
        # OOM or other error - stop here
        break
        
    batch_size *= 2

Automatic detection typically adds 1-3 minutes to startup time but ensures optimal performance throughout the entire run.

Manual Batch Size Tuning

For more control, you can set the batch size manually:

# Set explicit batch size
batch_size = 16

heretic --batch-size 16 Qwen/Qwen3-4B-Instruct-2507

When to Use Manual Tuning

Reproducibility

Ensure consistent behavior across multiple runs

Shared Resources

Control memory usage on multi-user systems

Known Configuration

Skip detection when you know the optimal value

Debugging

Isolate issues by fixing batch size

Maximum Batch Size Limit

Control the upper bound for automatic detection:

config.toml

# Prevent OOM during batch size detection
max_batch_size = 128  # default

Lower this value if automatic detection causes OOM errors or takes too long.

Multi-GPU Configuration

Heretic automatically detects and utilizes multiple GPUs:

Detected 2 CUDA device(s) (49.14 GB total VRAM):
* GPU 0: NVIDIA RTX 3090 (24.57 GB)
* GPU 1: NVIDIA RTX 3090 (24.57 GB)

Device Map Strategies

Control how the model is distributed across devices:

# Automatically distribute across all devices
device_map = "auto"

# Use only GPU 0
device_map = "cuda:0"

# Requires setting in Python, not TOML
# See manual configuration below

Per-Device Memory Limits

Set maximum memory allocation per device:

config.toml

# Limit memory usage per device
max_memory = {"0": "20GB", "1": "20GB", "cpu": "64GB"}

This is useful for:

Sharing GPUs with other processes
Preventing a single model from consuming all VRAM
Forcing CPU offloading for memory-intensive layers

When using max_memory, make sure the total allocated memory is sufficient for your model. Too restrictive limits will cause loading failures.

Performance on Different Hardware

From the README, here are typical processing times:

RTX 3090 Performance

Model: Llama-3.1-8B-Instruct
Configuration: Default settings (200 trials)
Duration: ~45 minutes This includes:

Model loading
Batch size detection
200 optimization trials
Evaluation

Smaller models (4B-7B) typically complete in 20-40 minutes, while larger models (70B+) may take several hours even with quantization.

Duration Estimates by Model Size

Model Size	Hardware	Quantization	Estimated Time
4B-7B	RTX 3090	No	20-30 min
8B-13B	RTX 3090	No	40-60 min
27B-34B	RTX 3090	Yes	2-4 hours
70B+	RTX 3090	Yes	4-8 hours

These are rough estimates for 200 trials with default settings. Actual time varies based on model architecture and prompt datasets.

Advanced Memory Optimization

Expandable Segments

Heretic automatically enables PyTorch expandable segments to reduce memory fragmentation:

# Enabled automatically in main.py:133-137
os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True"

This is particularly beneficial for multi-GPU setups.

TorchDynamo Cache

The compilation cache is increased during batch size detection:

# From main.py:222
torch._dynamo.config.cache_size_limit = 64

This prevents errors from excessive recompilation during the batch size search.

Supported Accelerators

Heretic supports a wide range of hardware:

NVIDIA CUDA
Intel XPU
Apple Metal (MPS)
Other Accelerators

# Automatic detection
Detected 1 CUDA device(s):
* GPU 0: NVIDIA RTX 4090 (24.00 GB)

Best supported, recommended for most users.

# Automatic detection
Detected 1 XPU device(s):
* XPU 0: Intel Data Center GPU Max 1550

For Intel Data Center GPUs.

# Automatic detection
Detected 1 MPS device (Apple Metal)

For Apple Silicon Macs.

Optimization Best Practices

Start with Defaults

Let automatic batch size detection find the optimal setting

batch_size = 0
device_map = "auto"

Enable Quantization for Large Models

Use 4-bit quantization for models >13B on consumer GPUs

quantization = "bnb_4bit"

Monitor Memory Usage

Watch VRAM during processing. If near capacity, reduce max_batch_size

nvidia-smi -l 1  # Monitor VRAM in real-time

Tune for Your Workload

If running many short sessions, fix batch size to skip detection

batch_size = 16  # from previous detection run

Configuration Examples

Single High-End GPU

config.toml

# RTX 4090 or similar (24 GB)
device_map = "auto"
batch_size = 0  # auto-detect
max_batch_size = 128
quantization = "none"  # full precision

Consumer GPU with Limited VRAM

config.toml

# RTX 3060 or similar (12 GB)
device_map = "auto"
batch_size = 0
max_batch_size = 32  # limit exploration
quantization = "bnb_4bit"  # essential for larger models

Multi-GPU Server

config.toml

# 4x GPU setup
device_map = "auto"
batch_size = 0
max_batch_size = 256  # higher limit for more VRAM
quantization = "none"

# Optional: reserve some VRAM for other tasks
# max_memory = {"0": "20GB", "1": "20GB", "2": "20GB", "3": "20GB"}

CPU Offloading

config.toml

# When model doesn't fit in VRAM
device_map = "auto"
max_memory = {"0": "20GB", "cpu": "128GB"}
batch_size = 4  # smaller for slower CPU offload
quantization = "bnb_4bit"

Troubleshooting

Out of Memory (OOM)

Symptoms: RuntimeError: CUDA out of memory Solutions:

Enable Quantization

quantization = "bnb_4bit"

Reduce Max Batch Size

max_batch_size = 32

Set Memory Limits

max_memory = {"0": "22GB"}  # leave 2GB headroom

Use CPU Offloading

max_memory = {"0": "20GB", "cpu": "64GB"}

Slow Batch Size Detection

Symptoms: Detection takes >5 minutes Solutions:

Lower max_batch_size to reduce search space
Set explicit batch_size based on previous runs
Use a smaller model for initial testing

Suboptimal Performance

Symptoms: Low tokens/second during processing Solutions:

Verify automatic detection chose a reasonable batch size
Check if CPU offloading is active (slow)
Ensure model fits entirely in VRAM
Monitor GPU utilization with nvidia-smi

Quantization - Reduce VRAM requirements with 4-bit quantization
Configuration - Complete configuration reference
Model Upload - Upload optimized models to Hugging Face

​Automatic Batch Size Detection

​How It Works

​Manual Batch Size Tuning

​When to Use Manual Tuning

Reproducibility

Shared Resources

Known Configuration

Debugging

​Maximum Batch Size Limit

​Multi-GPU Configuration

​Device Map Strategies

​Per-Device Memory Limits

​Performance on Different Hardware

​RTX 3090 Performance

​Duration Estimates by Model Size

​Advanced Memory Optimization

​Expandable Segments

​TorchDynamo Cache

​Supported Accelerators

​Optimization Best Practices

​Configuration Examples

​Single High-End GPU

​Consumer GPU with Limited VRAM

​Multi-GPU Server

​CPU Offloading

​Troubleshooting

​Out of Memory (OOM)

​Slow Batch Size Detection

​Suboptimal Performance

​Related Topics

Automatic Batch Size Detection

How It Works

Manual Batch Size Tuning

When to Use Manual Tuning

Maximum Batch Size Limit

Multi-GPU Configuration

Device Map Strategies

Per-Device Memory Limits

Performance on Different Hardware

RTX 3090 Performance

Duration Estimates by Model Size

Advanced Memory Optimization

Expandable Segments

TorchDynamo Cache

Supported Accelerators

Optimization Best Practices

Configuration Examples

Single High-End GPU

Consumer GPU with Limited VRAM

Multi-GPU Server

CPU Offloading

Troubleshooting

Out of Memory (OOM)

Slow Batch Size Detection

Suboptimal Performance

Related Topics