Heretic includes sophisticated hardware detection and optimization features that automatically tune performance for your system. This guide covers both automatic and manual optimization techniques.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/p-e-w/heretic/llms.txt
Use this file to discover all available pages before exploring further.
Automatic Batch Size Detection
By default, Heretic automatically determines the optimal batch size for your hardware:config.toml
0, Heretic will:
Benchmark Different Batch Sizes
Starting from batch size 1, doubles the batch size and tests performance (2, 4, 8, 16, …)
How It Works
The automatic detection process:Automatic detection typically adds 1-3 minutes to startup time but ensures optimal performance throughout the entire run.
Manual Batch Size Tuning
For more control, you can set the batch size manually:When to Use Manual Tuning
Reproducibility
Ensure consistent behavior across multiple runs
Shared Resources
Control memory usage on multi-user systems
Known Configuration
Skip detection when you know the optimal value
Debugging
Isolate issues by fixing batch size
Maximum Batch Size Limit
Control the upper bound for automatic detection:config.toml
Multi-GPU Configuration
Heretic automatically detects and utilizes multiple GPUs:Device Map Strategies
Control how the model is distributed across devices:Per-Device Memory Limits
Set maximum memory allocation per device:config.toml
- Sharing GPUs with other processes
- Preventing a single model from consuming all VRAM
- Forcing CPU offloading for memory-intensive layers
Performance on Different Hardware
From the README, here are typical processing times:RTX 3090 Performance
Model: Llama-3.1-8B-InstructConfiguration: Default settings (200 trials)
Duration: ~45 minutes This includes:
- Model loading
- Batch size detection
- 200 optimization trials
- Evaluation
Duration Estimates by Model Size
| Model Size | Hardware | Quantization | Estimated Time |
|---|---|---|---|
| 4B-7B | RTX 3090 | No | 20-30 min |
| 8B-13B | RTX 3090 | No | 40-60 min |
| 27B-34B | RTX 3090 | Yes | 2-4 hours |
| 70B+ | RTX 3090 | Yes | 4-8 hours |
These are rough estimates for 200 trials with default settings. Actual time varies based on model architecture and prompt datasets.
Advanced Memory Optimization
Expandable Segments
Heretic automatically enables PyTorch expandable segments to reduce memory fragmentation:TorchDynamo Cache
The compilation cache is increased during batch size detection:Supported Accelerators
Heretic supports a wide range of hardware:- NVIDIA CUDA
- Intel XPU
- Apple Metal (MPS)
- Other Accelerators
Optimization Best Practices
Configuration Examples
Single High-End GPU
config.toml
Consumer GPU with Limited VRAM
config.toml
Multi-GPU Server
config.toml
CPU Offloading
config.toml
Troubleshooting
Out of Memory (OOM)
Symptoms:RuntimeError: CUDA out of memory
Solutions:
Slow Batch Size Detection
Symptoms: Detection takes >5 minutes Solutions:- Lower
max_batch_sizeto reduce search space - Set explicit
batch_sizebased on previous runs - Use a smaller model for initial testing
Suboptimal Performance
Symptoms: Low tokens/second during processing Solutions:- Verify automatic detection chose a reasonable batch size
- Check if CPU offloading is active (slow)
- Ensure model fits entirely in VRAM
- Monitor GPU utilization with
nvidia-smi
Related Topics
- Quantization - Reduce VRAM requirements with 4-bit quantization
- Configuration - Complete configuration reference
- Model Upload - Upload optimized models to Hugging Face
