This directory contains ready-to-use example scripts demonstrating best practices for deep learning on the Ruqola serverβs H200 GPUs.
pytorch_training.py
- Complete PyTorch training example with ResNettensorflow_training.py
- TensorFlow training with mixed precision and XLAjax_training.py
- JAX/Flax functional programming approachtransformers_finetuning.py
- Hugging Face Transformers fine-tuning with LoRA supporttransformers_inference.py
- Optimized inference for large language modelslora_example.py
- Parameter-efficient fine-tuning with LoRAresnet_config.yaml
- PyTorch training configurationtf_config.json
- TensorFlow training configurationjax_config.py
- JAX/Flax configurationtransformers_config.yaml
- Transformers fine-tuning configurationlora_config.yaml
- LoRA fine-tuning configurationsubmit_jobs.sh
- Job submission examples and best practicesREADME.md
- This documentation file# Copy example files to your directory
cp examples/pytorch_training.py .
cp examples/resnet_config.yaml .
# Submit training job
gpuq submit \
--command "python pytorch_training.py --config resnet_config.yaml" \
--gpus 1 \
--memory 40 \
--time 8
# Copy TensorFlow files
cp examples/tensorflow_training.py .
cp examples/tf_config.json .
# Submit job
gpuq submit \
--command "python tensorflow_training.py --config tf_config.json" \
--gpus 1 \
--memory 60 \
--time 8
# Copy JAX files
cp examples/jax_training.py .
cp examples/jax_config.py .
# Submit job
gpuq submit \
--command "python jax_training.py --config jax_config.py" \
--gpus 1 \
--memory 50 \
--time 6
# Copy Transformers files
cp examples/transformers_finetuning.py .
cp examples/transformers_config.yaml .
# Submit job for large model fine-tuning
gpuq submit \
--command "python transformers_finetuning.py --config transformers_config.yaml" \
--gpus 2 \
--memory 100 \
--time 12
# Copy LoRA files
cp examples/lora_example.py .
cp examples/lora_config.yaml .
# Submit LoRA training job
gpuq submit \
--command "python lora_example.py --mode train --model microsoft/DialoGPT-medium --config lora_config.yaml" \
--gpus 1 \
--memory 40 \
--time 6
pytorch_training.py
)tensorflow_training.py
)jax_training.py
)@jit
decoratorremat
~/projects/my_experiment/
βββ train.py # Your training script (based on examples)
βββ config.yaml # Configuration file
βββ data/ # Dataset (use shared storage when possible)
βββ checkpoints/ # Model checkpoints
βββ logs/ # Training logs
βββ results/ # Final results and plots
βββ models/ # Saved models
# Efficient data loading for H200
dataloader = DataLoader(
dataset,
batch_size=128, # Multiple of 8 for Tensor Cores
num_workers=8, # Match CPU cores
pin_memory=True, # Faster GPU transfer
persistent_workers=True, # Reduce worker restart overhead
prefetch_factor=2, # Prefetch batches
)
# Replace CIFAR-10 with your dataset
train_dataset = YourCustomDataset(...)
# Change model definition
model = YourCustomModel(...)
# In config file
training:
batch_size: 64 # Adjust based on memory
learning_rate: 0.01 # Tune for your problem
def custom_loss(predictions, targets):
# Your loss implementation
return loss
# Enable gradient checkpointing
model.gradient_checkpointing = True
# Use mixed precision
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
output = model(input)
loss = criterion(output, target)
# Ensure tensor dimensions are optimal for H200
def make_divisible_by_8(x):
return ((x + 7) // 8) * 8
batch_size = make_divisible_by_8(batch_size)
hidden_dim = make_divisible_by_8(hidden_dim)
# Use memory mapping for large datasets
data = np.memmap('large_dataset.dat', dtype='float32', mode='r')
# Implement efficient transforms
def fast_transform(x):
# Vectorized operations
return x / 255.0 # Faster than individual pixel operations
# Monitor GPU usage
watch -n 5 nvidia-smi
# Monitor job progress
watch -n 10 'gpuq status | grep $USER'
# Check job logs
tail -f /tmp/gpu_queue/logs/job_XXXXX_stdout.log
# Check memory usage in Python
def print_memory_stats():
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / 1024**3
reserved = torch.cuda.memory_reserved() / 1024**3
print(f"GPU Memory - Allocated: {allocated:.1f}GB, Reserved: {reserved:.1f}GB")
# Call periodically during training
if step % 100 == 0:
print_memory_stats()
# Quick fixes:
# 1. Reduce batch size
# 2. Enable gradient checkpointing
# 3. Use mixed precision
# 4. Clear cache periodically
# In Python:
torch.cuda.empty_cache()
# Check GPU utilization (should be >80%)
nvidia-smi
# Common causes:
# - Data loading bottleneck (increase num_workers)
# - Small batch size (increase if memory allows)
# - Inefficient model architecture
# Check queue status
gpuq status
# Common issues:
# - All GPUs busy (wait or reduce resource request)
# - Requesting too much memory
# - Syntax error in command
To add new examples or improve existing ones:
If you encounter issues with these examples:
/tmp/gpu_queue/logs/job_XXXXX_*.log
nvidia-smi
and gpuq status
Happy training on the H200s! π