You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

374 lines
10 KiB

# Brain Rotting Multiple GPU Workflow for Dummies
## Problem: Working with Multiple GPUs Without Race Conditions
Running multiple training processes on different GPUs can lead to:
- Output directory conflicts
- Checkpoint corruption
- Resource contention
- Difficult debugging and tracking
This guide gives you dead simple solutions using only basic scripts.
## Directory Structure for Sanity
First, set up a clean directory structure to keep runs separate:
```
project/
├── scripts/
│ ├── train_gpu0.sh
│ ├── train_gpu1.sh
│ └── monitor_gpus.sh
├── src/
│ └── train.py
└── runs/
├── gpu0/ # Training on GPU 0
│ ├── checkpoints/
│ └── logs/
└── gpu1/ # Training on GPU 1
├── checkpoints/
└── logs/
```
## Simple Shell Scripts for GPU Management
### 1. Dedicated GPU Training Script (train_gpu0.sh)
```bash
#!/bin/bash
# Assign this process to GPU 0 only
export CUDA_VISIBLE_DEVICES=0
# Create timestamped run directory
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
OUTPUT_DIR="runs/gpu0/${TIMESTAMP}"
mkdir -p $OUTPUT_DIR/checkpoints
mkdir -p $OUTPUT_DIR/logs
# Run with output redirect to log file
python src/train.py \
--output_dir $OUTPUT_DIR \
--batch_size 32 \
--learning_rate 1e-4 \
> $OUTPUT_DIR/logs/training.log 2>&1
```
### 2. Second GPU Script (train_gpu1.sh)
```bash
#!/bin/bash
# Assign this process to GPU 1 only
export CUDA_VISIBLE_DEVICES=1
# Create timestamped run directory
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
OUTPUT_DIR="runs/gpu1/${TIMESTAMP}"
mkdir -p $OUTPUT_DIR/checkpoints
mkdir -p $OUTPUT_DIR/logs
# Run with output redirect to log file
python src/train.py \
--output_dir $OUTPUT_DIR \
--batch_size 32 \
--learning_rate 1e-4 \
> $OUTPUT_DIR/logs/training.log 2>&1
```
### 3. Simple GPU Monitoring Script (monitor_gpus.sh)
```bash
#!/bin/bash
# Simple GPU monitoring loop with timestamps
while true; do
clear
echo "======== $(date) ========"
nvidia-smi
sleep 5
done
```
## Checkpoint Management Without Race Conditions
In your `train.py`, implement safe checkpoint saving:
```python
import os
import time
import torch
import shutil
from pathlib import Path
def save_checkpoint(model, optimizer, epoch, step, args):
"""Save checkpoint safely without race conditions"""
# Get process-specific info for uniqueness
pid = os.getpid()
timestamp = time.strftime("%Y%m%d_%H%M%S")
# Create temporary directory with unique name
checkpoint_dir = Path(args.output_dir) / "checkpoints"
checkpoint_dir.mkdir(exist_ok=True)
temp_dir = checkpoint_dir / f"temp_{pid}_{timestamp}"
temp_dir.mkdir(exist_ok=True)
# Save to temporary location first
checkpoint_path = temp_dir / "checkpoint.pt"
torch.save({
'epoch': epoch,
'step': step,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
}, checkpoint_path)
# Create final directory name
final_dir = checkpoint_dir / f"checkpoint_epoch{epoch}_step{step}"
# Atomic rename operation (safer than copying files)
shutil.move(str(temp_dir), str(final_dir))
# Clean up old checkpoints (keep only last 5)
checkpoints = sorted([d for d in checkpoint_dir.iterdir()
if d.is_dir() and d.name.startswith("checkpoint_")])
for old_checkpoint in checkpoints[:-5]:
shutil.rmtree(old_checkpoint)
print(f"Saved checkpoint to {final_dir}")
return final_dir
```
## Running Multiple Training Jobs with Different Parameters
Create a parameter sweep script that launches multiple runs with different configs:
```bash
#!/bin/bash
# param_sweep.sh
# Define parameter grid
LEARNING_RATES=("1e-3" "5e-4" "1e-4")
BATCH_SIZES=(16 32 64)
# Loop through parameters and assign to GPUs
GPU=0
for lr in "${LEARNING_RATES[@]}"; do
for bs in "${BATCH_SIZES[@]}"; do
# Select GPU using modulo to cycle through available GPUs
SELECTED_GPU=$(($GPU % 2)) # Assuming 2 GPUs (0 and 1)
GPU=$((GPU + 1))
# Create run directory
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
RUN_NAME="lr${lr}_bs${bs}"
OUTPUT_DIR="runs/gpu${SELECTED_GPU}/${RUN_NAME}_${TIMESTAMP}"
mkdir -p $OUTPUT_DIR/checkpoints
mkdir -p $OUTPUT_DIR/logs
# Launch training in background
echo "Starting run on GPU ${SELECTED_GPU}: lr=${lr}, bs=${bs}"
CUDA_VISIBLE_DEVICES=$SELECTED_GPU python src/train.py \
--output_dir $OUTPUT_DIR \
--batch_size $bs \
--learning_rate $lr \
> $OUTPUT_DIR/logs/training.log 2>&1 &
# Wait a bit to stagger the starts
sleep 10
done
done
echo "All jobs launched. Monitor with './scripts/monitor_gpus.sh'"
```
## Dead Simple Experiment Tracking Without MLflow
Create a simple CSV logger in your training script:
```python
import csv
from pathlib import Path
class SimpleLogger:
def __init__(self, log_dir):
self.log_dir = Path(log_dir)
self.log_dir.mkdir(exist_ok=True, parents=True)
# Initialize metrics CSV
self.metrics_file = self.log_dir / "metrics.csv"
self.header_written = False
# Keep track of best metrics
self.best_metrics = {}
def log_metrics(self, metrics, step):
"""Log metrics to CSV file"""
metrics["step"] = step
# Create or append to CSV
write_header = not self.metrics_file.exists()
with open(self.metrics_file, mode='a', newline='') as file:
writer = csv.DictWriter(file, fieldnames=metrics.keys())
if write_header:
writer.writeheader()
writer.writerow(metrics)
# Update best metrics
for key, value in metrics.items():
if key != "step":
if key not in self.best_metrics or value < self.best_metrics[key]["value"]:
self.best_metrics[key] = {"value": value, "step": step}
# Write best metrics summary
with open(self.log_dir / "best_metrics.txt", 'w') as f:
for key, data in self.best_metrics.items():
f.write(f"Best {key}: {data['value']} (step {data['step']})\n")
```
## Finding and Comparing Results
Create a simple results aggregation script:
```bash
#!/bin/bash
# aggregate_results.sh
echo "Run Directory,Best Loss,Best Accuracy,Training Time"
find runs/ -name "best_metrics.txt" | while read metrics_file; do
run_dir=$(dirname "$metrics_file")
best_loss=$(grep "Best loss" "$metrics_file" | cut -d' ' -f3)
best_acc=$(grep "Best accuracy" "$metrics_file" | cut -d' ' -f3)
# Get training time from log
log_file="$run_dir/logs/training.log"
start_time=$(head -n 1 "$log_file" | grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}')
end_time=$(tail -n 10 "$log_file" | grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}' | tail -n 1)
echo "$run_dir,$best_loss,$best_acc,$start_time-$end_time"
done | sort -t',' -k2n
```
## Simple Visualization Without External Tools
Create a basic plotting script using matplotlib:
```python
# plot_results.py
import os
import glob
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
# Find all metrics.csv files
metrics_files = glob.glob("runs/**/metrics.csv", recursive=True)
plt.figure(figsize=(12, 8))
# Plot each run
for metrics_file in metrics_files:
run_name = Path(metrics_file).parent.name
df = pd.read_csv(metrics_file)
plt.plot(df['step'], df['loss'], label=f"{run_name} - loss")
plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Training Loss Comparison')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig('loss_comparison.png')
plt.close()
# Create accuracy plot if available
plt.figure(figsize=(12, 8))
for metrics_file in metrics_files:
run_name = Path(metrics_file).parent.name
df = pd.read_csv(metrics_file)
if 'accuracy' in df.columns:
plt.plot(df['step'], df['accuracy'], label=f"{run_name} - accuracy")
plt.xlabel('Step')
plt.ylabel('Accuracy')
plt.title('Training Accuracy Comparison')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig('accuracy_comparison.png')
```
## Process Management and GPU Allocation
Create a script to check GPU usage and allocate new jobs:
```bash
#!/bin/bash
# allocate_gpu.sh
# This script checks GPU usage and returns the index of the least utilized GPU
LEAST_BUSY_GPU=$(nvidia-smi --query-gpu=index,utilization.gpu --format=csv,noheader,nounits |
sort -t',' -k2n |
head -n 1 |
cut -d',' -f1)
echo $LEAST_BUSY_GPU
```
## Tips for Avoiding Race Conditions
1. **Always use unique output directories for each run**:
- Include timestamp, GPU ID, and PID in directory names
- Never share output directories between processes
2. **For checkpoint saving**:
- Save to temp directory first
- Use atomic operations like `shutil.move()` for final placement
- Don't depend on file locks (often unreliable with network filesystems)
3. **For data loading**:
- Use different random seeds per process
- Set `num_workers` appropriately (2-4 per GPU usually works well)
- Add process-specific buffer to avoid filesystem contention
4. **For logging**:
- Each process should write to its own log file
- Use timestamps in log entries
- Include GPU ID and PID in log messages
## Quick Commands Reference
```bash
# Start training on GPU 0
./scripts/train_gpu0.sh
# Start training on GPU 1
./scripts/train_gpu1.sh
# Run parameter sweep across GPUs
./scripts/param_sweep.sh
# Monitor GPU usage
./scripts/monitor_gpus.sh
# Find GPU with lowest utilization
BEST_GPU=$(./scripts/allocate_gpu.sh)
echo "Least busy GPU is: $BEST_GPU"
# Aggregate results into CSV
./scripts/aggregate_results.sh > results_summary.csv
# Generate comparison plots
python scripts/plot_results.py
```
Remember: The simplest solution is usually the most maintainable. Keep your scripts straightforward, make each run independent, and use filesystem organization to avoid conflicts.
# TODO: Replace print statements with loguru logging for better debugging and log file management