You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
374 lines
10 KiB
374 lines
10 KiB
# Brain Rotting Multiple GPU Workflow for Dummies
|
|
|
|
## Problem: Working with Multiple GPUs Without Race Conditions
|
|
|
|
Running multiple training processes on different GPUs can lead to:
|
|
|
|
- Output directory conflicts
|
|
- Checkpoint corruption
|
|
- Resource contention
|
|
- Difficult debugging and tracking
|
|
|
|
This guide gives you dead simple solutions using only basic scripts.
|
|
|
|
## Directory Structure for Sanity
|
|
|
|
First, set up a clean directory structure to keep runs separate:
|
|
|
|
```
|
|
project/
|
|
├── scripts/
|
|
│ ├── train_gpu0.sh
|
|
│ ├── train_gpu1.sh
|
|
│ └── monitor_gpus.sh
|
|
├── src/
|
|
│ └── train.py
|
|
└── runs/
|
|
├── gpu0/ # Training on GPU 0
|
|
│ ├── checkpoints/
|
|
│ └── logs/
|
|
└── gpu1/ # Training on GPU 1
|
|
├── checkpoints/
|
|
└── logs/
|
|
```
|
|
|
|
## Simple Shell Scripts for GPU Management
|
|
|
|
### 1. Dedicated GPU Training Script (train_gpu0.sh)
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
|
|
# Assign this process to GPU 0 only
|
|
export CUDA_VISIBLE_DEVICES=0
|
|
|
|
# Create timestamped run directory
|
|
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
|
|
OUTPUT_DIR="runs/gpu0/${TIMESTAMP}"
|
|
mkdir -p $OUTPUT_DIR/checkpoints
|
|
mkdir -p $OUTPUT_DIR/logs
|
|
|
|
# Run with output redirect to log file
|
|
python src/train.py \
|
|
--output_dir $OUTPUT_DIR \
|
|
--batch_size 32 \
|
|
--learning_rate 1e-4 \
|
|
> $OUTPUT_DIR/logs/training.log 2>&1
|
|
```
|
|
|
|
### 2. Second GPU Script (train_gpu1.sh)
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
|
|
# Assign this process to GPU 1 only
|
|
export CUDA_VISIBLE_DEVICES=1
|
|
|
|
# Create timestamped run directory
|
|
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
|
|
OUTPUT_DIR="runs/gpu1/${TIMESTAMP}"
|
|
mkdir -p $OUTPUT_DIR/checkpoints
|
|
mkdir -p $OUTPUT_DIR/logs
|
|
|
|
# Run with output redirect to log file
|
|
python src/train.py \
|
|
--output_dir $OUTPUT_DIR \
|
|
--batch_size 32 \
|
|
--learning_rate 1e-4 \
|
|
> $OUTPUT_DIR/logs/training.log 2>&1
|
|
```
|
|
|
|
### 3. Simple GPU Monitoring Script (monitor_gpus.sh)
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
|
|
# Simple GPU monitoring loop with timestamps
|
|
while true; do
|
|
clear
|
|
echo "======== $(date) ========"
|
|
nvidia-smi
|
|
sleep 5
|
|
done
|
|
```
|
|
|
|
## Checkpoint Management Without Race Conditions
|
|
|
|
In your `train.py`, implement safe checkpoint saving:
|
|
|
|
```python
|
|
import os
|
|
import time
|
|
import torch
|
|
import shutil
|
|
from pathlib import Path
|
|
|
|
def save_checkpoint(model, optimizer, epoch, step, args):
|
|
"""Save checkpoint safely without race conditions"""
|
|
# Get process-specific info for uniqueness
|
|
pid = os.getpid()
|
|
timestamp = time.strftime("%Y%m%d_%H%M%S")
|
|
|
|
# Create temporary directory with unique name
|
|
checkpoint_dir = Path(args.output_dir) / "checkpoints"
|
|
checkpoint_dir.mkdir(exist_ok=True)
|
|
|
|
temp_dir = checkpoint_dir / f"temp_{pid}_{timestamp}"
|
|
temp_dir.mkdir(exist_ok=True)
|
|
|
|
# Save to temporary location first
|
|
checkpoint_path = temp_dir / "checkpoint.pt"
|
|
torch.save({
|
|
'epoch': epoch,
|
|
'step': step,
|
|
'model_state_dict': model.state_dict(),
|
|
'optimizer_state_dict': optimizer.state_dict(),
|
|
}, checkpoint_path)
|
|
|
|
# Create final directory name
|
|
final_dir = checkpoint_dir / f"checkpoint_epoch{epoch}_step{step}"
|
|
|
|
# Atomic rename operation (safer than copying files)
|
|
shutil.move(str(temp_dir), str(final_dir))
|
|
|
|
# Clean up old checkpoints (keep only last 5)
|
|
checkpoints = sorted([d for d in checkpoint_dir.iterdir()
|
|
if d.is_dir() and d.name.startswith("checkpoint_")])
|
|
for old_checkpoint in checkpoints[:-5]:
|
|
shutil.rmtree(old_checkpoint)
|
|
|
|
print(f"Saved checkpoint to {final_dir}")
|
|
return final_dir
|
|
```
|
|
|
|
## Running Multiple Training Jobs with Different Parameters
|
|
|
|
Create a parameter sweep script that launches multiple runs with different configs:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# param_sweep.sh
|
|
|
|
# Define parameter grid
|
|
LEARNING_RATES=("1e-3" "5e-4" "1e-4")
|
|
BATCH_SIZES=(16 32 64)
|
|
|
|
# Loop through parameters and assign to GPUs
|
|
GPU=0
|
|
for lr in "${LEARNING_RATES[@]}"; do
|
|
for bs in "${BATCH_SIZES[@]}"; do
|
|
# Select GPU using modulo to cycle through available GPUs
|
|
SELECTED_GPU=$(($GPU % 2)) # Assuming 2 GPUs (0 and 1)
|
|
GPU=$((GPU + 1))
|
|
|
|
# Create run directory
|
|
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
|
|
RUN_NAME="lr${lr}_bs${bs}"
|
|
OUTPUT_DIR="runs/gpu${SELECTED_GPU}/${RUN_NAME}_${TIMESTAMP}"
|
|
mkdir -p $OUTPUT_DIR/checkpoints
|
|
mkdir -p $OUTPUT_DIR/logs
|
|
|
|
# Launch training in background
|
|
echo "Starting run on GPU ${SELECTED_GPU}: lr=${lr}, bs=${bs}"
|
|
CUDA_VISIBLE_DEVICES=$SELECTED_GPU python src/train.py \
|
|
--output_dir $OUTPUT_DIR \
|
|
--batch_size $bs \
|
|
--learning_rate $lr \
|
|
> $OUTPUT_DIR/logs/training.log 2>&1 &
|
|
|
|
# Wait a bit to stagger the starts
|
|
sleep 10
|
|
done
|
|
done
|
|
|
|
echo "All jobs launched. Monitor with './scripts/monitor_gpus.sh'"
|
|
```
|
|
|
|
## Dead Simple Experiment Tracking Without MLflow
|
|
|
|
Create a simple CSV logger in your training script:
|
|
|
|
```python
|
|
import csv
|
|
from pathlib import Path
|
|
|
|
class SimpleLogger:
|
|
def __init__(self, log_dir):
|
|
self.log_dir = Path(log_dir)
|
|
self.log_dir.mkdir(exist_ok=True, parents=True)
|
|
|
|
# Initialize metrics CSV
|
|
self.metrics_file = self.log_dir / "metrics.csv"
|
|
self.header_written = False
|
|
|
|
# Keep track of best metrics
|
|
self.best_metrics = {}
|
|
|
|
def log_metrics(self, metrics, step):
|
|
"""Log metrics to CSV file"""
|
|
metrics["step"] = step
|
|
|
|
# Create or append to CSV
|
|
write_header = not self.metrics_file.exists()
|
|
|
|
with open(self.metrics_file, mode='a', newline='') as file:
|
|
writer = csv.DictWriter(file, fieldnames=metrics.keys())
|
|
if write_header:
|
|
writer.writeheader()
|
|
writer.writerow(metrics)
|
|
|
|
# Update best metrics
|
|
for key, value in metrics.items():
|
|
if key != "step":
|
|
if key not in self.best_metrics or value < self.best_metrics[key]["value"]:
|
|
self.best_metrics[key] = {"value": value, "step": step}
|
|
|
|
# Write best metrics summary
|
|
with open(self.log_dir / "best_metrics.txt", 'w') as f:
|
|
for key, data in self.best_metrics.items():
|
|
f.write(f"Best {key}: {data['value']} (step {data['step']})\n")
|
|
```
|
|
|
|
## Finding and Comparing Results
|
|
|
|
Create a simple results aggregation script:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# aggregate_results.sh
|
|
|
|
echo "Run Directory,Best Loss,Best Accuracy,Training Time"
|
|
|
|
find runs/ -name "best_metrics.txt" | while read metrics_file; do
|
|
run_dir=$(dirname "$metrics_file")
|
|
best_loss=$(grep "Best loss" "$metrics_file" | cut -d' ' -f3)
|
|
best_acc=$(grep "Best accuracy" "$metrics_file" | cut -d' ' -f3)
|
|
|
|
# Get training time from log
|
|
log_file="$run_dir/logs/training.log"
|
|
start_time=$(head -n 1 "$log_file" | grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}')
|
|
end_time=$(tail -n 10 "$log_file" | grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}' | tail -n 1)
|
|
|
|
echo "$run_dir,$best_loss,$best_acc,$start_time-$end_time"
|
|
done | sort -t',' -k2n
|
|
```
|
|
|
|
## Simple Visualization Without External Tools
|
|
|
|
Create a basic plotting script using matplotlib:
|
|
|
|
```python
|
|
# plot_results.py
|
|
import os
|
|
import glob
|
|
import pandas as pd
|
|
import matplotlib.pyplot as plt
|
|
from pathlib import Path
|
|
|
|
# Find all metrics.csv files
|
|
metrics_files = glob.glob("runs/**/metrics.csv", recursive=True)
|
|
|
|
plt.figure(figsize=(12, 8))
|
|
|
|
# Plot each run
|
|
for metrics_file in metrics_files:
|
|
run_name = Path(metrics_file).parent.name
|
|
df = pd.read_csv(metrics_file)
|
|
|
|
plt.plot(df['step'], df['loss'], label=f"{run_name} - loss")
|
|
|
|
plt.xlabel('Step')
|
|
plt.ylabel('Loss')
|
|
plt.title('Training Loss Comparison')
|
|
plt.legend()
|
|
plt.grid(True)
|
|
plt.tight_layout()
|
|
plt.savefig('loss_comparison.png')
|
|
plt.close()
|
|
|
|
# Create accuracy plot if available
|
|
plt.figure(figsize=(12, 8))
|
|
for metrics_file in metrics_files:
|
|
run_name = Path(metrics_file).parent.name
|
|
df = pd.read_csv(metrics_file)
|
|
|
|
if 'accuracy' in df.columns:
|
|
plt.plot(df['step'], df['accuracy'], label=f"{run_name} - accuracy")
|
|
|
|
plt.xlabel('Step')
|
|
plt.ylabel('Accuracy')
|
|
plt.title('Training Accuracy Comparison')
|
|
plt.legend()
|
|
plt.grid(True)
|
|
plt.tight_layout()
|
|
plt.savefig('accuracy_comparison.png')
|
|
```
|
|
|
|
## Process Management and GPU Allocation
|
|
|
|
Create a script to check GPU usage and allocate new jobs:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# allocate_gpu.sh
|
|
|
|
# This script checks GPU usage and returns the index of the least utilized GPU
|
|
LEAST_BUSY_GPU=$(nvidia-smi --query-gpu=index,utilization.gpu --format=csv,noheader,nounits |
|
|
sort -t',' -k2n |
|
|
head -n 1 |
|
|
cut -d',' -f1)
|
|
|
|
echo $LEAST_BUSY_GPU
|
|
```
|
|
|
|
## Tips for Avoiding Race Conditions
|
|
|
|
1. **Always use unique output directories for each run**:
|
|
- Include timestamp, GPU ID, and PID in directory names
|
|
- Never share output directories between processes
|
|
|
|
2. **For checkpoint saving**:
|
|
- Save to temp directory first
|
|
- Use atomic operations like `shutil.move()` for final placement
|
|
- Don't depend on file locks (often unreliable with network filesystems)
|
|
|
|
3. **For data loading**:
|
|
- Use different random seeds per process
|
|
- Set `num_workers` appropriately (2-4 per GPU usually works well)
|
|
- Add process-specific buffer to avoid filesystem contention
|
|
|
|
4. **For logging**:
|
|
- Each process should write to its own log file
|
|
- Use timestamps in log entries
|
|
- Include GPU ID and PID in log messages
|
|
|
|
## Quick Commands Reference
|
|
|
|
```bash
|
|
# Start training on GPU 0
|
|
./scripts/train_gpu0.sh
|
|
|
|
# Start training on GPU 1
|
|
./scripts/train_gpu1.sh
|
|
|
|
# Run parameter sweep across GPUs
|
|
./scripts/param_sweep.sh
|
|
|
|
# Monitor GPU usage
|
|
./scripts/monitor_gpus.sh
|
|
|
|
# Find GPU with lowest utilization
|
|
BEST_GPU=$(./scripts/allocate_gpu.sh)
|
|
echo "Least busy GPU is: $BEST_GPU"
|
|
|
|
# Aggregate results into CSV
|
|
./scripts/aggregate_results.sh > results_summary.csv
|
|
|
|
# Generate comparison plots
|
|
python scripts/plot_results.py
|
|
```
|
|
|
|
Remember: The simplest solution is usually the most maintainable. Keep your scripts straightforward, make each run independent, and use filesystem organization to avoid conflicts.
|
|
|
|
# TODO: Replace print statements with loguru logging for better debugging and log file management
|