10 KiB

Raw Blame History

Brain Rotting Multiple GPU Workflow for Dummies

Problem: Working with Multiple GPUs Without Race Conditions

Running multiple training processes on different GPUs can lead to:

Output directory conflicts
Checkpoint corruption
Resource contention
Difficult debugging and tracking

This guide gives you dead simple solutions using only basic scripts.

Directory Structure for Sanity

First, set up a clean directory structure to keep runs separate:

project/
├── scripts/
│   ├── train_gpu0.sh
│   ├── train_gpu1.sh 
│   └── monitor_gpus.sh
├── src/
│   └── train.py
└── runs/
    ├── gpu0/  # Training on GPU 0
    │   ├── checkpoints/
    │   └── logs/
    └── gpu1/  # Training on GPU 1
        ├── checkpoints/
        └── logs/

Simple Shell Scripts for GPU Management

1. Dedicated GPU Training Script (train_gpu0.sh)

#!/bin/bash

# Assign this process to GPU 0 only
export CUDA_VISIBLE_DEVICES=0

# Create timestamped run directory
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
OUTPUT_DIR="runs/gpu0/${TIMESTAMP}"
mkdir -p $OUTPUT_DIR/checkpoints
mkdir -p $OUTPUT_DIR/logs

# Run with output redirect to log file
python src/train.py \
  --output_dir $OUTPUT_DIR \
  --batch_size 32 \
  --learning_rate 1e-4 \
  > $OUTPUT_DIR/logs/training.log 2>&1

2. Second GPU Script (train_gpu1.sh)

#!/bin/bash

# Assign this process to GPU 1 only
export CUDA_VISIBLE_DEVICES=1

# Create timestamped run directory
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
OUTPUT_DIR="runs/gpu1/${TIMESTAMP}"
mkdir -p $OUTPUT_DIR/checkpoints
mkdir -p $OUTPUT_DIR/logs

# Run with output redirect to log file
python src/train.py \
  --output_dir $OUTPUT_DIR \
  --batch_size 32 \
  --learning_rate 1e-4 \
  > $OUTPUT_DIR/logs/training.log 2>&1

3. Simple GPU Monitoring Script (monitor_gpus.sh)

#!/bin/bash

# Simple GPU monitoring loop with timestamps
while true; do
  clear
  echo "======== $(date) ========"
  nvidia-smi
  sleep 5
done

Checkpoint Management Without Race Conditions

In your train.py, implement safe checkpoint saving:

import os
import time
import torch
import shutil
from pathlib import Path

def save_checkpoint(model, optimizer, epoch, step, args):
    """Save checkpoint safely without race conditions"""
    # Get process-specific info for uniqueness
    pid = os.getpid()
    timestamp = time.strftime("%Y%m%d_%H%M%S")
    
    # Create temporary directory with unique name
    checkpoint_dir = Path(args.output_dir) / "checkpoints"
    checkpoint_dir.mkdir(exist_ok=True)
    
    temp_dir = checkpoint_dir / f"temp_{pid}_{timestamp}"
    temp_dir.mkdir(exist_ok=True)
    
    # Save to temporary location first
    checkpoint_path = temp_dir / "checkpoint.pt"
    torch.save({
        'epoch': epoch,
        'step': step,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
    }, checkpoint_path)
    
    # Create final directory name
    final_dir = checkpoint_dir / f"checkpoint_epoch{epoch}_step{step}"
    
    # Atomic rename operation (safer than copying files)
    shutil.move(str(temp_dir), str(final_dir))
    
    # Clean up old checkpoints (keep only last 5)
    checkpoints = sorted([d for d in checkpoint_dir.iterdir() 
                         if d.is_dir() and d.name.startswith("checkpoint_")])
    for old_checkpoint in checkpoints[:-5]:
        shutil.rmtree(old_checkpoint)
    
    print(f"Saved checkpoint to {final_dir}")
    return final_dir

Running Multiple Training Jobs with Different Parameters

Create a parameter sweep script that launches multiple runs with different configs:

#!/bin/bash
# param_sweep.sh

# Define parameter grid
LEARNING_RATES=("1e-3" "5e-4" "1e-4")
BATCH_SIZES=(16 32 64)

# Loop through parameters and assign to GPUs
GPU=0
for lr in "${LEARNING_RATES[@]}"; do
  for bs in "${BATCH_SIZES[@]}"; do
    # Select GPU using modulo to cycle through available GPUs
    SELECTED_GPU=$(($GPU % 2)) # Assuming 2 GPUs (0 and 1)
    GPU=$((GPU + 1))
    
    # Create run directory
    TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
    RUN_NAME="lr${lr}_bs${bs}"
    OUTPUT_DIR="runs/gpu${SELECTED_GPU}/${RUN_NAME}_${TIMESTAMP}"
    mkdir -p $OUTPUT_DIR/checkpoints
    mkdir -p $OUTPUT_DIR/logs
    
    # Launch training in background
    echo "Starting run on GPU ${SELECTED_GPU}: lr=${lr}, bs=${bs}"
    CUDA_VISIBLE_DEVICES=$SELECTED_GPU python src/train.py \
      --output_dir $OUTPUT_DIR \
      --batch_size $bs \
      --learning_rate $lr \
      > $OUTPUT_DIR/logs/training.log 2>&1 &
    
    # Wait a bit to stagger the starts
    sleep 10
  done
done

echo "All jobs launched. Monitor with './scripts/monitor_gpus.sh'"

Dead Simple Experiment Tracking Without MLflow

Create a simple CSV logger in your training script:

import csv
from pathlib import Path

class SimpleLogger:
    def __init__(self, log_dir):
        self.log_dir = Path(log_dir)
        self.log_dir.mkdir(exist_ok=True, parents=True)
        
        # Initialize metrics CSV
        self.metrics_file = self.log_dir / "metrics.csv"
        self.header_written = False
        
        # Keep track of best metrics
        self.best_metrics = {}
    
    def log_metrics(self, metrics, step):
        """Log metrics to CSV file"""
        metrics["step"] = step
        
        # Create or append to CSV
        write_header = not self.metrics_file.exists()
        
        with open(self.metrics_file, mode='a', newline='') as file:
            writer = csv.DictWriter(file, fieldnames=metrics.keys())
            if write_header:
                writer.writeheader()
            writer.writerow(metrics)
        
        # Update best metrics
        for key, value in metrics.items():
            if key != "step":
                if key not in self.best_metrics or value < self.best_metrics[key]["value"]:
                    self.best_metrics[key] = {"value": value, "step": step}
        
        # Write best metrics summary
        with open(self.log_dir / "best_metrics.txt", 'w') as f:
            for key, data in self.best_metrics.items():
                f.write(f"Best {key}: {data['value']} (step {data['step']})\n")

Finding and Comparing Results

Create a simple results aggregation script:

#!/bin/bash
# aggregate_results.sh

echo "Run Directory,Best Loss,Best Accuracy,Training Time"

find runs/ -name "best_metrics.txt" | while read metrics_file; do
    run_dir=$(dirname "$metrics_file")
    best_loss=$(grep "Best loss" "$metrics_file" | cut -d' ' -f3)
    best_acc=$(grep "Best accuracy" "$metrics_file" | cut -d' ' -f3)
    
    # Get training time from log
    log_file="$run_dir/logs/training.log"
    start_time=$(head -n 1 "$log_file" | grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}')
    end_time=$(tail -n 10 "$log_file" | grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}' | tail -n 1)
    
    echo "$run_dir,$best_loss,$best_acc,$start_time-$end_time"
done | sort -t',' -k2n

Simple Visualization Without External Tools

Create a basic plotting script using matplotlib:

# plot_results.py
import os
import glob
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

# Find all metrics.csv files
metrics_files = glob.glob("runs/**/metrics.csv", recursive=True)

plt.figure(figsize=(12, 8))

# Plot each run
for metrics_file in metrics_files:
    run_name = Path(metrics_file).parent.name
    df = pd.read_csv(metrics_file)
    
    plt.plot(df['step'], df['loss'], label=f"{run_name} - loss")
    
plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Training Loss Comparison')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig('loss_comparison.png')
plt.close()

# Create accuracy plot if available
plt.figure(figsize=(12, 8))
for metrics_file in metrics_files:
    run_name = Path(metrics_file).parent.name
    df = pd.read_csv(metrics_file)
    
    if 'accuracy' in df.columns:
        plt.plot(df['step'], df['accuracy'], label=f"{run_name} - accuracy")
    
plt.xlabel('Step')
plt.ylabel('Accuracy')
plt.title('Training Accuracy Comparison')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig('accuracy_comparison.png')

Process Management and GPU Allocation

Create a script to check GPU usage and allocate new jobs:

#!/bin/bash
# allocate_gpu.sh

# This script checks GPU usage and returns the index of the least utilized GPU
LEAST_BUSY_GPU=$(nvidia-smi --query-gpu=index,utilization.gpu --format=csv,noheader,nounits | 
                 sort -t',' -k2n | 
                 head -n 1 | 
                 cut -d',' -f1)

echo $LEAST_BUSY_GPU

Tips for Avoiding Race Conditions

Always use unique output directories for each run:
- Include timestamp, GPU ID, and PID in directory names
- Never share output directories between processes
For checkpoint saving:
- Save to temp directory first
- Use atomic operations like shutil.move() for final placement
- Don't depend on file locks (often unreliable with network filesystems)
For data loading:
- Use different random seeds per process
- Set num_workers appropriately (2-4 per GPU usually works well)
- Add process-specific buffer to avoid filesystem contention
For logging:
- Each process should write to its own log file
- Use timestamps in log entries
- Include GPU ID and PID in log messages

Quick Commands Reference

# Start training on GPU 0
./scripts/train_gpu0.sh

# Start training on GPU 1
./scripts/train_gpu1.sh

# Run parameter sweep across GPUs
./scripts/param_sweep.sh

# Monitor GPU usage
./scripts/monitor_gpus.sh

# Find GPU with lowest utilization
BEST_GPU=$(./scripts/allocate_gpu.sh)
echo "Least busy GPU is: $BEST_GPU"

# Aggregate results into CSV
./scripts/aggregate_results.sh > results_summary.csv

# Generate comparison plots
python scripts/plot_results.py

Remember: The simplest solution is usually the most maintainable. Keep your scripts straightforward, make each run independent, and use filesystem organization to avoid conflicts.

TODO: Replace print statements with loguru logging for better debugging and log file management

10 KiB Raw Blame History