ReZero-Search-LLM-Agent-Fork/docs/brain-rotting-multiple-gpu-...

# Brain Rotting Multiple GPU Workflow for Dummies

## Problem: Working with Multiple GPUs Without Race Conditions

Running multiple training processes on different GPUs can lead to:

- Output directory conflicts
- Checkpoint corruption
- Resource contention
- Difficult debugging and tracking

This guide gives you dead simple solutions using only basic scripts.

## Directory Structure for Sanity

First, set up a clean directory structure to keep runs separate:

```
project/
├── scripts/
│   ├── train_gpu0.sh
│   ├── train_gpu1.sh
│   └── monitor_gpus.sh
├── src/
│   └── train.py
└── runs/
    ├── gpu0/  # Training on GPU 0
    │   ├── checkpoints/
    │   └── logs/
    └── gpu1/  # Training on GPU 1
        ├── checkpoints/
        └── logs/
```

## Simple Shell Scripts for GPU Management

### 1. Dedicated GPU Training Script (train_gpu0.sh)

```bash
#!/bin/bash

# Assign this process to GPU 0 only
export CUDA_VISIBLE_DEVICES=0

# Create timestamped run directory
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
OUTPUT_DIR="runs/gpu0/${TIMESTAMP}"
mkdir -p $OUTPUT_DIR/checkpoints
mkdir -p $OUTPUT_DIR/logs

# Run with output redirect to log file
python src/train.py \
  --output_dir $OUTPUT_DIR \
  --batch_size 32 \
  --learning_rate 1e-4 \
  > $OUTPUT_DIR/logs/training.log 2>&1
```

### 2. Second GPU Script (train_gpu1.sh)

```bash
#!/bin/bash

# Assign this process to GPU 1 only
export CUDA_VISIBLE_DEVICES=1

# Create timestamped run directory
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
OUTPUT_DIR="runs/gpu1/${TIMESTAMP}"
mkdir -p $OUTPUT_DIR/checkpoints
mkdir -p $OUTPUT_DIR/logs

# Run with output redirect to log file
python src/train.py \
  --output_dir $OUTPUT_DIR \
  --batch_size 32 \
  --learning_rate 1e-4 \
  > $OUTPUT_DIR/logs/training.log 2>&1
```

### 3. Simple GPU Monitoring Script (monitor_gpus.sh)

```bash
#!/bin/bash

# Simple GPU monitoring loop with timestamps
while true; do
  clear
  echo "======== $(date) ========"
  nvidia-smi
  sleep 5
done
```

## Checkpoint Management Without Race Conditions

In your `train.py`, implement safe checkpoint saving:

```python
import os
import time
import torch
import shutil
from pathlib import Path

def save_checkpoint(model, optimizer, epoch, step, args):
    """Save checkpoint safely without race conditions"""
    # Get process-specific info for uniqueness
    pid = os.getpid()
    timestamp = time.strftime("%Y%m%d_%H%M%S")

    # Create temporary directory with unique name
    checkpoint_dir = Path(args.output_dir) / "checkpoints"
    checkpoint_dir.mkdir(exist_ok=True)

    temp_dir = checkpoint_dir / f"temp_{pid}_{timestamp}"
    temp_dir.mkdir(exist_ok=True)

    # Save to temporary location first
    checkpoint_path = temp_dir / "checkpoint.pt"
    torch.save({
        'epoch': epoch,
        'step': step,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
    }, checkpoint_path)

    # Create final directory name
    final_dir = checkpoint_dir / f"checkpoint_epoch{epoch}_step{step}"

    # Atomic rename operation (safer than copying files)
    shutil.move(str(temp_dir), str(final_dir))

    # Clean up old checkpoints (keep only last 5)
    checkpoints = sorted([d for d in checkpoint_dir.iterdir()
                         if d.is_dir() and d.name.startswith("checkpoint_")])
    for old_checkpoint in checkpoints[:-5]:
        shutil.rmtree(old_checkpoint)

    print(f"Saved checkpoint to {final_dir}")
    return final_dir
```

## Running Multiple Training Jobs with Different Parameters

Create a parameter sweep script that launches multiple runs with different configs:

```bash
#!/bin/bash
# param_sweep.sh

# Define parameter grid
LEARNING_RATES=("1e-3" "5e-4" "1e-4")
BATCH_SIZES=(16 32 64)

# Loop through parameters and assign to GPUs
GPU=0
for lr in "${LEARNING_RATES[@]}"; do
  for bs in "${BATCH_SIZES[@]}"; do
    # Select GPU using modulo to cycle through available GPUs
    SELECTED_GPU=$(($GPU % 2)) # Assuming 2 GPUs (0 and 1)
    GPU=$((GPU + 1))

    # Create run directory
    TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
    RUN_NAME="lr${lr}_bs${bs}"
    OUTPUT_DIR="runs/gpu${SELECTED_GPU}/${RUN_NAME}_${TIMESTAMP}"
    mkdir -p $OUTPUT_DIR/checkpoints
    mkdir -p $OUTPUT_DIR/logs

    # Launch training in background
    echo "Starting run on GPU ${SELECTED_GPU}: lr=${lr}, bs=${bs}"
    CUDA_VISIBLE_DEVICES=$SELECTED_GPU python src/train.py \
      --output_dir $OUTPUT_DIR \
      --batch_size $bs \
      --learning_rate $lr \
      > $OUTPUT_DIR/logs/training.log 2>&1 &

    # Wait a bit to stagger the starts
    sleep 10
  done
done

echo "All jobs launched. Monitor with './scripts/monitor_gpus.sh'"
```

## Dead Simple Experiment Tracking Without MLflow

Create a simple CSV logger in your training script:

```python
import csv
from pathlib import Path

class SimpleLogger:
    def __init__(self, log_dir):
        self.log_dir = Path(log_dir)
        self.log_dir.mkdir(exist_ok=True, parents=True)

        # Initialize metrics CSV
        self.metrics_file = self.log_dir / "metrics.csv"
        self.header_written = False

        # Keep track of best metrics
        self.best_metrics = {}

    def log_metrics(self, metrics, step):
        """Log metrics to CSV file"""
        metrics["step"] = step

        # Create or append to CSV
        write_header = not self.metrics_file.exists()

        with open(self.metrics_file, mode='a', newline='') as file:
            writer = csv.DictWriter(file, fieldnames=metrics.keys())
            if write_header:
                writer.writeheader()
            writer.writerow(metrics)

        # Update best metrics
        for key, value in metrics.items():
            if key != "step":
                if key not in self.best_metrics or value < self.best_metrics[key]["value"]:
                    self.best_metrics[key] = {"value": value, "step": step}

        # Write best metrics summary
        with open(self.log_dir / "best_metrics.txt", 'w') as f:
            for key, data in self.best_metrics.items():
                f.write(f"Best {key}: {data['value']} (step {data['step']})\n")
```

## Finding and Comparing Results

Create a simple results aggregation script:

```bash
#!/bin/bash
# aggregate_results.sh

echo "Run Directory,Best Loss,Best Accuracy,Training Time"

find runs/ -name "best_metrics.txt" | while read metrics_file; do
    run_dir=$(dirname "$metrics_file")
    best_loss=$(grep "Best loss" "$metrics_file" | cut -d' ' -f3)
    best_acc=$(grep "Best accuracy" "$metrics_file" | cut -d' ' -f3)

    # Get training time from log
    log_file="$run_dir/logs/training.log"
    start_time=$(head -n 1 "$log_file" | grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}')
    end_time=$(tail -n 10 "$log_file" | grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}' | tail -n 1)

    echo "$run_dir,$best_loss,$best_acc,$start_time-$end_time"
done | sort -t',' -k2n
```

## Simple Visualization Without External Tools

Create a basic plotting script using matplotlib:

```python
# plot_results.py
import os
import glob
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

# Find all metrics.csv files
metrics_files = glob.glob("runs/**/metrics.csv", recursive=True)

plt.figure(figsize=(12, 8))

# Plot each run
for metrics_file in metrics_files:
    run_name = Path(metrics_file).parent.name
    df = pd.read_csv(metrics_file)

    plt.plot(df['step'], df['loss'], label=f"{run_name} - loss")

plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Training Loss Comparison')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig('loss_comparison.png')
plt.close()

# Create accuracy plot if available
plt.figure(figsize=(12, 8))
for metrics_file in metrics_files:
    run_name = Path(metrics_file).parent.name
    df = pd.read_csv(metrics_file)

    if 'accuracy' in df.columns:
        plt.plot(df['step'], df['accuracy'], label=f"{run_name} - accuracy")

plt.xlabel('Step')
plt.ylabel('Accuracy')
plt.title('Training Accuracy Comparison')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig('accuracy_comparison.png')
```

## Process Management and GPU Allocation

Create a script to check GPU usage and allocate new jobs:

```bash
#!/bin/bash
# allocate_gpu.sh

# This script checks GPU usage and returns the index of the least utilized GPU
LEAST_BUSY_GPU=$(nvidia-smi --query-gpu=index,utilization.gpu --format=csv,noheader,nounits |
                 sort -t',' -k2n |
                 head -n 1 |
                 cut -d',' -f1)

echo $LEAST_BUSY_GPU
```

## Tips for Avoiding Race Conditions

1. **Always use unique output directories for each run**:
   - Include timestamp, GPU ID, and PID in directory names
   - Never share output directories between processes

2. **For checkpoint saving**:
   - Save to temp directory first
   - Use atomic operations like `shutil.move()` for final placement
   - Don't depend on file locks (often unreliable with network filesystems)

3. **For data loading**:
   - Use different random seeds per process
   - Set `num_workers` appropriately (2-4 per GPU usually works well)
   - Add process-specific buffer to avoid filesystem contention

4. **For logging**:
   - Each process should write to its own log file
   - Use timestamps in log entries
   - Include GPU ID and PID in log messages

## Quick Commands Reference

```bash
# Start training on GPU 0
./scripts/train_gpu0.sh

# Start training on GPU 1
./scripts/train_gpu1.sh

# Run parameter sweep across GPUs
./scripts/param_sweep.sh

# Monitor GPU usage
./scripts/monitor_gpus.sh

# Find GPU with lowest utilization
BEST_GPU=$(./scripts/allocate_gpu.sh)
echo "Least busy GPU is: $BEST_GPU"

# Aggregate results into CSV
./scripts/aggregate_results.sh > results_summary.csv

# Generate comparison plots
python scripts/plot_results.py
```

Remember: The simplest solution is usually the most maintainable. Keep your scripts straightforward, make each run independent, and use filesystem organization to avoid conflicts.

# TODO: Replace print statements with loguru logging for better debugging and log file management