This subfolder contains the custom script implementing the GPU queuing and monitoring system set up on the Mjölnir server. The script is contained in gpu_queue.py. Below instructions show how the script was set up on the server (and how it can be set up again, in case), as well as some basic usage examples and other useful information for GPU monitoring on the server.
gpu_queue.py
and make it executable:
chmod +x gpu_queue.py
sudo cp gpu_queue.py /usr/local/bin/gpuq
pip install requests # for Slack notifications
gpuq config
{
"max_job_time_hours": 24,
"max_memory_per_gpu_gb": 70,
"notification_email": {
"enabled": true,
"smtp_server": "smtp.gmail.com",
"smtp_port": 587,
"username": "your-email@gmail.com",
"password": "your-app-password",
"admin_email": "admin@yourlab.com"
},
"slack": {
"enabled": true,
"webhook_url": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
"channel": "#gpu-alerts"
},
"user_emails": {
"alice": "alice@yourlab.com",
"bob": "bob@yourlab.com"
}
}
Create a systemd service to run the queue daemon:
sudo tee /etc/systemd/system/gpu-queue.service << EOF
[Unit]
Description=GPU Queue Manager
After=network.target
[Service]
Type=simple
User=nobody
ExecStart=/usr/local/bin/gpuq daemon
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable gpu-queue
sudo systemctl start gpu-queue
When using the gpuq tool, always remember to activate your virtual environment first, so that the required libraries for your experiments will be available to the script.
conda activate $your_environment
source venvs/$your_environment/bin/activate
# Submit a simple training job
gpuq submit --command "python train.py --epochs 100"
# Request 2 GPUs with 40GB memory each for 12 hours
gpuq submit --command "python big_model.py" --gpus 2 --memory 40 --time 12
# Submit with email notification
gpuq submit --command "python experiment.py" --email "user@lab.com"
# Check current status
gpuq status
# Monitor in real-time
watch -n 5 gpuq status
# Kill a specific job
gpuq kill --job-id 12345
# Submit interactive job (for Jupyter notebooks)
gpuq submit --command "jupyter notebook --ip=0.0.0.0 --port=8888" --time 4
Create convenient aliases for common tasks:
/usr/local/bin/gpu-train
- Training wrapper#!/bin/bash
# Wrapper for training jobs
if [ $# -eq 0 ]; then
echo "Usage: gpu-train <script.py> [--gpus N] [--time H]"
exit 1
fi
SCRIPT=$1
shift
gpuq submit --command "python $SCRIPT" "$@"
/usr/local/bin/gpu-jupyter
- Jupyter wrapper#!/bin/bash
# Start Jupyter with GPU access
PORT=${1:-8888}
gpuq submit --command "jupyter notebook --ip=0.0.0.0 --port=$PORT --no-browser" --time 8
echo "Jupyter will start soon. Check 'gpuq status' for details."
gpuq status
regularlygpuq kill
/tmp/gpu_queue/logs/
To set up Slack notifications:
The system will send notifications for:
For Gmail, you’ll need to:
At the moment we use a new gmail account mjolnirruqola@gmail.com to send email notifications via smtp. All information related to this account (passwords/app password, etc.) are securely stored on the OneDrive folder named mjolnir.
The system automatically:
# Real-time GPU monitoring
nvidia-smi -l 1
# Check who's using what
gpuq status | grep -A 10 "Running Jobs"
# View job logs
tail -f /tmp/gpu_queue/logs/job_12345_stdout.log