ruqola-server-deploy

Mjölnir User Documentation

Welcome to the Ruqola project NTU research server (aka Mjölnir)! This documentation provides comprehensive guidance for using our shared GPU computing resources effectively.

You can access a Jeckyll version of this documentation here.

🖥️ Server Specifications

GPUs: 3x NVIDIA H200 (80GB HBM3e each)
Total GPU Memory: 240GB
Custom Queue Management: Fair resource allocation system

📚 Documentation Structure

For New Users

Bash Basics - Essential command line skills for server usage
Server Best Practices - Guidelines for respectful resource sharing

Server Users Creation and Deletion

Create/Delete Users - Explain the two minimal scripts used to create and delete users either individually or from a csv file containing multiple users.

File System and Folders Structure

Users Quota - Explain the quotas system used to limit the disk space used by each user and some essential bash commands to check and manage quotas.
Scratch Folder - Guidelines and information for use the scratch folder to store large datasets and programmes artifacts.

GPU Queue System

GPU Queue User Guide - Comprehensive guide to job submission and monitoring
H200 GPU Specifications - Technical details and capabilities
Custom Queue Setup - Technical setup and administration (existing)

Deep Learning Frameworks

PyTorch with H200 - Optimized PyTorch usage and examples
TensorFlow/Keras with H200 - TensorFlow setup and best practices
JAX/Flax with H200 - JAX configuration and usage patterns
Transformers with H200 - Hugging Face Transformers for LLMs and fine-tuning

Examples and Scripts

Example Scripts - Ready-to-use scripts for common workflows
Troubleshooting - Common issues and solutions

🚀 Quick Start

First Time Setup: Read Bash Basics
Familiarise yourself with file and folder structure: Read Users Quota and Scratch Folder
Submit Your First Job: Check GPU Queue Guide
Choose Your Framework: Select from PyTorch, TensorFlow, or JAX guides
Optimize Your Code: Review Best Practices

⚡ Quick Commands

# Check GPU availability
gpuq status

# Submit a training job
conda activate $your_environment
gpuq submit --command "python train.py" --gpus 1 --time 8

# Monitor GPUs in real-time
nvidia-smi -l 1

# Check your running jobs
gpuq status | grep $USER

📞 Getting Help

Technical Issues: Contact your server administrator
Documentation Updates: Submit suggestions or corrections
Queue System: Check gpuq/README.md for technical details

Last updated: August 2025