Training Configuration

Configure your AI model training in one place

🎯 ESSENTIALS

Make your core choices - we'll handle the optimization

📊 Optimal Batch Size

-- total batch
0% GPU Memory Used
✅ Ready to proceed!

💻 Batch Size Calculator

Ready
$ GPU: --
$ Model: --
$ Precision: FP16 | Optimizer: AdamW
$ Calculating optimal batch size...
$ 📊 Optimal Batch Size: --
$ 💾 GPU Utilization: --

Requirements

PyTorch + Transformers Compatibility & Package Management

Auto-Generated Dependencies

Just click "Generate Requirements" and we'll create 66+ optimized packages with perfect compatibility. All conflicts are automatically resolved and installation order is optimized!

🔗 PyTorch + Transformers Compatibility

Select the best combination for your setup

📦 Dependencies & Libraries

Required software packages and versions

Click "Generate Requirements" to create your dependency list.

🔑 API Keys & Access

Manage all your API keys and tokens in one place

🔐 Centralized Secure Storage

All API keys are encrypted and stored securely. Configure them once here, and they'll sync automatically to the Cloud Deploy section.

🤗 HuggingFace Token

Not configured

Your HuggingFace token is used to download and upload models to your account.

Get your token →

🐙 GitHub Token (Optional)

Not configured

GitHub token enables discovery of models from research repositories and increases API rate limits.

Create token →

🌐 Vast.ai API Key

Not configured

Your Vast.ai API key is used to browse and rent GPUs from their marketplace.

Get your API key →

🎮 RunPod API Key

Not configured

Your RunPod API key is used to create and manage GPU pods.

Get your API key →

💡 Request a Cloud Provider

Want us to add support for another GPU cloud provider? Let us know! We're actively expanding our integrations.

✅ Vast.ai ✅ RunPod 🔜 Lambda Labs 🔜 CoreWeave
💡

API Keys Sync Across Platform

Configure your API keys once here, and they'll automatically populate in:

  • ☁️ Cloud Deploy - Full GPU browsing and management
  • Quick Deploy - Instant model training

Security: All keys are encrypted at rest and in transit. We never log or expose your API keys.

📚

Multi-Repository Support

EzEpoch searches multiple AI model repositories to give you access to the latest models:

  • 🤗 HuggingFace Hub - 500K+ models (requires token for gated models)
  • 🐙 GitHub Repositories - Research models and implementations (optional token)
  • 📄 Papers with Code - State-of-the-art research models (no token needed)
  • 🔥 PyTorch Hub - Native PyTorch models (no token needed)

AI Auto-Analysis: All discovered models get optimized settings generated automatically!

🤖 AI Model Library

Browse, search, and manage AI models from HuggingFace

🔍 100,000+ Models Available

Browse and search from the world's largest AI model library. Use filters to find models compatible with your GPU and training goals. Your favorites and recent models are automatically saved!

🔐 Repository Access Status

Connect to repositories to discover and access their models

🤗

HuggingFace Hub

500K+ models, gated access available

🔴 Not Connected
🐙

GitHub Repositories

Research models and implementations

🔴 Not Connected
📄

Papers with Code

State-of-the-art research models

🟢 Available
🔥

PyTorch Hub

Native PyTorch models

🟢 Available

📚 Browse Models

Select a repository and browse available models

Select a repository to view available models...

👤 User Profile

Manage your account information and preferences

🔐 Account Security

Update your profile information, manage your subscription, and configure notification preferences. All changes are saved automatically.

🚀

Founding Beta Program

Limited to 50 spots — refer friends who subscribe to keep your discount

Join the first builders shaping EzEpoch. Earn exclusive discounts by referring friends who actually subscribe — real users, real savings.

🔑 All tiers include access to 3 platforms: EzEpoch · MyEZSetup · DataLabPro
🏆 FOUNDING (1-25)
2 mo free + 30% off for life
2 paid referrals in 30 days
⚡ EARLY ACCESS (26-50)
1 mo free + 25% off 12 mo
1 paid referral in 30 days
🎯 ALL BETA SIGNUPS
25% off first 3 months
No referral needed
🎁 Bonus: every friend who subscribes through your link = +1 free month (up to 3)

Apply for Beta Access

Loading...

📝 Personal Information

📞 Contact Information

🤖 AI Monitoring Preferences

🔐 API Hash & Sessions

This hash links your training sessions to your account

📦 Build Package

Create a complete training package for cloud deployment

📦 One-Click Package Creation

Everything is ready! Just click "Create Package" and we'll generate a complete training package with all dependencies, scripts, monitoring tools, and your data folders. Ready for any cloud GPU platform!

📁 Package Contents

File Status Size Description
requirements.txt ⏳ Pending ~20KB Python dependencies (~22 selective packages, EzSetup validated)
main.py ✅ Complete ~5KB Primary training script with repo-specific model loading
all.env ✅ Complete ~600B API keys, model repos, and training configuration
setup.sh ✅ Complete ~750B Environment setup script with EzSetup ordering + venv creation
README.md ✅ Complete ~1KB Setup instructions and quick-start guide for RunPod/Vast.ai

🚀 Package Creation

☁️ Cloud GPU Deployment

Select a package and deploy to Vast.ai or RunPod with one click

🚀 One-Click Cloud Deployment

Connect your Vast.ai or RunPod account, browse available GPUs, and deploy your training package directly to the cloud.

📦
(Auto-set from package)
🌐

Vast.ai Auto-Deploy

Searches marketplace for best matching GPU

Strategy: Try up to 15 offers
Sorted by: Price (cheapest first)
🎮

RunPod Auto-Deploy

Creates pod with your exact specifications

💡 Higher CPU/RAM = faster data loading & preprocessing

☁️ Your Running Instances

⏱️
Just created an instance?
New instances take 15-30 seconds to appear. Please refresh if needed.
🔌
SSH Connection Info:
Cloud providers can take 30-90 seconds for SSH to be fully ready after "running" status. Our system automatically retries with intelligent backoff if initial connection fails. This is normal!
Loading...
📋 Full Requirements Details
📦 Package: --
Model: --
Memory: --
GPUs: --
Cost: --

Connect your API keys above to browse available GPUs

📊 Training Dashboard

Monitor your active training sessions in real-time

📁 Your Training Sessions

💳 Subscription & Billing

Manage your plan, sessions, and billing

♾️ Unlimited Training - Netflix Model

All paid plans include unlimited training sessions. Train as much as you need - we don't host GPUs, you do. Our job is to make sure your training works the first time.

📊 Current Plan

Loading...
Plan: Loading...
Monthly Cost: Loading...
Sessions This Month: 0
Sessions Available: Loading...
Next Billing: Loading...

💰 Why EzEpoch Pays for Itself

Train unlimited with confidence - one saved failure pays for months of subscription

❌ Without EzEpoch

  • OOM errors waste $20-50+ in GPU time
  • Config mistakes require restart from scratch
  • Crashes lose hours of training progress
  • 40% of training jobs fail industry-wide

✅ With EzEpoch

  • AI calculates perfect settings every time
  • Crash prevention catches issues before failure
  • Auto-recovery resumes from checkpoints
  • 99% success rate - guaranteed

🚀 Available Plans

Choose the plan that fits your training needs

🆓 Free Trial
$0
✅ 1 free training session
✅ Up to 8B models
✅ Full platform access
✅ AI Guardian monitoring
🚀 BETA PRICING
⚡ Essential
$79.99 $59/month
♾️ UNLIMITED Training Sessions
✅ Up to 20B models
✅ AI Guardian Auto-Pilot
✅ Crash prevention & recovery
🎁 EzSetup Included FREE
✅ Email support

🛒 Add-On Products & One-Time Purchases

Enhance your workflow with standalone tools or buy sessions as needed

🎯 Single Session
$9.99/one-time
✅ 1 training session
✅ Up to 70B models
✅ Full AI monitoring
✅ No subscription required
💡 Perfect for one-off training
🎯 Large Model Session
$19.99/one-time
✅ 1 training session
✅ Up to 200B models
✅ Full AI monitoring
✅ No subscription required
💡 For the biggest models
📊 DataLab Pro
$9.99/month
✅ Smart data cleaning & chunking
✅ AI-powered training pair generation
✅ Model quantization (8-bit, 4-bit, GGUF)
✅ RAG database builder
✅ Format converter (JSON/JSONL/CSV)
✅ Quality analysis with before/after scores
✅ Image & audio linking for multimodal
✅ Windows/Mac/Linux desktop app
💡 FREE with Pro+
🔧 EzSetup v1.0
$4.99/month
✅ AI dependency analysis & resolution
✅ Auto-fix version conflicts
✅ Correct install order detection
✅ Platform-specific handling
✅ PyTorch/CUDA compatibility checks
✅ Manual install flagging (flash-attn, etc)
❌ No API access (basic)
💡 FREE with Essential+
🔧 EzSetup API v1.0
$9.99/month
✅ Everything in EzSetup
✅ Full REST API access
✅ CI/CD pipeline integration
✅ Automated dependency fixing
✅ Bulk requirements processing
💡 FREE with Pro

💡 We Want You To Succeed!

Unlimited training means unlimited support - we're here to help you every step of the way

🔄 Crash Recovery

AI Guardian automatically analyzes crashes and restarts training with corrected settings. Resumes from last checkpoint - no progress lost!

🔁 Unlimited Restarts

Training keeps going until YOU succeed. No limits on restart attempts - our goal is your successful model.

🎯 Direct Developer Support

Stuck? You get direct access to the developer who built this platform. I'll personally help debug your setup, review your settings, and guide you to success.

📊 Dashboard Control

Monitor live metrics, adjust settings on-the-fly, and let AI Guardian optimize every 100 steps automatically.

🚀 So Easy, Anyone Can Do It!

Industry Standard: 40% of AI training jobs fail. With EzEpoch: 99% success rate because our AI calculates perfect settings and auto-recovers from crashes.

👨‍💻 Built by One Developer. For Real People.

The story behind EzEpoch

🧑‍💻

Hi, I'm Wil Hurley

EzEpoch is a one-person operation. I designed, built, and maintain every line of code — the AI training engine, the crash prevention system, the dependency resolver, DataLab Pro, and everything in between.

I built EzEpoch because I was frustrated watching people waste hundreds of dollars on failed training runs due to wrong configs, OOM crashes, and dependency nightmares. There had to be a better way.

When you subscribe, you're not paying a faceless corporation — you're supporting an indie developer who personally reads every support message and ships updates weekly. I care about every single user's success because your success is my success.

🛠️
100,000+ lines of code
🤖
AI Guardian patent pending
📊
3 products — 1 developer
💬
Direct support from the creator

© 2025-2026 Wil Hurley / ARC Technologies LLC — Patent Pending

📚 EzEpoch Help & Guide

Step-by-step visual guide to every feature — from first login to live training

🚀 Quick Start 🔐 Login 🔑 API Keys & HF Token ⚙️ Training Config 🔬 Advanced / Expert 📋 Requirements 📦 Package Builder ☁️ Cloud Deploy 🧙 Deployment Wizard 📊 Dashboard 🤗 AI Models 📖 Settings Dictionary

🚀 Quick Start — 5 Steps to Your First Training Run

From zero to a deployable training package in minutes

1️⃣ Log In

Create your account or sign in. Your session is remembered so you stay logged in.

2️⃣ Add API Keys

Add your HuggingFace token (required), plus Vast.ai or RunPod key for cloud GPU access.

3️⃣ Pick a Model & GPU

Choose your model. GPUs with a ⭐ star have enough memory for full fine-tuning.

4️⃣ Configure & Generate

Review the Training Config tab. Click Create Package — EzEpoch resolves all conflicts automatically.

5️⃣ Deploy to Cloud

Use the Deployment Wizard to rent a GPU, upload your package, and start training — all from the browser.

🔐 Login & Account

Sign in to access your saved configs, API keys, and training history

Login page

What to know

  • Use your email + password to sign in.
  • Forgot password? Use the reset link on the login page.
  • All your API keys and training configs are saved to your account and encrypted at rest.
  • You can stay logged in across sessions — your data is always waiting for you.

🔑 API Keys & HuggingFace Token

Connect your cloud accounts — all keys are encrypted and stored securely

API Keys page HuggingFace token type selection

The 4 HuggingFace token permission levels

🤗 How to create a HuggingFace token (required)

  1. Go to huggingface.co/settings/tokens
  2. Click + Create new token
  3. Give it a name (e.g. "EzEpoch")
  4. Choose a permission level:
    Read — download public & gated models (recommended for most users)
    Write — upload fine-tuned models back to HF Hub
    Fine-grained — custom per-repo permissions
    Legacy — older token format (avoid for new tokens)
  5. Click Generate token and copy it (starts with hf_)
  6. Paste it into the HuggingFace Token field in EzEpoch and click Test & Save

Other API keys

  • Vast.ai — get your key from cloud.vast.ai/account
  • RunPod — get your key from runpod.io → Settings
  • GitHub — optional, for repo access. Use a token with repo scope.
  • All keys are encrypted before storage. Use 👁 to reveal, or Clear to remove.

⚙️ Training Configuration

Select your model, GPU, dataset, and method — EzEpoch auto-optimises the rest

Default view — clean and simple

Training config default view

All settings expanded for full control

Training config expanded view
⭐ GPU Star System

Stars appear on GPUs with enough VRAM for full fine-tuning with AdamW. No star = use LoRA/QLoRA.

🤖 Auto-Optimization

Incompatible settings are greyed out automatically. The system picks the best defaults for your model + GPU combo.

📁 Dataset Upload

Upload a JSONL file or paste a HuggingFace dataset path. The system validates format automatically.

🔢 Batch Size

Set manually or use Auto — the AI probes your GPU and finds the largest safe batch size.

🔬 Advanced & Expert Settings

Fine-tune precision, optimizer, scheduler, and low-level hyperparameters

Advanced tab — precision & optimizer settings

Advanced settings tab

Expert tab — scheduler, warmup, weight decay

Expert settings tab
🎯 Precision

bf16 (recommended for Ampere+), fp16, or fp32. BF16 uses less memory with no accuracy loss.

⚡ Optimizer

AdamW (default), 8-bit Adam (saves memory), Adafactor (very low memory), or SGD.

📉 Scheduler

Cosine, linear, constant, or warmup variants. Cosine with warmup is recommended for most fine-tuning.

💾 Gradient Checkpointing

Reduces VRAM by recomputing activations during backward pass. Slightly slower but essential for large models.

📋 Requirements Tab

Auto-generated requirements.txt — conflict-free and CUDA-aware

Requirements tab

What EzEpoch does for you

  • Generates a complete requirements.txt matched to your model + training method
  • Resolves version conflicts between torch, transformers, peft, and other packages
  • Handles CUDA 11.8 / 12.1 / CPU-only variants automatically
  • Flags packages requiring special install steps (flash-attn, xformers, bitsandbytes)
  • Add extra packages — the system validates them before including
  • All dependencies install in the correct order in your deploy script

📦 Package Builder

One click generates a complete, ready-to-run training package

Package builder

What's inside every package

  • train.py — full fine-tuning script with your exact config
  • requirements.txt — conflict-free, versioned dependencies
  • install.sh — installs packages in the correct order
  • run.sh — launch script with all env vars set
  • config.json — all training hyperparameters captured
✅ Conflict-free guarantee: Every package is validated before download. If there's a conflict the system tells you exactly what to change.

☁️ Cloud Deployment Tab

Manage your Vast.ai / RunPod instances from inside EzEpoch

Cloud deployment tab

What you can do here

  • Browse available GPUs on Vast.ai and RunPod with live pricing
  • Filter by VRAM, cost/hr, location, and CUDA version
  • Rent a GPU instance directly — no separate website needed
  • See your active instances: status, cost, start time
  • SSH into a running instance for manual access
  • Destroy instances when training completes to stop billing

🧙 Deployment Wizard — 3-Step Auto-Deploy

Rent GPU → Upload package → Start training. Fully automated.

Step 1 — Select GPU & provider

Deployment wizard step 1

Step 2 — Review & confirm

Deployment wizard step 2

Step 3 — Launch & monitor

Deployment wizard step 3
Step 1 — Choose GPU

Pick from Vast.ai or RunPod. Filter by VRAM required, price per hour, and CUDA version.

Step 2 — Review & Confirm

See the full cost estimate, instance specs, and package contents before committing. Estimated training time shown.

Step 3 — Auto-Launch

EzEpoch rents the GPU, uploads your package, and starts training. Live progress streams to your Dashboard.

📊 Training Dashboard

Real-time training metrics, live logs, and GPU stats — embedded or standalone

Embedded dashboard — inside EzEpoch

App embedded dashboard

Standalone dashboard — dashboard.ezepoch.com

Standalone dashboard
📈 Live Loss Curve

Training and validation loss plotted in real time. Divergence is flagged automatically.

🖥️ GPU Stats

VRAM usage, GPU utilisation %, temperature, and power draw — all streaming live.

📜 Live Logs

Full training stdout/stderr streamed to the browser. Click any line to jump to that epoch.

🌐 Standalone

Visit dashboard.ezepoch.com to monitor from any device.

🤗 AI Models & Repos

Browse, search, and manage HuggingFace models & your saved repos

AI Models and Repos tab

What you can do here

  • Search HuggingFace Hub for models by name, task, or architecture
  • See model size, license, downloads, and VRAM requirements at a glance
  • Save favourite models to your account for quick access
  • Link HuggingFace repos to auto-push fine-tuned weights after training
  • View gated model access status — EzEpoch checks your HF token has permission

📖 Settings Dictionary

Quick reference for every training parameter

Core Settings
Epochs
Full passes through your training dataset
Batch size
Samples processed per gradient update step
Learning rate
Step size for weight updates (1e-4 to 1e-5 typical for fine-tuning)
Max seq len
Maximum token length per sample — longer = more VRAM
Grad accum
Accumulate N gradients before updating — simulates larger batch sizes
Training Methods
Full Fine-tune
All model weights updated. Best quality, needs most VRAM.
LoRA
Trains small adapters alongside frozen base model. ~10-30% VRAM vs full.
QLoRA
LoRA on a 4-bit quantised base. Run 70B models on a single 24GB GPU.
DoRA
Direction + magnitude LoRA variant. Often better than standard LoRA.
Multi-GPU Strategies
DDP
Data Parallel. Each GPU holds a full model copy. Simplest, best throughput.
ZeRO-2
Shards optimizer states + gradients. Cuts VRAM ~50% vs DDP.
ZeRO-3
Also shards model weights. Required for very large models across GPUs.
FSDP
PyTorch native sharding. Good for models too large for a single GPU.
LoRA Parameters
Rank (r)
Adapter size. Higher = more capacity but more VRAM. 8-64 typical.
Alpha
Scales the LoRA update. Usually set equal to rank or 2x rank.
Dropout
Regularisation for adapter layers. 0.05-0.1 typical.
Target modules
Which layers get adapters. EzEpoch auto-selects the optimal set.