Show HN: InferX – an AI-native OS for running 50 LLMs per GPU with hot swapping

Hey folks , We’ve been building InferX. an AI-native runtime that snapshots the full GPU execution state of LLMs (weights, KV cache, CUDA context) and restores it in under 2s. This lets us hot-swap models like threads. no reloading, no cold starts.

We treat each model as a lightweight, resumable process. like an OS for LLM inference.

Why it matters:

•Run 50+ LLMs per GPU (7B–13B range)

•90% GPU utilization (vs ~30–40% with conventional setups)

•Avoids cold starts by snapshotting and restoring directly on GPU •Designed for agentic workflows, toolchains, and multi-tenant use cases

•Helpful for Codex CLI-style orchestration or bursty multi-model apps

Still early, but we’re seeing strong interest from builders and infra folks. Would love thoughts, feedback, or edge cases you’d want to see tested.

Demo: https://inferx.net X: @InferXai

3 points | by pveldandi 1 day ago

2 comments

  • sauravt 1 day ago
    Very interesting. How would memory (or previous chat context awareness) work in the case of hot swapping, when multiple users to hot swapping models like threads.
  • precompute 14 hours ago
    Wow, that's really cool!