Eviox Nexus v0.2.0 — Powered by NemoClaw

NXS-001 · System Architecture

Seven layers.
One intelligent cluster.

Nexus wraps your existing Slurm, InfiniBand, and parallel storage with a NemoClaw agent layer. Each agent runs sandboxed inside OpenShell. Inference is intercepted and routed transparently — the agent never calls Nemotron directly.

OPERATOR INTERFACE · NXS-DASH · NXS-API

Nexus Dashboard · REST API · WebSocket · Prometheus · Grafana

Real-time event stream · live telemetry · /status · /agents · /events · /ws/events · DCGM dashboards

↕ events/publish · HTTP POST/GET · WebSocket fan-out

NXS-CORE · FastAPI · asyncio event bus · NVML power applier

nexus-core v0.2.0 — Sandbox Manager · Event Bus · pynvml Cap Applier

Manages NemoClaw sandbox lifecycle via openclaw CLI · fans events to WebSocket clients · applies power caps from optimizer optimize_action events

↕ openclaw nemoclaw launch · sandbox lifecycle · stdout JSON event bridge

NemoClaw Sandbox

nexus-scheduler

OpenClaw agent · Slurm REST · Nemotron placement · 15s poll

NemoClaw Sandbox

nexus-healer

OpenClaw agent · DCGM/Prometheus · fault playbook · IPMI · 10s poll

NemoClaw Sandbox

nexus-optimizer

OpenClaw agent · NVML telemetry · workload classifier · power caps · 5s poll

↕ all inference calls intercepted · never leave sandbox directly

NVIDIA OpenShell Gateway · openclaw-sandbox.yaml enforced

OpenShell Gateway — Inference Router · Network Policy · Landlock LSM + seccomp Filesystem Isolation

Blocks all endpoints not in policy · operator approval via TUI for blocked requests · routes to configured provider — zero agent code change on profile switch

↕ routes to active inference provider

● ACTIVE · vllm profile

vLLM — RTX 3090

nemotron-3-nano-30b · host.openshell.internal:8000 · 131K ctx

○ DGX stage · nim-local

NVIDIA NIM Service

nemotron-3-super-120b · nim-service.local:8000 · 131K ctx

○ fallback · default

NVIDIA Cloud

build.nvidia.com · super-120b · API key required

↕ agents POST events via HTTP · allowed by network policy

Scheduler Target

Slurm REST API

slurmrestd :6820 · JWT

Healer Target

Prometheus + DCGM

:9090 · :9400 · ECC · temp

Optimizer Target

NVML / IPMI

pynvml · power caps · BMC

Storage

VAST · WEKA · GPFS

Lustre · NVMe · BeeGFS

Cluster Hardware

RTX 3090 (Stage 1) → DGX H100 / B200 / GB200 NVL72 (Stage 2) · InfiniBand NDR 400G · VAST · WEKA · GPFS · Lustre

NXS-002 · NemoClaw Integration

Sandboxed agents.
Transparent inference.

NemoClaw is NVIDIA's OpenClaw plugin for OpenShell. It runs OpenClaw agents inside isolated sandboxes with NVIDIA inference routing, strict network policy, and operator-controlled egress approval.

AGENT (sandbox)

OpenClaw Agent

nexus-scheduler / healer / optimizer · makes inference call with cluster context prompt

OPENSH ELL GATEWAY

Policy Enforcement

openclaw-sandbox.yaml · blocks unlisted endpoints · logs all attempts for audit

INFERENCE PROVIDER

Nemotron

vLLM local (RTX) / NIM (DGX) / NVIDIA cloud · 131K context window · zero code change on switch

EVENT PUBLISH

nexus-core bus

HTTP POST /events/publish · allowed by policy · WebSocket fan-out to dashboard

Inference Profiles

PROFILE	STAGE	MODEL	ENDPOINT	CONTEXT	STATUS
vllm	RTX 3090	nvidia/nemotron-3-nano-30b-a3b	host.openshell.internal:8000	131,072	● ACTIVE
nim-local	DGX Stage	nvidia/nemotron-3-super-120b-a12b	nim-service.local:8000	131,072	○ NEXT
default	Fallback	nvidia/nemotron-3-super-120b-a12b	integrate.api.nvidia.com	131,072	○ CLOUD

Switch at runtime (no restart): openshell inference set --provider nim-local --model nvidia/nemotron-3-super-120b-a12b

Network Policy · openclaw-sandbox.yaml

POLICY	ENDPOINTS ALLOWED	BINARIES	METHODS
vllm_inference	host.openshell.internal:8000	openclaw	POST /v1/chat/completions
nexus_core	nexus-core:8000 · localhost:8000	openclaw · python3	GET /health · POST /events/*
slurm_rest	localhost:6820 · host.docker.internal:6820	openclaw · python3	GET · POST /slurm/*
prometheus	prometheus:9090 · localhost:9090	openclaw · python3	GET /api/v1/*
dcgm_exporter	localhost:9400 · host.docker.internal:9400	openclaw · python3	GET /metrics

default_action: BLOCK — all other endpoints blocked · surfaced in OpenShell TUI for operator approval · approved endpoints persist for session only

NXS-003 · Agent Capabilities

Three agents.
One autonomous system.

Each agent runs as a sandboxed OpenClaw process, governed by its system prompt and the OpenShell network policy. Behaviour is transparent and auditable — every decision is logged to the Nexus event bus.

NXS-003.1 · NemoClaw Sandbox

Scheduler Agent — nexus-scheduler

● TESTING

Polls Slurm REST API every 15 seconds. Classifies pending jobs by urgency and GPU requirements. Submits structured placement reasoning to Nemotron via OpenShell gateway. Applies priority boosts to starved jobs (wait >300s). Heuristic greedy best-fit fallback when Nemotron is unavailable — zero-downtime scheduling guaranteed.

UTILISATION TARGET+30–50% vs vanilla Slurm

STARVATION THRESHOLD>300s wait time

STACKSlurm · slurmrestd · OpenShell · Nemotron

NXS-003.2 · NemoClaw Sandbox

Fault Healer Agent — nexus-healer

● TESTING

Monitors DCGM + Prometheus every 10 seconds. Detects ECC storms (≥5 errors), GPU overheat (≥88°C), node unreachable, and GPU hang. Executes a 4-step remediation playbook: migrate jobs → drain node → IPMI power cycle (severe faults only) → verify recovery. Undrains and records MTTR when health checks pass.

MTTR TARGET<4 minutes

FAULT TYPESECC · overheat · hang · unreachable

STACKDCGM · Prometheus · IPMI · slurmrestd

NXS-003.3 · NemoClaw Sandbox

Power Optimizer Agent — nexus-optimizer

● TESTING

Monitors GPU telemetry every 5 seconds. Classifies workload type from DCGM metrics and applies the appropriate power cap: 320W for training, 250W for inference, 150W for idle. Thermal derate of −10W per degree above 85°C. Emergency 300W cap on spike detection. Publishes optimize_action events — nexus-core applies pynvml caps on the host.

ENERGY TARGET20–35% lower energy-per-token

CAPS320W train · 250W infer · 150W idle

STACKDCGM · NVML · pynvml · OpenShell

NXS-003.4 · Planned

Experiment Designer Agent

○ PLANNED

Goal-driven autonomous experiment iteration. Provide a research objective — "optimize seismic RTM parameters" or "find best nf-core pipeline for this WGS cohort" — and the agent spins parallel Slurm job arrays, analyzes results with Nemotron, and iterates fully autonomously. MLflow experiment tracking. Targets genomics and O&G seismic verticals.

VERTICALSGenomics · Seismic · AI/ML training

STACKNextflow · nf-core · MLflow · Slurm arrays

NXS-004 · Performance Targets

Five metrics.
All under validation.

Active validation on internal RTX 3090 cluster. All figures are targets. Full benchmark methodology and results published at eviox.tech/nexus on Q2 2026 completion.

GPU_UTILISATION_DELTA

+50%

higher utilisation vs vanilla Slurm

METHOD: DCGM avg · 72h window · same workload mix

ENERGY_PER_TOKEN_REDUCTION

−35%

lower than unconstrained TDP · inference serving

METHOD: GPU watt-hours / tokens generated · 6h window

FAULT_MTTR_TARGET

<4 min

mean time to recovery · 20 fault injection runs

METHOD: fault detect → node re-enabled in Slurm

SCHEDULER_POLL_LATENCY

<500ms

poll-to-action roundtrip including Nemotron

METHOD: timestamp delta · publish to action confirm

API_HEALTH_LATENCY

<50ms

nexus-core /health endpoint p99

METHOD: k6 load test · 100 concurrent connections

NXS-005 · Deployment Roadmap

RTX 3090 first.
DGX when the numbers hold.

Nexus is validated on RTX 3090 before any customer deployment. DGX migration is triggered by benchmark sign-off — it requires only a profile switch, no code changes.

● ACTIVE NOW

Stage 1 — RTX 3090 Internal Validation

Internal Eviox cluster. Full NemoClaw sandbox stack running against real genomics and ML workloads. vLLM serving nemotron-nano-30b on the same GPU node. OpenShell network policy enforced. Results ETA Q2 2026.

HARDWARERTX 3090 · 24 GB VRAM

AGENT RUNTIMENemoClaw + OpenShell

INFERENCEvLLM · nemotron-nano-30b

STORAGEGPFS + NVMe local

NETWORKING10 GbE (IB NDR in Stage 2)

RESULTS ETAQ2 2026

○ COMING NEXT

Stage 2 — DGX Cluster General Availability

Triggered by RTX benchmark sign-off. Switch NEMOCLAW_PROFILE to nim-local, restart sandboxes. OpenShell routes to NIM service. Larger Nemotron model, full InfiniBand NDR, multi-backend storage, enterprise air-gap.

HARDWAREDGX H100 / B200 / GB200 NVL72

INFERENCENIM · nemotron-super-120b-a12b

NETWORKINGInfiniBand NDR 400G

STORAGEVAST · WEKA · GPFS · Lustre

SECURITYAir-gap · RBAC · Audit logs

TRIGGERPost RTX 3090 benchmark sign-off

NXS-006 · Installation Guide

Ten steps.
Self-driving cluster.

Prerequisites: Ubuntu 22.04 LTS · Docker 24+ · NVIDIA Container Toolkit · DCGM exporter on host (:9400) · Prometheus node exporter (:9100) · Slurm 23.x with slurmrestd (optional — mock data used without it).

01

Clone the Nexus repository

Clone the Eviox Nexus repository from GitHub into your working directory and navigate into it.

github.com/tushu1232/nexus

02

Install NVIDIA OpenShell

Clone the NVIDIA OpenShell repository and run the installer script. Confirm OpenShell v0.1.0 or later is present and the CLI is accessible from your PATH.

github.com/NVIDIA/OpenShell

03

Install NemoClaw

Clone the NVIDIA NemoClaw repository and run its installer script. Verify that both the OpenClaw CLI and the NemoClaw plugin are present and correctly registered.

github.com/NVIDIA/NemoClaw

04

Configure environment variables

Copy the provided environment template and fill in your values: select the inference profile (vllm for RTX 3090), add your HuggingFace token for Nemotron model download, your Slurm JWT token, and IPMI credentials for hardware fault remediation.

05

Apply NemoClaw network policy

Apply the Nexus network policy file using the OpenShell CLI. This enforces strict egress rules: only the vLLM inference endpoint, Nexus Core event bus, Slurm REST API, Prometheus, and DCGM exporter are permitted. All other connections are blocked by default.

nemoclaw-policies/openclaw-sandbox.yaml

06

Start the infrastructure stack

Start the full Docker Compose stack. This brings up Nexus Core, the vLLM inference service, Prometheus, Grafana, and the Nginx-served operator dashboard simultaneously.

deploy/docker-compose.yml

07

Wait for the Nemotron model to load

On first run, vLLM downloads the Nemotron model weights — approximately 5 to 15 minutes depending on your connection. Monitor the vLLM container logs and wait until the server reports it has started successfully, then verify the health endpoint on port 8001.

08

Point the OpenShell gateway at vLLM

Configure the OpenShell inference gateway to route all agent inference calls to the local vLLM service running Nemotron Nano 30B. This can be switched to NIM or NVIDIA cloud at any time without restarting agent sandboxes.

09

Launch the three NemoClaw agent sandboxes

Launch each Nexus agent — scheduler, healer, and optimizer — as an isolated NemoClaw sandbox using the vllm inference profile. Each sandbox starts an OpenClaw agent governed by its system prompt and the network policy applied in step 5. Alternatively, run the setup script to automate steps 2 through 9.

bash setup.sh — automates steps 2–9

10

Verify the system and access the dashboard

Confirm all three agent sandboxes are running and healthy via the Nexus Core status API. Open the operator dashboard on port 8080 for live telemetry, event log, and agent health. Grafana is available on port 3000 for full DCGM and power metrics. Mock telemetry streams automatically until a live Slurm JWT token is configured.

Dashboard :8080 · API :8000 · Grafana :3000

NXS-007 · Technology Stack

Open. No lock-in.
Drops in on existing infrastructure.

Built on NVIDIA NemoClaw + OpenShell with vLLM inference. Runs alongside your existing Slurm scheduler, InfiniBand fabric, and parallel storage. DGX migration is one openshell inference set command.

NVIDIA NemoClaw NVIDIA OpenShell Nemotron nano-30b · super-120b vLLM · NVIDIA NIM NVIDIA DCGM GPUDirect RDMA Slurm 23.x · LSF · Kubernetes slurmrestd · JWT auth FastAPI · asyncio · WebSocket pynvml · IPMI · ipmitool Prometheus · Grafana Landlock LSM · seccomp VAST Data WEKA IBM GPFS / Spectrum Scale Lustre · BeeGFS NVMe-oF · NFS/RDMA InfiniBand NDR 400G RoCEv2 / RDMA AWS ParallelCluster · EFA Ansible · Terraform · Warewulf Nextflow · nf-core MLflow Experiment Tracking Air-gap · On-prem Secure Mode RBAC Agent Permissions Full Audit Logs

NXS-008 · Pricing

Usage-based.
No annual commitments.

Billed only when NemoClaw agents are active. No idle charges. No upfront commitment. INR invoicing with GST compliance for India-based customers.

NEXUS_CORE

$0.015

per GPU-hour · agents-active only

Scheduler + fault healer agents
NemoClaw vllm inference profile
DCGM / Prometheus integration
Any Slurm / K8s cluster
Community support channel
OpenShell network policy included

NEXUS_ENTERPRISE

$0.028

per GPU-hour + fine-tuning included

All Core agents
Power optimizer + experiment designer
nim-local profile — nemotron-super-120b
15-min response SLA
Dedicated Nemotron fine-tuning
Air-gap + full audit logs
RBAC agent permissions
INR invoicing · GST compliant

MANAGED_BUNDLE

+Nexus

add to any Eviox cluster contract

20% off hardware OpEx
Bundled NemoClaw agent runtime
DGX-ready NIM config out of box
Onboarding + migration support
Existing customer priority access

EVIOX NEXUS
Self-Driving
Supercomputer Layer

Ready to make your cluster
self-driving?

EVIOX NEXUSSelf-DrivingSupercomputer Layer

Ready to make your clusterself-driving?

EVIOX NEXUS
Self-Driving
Supercomputer Layer

Ready to make your cluster
self-driving?