Roadmap

Milestone 1 — Core Orchestration (MVP)

Status: Done

Deploy containers across multiple servers using a familiar YAML manifest.

Status: Done

Per-container health status, logs, and visibility from the CLI.

Agent monitors container health after deployment (running, exited, restarting)
Agent reports per-container status back to Engine via gRPC
banyan-cli status shows per-service and per-container status (not just aggregate)
CLI command to stream container logs from agents (via engine gRPC proxy)
Detect and surface failed containers (e.g. exited immediately after start)
banyan-cli down command to stop and remove all containers for a deployment

Status: Done

Secure gRPC communication between CLI, Engine, and Agents.

All inter-component communication uses gRPC with password authentication
Agent → Engine: password in gRPC metadata on every call
CLI → Engine: password in gRPC metadata on every call
Engine → Agent: session token authentication for log streaming
Config file at /etc/banyan/banyan.yaml with sections: security, engine, agent, cli
init commands for engine, agent, and CLI prompt for credentials and connection info
Three separate binaries: banyan-engine, banyan-agent, banyan-cli

Collect and expose resource metrics from every node and container in Prometheus-compatible format.

Prometheus-compatible metrics: Expose /metrics endpoint with Prometheus format (no custom metric format)
Agent-side metric collection: CPU, memory, disk usage per container
Container-level metrics: per-container CPU%, memory usage, restart count
Node-level metrics: total CPU, memory, disk usage per agent
Service-level metrics: request throughput, error rate per service
CLI monitoring interface: Terminal UI dashboard (use https://github.com/charmbracelet/bubbletea) (similar to htop/pm2) showing live metrics directly in CLI
Metric storage in etcd for short-term retention
Metric retrieval API for other components to consume

Smarter task distribution based on node resources instead of simple round-robin.

Agent reports node resource usage (CPU, memory, disk) to Engine via etcd
Engine selects the node with the most available resources when scheduling new tasks
Resource requests in banyan.yaml: services can declare CPU and memory requirements (e.g. cpus: 2, memory: 4g)
Default resource requests: Services without explicit resource requirements get sensible defaults (512MB RAM, 1 CPU) — configurable via engine flags
Engine validates that target node has sufficient resources before assigning a task
Engine rejects deployments that exceed total cluster capacity

Multiple active engine nodes share workload for high availability and horizontal scaling.

Active-active engines: Any engine can handle CLI requests and schedule tasks
etcd coordination: Task claiming via Compare-And-Swap to prevent duplication
Distributed registry: Index-based lookup so agents pull images from the correct engine
Optimistic locking: Concurrent deployment updates are serialized
Session state in etcd: Agents can reconnect to any engine
Client load balancing: CLI connects to any available engine

See Multi-Engine HA Design for detailed architecture.

Scale services based on metrics and support zero-downtime updates.

Auto-scaling: Define scaling rules in the manifest (min/max replicas, target thresholds)
Auto-scaling: Engine evaluates metrics against rules and adjusts replica count
Auto-scaling: Graceful scale-down (drain before stopping)
Redeployment: Rolling update when service image or config changes
Redeployment: Health check between rollout steps
Redeployment: Automatic rollback on failure

Web-based dashboard for cluster visualization and monitoring.

Note: CLI monitoring interface (terminal UI) is delivered in Milestone 4.

Stronger authentication model for production environments.

Automatically redistribute services across nodes based on actual resource usage and node capacity.

Resource monitoring: Engine tracks actual CPU/memory usage per container (from metrics collected in Milestone 4)
Capacity detection: Identify over-utilized nodes (>80% resources) and under-utilized nodes
Service migration: Gracefully move containers from crowded nodes to nodes with available capacity
Migration strategy: Drain-and-restart for stateless services (stop on source, start on destination)
Stateful handling: Exclude databases and stateful services from auto-migration (manual rebalancing only)
Threshold configuration: Configurable triggers (e.g., migrate when node >90% full or container is resource-starved)
Safety checks: Verify destination node has sufficient capacity before migration
Rollback support: Revert failed migrations back to original node

This milestone enables the cluster to self-optimize: services needing more resources are automatically moved to nodes where they can thrive.

Deeper observability and richer operational tooling.