Running AI Locally: Lessons from The Coe Lab
Practical insights from running a self-hosted AI assistant, including hardware challenges, memory architecture, and when to choose local vs cloud AI.
After months of running OpenClaw as my personal AI assistant, I've learned a lot about what works—and what doesn't—when hosting AI infrastructure locally. Here are the key lessons.
Hardware Reality Check
The biggest misconception about local AI is that you can just run any model on any hardware. You can't. I learned this the hard way when I tried to run a 70B parameter model on my home server and watched it crash repeatedly.
The reality: Local inference requires serious hardware. For anything beyond small models (7B-13B parameters), you need:
- GPU with 24GB+ VRAM for decent-sized models
- 64GB+ system RAM as a fallback
- Fast NVMe storage for model loading
My solution? Use cloud models via Ollama Cloud. The local Ollama server acts as a relay, but the actual inference happens on cloud GPUs. This gives me access to powerful models like Qwen3.5 (397B parameters) without needing a $10,000 GPU setup.
Memory Architecture Matters
An AI assistant without memory is just a chatbot. The key insight: memory should be external to the model. I use Neo4j as a graph database to store:
- People, projects, and relationships
- Goals, tasks, and their status
- Services, infrastructure, and configurations
- Events, notes, and decisions
This allows the AI to have contextual conversations. When I ask "How's the media server doing?", it can query Neo4j and tell me about Radarr, Sonarr, and any stalled downloads—not just give a generic response.
When to Go Local vs Cloud
After experimenting with both, here's my framework:
Use Local Models When:
- Privacy is critical (sensitive data never leaves your network)
- Latency matters (no round-trip to cloud APIs)
- You're doing simple tasks (classification, extraction, basic Q&A)
- Cost is a concern (small models are free to run)
Use Cloud Models When:
- You need advanced reasoning (complex problem-solving)
- Vision capabilities are required (image analysis)
- You want the latest models (cloud providers update frequently)
- Hardware is a limitation (most home setups)
The Hybrid Approach
My current setup uses both:
- Ministral 3 (8B) for heartbeats and simple cron jobs—fast, cheap, runs locally
- GLM-5 (cloud) for heavy tasks like blog writing and complex analysis
- Qwen3.5 (397B, cloud) as the primary model—handles vision, reasoning, and general tasks
This balances cost, performance, and capability. Simple tasks stay local; complex work goes to the cloud.
Automation is Where It Shines
The real power of a self-hosted AI isn't chat—it's automation. OpenClaw runs scheduled jobs that:
- Post hourly status updates to Discord
- Monitor media downloads and clean up stalled torrents
- Check infrastructure health and reboot offline nodes
- Generate daily reports and flush memory before reboots
These aren't scripted commands—they're AI-driven decisions. The media manager, for example, analyzes torrent health, decides what's dead, removes it, and triggers re-searches. All autonomously.
Lessons Learned
- Start simple. Don't try to build everything at once. Begin with one use case (like status updates) and expand.
- Invest in memory. An AI without context is frustrating. Graph databases make it possible to have meaningful conversations.
- Embrace hybrid. Local + cloud gives you the best of both worlds.
- Automation > Chat. The most valuable use case isn't answering questions—it's doing things autonomously.
- Monitor everything. If your AI assistant goes down, you want to know. Health checks and alerts are essential.
What's Next
The roadmap includes expanding automation (email triage, calendar management), improving memory (better entity extraction from conversations), and exploring multi-agent workflows (specialized AI assistants for different domains).
If you're building your own AI infrastructure, I'm happy to share what I've learned. Reach out via the contact page or connect on Discord.