Bruno Marcuche

Bruno Marcuche

Site Reliability Engineer, AIOPs

Bruno Marcuche

Site Reliability Engineer, AIOPs

Boulder, CO 80301 · 561-284-2441 · bmarcuche@gmail.com · linkedin.com/in/bruno-marcuche · resume.mindtunnel.org · github.com/bmarcuche

Summary

Site Reliability Engineer and technical leader who builds the systems other teams run on. I've scaled infrastructure across on-prem, hybrid, and cloud, and automated deployment and operations for thousands of Linux and Windows instances. Most recently I architected an internal AI agent platform, a custom semantic router orchestrating 16 specialized LLM agents, that has handled 10,000+ ops and engineering tasks and cut change lead time by ~89%. I lead ops and SRE teams, drive observability with OpenTelemetry and PagerDuty, and turn slow, manual operations into fast, repeatable automation. Open to both leadership and senior IC roles.

Professional Experience

03/2025 to present
Berwyn, PA

Operations Team Lead

AssetWorks

  • Architected and built an internal AI agent platform on the Model Context Protocol: 16 specialized LLM agents coordinated by a custom semantic router (fine-tuned sentence-transformer embeddings with pgvector knowledge retrieval). It handled 10,000+ routed ops and engineering tasks in its first 12 weeks across deployment, cloud, CI/CD, and incident response.
  • Cut FA-EAM change lead time by ~89% after launch. Customer upgrades dropped from ~27 days to under 3 days, and provisioning of new environments from ~12 days to ~1.5 days, declining every month after going live (DORA lead time for changes).
  • Delivered 235 Ansible and CI/CD pipelines built by the agents and automated 213 customer upgrade deployments, eliminating ~426 hours of manual deploy work, across a fleet of 350+ servers serving 150+ government clients at 99.99% uptime.
  • Designed multi-agent incident workflows that investigate and remediate across the fleet in a single session. They caught a bug that was deleting configuration on 32 servers and identified the root cause of a Windows Update and API regression that would otherwise take hours of manual log correlation.
  • Lead a team of five; drove an observability rollout (OpenTelemetry and Observe) to reduce MTTR and mentor engineers through 1:1s, training, and knowledge sharing.
12/2022 to 01/2025
Boulder, CO

Backend Developer, Founder

EdventureTrek

  • Founded and led development of an educational game focused on outdoor exploration and biodiversity.
  • Designed custom taxonomy GPTs for plant and animal classification.
  • Built Python/FastAPI backend with MySQL and event logging.
  • Managed CI/CD on GCP with GitHub Actions and internal tooling.
07/2022 to 12/2022
Atlanta, GA

Site Reliability Engineering Manager

AnswerRocket

  • Led remote SRE team (4 reports); ran weekly syncs and architecture reviews.
  • Expanded Ansible coverage across AWS, cutting manual deploy time by 15%.
  • Supported SOC 2 audit by automating cloud environment validation.
03/2016 to 07/2022
Alpharetta, GA

Site Reliability Architect

OfficeSpace Software

  • Led SRE hiring, onboarding and 1:1s for a 3-person team.
  • Built Slackbot enabling teams to deploy customer instances in under 10 minutes.
  • Reduced deploy times over 60% via CI pipeline (CircleCI, Puppet, Docker, Terraform).
  • Migrated infrastructure from Rackspace to GCP, saving $60K annually.
  • Owned production and staging infrastructure on GCP; managed OS patching, config management, release packaging, and automation with Puppet and Python.
2009 to 2016
São Paulo, Brazil

Sr. Technical Consultant / Team Lead

Hewlett Packard

  • Delivered Tier 3 support for HP Server Automation; mentored junior engineers.
  • Automated workflows by developing Python scripts utilizing the HPSA API.
  • Ranked #1 in team for customer satisfaction.

Education

10/2001 to 12/2004
Ft. Lauderdale, Florida

Information Systems | Bachelor's Degree

ITT Technical Institute

(Honors Graduate)

Strengths

LeadershipAI-Led OpsCloud InfrastructureAutomation

Key Skills

Cloud & Automation

GCPAzure CloudCloud RunTerraformAnsibleDocker

CI/CD & Delivery

GitHub ActionsAzure DevOpsJenkins

Observability

OpenTelemetryPrometheusPagerDutyObserve

AI & Agents

LLM AgentsModel Context ProtocolpgvectorHugging FaceClaude

Development & Data

PythonRedisPostgres

Systems & Serving

LinuxNginx

Hobbies

  • Exploring AI tooling (LLMs, MCP)
  • SRE meetups
  • Home lab experimentation with Docker

Volunteering

07/2020 to 11/2024
Boulder, Colorado, USA

Delivery Driver / Wellness Check

Meals on Wheels Boulder

As a Meals on Wheels delivery driver, I got to enjoy great conversations with some of Boulder's greatest citizens.

mowboulder.org

// STRENGTHS

Core Strengths

The areas I lead with, from incident command to building the automation other teams run on.

Leadership
AI-Led Ops
Cloud Infrastructure
Automation

// SELECTED WORK

Things I've Built

A few systems I designed and shipped end to end, from semantic routing for AI agents to fleet discovery and just-in-time access.

Architect & Developer · 12/2024 to present

Internal AI Agent Platform

Semantic Router & Gateway

A self-hosted multi-agent orchestrator with a custom two-stage semantic router, built as an advanced routing layer on top of Amazon Kiro's multi-agent architecture. It classifies every request in under 100ms and dispatches to the right specialist agent across 19 domain experts.

<100msto classify & route
0.81top-1 routing accuracy
19specialist agents
3,400+prompts in 12 weeks
  • An advanced routing layer on top of Amazon Kiro: a pre-classification gateway that intercepts every prompt before the LLM, injecting tier decisions, retrieved knowledge, and entity context, so 95%+ of requests route with no LLM reasoning.
  • A two-stage neural pipeline: a fine-tuned bi-encoder retrieves candidate patterns from a pgvector store, then a fine-tuned cross-encoder reranks them for the final classification, all on CPU.
  • A closed-loop learning system captures real routing outcomes and human corrections as gold labels and gates every retrain on holdout accuracy. Fallback to LLM reasoning dropped from 22% to under 5% over 5 retrain iterations.
  • Routes across 19 specialist domains with 268 scoped tools (ops, infrastructure, pipelines, credentials, databases, monitoring, and more), using a keyword fast-path for deterministic patterns and the neural pipeline for ambiguous queries.
PythonRust (candle)pgvectorPostgreSQLsentence-transformersKiro CLIMCPsystemdWinRMAzure DevOpsAnsible

Architect & Developer

Hosted Environment Navigator

Fleet system of record (HEN)

A system of record for a Windows server estate running FA/EAM, with continuous discovery across the fleet.

20blueprints
224routes
25table schema
245tests
  • WinRM auto-discovery continuously inventories every install (version, config, services, IIS, databases, certificates) into a JSONB Postgres store, surfaced through a searchable dashboard.
  • Integrates Zendesk, Azure DevOps, GitHub, DigiCert, and Azure Bastion, with Celery workers refreshing fleet state on a rolling schedule.
  • Secured with bcrypt RBAC, API tokens, CSRF and rate limiting, and audit logging.
Python 3.9FlaskGunicorngeventPostgreSQL (JSONB)RedisCeleryWinRMAnsibleAzure

Architect & Developer

Hosted Access Manager

Just-in-time privileged access (HAM)

A just-in-time, time-boxed service for privileged access to Oracle databases across the fleet.

0permanent credentials
250+sessions
350+Oracle SIDs managed
  • Requests validated through CAB or Jira unlock accounts and automatically lock again on expiry, with credentials stored in Azure Key Vault and a full audit trail.
  • Self-service onboarding registers servers automatically after a token check and SSH verification, with a single console to navigate every server and SID.
  • A scheduler continuously reconciles state, locking expired sessions and orphaned accounts.
  • Replaced standing credentials and manual DBA grants across the Oracle fleet.
Python 3.12FastAPIPostgreSQLpsycopg2APSchedulerParamikoAzure Key Vaultnginxsystemd

// SKILLS & TOOLBOX

Core Technologies

My core skill set: the platforms, languages, and tools I work with day to day.

// WORKSTATION

Current Setup

My daily development environment and tools.

Pop! OS
Alacritty
tmux (custom keybinds)
NeoVIM
Git
Zsh
Amazon Kiro CLI
Claude Code
Azure DevOps
Azure CLI
gcloud CLI
GitHub CLI