Job description

Site Reliability Engineer (AI Infrastructure - DevTool Start-Up)
$150,000 - $220,000 + Equity + Benefits + PTO
San Francisco, CA

Are you passionate about keeping production AI infrastructure fast, reliable, and self-healing? Do you thrive in environments where you directly own the systems that millions of LLM requests flow through every day?

This is an opportunity to join a fast-growing, profitable startup at the forefront of AI infrastructure, building the reliability layer that powers how real customers deploy and use language models in production. Backed by top-tier investors and trusted by major enterprises, the team has built a unified LLM gateway used as a critical proxy by engineering teams worldwide. Now, they're looking for a founding SRE to own the reliability, performance, and observability of that proxy in production.

As a founding member of the engineering team, you'll take ownership of the systems keeping the core proxy alive under load, debugging OOMs, resolving database connection exhaustion, fixing race conditions, and making the platform resilient when dependencies go down. You'll work directly with senior leadership, engage with a large open source community, and ensure that when customers put their entire AI stack behind this gateway, it never lets them down.

If you're looking for a role where you can combine deep systems debugging with real customer impact and directly influence the infrastructure that underpins modern AI applications, this is an outstanding opportunity.

The Role:

Own and resolve production reliability issues including OOMs, deadlocks, connection pool exhaustion, and race conditions
Optimize performance across hot paths including spend tracking, database writes, and health checks
Improve Redis and in-memory cache reliability across multi-pod deployments
Make the proxy self-healing with graceful degradation, retry logic, and proper health checks when DB or Redis is unavailable
Build and maintain Prometheus metrics, alerting, and observability for production deployments
Collaborate directly with customers and the open source community to turn real-world issues into platform improvements

The Person:

1-4 years running Python services in production at scale
Experience debugging OOMs, memory leaks, race conditions, and deadlocks in live environments
Strong familiarity with PostgreSQL, Redis, and Kubernetes in live environments
Comfortable owning production systems and debugging customer-facing incidents
Solid understanding of distributed systems, connection pooling, and caching layers
Excited to work in an early-stage, high-ownership, fast-shipping environment

Rise Technical Recruitment Inc of 1011 Centre Rd, Suite 322, Wilmington, DE 19805 act as an employer-paid private personnel agency.

The salary advertised is the bracket available for this position. The actual salary paid will be dependent on your level of experience, qualifications and skill set and will be decided by our client, the employer. Rise are not responsible or liable for any hiring decisions made by the end client.

We are an equal opportunities company and welcome applications from all suitable candidates.

Consultant

Luca Browning

Recruitment Consultant

Site Reliability Engineer

Job description

Let's Talk

Quick Links

Contact Us

Accreditations & Certifications

Follow Us