Job description
Site Reliability Engineer (AI Infrastructure - DevTool Start-Up)
$150,000 - $220,000 + Equity + Benefits + PTO
San Francisco, CA
Are you passionate about keeping production AI infrastructure fast, reliable, and self-healing? Do you thrive in environments where you directly own the systems that millions of LLM requests flow through every day?
This is an opportunity to join a fast-growing, profitable startup at the forefront of AI infrastructure, building the reliability layer that powers how real customers deploy and use language models in production. Backed by top-tier investors and trusted by major enterprises, the team has built a unified LLM gateway used as a critical proxy by engineering teams worldwide. Now, they're looking for a founding SRE to own the reliability, performance, and observability of that proxy in production.
As a founding member of the engineering team, you'll take ownership of the systems keeping the core proxy alive under load, debugging OOMs, resolving database connection exhaustion, fixing race conditions, and making the platform resilient when dependencies go down. You'll work directly with senior leadership, engage with a large open source community, and ensure that when customers put their entire AI stack behind this gateway, it never lets them down.
If you're looking for a role where you can combine deep systems debugging with real customer impact and directly influence the infrastructure that underpins modern AI applications, this is an outstanding opportunity.
The Role:
- Own and resolve production reliability issues including OOMs, deadlocks, connection pool exhaustion, and race conditions
- Optimize performance across hot paths including spend tracking, database writes, and health checks
- Improve Redis and in-memory cache reliability across multi-pod deployments
- Make the proxy self-healing with graceful degradation, retry logic, and proper health checks when DB or Redis is unavailable
- Build and maintain Prometheus metrics, alerting, and observability for production deployments
- Collaborate directly with customers and the open source community to turn real-world issues into platform improvements
The Person:
- 1-4 years running Python services in production at scale
- Experience debugging OOMs, memory leaks, race conditions, and deadlocks in live environments
- Strong familiarity with PostgreSQL, Redis, and Kubernetes in live environments
- Comfortable owning production systems and debugging customer-facing incidents
- Solid understanding of distributed systems, connection pooling, and caching layers
- Excited to work in an early-stage, high-ownership, fast-shipping environment
Rise Technical Recruitment Inc of 1011 Centre Rd, Suite 322, Wilmington, DE 19805 act as an employer-paid private personnel agency.
The salary advertised is the bracket available for this position. The actual salary paid will be dependent on your level of experience, qualifications and skill set and will be decided by our client, the employer. Rise are not responsible or liable for any hiring decisions made by the end client.
We are an equal opportunities company and welcome applications from all suitable candidates.
