Location: SF Bay Area / Remote.
About Kedara
Millions of families struggle to coordinate elder care. We're building an AI-powered Care Coordination System with personalized, voice-AI Assistant that becomes a trusted partner for caregivers—assisting them with managing the care coordination burden and reducing burnout.
Our founders scaled Speech AI and NLP at MindMeld and BabbleLabs (both acquired by Cisco). We're a small team of product, AI/ML, and senior care experts.
The Role
Design and build the core backend infrastructure for the application and real-time AI inference—powering our app including voice-AI conversational assistant. You will own the overall system architecture that powers the frontend and the AI orchestration layer optimizing for latency, cost, security, reliability and human-like conversational fluidity for the conversational assistant. You will work across GCP and AWS (using each for what it does best) and manage infrastructure costs across our GCP, AWS, and Nvidia Inception credit pools.
Core Infrastructure
- Partner with frontend, product, data and AI/ML leads to translate requirements into end-to-end system architecture
- Own end-to-end delivery from design through production for reliability and scalability. [Manage cloud credits wisely across GCP and AWS — architect for cost, performance]
- Design scalable APIs serving user onboarding, data layer, identity and role based access controls
- Build security and compliance: HIPAA-compliant data handling, encryption, access controls, PHI protection
AI Inference Infrastructure
- Architecting real-time media transport layers (e.g., WebRTC or optimized WebSockets) to stream raw audio, video & image data between the mobile client and inference engine, for ultra-low latency in-app voice and video interactions
- Integrate and orchestrate different models (ASR, TTS, LLMs/SLMs, OCR etc), a multi-modal pipeline (audio/voice, video, images) and orchestration layer for “stateful agents” to coordinate complex care tasks
- Route inference between on-device models (Speech-to-Text, quick responses) and cloud LLMs (reasoning, RAG) based on latency and cost trade-offs
- Establish observability for inference quality: latency, token usage, cost per request, hallucination detection
Requirements
- 5-8+ years building distributed backend systems at scale, especially for consumer apps preferably from 0-1 in high ambiguity and high velocity environments
- AI-native, hands on with tools like Claude Code to rapidly build and expedite delivery while keeping thinking hat to ensure code quality, security and reliability for the platform.
- Strong proficiency in a systems language (eg Python) for rapid AI experimentation and implementation for high-performance and conversational fluidity.