As a trusted advisor, leader, and collaborator, Rohit applies problem resolution, analytical, and operational skills to all initiatives and develops strategic requirements and solution analysis through all stages of the project life cycle and product readiness to execution.
Rohit excels in designing scalable cloud microservice architectures using Spring Boot and Netflix OSS technologies using AWS and Google clouds. As a Security Ninja, Rohit looks for ways to resolve application security vulnerabilities using ethical hacking and threat modeling. Rohit is excited about architecting cloud technologies using Dockers, REDIS, NGINX, RightScale, RabbitMQ, Apigee, Azul Zing, Actuate BIRT reporting, Chef, Splunk, Rest-Assured, SoapUI, Dynatrace, and EnterpriseDB. In addition, Rohit has developed lambda architecture solutions using Apache Spark, Cassandra, and Camel for real-time analytics and integration projects.
Rohit has done MBA from Babson College in Corporate Entrepreneurship, Masters in Computer Science from Boston University and Harvard University. Rohit is a regular speaker at No Fluff Just Stuff, UberConf, RichWeb, GIDS, and other international conferences.
Rohit loves to connect on http://www.productivecloudinnovation.com.
http://linkedin.com/in/rohit-bhardwaj-cloud or using Twitter at rbhardwaj1.
Modern system design has entered a new era. It’s no longer enough to optimize for uptime and latency — today’s systems must also be AI-ready, token-efficient, trustworthy, and resilient. Whether building global-scale apps, powering recommendation engines, or integrating GenAI agents, architects need new skills and playbooks to design for scale, speed, and reliability.
This full-day workshop blends classic distributed systems knowledge with AI-native thinking. Through case studies, frameworks, and hands-on design sessions, you’ll learn to design systems that balance performance, cost, resilience, and truthfulness — and walk away with reusable templates you can apply to interviews and real-world architectures.
Target Audience
Enterprise & Cloud Architects → building large-scale, AI-ready systems.
Backend Engineers & Tech Leads → leveling up to system design mastery.
AI/ML & Data Engineers → extending beyond pipelines to full-stack AI systems.
FAANG & Big Tech Interview Candidates → preparing for system design interviews with an AI twist.
Engineering Managers & CTO-track Leaders → guiding teams through AI adoption.
Startup Founders & Builders → scaling AI products without burning money.
Learning Outcomes
By the end of the workshop, participants will be able to:
Apply a 7-step system design framework extended for AI workloads.
Design systems that scale for both requests and tokens.
Architect multi-provider failover and graceful degradation ladders.
Engineer RAG 2.0 pipelines with hybrid search, GraphRAG, and semantic caching.
Implement AI trust & security with guardrails, sandboxing, and red-teaming.
Build observability dashboards for hallucination %, drift, token costs.
Reimagine real-world platforms (Uber, Netflix, Twitter, Instagram) with AI integration.
Practice mock interviews & chaos drills to defend trade-offs under pressure.
Take home reusable templates (AI System Design Canvas, RAG Checklist, Chaos Runbook).
Gain the confidence to lead AI-era system design in interviews, enterprises, or startups.
Workshop Agenda (Full-Day, 8 Hours)
Session 1 – Foundations of Modern System Design (60 min)
The new era: Why classic design is no longer enough.
Architecture KPIs in the AI age: latency, tokens, hallucination %, cost.
Group activity: brainstorm new KPIs.
Session 2 – Frameworks & Mindset (75 min)
The 7-Step System Design Framework (AI-extended).
Scaling humans vs tokens.
Token capacity planning exercise.
Session 3 – Retrieval & Resilience (75 min)
RAG 2.0 patterns: chunking, hybrid retrieval, GraphRAG, semantic cache.
Multi-provider resilience + graceful degradation ladders.
Whiteboard lab: design a resilient RAG pipeline.
Session 4 – Security & Observability (60 min)
Threats: prompt injection, data exfiltration, abuse.
Guardrails, sandboxing, red-teaming.
Observability for LLMs: traces, cost dashboards, drift monitoring.
Activity: STRIDE threat-modeling for an LLM endpoint.
Session 5 – Real-World System Patterns (90 min)
Uber, Netflix, Instagram, Twitter, Search, Fraud detection, Chatbot.
AI-enhanced vs classic system designs.
Breakout lab: redesign a system with AI augmentation.
Session 6 – Interviews & Chaos Drills (75 min)
Mock interview challenges: travel assistant, vector store sharding.
Peer review of trade-offs, diagrams, storytelling.
Chaos drills: provider outage, token overruns, fallback runbooks.
Closing (15 min)
Recap: 3 secrets (Scaling tokens, RAG as index, Resilient degradation).
Templates & takeaways: AI System Design Canvas, RAG Checklist, Chaos Runbook.
Q&A + networking.
Takeaways for Participants
AI System Design Canvas (framework for interviews & real-world reviews).
RAG 2.0 Checklist (end-to-end retrieval playbook).
Chaos Runbook Template (resilience drill starter kit).
AI SLO Dashboard template for observability + FinOps.
Confidence to design and defend AI-ready architectures in both career and enterprise contexts.
AI inference is no longer a simple model call—it is a multi-hop DAG of planners, retrievers, vector searches, large models, tools, and agent loops. With this complexity comes new failure modes: tail-latency blowups, silent retry storms, vector store cold partitions, GPU queue saturation, exponential cost curves, and unmeasured carbon impact.
In this talk, we unveil ROCS-Loop, a practical architecture designed to close the four critical loops of enterprise AI:
•Reliability (Predictable latency, controlled queues, resilient routing)
•Observability (Full DAG tracing, prompt spans, vector metrics, GPU queue depth)
•Cost-Awareness (Token budgets, model tiering, cost attribution, spot/preemptible strategies)
•Sustainability (SCI metrics, carbon-aware routing, efficient hardware, eliminating unnecessary work)
KEY TAKEAWAYS
•Understand the four forces behind AI outages (latency, visibility, cost, carbon).
•Learn the ROCS-Loop framework for enterprise-grade AI reliability.
•Apply 19 practical patterns to reduce P99, prevent retry storms, and control GPU spend.
•Gain a clear view of vector store + agent observability and GPU queue metrics.
•Learn how ROCS-Loop maps to GCP, Azure, Databricks, FinOps & SCI.
•Leave with a 30-day action plan to stabilize your AI workloads.
⸻
AGENDA
1.The Quiet Outage: Why AI inference fails
2.X-Ray of the inference pipeline (RAG, agents, vector, GPUs)
3.Introducing the ROCS-Loop framework
4.19 patterns for Reliability, Observability, FinOps & GreenOps
5.Cross-cloud mapping (GCP, Azure, Databricks)
6.Hands-on: Diagnose an outage with ROCS
7.Your 30-day ROCS stabilization plan
8.Closing: Becoming a ROCS AI Architect
Dynamic Programming (DP) intimidates even seasoned engineers. With the right lens, it’s just optimal substructure + overlapping subproblems turned into code. In this talk, we start from a brute-force recursive baseline, surface the recurrence, convert it to memoization and tabulation, and connect it to real systems (resource allocation, routing, caching). Along the way you’ll see how to use AI tools (ChatGPT, Copilot) to propose recurrences, generate edge cases, and draft tests—while you retain ownership of correctness and complexity. Expect pragmatic patterns you can reuse in interviews and production.
Why Now
Key Framework
Core Content
Learning Outcomes