His main areas of expertise are Ware House Scale Computing, linux containers, highly available systems, distributed persistence.
Netflix has a complex micro-services architecture which is run in an active-active manner from multiple geographies on top of AWS. Application deployment and management is a very important aspect of running services in this manner at scale. We have developed Titan to make cluster management, application deployments and process supervision much more robust and efficient in terms of cpu/memory utilization across all of our servers in different geographies.
Titan is built on top of Apache Mesos for scheduling processes of applications on top of AWS EC2. In this talk we will talk about the design of our Mesos framework and the scheduler. We will focus on the following aspects of the scheduler - Bin packing algorithms, Scaling in and out of clusters, fault tolerance of processes via reconciliation and processing life cycle events, multi-geography/cross data center redundancy.
As Netflix grows so does the complexity of our application deployments. In order to lower developer cost of entry while increasing reliability, we've begun to re-evaluate how applications live in our production environment. Amazon gives us the flexibility to tap into massive amounts of resources, but how we use and manage those is a constantly evolving and ever growing task.
Titan, a combination of Apache Mesos and infrastructure tooling, gives us the ability to utilize linux containers and shift our developers focus back to their applications, while maintaining the level of insight we have come to expect in our ecosystem. By combining Apache Mesos and Docker we have built an application infrastructure that gives us a highly resilient PAAS that reduces the time and pain our developers once felt when trying to launch applications within our increasingly complex infrastructure and gives us the ability to make changes at the IAAS layer without impacting our engineers or sacrificing our insight.