Seattle Conference on Scalability: YouTube Scalability

Notes

  • Apache isn't that great at serving static content for a large number of requests vs. NetScaler load balancing
  • Python is fast enough
    • There are many other bottlenecks such as waiting for calls from DB, cache, etc. Python web speed is "fast enough" and usually not the biggest bottleneck
    • Development speed can be faster with Python and that is more critical
    • psyco for Python -> C compiling
  • Each video is hosted by a mini-cluster
    • A cluster of machines that serve the exact same video
    • Offers availability via backups
    • Served with Apache at first. Lasted about 3 months. High load, high context switching (whenever the operating system stops a thread from running in the CPU and puts another thread to run in its place).
    • Moved to lighttpd to move from single-process to single-thread architecture, open source. Allows pulling a large number of file descriptors efficiently.
  • CDN for most popular content
    • Less popular content go to regular Youtube servers
    • Youtube servers have to be optimized very differently vs. the CDNs. The CDN data mostly comes from memory. For Youtube servers, commodity hardware was used, with not too little or too much memory. Expensive hardware wasn't worth it. Memory had to be tweaked for random access and to avoid cache thrashing.
  • Thumbnails
    • Served them with Apache at first with a unified pool (same hosts that serve traffic also serve static thumbnails). Eventually moved to separate pools. High load, low performance
    • EXT3 partition => too many files in one directory (thumbnails) caused an issue where no more video uploads can be made
    • Switched from Apache to Squid (reverse proxy), far more efficient with CPU. However the performance would degrade (start at 300 TPS, then eventually down to 20 TPS)
    • However, there were still too many images. It would take 24-48 hours just to copy images from one machine to another machine. rsync would never finish because it would exhaust all the memory in the machine.
    • Moved onto Big Table to fix the issue, i.e. Google's BTFE.
  • Databases
    • Used MySQL because its fast if you know how to tune correctly
    • Stored metadata, i.e. username, password, vid descriptions, titles, messages
    • Started with one main DB, one backup
    • Did ok until more people started using Youtube
    • MySQL db was swapping, bringing down the site to a crawl. Turned out that early Linux kernel, it treats page cache more important than your application
    • As Youtube started scaling up, they did DB/SQL replication to spread reads across multiple machines.
    • Replication lag, too many read replicas. Users complaining about weird unreproducible bugs.
    • Split read replicas to video watching pools and the general pool. Stopgap measure - not a real fix.
    • DB RAID tweaking increased throughput by 20%. Linux sees 5 volumes instead of 1 logical volume, allowing to more aggresively schedule disk I/O. Same hardware.
    • Eventually did Database Partitions to spread writes AND read, partition by user. Much better cache locality => less I/O. Got rid of read replicas.