Elon Musk and Twitter’s System Design
System design interview buzzwords are now surfacing on Twitter arguments
In just a few short weeks, Elon Musk has Tweeted about a wide variety of software engineer topics. He wrote a Tweet about microservices, which is a pretty complex topic, prompting an argument about technical debt, which is a pretty broad topic, further branching into discussions pertaining to server-side rendering, network latency, and graphQL…all of which are fairly broad topics. You are welcome to go down the rabbit hole of arguments, in all their Twitter glory, but I had a different angle in mind.
In this talk from 10 years ago, the director of Twitter Application Services detailed how the read and write paths worked in Twitter. He mentioned their ability to deliver 300,000 Tweets a second and described the importance this service had during the 2011 Tsunami Disaster in Japan. This, he said, was a big motivator.
It served as a reminder for what Twitter is: A service capable of accommodating 400 million users, sometimes in moments of great crisis. It also has released a number of valuable open source projects over the years, including Twitter Bootstrap, Gizzard, and Pelikan —Twitter’s unified cache backend. Twitter is also a data miner’s paradise; I did some proof of concept work with the Twitter API many years ago, back when I was a technical writer intern at Splunk. The basis for this work was a research paper published by Johns Hopkins called You Are What You Tweet, a scientific study that proved Twitter could be used to determine disease trends that were consistent with those of the CDC and came at a much faster rate.
System Design Interviews: Twitter
If you want to see “Toy Twitter,” you are more than welcome to look at the hypothetical design on System Design Primer. I also have an Educative.io link, created by Design Gurus, but if you want to access “Educative.io Unlimited” you need to pay something like $20 a month to get a course that really isn’t very different from the free GitHub equivalent.
Hypothetical designs of Twitter may differ because they are just that, but here are a few things they tend to have in common:
- They emphasize the read-heavy nature of Twitter
- They emphasize the importance of availability over consistency
- They introduce a cache; System Design Primer explicitly names Redis
Actual Twitter
I have been wanting to write about system design interviews for quite some time, and now Elon Musk is presenting us with system design information in real time. He is being very open about how Twitter is designed, going as far as to have arguments with his own engineers on the very public forum they are restructuring.
For the record, and as some of Elon’s critics have pointed out, Twitter as a whole was already pretty transparent about design details in talks, technical blog posts, and some limited open source projects. Naturally, news about Elon Musk tends to be very political, and this blog is a corgi-themed name with a picture of a dog that is more than 50% Maltese — this blog is way too stupid to get into a smart topic like politics.
Real Time Twitter Delivery
Real-Time Twitter Delivery in a one hour talk about Twitter architecture. Its speaker, Raffi, was actually studying architecture before and cited A Pattern Language as a source of inspiration. This physical architecture book also inspired The Gang Of Four (no, the software ones).
To be clear, this talk is also from ten years ago.
A few notes from his talk:
- Twitter has (or at least had, in 2012) the ability to stream a new Tweet onto the home timeline in 100 ms
- The Write API accepts a tweet, then uses a “fanout” service so everyone who follows a user can see it. Hashes are kept for Tweets, which are Redis instances, and each Redis instance has a destination for some user. The “fanout” service batches 4000 users at a time
- When “out of cache,” the process of retrieving Tweets may take 2–3 seconds; in cache is much faster. In other words, Twitter loads much more quickly for daily users than it does for people who have not used Twitter in months
- Writes are relatively slow, particularly for very popular users with millions of followers; reads, in contrast, are “insanely fast”
- Batch analytics are calculated nightly, with Hadoop
- When someone really popular/famous Tweets, “fanout” is actually not used. Instead, it is integrated into the read path
- There are 18 million deliveries a minute
A few quotes from “High Scalability”
Source, which only opens in Firefox for me for some reason:
Outliers, those with huge follower lists, are becoming a common case. Sending a tweet from a user with a lot of followers, that is with a large fanout, can be slow. Twitter tries to do it under 5 seconds, but it doesn’t always work, especially when celebrities tweet and tweet each other, which is happening more and more. One of the consequences is replies can arrive before the original tweet is received. Twitter is changing from doing all the work on writes to doing more work on reads for high value users.
— From “High Scalability”Blender creates the search timeline. It has to scatter-gather across the datacenter. It queries every Early Bird shard and asks do you have content that matches this query? If you ask for “New York Times” all shards are queried, the results are returned, sorted, merged, and reranked. Rerank is by social proof, which means looking at the number of retweets, favorites, and replies.
— From “High Scalability”
This is neither here nor there, but there is one thing from the Twitter Timeline talk that I found entertaining. The speaker mentioned that undergrads would often suggest using a massive SELECT statement, then says, “Trust me, we have tried it.” It makes me imagine a senior Twitter developer with a junior one at 2AM, trying to make the system work.
Junior Developer: Okay, why not just use a massive SELECT statement?
Senior Developer: Hey, let’s try it!
Junior Developer: ….
Senior Developer: ….
Junior Developer: Did it work?
Senior Developer: Where is the fire extinguisher
Tying It All Together (Sort Of)
That chart Elon Musk drew above (or had drawn; I am not sure which) is the Read Path. It leaves out a lot of detail.
Naturally, I think people were zeroing in on the “>1000 poorly batched RPCs just to render a home timeline” quotes. I am not sure what to make of it. The source for the High Scalability article was a Twitter talk called “Timelines At Scale,” so it would probably be a good place to start.
Another good source, I think, would be Yao Yue, who had worked at Twitter until a few weeks ago. She created something called Pelikan, and was an expert in cacheing.
She is also explicitly mentioned in the Twitter Blog, in this infrastructure post, but the person who wrote it (former VP of infrastructure and operations) was also either fired or quit.
Closing Thoughts
Please comment below with your extremely strong opinions for or against Elon Musk, or with something that is related to the tech stack.