Twitter has reworked the way its
search works — from an architectural standpoint, at least.
Most end users shouldn’t notice any
differences just yet, but Twitter’s search should now scale better, index more
tweets per second, and use less of Twitter’s system resources. All this
newfound scalability and headroom will give Twitter’s developers the ability to
build cool new search features in the near future (we’re hoping for an older
back catalog of tweets to show up in search results, but nothing like that has
been confirmed yet).
So, what ever happened to Summize?
Apparently, this early-stage acquisition from 2008 has all but
disappeared; Twitter’s real-time search engine is no longer based on Summize’s
technology.
The search architecture is also no
longer based on MySQL, the scaling of which, Twitter dev Michael Busch noted, “had become
increasingly challenging.”
Around six months ago, Busch (a
Lucene committer) and team decided to make the switch to Lucene, a 10-year-old open-source information
retrieval software. The team then spent some quality (and quantity) time
hacking Lucene to suite Twitter’s unique needs. And of course, since Lucene is
open-source, the modifications are being added to Lucene, particularly its
real-time branch.
“We rewrote big parts of the core
in-memory data structures,” said Busch, “especially the posting lists, while
still supporting Lucene’s standard APIs.” The team also improved garbage
collection, added lock-free data structures and algorithms and a few other
niceties.
We’re impressed that, indeed, we
hadn’t noticed any odd behavior or downtime for Twitter search, specifically.
But we’d like to know more about the problems Twitter’s engineers were having
with MySQL not being able to scale well. You may not have noticed, but database
scaling has been something of a recurring theme around here
lately.
No comments:
Post a Comment