Building an entire Tweet index

Building an entire Tweet index

November 19, 2014

Today, we tend to area unit happy to announce that Twitter currently indexes each public Tweet since 2006.

Since that initial easy Tweet over eight years agone, many billions of Tweets have captured everyday human experiences and major historical events. Our program excelled at egress breaking news and events in real time, and our search index infrastructure mirrored this sturdy stress on recency. however our long-standing goal has been to let individuals search through each Tweet ever revealed.

This new infrastructure allows several use cases, providing comprehensive results for entire TV and sports seasons, conferences (#TEDGlobal), trade discussions (#MobilePayments), places, businesses and lasting hashtag conversations across topics, like #JapanEarthquake, #Election2012, #ScotlandDecides, #HongKong, #Ferguson and lots of additional. this alteration are going to be rolling bent users over consequent few days.

In this post, we tend to describe however we tend to designed a probe service that with efficiency indexes roughly 0.5 a trillion documents and serves queries with a median latency of below 100ms.

The most necessary factors in our style were:

Modularity: Twitter already had a time period index (an inverted index containing a few week’s value of recent Tweets). we tend to shared ASCII text file and tests between the 2 indices wherever doable, that created a cleaner system in less time.
Scalability: the complete index is over a hundred times larger than our time period index and grows by many billion Tweets every week. Our fixed-size time period index clusters area unit non-trivial to expand; adding capability needs re-partitioning and important operational overhead. we would have liked a system that expands in situ graciously.
Cost effectiveness: Our time period index is totally keep in RAM for low latency and quick updates. However, exploitation a similar RAM technology for the complete index would are prohibitively high-ticket.
Simple interface: Partitioning is inescapable at this scale. however we tend to wished a straightforward interface that hides the underlying partitions so internal shoppers will treat the cluster as one termination.
Incremental development: The goal of “indexing each Tweet” wasn't achieved in one quarter. the complete index builds on previous foundational comes. In 2012, we tend to designed atiny low historical index of roughly a pair of billion high Tweets, developing associate degree offline knowledge aggregation and preprocessing pipeline. In 2013, we tend to expanded that index by associate degree order of magnitude, evaluating and standardisation SSD performance. In 2014, we tend to designed the complete index with a multi-tier design, specializing in measurability and operability.
Overview
The system consists four main parts: a batched knowledge aggregation associate degreed preprocess pipeline; an inverted index builder; Earlybird shards; and Earlybird roots. browse on for a high-level summary of every element.

Batched knowledge aggregation and preprocessing
The uptake pipeline for our time period index processes individual Tweets one at a time. In distinction, the complete index uses a instruction execution pipeline, wherever every batch could be a day of Tweets. we tend to wished our offline instruction execution jobs to share the maximum amount code as doable with our time period uptake pipeline, whereas still remaining economical.

To do this, we tend to packaged the relevant time period uptake code into Pig User-Defined Functions so we tend to might recycle it in Pig jobs (soon, moving to Scalding), and created a pipeline of Hadoop jobs to combination knowledge and preprocess Tweets on Hadoop. The pipeline is shown during this diagram:

The daily knowledge aggregation and preprocess pipeline consists of those components:

Engagement aggregator: Counts the quantity of engagements for every Tweet in an exceedingly given day. These engagement counts area unit used later as associate degree input in marking every Tweet.
Aggregation: Joins multiple knowledge sources along supported Tweet ID.
Ingestion: Performs differing kinds of preprocessing – language identification, tokenization, text feature extraction, universal resource locator resolution and additional.
Scorer: Computes a score supported options extracted throughout uptake. For the smaller historical indices, this score determined that Tweets were selected into the index.
Partitioner: Divides the info into smaller chunks through our hashing algorithmic rule. the ultimate output is keep into HDFS.
This pipeline was designed to run against one day of Tweets. we tend to found out the pipeline to run on a daily basis to method knowledge incrementally. This setup had 2 main advantages. It allowed USA to incrementally update the index with new knowledge while not having to totally construct too oftentimes. and since process for every day is ready up to be totally freelance, the pipeline may be massively parallelizable on Hadoop. This allowed USA to with efficiency construct the complete index sporadically (e.g. to feature new indexed fields or modification tokenization)

Inverted index building
The daily knowledge aggregation and preprocess job outputs one record per Tweet. That output is already tokenized, however not however inverted. thus our next step was to line up single-threaded, unsettled inverted index builders that run on Mesos.

The inverted index builder consists of the subsequent components:

Segment partitioner: teams multiple batches of preprocessed daily Tweet knowledge from a similar partition into bundles. we tend to decision these bundles “segments.”
Segment indexer: Inverts every Tweet in an exceedingly section, builds associate degree inverted index and stores the inverted index into HDFS.
The beauty of those inverted index builders is that they're terribly easy. they're single-threaded and unsettled, and these little builders will be massively parallelized on Mesos (we have launched run over k parallel builders in some cases). These inverted index builders will coordinate with one another by putting locks on ZooKeeper, that ensures that 2 builders don’t build a similar section. exploitation this approach, we tend to remodeled inverted indices for nearly 0.5 a trillion Tweets in precisely concerning 2 days (fun fact: our bottleneck is really the Hadoop namenode).

Earlybirds shards
The inverted index builders created many inverted index segments. These segments were then distributed to machines known as Earlybirds. Since every Earlybird machine might solely serve atiny low portion of the complete Tweet corpus, we tend to had to introduce sharding.

In the past, we tend to distributed segments into totally different hosts employing a hash operate. This works well with our time period index, that remains a continuing size over time. However, our full index clusters required to grow unendingly.

With easy hash partitioning, increasing clusters in situ involves a non-trivial quantity of operational work – knowledge has to be shuffled around because the range of hash partitions will increase. Instead, we tend to created a two-dimensional sharding theme to distribute index segments onto serving Earlybirds. With this two-dimensional sharding, we will expand our cluster while not modifying existing hosts within the cluster:

Temporal sharding: The Tweet corpus was initial divided into multiple time tiers.
Hash partitioning: at intervals when tier, knowledge was divided into partitions supported a hash operate.
Earlybird: at intervals every hash partition, knowledge was additional divided into chunks known as Segments. Segments were classified along supported what number might work on every Earlybird machine.
Replicas: every Earlybird machine is replicated to extend serving capability and resilience.
The sharding is shown during this diagram:

This setup makes cluster enlargement simple:

To grow knowledge capability over time, we are going to add time tiers. Existing time tiers can stay unchanged. this enables USA to expand the cluster in situ.
To grow serving capability (QPS) over time, we will add additional replicas.
This setup allowed USA to avoid adding hash partitions, that is non-trivial if we wish to perform knowledge shuffling while not taking the cluster offline.

A larger range of Earlybird machines per cluster interprets to additional operational overhead. we tend to reduced cluster size by:

Packing additional segments onto every Earlybird (reducing hash partition count).
Increasing the quantity of QPS every Earlybird might serve (reducing replicas).
In order to pack additional segments onto every Earlybird, we would have liked to search out a unique medium. RAM was too high-ticket. Even worse, our ability to plug massive amounts of RAM into every machine would are physically restricted by the quantity of DIMM slots per machine. SSDs were considerably less costly ($/terabyte) than RAM. SSDs conjointly provided a lot of higher read/write performance compared to regular spindle disks.

However, SSDs were still orders of magnitude slower than RAM. shift from RAM to SSD, our Earlybird QPS capability took a significant hit. to extend serving capability, we tend to created multiple optimizations like standardisation kernel parameters to optimize SSD performance, packing multiple DocValues fields along to cut back SSD random access, loading oftentimes accessed fields directly in-process and additional. These optimizations don't seem to be lined intimately during this web log post.

Earlybird roots
This two-dimensional sharing addressed cluster scaling and enlargement. However, we tend to failed to need API shoppers to own to scatter gather from the hash partitions and time tiers so as to serve one question. to stay the consumer API easy, we tend to introduced roots to abstract away the inner details of tearing and partitioning within the full index.

The roots perform a 2 level scatter-gather as shown within the below diagram, merging search results and term statistics histograms. This ends up in a straightforward API, and it seems to our shoppers that they're touching one termination. additionally, this 2 level merging setup permits USA to perform further optimizations, like avoiding forwarding requests to time tiers not relevant to the search question.

Looking ahead
For now, complete results from the complete index can seem within the “All” tab of search results on the Twitter internet consumer and Twitter for iOS & Twitter for robot apps. Over time, you’ll see additional Tweets from this index showing within the “Top” tab of search results and in new product experiences high-powered by this index. attempt it out: you'll hunt for the primary Tweets concerning New Years between Dec. 30, 2006 and Gregorian calendar month. 2, 2007.

The full index could be a major infrastructure investment and a part of in progress enhancements to the search and discovery expertise on Twitter. there's still additional exciting work ahead, like optimizations for good caching. If this project sounds attention-grabbing to you, we tend to might use your facilitate – be a part of the flock!

Acknowledgments
The full index project represented during this post was crystal rectifier by Yi Zhuang and Paul Burstein. However, it builds on multiple years of connected work. several because of the team members that created this project doable.

Comments