A collection of real-time applications built with Apache Storm.
The classic example of big data applications, the wordcount application was extracted from storm-stater. It is composed of a bolt for splitting sentences into words and another one for counting the number of occurrences for each word in a hashmap.
Also taken from the storm-stater, it extracts the hashtags from tweets and keeps track of the number of occurrences in a rolling counter. The full count is emitted periodically, ranked by number of occurrences and the ones with the highest counts are emitted in the end, as the trending topics.
This applications was taken from papers about the System S (IBM InfoSphere Streams). First, the VWAP (Volume Weighted Average Price) is calculated from a stream of trades, then another bolt receives both the VWAP and another stream of quotes and calculates a bargain index that tells if it is a good idea to buy the quote that is being offered and how good it is.
Tracks measurements from a set of sensor devices, calculates the moving average of these measurements and checks if the current readings are above a certain threshold in relation to the moving average, if so, an alert is emitted.
Calculates the sentiment score for each tweet and produces a summary per state. Uses a very basic algorithm that counts occurrences of good and bad words in the message to calculate the score.
VoIPSTREAM is an application composed of a set of filters and modules that are used to detect telemarketing spam in Call Detail Records (CDRs). A detailed description of the application can be found in the paper that describes an on-demand time-decaying bloom filter.
Calculate the current Click-Through Rate (CTR) for pairs of query and ad. Predicts the probability of a given ad being clicked given a set of features, such as the query, position of the ad, the advertiser, etc.
Reinforcement learning in the context of ads can be employed as a way of maximizing the CTR by choosing the ad or ads with highest profit. As the time goes by, an ad may be replaced by other ads as a response to a decreasing CTR.
Application | Source | Size |
---|---|---|
WC | Project Gutenberg | ~8GB |
BI | Yahoo Finance, Google Finance | — |
SD | Intel Berkeley Research Lab | 150MB |
TT, SA | Twitter Streaming | — |
MO | Google Cluster Traces | 36GB (compressed) |
CA, LP | 1998 World Cup Web Site Http Logs | 104GB |
SF | TREC 2007 Public Spam Corpus | 547MB (labeled) |
SPAM Archive by Bruce Guenter | ~1.2GB (spam only) | |
Enron Email Dataset | 2.6GB (raw) | |
Enron Spam Dataset | 50MB (labeled) |