Skip to content

A collection of real-time applications built with Apache Storm.

Notifications You must be signed in to change notification settings

positivepsycho/storm-applications

 
 

Repository files navigation

storm-applications

A collection of real-time applications built with Apache Storm.

Applications

Wordcount (WC)

The classic example of big data applications, the wordcount application was extracted from storm-stater. It is composed of a bolt for splitting sentences into words and another one for counting the number of occurrences for each word in a hashmap.

Trending Topics (TT)

Also taken from the storm-stater, it extracts the hashtags from tweets and keeps track of the number of occurrences in a rolling counter. The full count is emitted periodically, ranked by number of occurrences and the ones with the highest counts are emitted in the end, as the trending topics.

Bargain Index (BI)

This applications was taken from papers about the System S (IBM InfoSphere Streams). First, the VWAP (Volume Weighted Average Price) is calculated from a stream of trades, then another bolt receives both the VWAP and another stream of quotes and calculates a bargain index that tells if it is a good idea to buy the quote that is being offered and how good it is.

Fraud Detection in Credit Card Transactions (FD)

Outlier Detection in Computer Network (MO)

Spike Detection in Sensor Network (SD)

Tracks measurements from a set of sensor devices, calculates the moving average of these measurements and checks if the current readings are above a certain threshold in relation to the moving average, if so, an alert is emitted.

Sentiment Analysis for Twitter (SA)

Calculates the sentiment score for each tweet and produces a summary per state. Uses a very basic algorithm that counts occurrences of good and bad words in the message to calculate the score.

VoIPSTREAM (Spam Detection in VoIP) (VS)

VoIPSTREAM is an application composed of a set of filters and modules that are used to detect telemarketing spam in Call Detail Records (CDRs). A detailed description of the application can be found in the paper that describes an on-demand time-decaying bloom filter.

Ads Analytics

Calculate the current Click-Through Rate (CTR) for pairs of query and ad. Predicts the probability of a given ad being clicked given a set of features, such as the query, position of the ad, the advertiser, etc.

Reinforcement Learner (RL)

Reinforcement learning in the context of ads can be employed as a way of maximizing the CTR by choosing the ad or ads with highest profit. As the time goes by, an ad may be replaced by other ads as a response to a decreasing CTR.

Spam Filter for Emails (SF)

Log Processing (LP)

Click Analytics (CA)

Datasets

Application Source Size
WC Project Gutenberg ~8GB
BI Yahoo Finance, Google Finance
SD Intel Berkeley Research Lab 150MB
TT, SA Twitter Streaming
MO Google Cluster Traces 36GB (compressed)
CA, LP 1998 World Cup Web Site Http Logs 104GB
SF TREC 2007 Public Spam Corpus 547MB (labeled)
SPAM Archive by Bruce Guenter ~1.2GB (spam only)
Enron Email Dataset 2.6GB (raw)
Enron Spam Dataset 50MB (labeled)

About

A collection of real-time applications built with Apache Storm.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published