Skip to Content

tag

Tag Archives: Cassandra

post

Analysis of twitter streams with Kafka and Storm

Following my last post, I will present a real-time processing sample with Kafka and Storm using the Twitter Streaming API.

Overview

Twitter Streaming

The solution consists of the following:

  • twitter-kafka-producer: A very basic producer that reads tweets from the Twitter Streaming API and stores them in Kafka.
  • twitter-storm-topology: A Storm topology that reads tweets from Kafka and, after applying filtering and sanitization, process the messages in parallel for:
    • Sentiment Analysis: Using a sentiment analysis algorithm to classify the tweet into a positive or negative feeling.
    • Top Hashtags: Calculates the top 20 hashtags using a sliding window.

 

Storm Topology

Twitter Topology

The Storm topology consist of the following elements:

  • Kafka Spout: The spout implementation to read messages from Kafka.
  • Filtering: Filtering out all non-english language tweets.
  • Sanitization: Text normalization in order to be processed properly by the sentiment analysis algorithm.
  • Sentiment Analysis: The algorithm that analyses word by word the text of the tweet, giving a value between -1 to 1.
  • Sentiment Analysis to Cassandra: Stores the tweets and its sentiment value in Cassandra.
  • Hashtag Splitter: Splits the different hashtags appearing in a tweet.
  • Hashtag Counter: Counts hashtag occurrences.
  • Top Hashtag: Does a ranking of the top 20 hashtags given a sliding windows (using the Tick Tuple feature from Storm).
  • Top Hashtag to Cassandra: Stores the top 20 hashtags in Cassandra.

 

Summary

In this post we have seen the benefits of using Apache Kafka & Apache Storm to ingest and process streams of data, on next posts will look at the implementation details and will provide some analytical insight from the data stored in Cassandra.

The sample can be found on Github: https://github.com/mserrate/twitter-streaming-app,

post

Cassandra on Azure CentOS VM

Having some fun with Cassandra lately I wanted to figure out how to setup a working environment on the new Windows Azure VM roles, so I decided to give a try and install a Cassandra cluster on CentOS.

Although it’s on Ubuntu, the following article is a good guide that helped me to configure a Linux cluster: https://www.windowsazure.com/en-us/manage/linux/other-resources/how-to-run-cassandra-with-linux/

We create the 1st VM assigning a pem certificate in order to get access by ssh:

Create VM

Read more »