Demo project with Apache Spark

Several years ago I was involved in the project that was heavily orbiting around data. I was challenged to create a scalable system for data processing (mostly to enhance data and match two or more data sets). What a shame Spark was not available at that time :). I recall that at that time (2007) there were almost no sensible tools available to do the job and so I was forced to do a lot of low level work by creating my own framework for parallel data processing (hello sockets).

I have figured out a small project where I could use Spark to gain a deeper understanding of this tool. I want to perform a real time analysis of the recent movies reception among Twitter’s users. In order to do that I’ll need some training data (movie reviews from IMDB should be a good starting point). This project will require to perform some RDD operations, use Spark Stream and Spark ML.

Edit: Source code for the project is available on my Github: https://github.com/bamatosi/movie-reviews-classifier.

Basic concepts behind Apache Spark

Apache Spark is a cluster computing system. It provides the API for Java, Scala (among others). It also provides a neat set of tools for graph and structured data processing, machine learning and data streaming (for instance from Kafka).

The basic data abstraction in Spark is Resilient Distributed Datasets (RDD) which is an immutable, partitioned collection of elements that can be operated on in parallel. There are several ways to create RDD
Continue reading