Demo project with Apache Spark

Several years ago I was involved in the project that was heavily orbiting around data. I was challenged to create a scalable system for data processing (mostly to enhance data and match two or more data sets). What a shame Spark was not available at that time :). I recall that at that time (2007) there were almost no sensible tools available to do the job and so I was forced to do a lot of low level work by creating my own framework for parallel data processing (hello sockets).

I have figured out a small project where I could use Spark to gain a deeper understanding of this tool. I want to perform a real time analysis of the recent movies reception among Twitter’s users. In order to do that I’ll need some training data (movie reviews from IMDB should be a good starting point). This project will require to perform some RDD operations, use Spark Stream and Spark ML.

Edit: Source code for the project is available on my Github: https://github.com/bamatosi/movie-reviews-classifier.

Leave a Reply