Demo project with Apache Spark

Posted on 4 January 2016 by bartek

Several years ago I was involved in the project that was heavily orbiting around data. I was challenged to create a scalable system for data processing (mostly to enhance data and match two or more data sets). What a shame Spark was not available at that time :). I recall that at that time (2007) there were almost no sensible tools available to do the job and so I was forced to do a lot of low level work by creating my own framework for parallel data processing (hello sockets).

I have figured out a small project where I could use Spark to gain a deeper understanding of this tool. I want to perform a real time analysis of the recent movies reception among Twitter’s users. In order to do that I’ll need some training data (movie reviews from IMDB should be a good starting point). This project will require to perform some RDD operations, use Spark Stream and Spark ML.

Edit: Source code for the project is available on my Github: https://github.com/bamatosi/movie-reviews-classifier.

Basic concepts behind Apache Spark

Posted on 10 December 2015 by bartek

Apache Spark is a cluster computing system. It provides the API for Java, Scala (among others). It also provides a neat set of tools for graph and structured data processing, machine learning and data streaming (for instance from Kafka).

The basic data abstraction in Spark is Resilient Distributed Datasets (RDD) which is an immutable, partitioned collection of elements that can be operated on in parallel. There are several ways to create RDD
Continue reading →

Perlbrew

Posted on 8 June 2015 by bartek

I recently run into a very neat project called Perlbrew (http://perlbrew.pl). Perlbrew allows you to setup and manage multiple sandboxed Perl installations on a single OS.
Continue reading →

bamatosi

Research notes on text/data mining, software development and other stuff that attracts my geeky attention

Category Archives: Software Development

Demo project with Apache Spark

Basic concepts behind Apache Spark

Perlbrew