Building facets navigation

e-commerce-facetsFacets navigation is a very efficient way of organising navigation/filtering/searching whenever you have a lot of items in your dataset. I am sure that you have already seen and use facets navigation in the past, for instance when you were looking for a specific items while purchasing items through the web. The picture means more that a word so just look at the provided example.
Continue reading

Demo project with Apache Spark

Several years ago I was involved in the project that was heavily orbiting around data. I was challenged to create a scalable system for data processing (mostly to enhance data and match two or more data sets). What a shame Spark was not available at that time :). I recall that at that time (2007) there were almost no sensible tools available to do the job and so I was forced to do a lot of low level work by creating my own framework for parallel data processing (hello sockets).

I have figured out a small project where I could use Spark to gain a deeper understanding of this tool. I want to perform a real time analysis of the recent movies reception among Twitter’s users. In order to do that I’ll need some training data (movie reviews from IMDB should be a good starting point). This project will require to perform some RDD operations, use Spark Stream and Spark ML.

Edit: Source code for the project is available on my Github: https://github.com/bamatosi/movie-reviews-classifier.

Basic concepts behind Apache Spark

Apache Spark is a cluster computing system. It provides the API for Java, Scala (among others). It also provides a neat set of tools for graph and structured data processing, machine learning and data streaming (for instance from Kafka).

The basic data abstraction in Spark is Resilient Distributed Datasets (RDD) which is an immutable, partitioned collection of elements that can be operated on in parallel. There are several ways to create RDD
Continue reading

Coursera Data Mining specialization

Coursera is now offering Data Mining which consists of six courses including final capstone project. First course has already passed but it is possible that coursera will rerun it again in the future so I hope it will still possible to finish whole track.

For now I have signed in for first available course which is Text retrieval and search engines and I’m really exited with that. Text book for the course is available in http://books.google.pl/books?id=FwtPRUBqUuQC&printsec=frontcover#v=onepage&q&f=false