Sharing variables on Apache Spark

Posted on 28 December 2015 by bartek

All functions passed to Spark operations (like map or reduce) and executed on the remote machines uses their own local variables. It is important to notice that no variables are propagated back to the driver program.
Continue reading →

Basic concepts behind Apache Spark

Posted on 10 December 2015 by bartek

Apache Spark is a cluster computing system. It provides the API for Java, Scala (among others). It also provides a neat set of tools for graph and structured data processing, machine learning and data streaming (for instance from Kafka).

The basic data abstraction in Spark is Resilient Distributed Datasets (RDD) which is an immutable, partitioned collection of elements that can be operated on in parallel. There are several ways to create RDD
Continue reading →

bamatosi

Research notes on text/data mining, software development and other stuff that attracts my geeky attention

Monthly Archives: December 2015

Sharing variables on Apache Spark

Basic concepts behind Apache Spark