Sharing variables on Apache Spark

All functions passed to Spark operations (like map or reduce) and executed on the remote machines uses their own local variables. It is important to notice that no variables are propagated back to the driver program.

There are two ways to share variables

First is broadcasting a read-only variable to all nodes rather that shipping it along with the tasks. Data that is sent this way is cached, serialized and deserialized before running each task.

1 2	Broadcast<int[]> broadcastVar = sc.broadcast(new int[] {1, 2, 3}); broadcastVar.value(); // This should be used in any function run in the cluster

The second way is using accumulator variables. These are especially useful when you need to do a counters available for the driver program (nodes cannot read it – just add to it, only driver can read it). Accumulators are updated in actions when an operation is executed.

1
2
3
4
5
6
7

Accumulator<Integer> accum = sc.accumulator(0);
sc.parallelize(Arrays.asList(1, 2, 3, 4)).foreach(x -> accum.add(x));
accum.value(); // returns 10

Accumulator<Integer> accum2 = sc.accumulator(0);
data.map(x -> { accum2.add(x); return f(x); });
// Here, accum2 is still 0 because no actions have caused the `map` to be computed.

You can use a custom type in accumulator by implementing as AccumulatorParam

1	Accumulator<MyClass> vecAccum = sc.accumulator(new MyClass(...), new MyClassAccumulatorParam());

bamatosi

Research notes on text/data mining, software development and other stuff that attracts my geeky attention

Sharing variables on Apache Spark

Leave a Reply Cancel reply