More than 13 year in IT. CTO, engineer, an expert in data management platforms, highly loaded systems and systems integration, Sergey also has a significant experience in investment banking (Troika Dialog/Sberbank CIB). In CleverDATA he is in charge of technological vision and development of 1DMP and 1DMC products. Periodically can be met at professional conferences as a speaker on the mentioned topics.
Lie to Me… or Demystifying Spark Accumulators
Nowadays the techniques to optimize the processing speed of your data pipelines are pretty well-known and understood. Here are a few of them:
1) Scaling your servers either vertically or horizontally by adding more servers, more RAM, more CPUs, more GPUs, more LAN bandwidth aka “brute force”.
2) Reading only the data you actually have to minimizing disk IO, e.g.
– laying out the data according to your data access patterns using sharding, partitioning, bucketing, etc.
– using columnar data formats, such as Parquet and ORC to prevent reading the entire rows if you don’t have to
3) Minimizing the network IO
– preventing unnecessary shuffles by co-partitioning your datasets in advance.
From this talk you will learn a thing or two about Spark accumulators and how to use them as a side-effect during execution time of your data processing pipelines (speeding up your jobs correspondingly) updating them just once, although Spark documentation states that it is possible in actions only (and we will check whether it is always true … or maybe not).
Date: October 11, 2018