To Spark or not to Spark

A brief reflection about when I would use Spark

  ·   2 min read

Last night I joined a Codemotion meetup about AI & Data Intelligence, and the second of the two speakers discussed PySpark for processing large amount of data.

The talk had an introductory flavour but I was struck by a question that arose during the Q&A session: “when is PySpark the right tool for the job?”.

The answer to this question might have seemed trivial when the only alternative for data processing was Pandas, but nowadays tools such as Polars and DuckDB should make you question whether Spark is the right tool for the job.

Spark can handle datasets larger than a single machine’s main memory simply by engaging multiple machines. It splits both the data and the transformations on the data across multiple nodes (workers) and processes (executors).

There is a price to pay for this setup: the network bandwidth and latency needed to the transfer data and coordinate work across multiple nodes, additional complexity in debugging issues, poor ergonomics for local development.

Most often than not, the price is not worth paying if the dataset fits into main memory, or you are dealing with one-time data processing jobs (e.g. for research projects).

Even if the dataset is larger than memory, it is worthwhile evaluating whether “scaling out vertically” (i.e. renting a single machine with as much main memory and CPU cores as possible) is more cost effective than “scaling out horizontally” (i.e. distributing data across multiple nodes).