Introduction
Apache Spark, a framework for parallel distributed data processing, has become a popular choice for building streaming applications, data lake houses and big data extract-transform-load data processing (ETL). It is horizontally scalable, fault-tolerant, and performs well at high scale. Historically however, managing and scaling Spark jobs running on Apache Hadoop clusters could be challenging and often time-consuming for many reasons, but surely at least due to the availability of physical systems and configuring the Kerberos security protocol that Hadoop uses. But there is a new kid in town – Kubernetes – as an alternative to Apache Hadoop. Kubernetes is an open-source platform for deployment and management of nearly any type of containerized application. In this article we’ll walk through the process of deploying Apache Spark on Amazon EKS with Canonical’s Charmed Spark solution.
Kubernetes provides a robust foundation platform for Spark based data processing…