Sandip Datta

An edge node in a big data ecosystem is a computer or device that sits at the "edge" of a network.

Typically it's closer to the data source, and is responsible for collecting, pre-processing, and forwarding data to other nodes in the system for further processing or storage.

Edge nodes are used in big data systems to reduce the amount of data that needs to be transmitted and processed by the more powerful and centralized nodes in the system, improving performance and reducing costs.

In an Apache Spark cluster, an edge node refers to a node that sits at the edge of the cluster and is responsible for coordinating with external systems and clients. The edge node typically runs Spark's driver program and communicates with the cluster manager (such as Yarn or Mesos) to request resources for Spark jobs. It also acts as a client to the cluster, submitting jobs and querying their status.

In addition, Edge node also acts as an entry point for external systems to interact with the Spark cluster, for example, data ingestion system can push data to the edge node and then it can be distributed to the worker nodes for processing. Also, it can provide the access to the cluster's web UIs, file systems, and other services, as well as act as a gateway for external clients to submit Spark jobs and queries to the cluster.

Edge node is used to separate the client-facing functionality from the rest of the cluster, this allows the cluster to be more secure and performant, since it can be isolated from external clients and systems.

Sandip D. |

Edge-Node in a Big-data eco-system or Apache Spark cluster