Electron microscopy
 
PythonML
Apache Spark Architecture
- Python Automation and Machine Learning for ICs -
- An Online Book: Python Automation and Machine Learning for ICs by Yougui Liao -
Python Automation and Machine Learning for ICs                                                           http://www.globalsino.com/ICs/        


Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Apache Spark is a unified analytics engine for large-scale data processing. It is designed to handle both batch and streaming data efficiently. Here’s an overview of its architecture:

  • Key components of Spark architecture:
    • Driver Program: This is the main program of your application that runs the user-defined main() function. It converts the user program into tasks and schedules them to run on the cluster.
    • Cluster Manager: This is responsible for managing the cluster resources. Spark can run over a variety of cluster managers, including its own standalone cluster manager, Apache Mesos, Kubernetes, and Hadoop YARN.
    • Executors: Executors are distributed agents responsible for executing the tasks assigned to them by the driver program. Each executor runs multiple tasks in separate threads. Executors also provide in-memory storage for RDDs (Resilient Distributed Datasets) that are cached by user programs through Block Managers within each executor.
    • Task: A unit of work that is sent to the executor. Each task applies its unit of computation to a partition of the RDD.
  • Spark Core and Resilient Distributed Datasets (RDDs):
    • Spark Core: This is the fundamental part of the system that provides distributed task dispatching, scheduling, and basic I/O functionalities. It supports the RDD abstraction, which is the primary data structure of Spark.
    • Resilient Distributed Datasets (RDDs): RDDs are collections of data items distributed across the compute nodes of the cluster that can be processed in parallel. RDDs are designed to be fault-tolerant, capable of recomputing data in the event of a node failure.
  • Cluster Manager:
    • Spark can run over a variety of cluster managers (which allocate resources across applications). The most common are Standalone Cluster Manager (native Spark cluster), Apache Hadoop YARN, and Apache Mesos.
    • Here’s how Spark Cluster Manager works in detail:
      • Cluster Manager Role: The Cluster Manager is responsible for managing the resources of a cluster, such as allocating and releasing resources (like CPU, memory, and storage) to various applications running on the cluster. It functions as an intermediary between Spark and the underlying cluster infrastructure.
      • Operation: When a Spark application is submitted, the Cluster Manager is tasked with allocating resources to run the application. This involves negotiating with the cluster (like Hadoop YARN, Mesos, or Kubernetes) to obtain necessary resources (executors) that actually run the tasks of the Spark application.
      • Service Nature: The Cluster Manager runs as a separate service outside of the Spark application. This design allows the Cluster Manager to be agnostic of the application specifics and focuses purely on resource management.
      • Abstraction: One of the key benefits of this arrangement is the abstraction it provides. Spark itself doesn’t need to know the details of the cluster it is running on. Whether the underlying cluster is based on YARN, Mesos, or Kubernetes, Spark interacts with it through the Cluster Manager, which handles all the specifics of resource allocation and management on that particular type of cluster.
    • Apache Spark supports several types of cluster managers, each of which can manage the distribution and allocation of resources to Spark applications. The choice of cluster manager can depend on the specific requirements of the environment, such as the existing infrastructure, ease of setup, scalability needs, and integration with other tools. Here are the main types of cluster managers that Spark can run on:
      • Standalone Cluster Manager: This is the simplest cluster manager, included with Spark itself. It’s easy to set up and use, particularly good for simpler, smaller deployments or for development purposes. It manages Spark jobs within its own system.
      • Apache Hadoop YARN (Yet Another Resource Negotiator): YARN is a popular cluster manager originally designed for Hadoop but also supports Spark. It allows Spark to share a common cluster and resources with other big data tools and technologies, making it a good choice for environments where multiple data processing frameworks are in use.
      • Apache Mesos: Mesos is a general-purpose cluster manager that can also run Hadoop MapReduce and service applications. It provides more fine-grained resource sharing and scheduling capabilities compared to YARN, making it suitable for mixed workload environments and for running other distributed applications alongside Spark.
      • Kubernetes: As a modern container orchestration system, Kubernetes can manage Spark using its native capabilities to handle containerized applications. It's particularly useful for deploying Spark within cloud environments and for dynamic resource allocation and scaling. Kubernetes support for Spark is a relatively newer addition but has been growing in popularity due to the robust tooling and scalability it offers.
      • Others: While not as common, Spark can also be adapted to work with other resource managers or cluster managers depending on specific needs, though this might require more custom setup and integration.
      Each of these cluster managers has its strengths and is suitable for different kinds of deployment scenarios. The choice often depends on the specific requirements of the task, existing infrastructure, and other tools and processes in place within an organization.
    • Choosing a Cluster Manager:
      • The choice of cluster manager can depend on various factors. If you're already using a Hadoop ecosystem, YARN might be the natural choice. For organizations that are adopting modern containerized environments, Kubernetes could be more appropriate. Mesos is suitable for mixed workloads, whereas the standalone mode could be preferred for Spark-focused environments without the need for additional setup.
      • Other considerations might include the ease of setup and management, compatibility with existing systems, scalability needs, and specific features offered by each cluster manager like monitoring, security, and resource allocation.
  • Distributed Data Storage:
    • Spark does not have its own file management system, relying instead on external storage systems like Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, HBase, and many others.
  • Computational Models:
    • Batch Processing: Spark's core functionality revolves around batch processing, where data is divided into chunks and operations are executed over these chunks.
    • Stream Processing: Apache Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • APIs and Libraries:
    • Spark provides APIs in Scala, Java, Python, and R, making it accessible to a wide range of data engineers and scientists. The high-level APIs allow developers to focus on the computation, abstracting away much of the underlying complexity.
    • Libraries: On top of the core processing engine, Spark comes with several libraries including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.
  • Driver and Executors:
    • Driver: The driver is the central coordinator of the Spark application. It converts the user program into tasks and schedules them on the executors.
    • Executors: Executors are distributed agents responsible for executing the tasks assigned to them by the driver. They also report back the state of computation and data to the driver.
  • Local Mode: This mode runs Spark on a single JVM on a single machine, and it is primarily used for development, testing, and debugging. When you run Spark in local mode, there is no need for a cluster manager since Spark runs as a single process. This mode is also useful for small data processing tasks that don't require distributed processing power.
  • Spark UI:
    • Spark provides a UI to monitor the detailed job, task execution, and resource usage statistics of the cluster.

This architecture makes Spark highly efficient for a wide range of data processing tasks, from simple data load and SQL queries to complex machine learning and real-time stream processing.

Some details are:

  • Driver: The driver process runs the main() function of your application and is the heart of a Spark job. It is responsible for translating the user's program into tasks, scheduling these tasks on executors, and managing their execution. The driver also maintains relevant information during the life of the Spark application.
  • Driver Program Modes: The mode in which the driver program runs can significantly influence the behavior and performance of a Spark application:
    • Client Mode: In client mode, the driver is launched in the same process as the client that submits the application, typically on your local machine or a gateway machine that has access to the cluster. This setup is beneficial for interactive and debugging purposes because it provides more direct control over the job and easier access to job logs. However, if the client machine is far from the worker nodes or has network limitations, it can increase the latency and reduce the overall performance.
    • Cluster Mode: In cluster mode, the Spark driver runs on one of the nodes inside the cluster. This can be beneficial for production jobs as it reduces the network latency between the driver and the executors and can help improve the overall performance of the application. It also allows the driver to benefit from the high availability features of the cluster management system. However, debugging can be more challenging in this mode since the driver logs are now on the cluster nodes.
  • SparkContext: This is the main entry point for Spark functionality. It represents the connection to a Spark cluster, and it is used by the Spark driver to establish and manage the Spark jobs. SparkContext coordinates with cluster managers (like YARN, Mesos, or Kubernetes in standalone mode) to allocate resources.
  • Apache Spark static configuration can be used for application-related properties, such as the application name.
  • Executors: Executors are worker nodes' processes in charge of running the tasks assigned to them by the driver. Each executor can run multiple tasks in separate threads simultaneously. Executors also provide in-memory storage for RDDs (Resilient Distributed Datasets) that are cached by user programs through Spark’s shared variables.
  • For Java and Scala-based applications in Apache Spark, the best way to provide access to the application project for both the driver and the cluster executor processes is by creating an uber-JAR. An uber-JAR, also known as a fat JAR, includes all of your application's code along with all its dependencies, packaged into a single JAR file. This simplifies the deployment process and ensures that all necessary code and libraries are available in both the driver and the executors across the cluster.
  • Shuffles: Shuffles are one of the most performance-impacting operations within a Spark job. A shuffle occurs when data needs to be redistributed across different executors or even across machines, which happens during operations like groupBy, reduceByKey, or join. During a shuffle, data is serialized and written to disk, and then it must be transferred over the network to other executors, which also involves deserialization and often additional disk I/O when the data is received. Because of the disk and network I/O, plus the cost of serialization and deserialization, shuffles can greatly affect the performance and scalability of Spark applications.
  • Jobs, Stages, and Tasks:
    • Jobs: A job corresponds to a Spark action (like save, collect) in your code. Each job results from an action in your Spark application.
    • Stages: Jobs are divided into stages. Stages are split based on transformations that require shuffling the data, like reduceByKey or join. A new stage starts with a new set of computations where the data needs to be shuffled across the executors.
    • Tasks: Each stage consists of tasks, where a task is the smallest unit of work that is sent to the executor. Each task corresponds to a combination of data and computation on that data. Tasks within the same stage can be performed in parallel.

===========================================

         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         

 

 

 

 

 



















































 

 

 

 

 

=================================================================================