Electron microscopy
 
PythonML
RDD (Resilient Distributed Dataset)
- Python Automation and Machine Learning for ICs -
- An Online Book: Python Automation and Machine Learning for ICs by Yougui Liao -
Python Automation and Machine Learning for ICs                                                           http://www.globalsino.com/ICs/        


Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

RDD, or Resilient Distributed Dataset, is a fundamental data structure of Apache Spark. It represents an immutable, distributed collection of objects that can be processed in parallel. RDDs are designed to handle large-scale data processing efficiently and are the backbone of many Spark operations. RDDs are a core concept in Apache Spark, embodying distributed data collections. These datasets enable parallel and fault-tolerant processing across computer clusters. RDDs can be generated from pre-existing data in storage systems such as HDFS and can be manipulated through various transformations and actions, including filtering, mapping, and aggregating. The "resilient" attribute of RDDs underscores their capacity to handle node failures, while the "distributed" nature underscores their spread across numerous machines in a cluster, facilitating parallel processing. The Storage tab in the Apache Spark user interface displays details about cached RDDs.

Some key features of RDDs are:

  • Immutability and Partitioning: RDDs are immutable, meaning once they are created, they cannot be changed. This property simplifies programming because you don't have to manage state or worry about the inconsistencies that can arise in a distributed environment. Data in an RDD is split into logical partitions, which may be computed on different nodes of the cluster, enabling parallel processing.
  • Fault Tolerance: RDDs are resilient to failures. If any partition of an RDD is lost due to node failure, Spark can recompute the RDD from a lineage graph that describes the series of transformations applied to initial input data to build the RDD.
  • Lazy Evaluation: RDD operations are lazy, meaning that no computation happens until an action is performed that requires Spark to return a result to the driver program. This helps optimize the overall data processing pipeline by allowing Spark to run operations more efficiently.
  • Operations: RDDs support two types of operations: transformations, which create a new RDD from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. Common transformations include "map", "filter", and "groupBy", while typical actions include "count", "collect", and "reduce".
  • Creation: RDDs can be created through various methods, including loading external datasets, distributing a collection of objects (like a list or array), or transforming an existing RDD. 

RDD actions are utilized to execute a transformation in Spark. They compute a result and send it back to the driver program. For instance, the reduce action compiles the elements of an RDD and delivers the aggregated outcome to the driver program.

RDD transformations facilitate the creation of a new RDD from an existing one. In Spark, transformations are considered lazy because they do not compute results immediately; instead, results are computed only when evaluated by actions. For instance, a map transformation applies a function to each element of a dataset, producing a new RDD.

In Apache Spark, the sequence of RDD transformation and action evaluation is:

  • Spark creates a Directed Acyclic Graph (DAG) during the creation of a Resilient Distributed Dataset (RDD). When an RDD is created, Spark builds a DAG of the operations leading to it. This step involves defining the transformations that will be applied to the data.
  • The DAG is associated with the new RDD. Each RDD maintains a pointer to its DAG, representing dependencies and transformations applied to it or its parent RDDs.
  • If there is an action, the driver program, which invokes calls the action, evaluates the DAG after Spark completes the action. Actions are operations that trigger the execution of the RDD computation. The driver program schedules the computation by evaluating the DAG once an action is called.
  • The pointer responsible for transforming the RDD returns to the Spark driver program. As transformations and actions are executed, the control and results of computations return to the Spark driver, which initiated the action.
  • Spark utilizes the DAG Scheduler to perform a transformation and updates the DAG accordingly. The DAG Scheduler is the component that translates RDD operations into stages that can be executed on the cluster. It optimizes the execution plan and handles failures and re-computations.

The script below shows the described above:

Output:

In this code:

  • Creating the RDD: This initial step begins the construction of the DAG. When we create an RDD using spark.sparkContext.parallelize(data), Spark starts tracking the operations and dependencies associated with this RDD.
    • rdd = spark.sparkContext.parallelize(data): Initializes the RDD and starts the DAG.
  • Applying Transformations: Each transformation applied to the RDD, such as rdd.map(lambda x: x ** 2) and squared_rdd.filter(lambda x: x % 2 == 0), adds nodes and edges to the DAG. These transformations define the computation steps and their dependencies but don't execute them.
    • squared_rdd = rdd.map(lambda x: x ** 2): Adds a transformation node to the DAG.
    • filtered_rdd = squared_rdd.filter(lambda x: x % 2 == 0): Further extends the DAG with another transformation node.

===========================================

         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         

 

 

 

 

 



















































 

 

 

 

 

=================================================================================