Electron microscopy
 
PythonML
Data Storage in Hadoop (HDFS, HBase and YARN)
- Python Automation and Machine Learning for ICs -
- An Online Book: Python Automation and Machine Learning for ICs by Yougui Liao -
Python Automation and Machine Learning for ICs                                                           http://www.globalsino.com/ICs/        


Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Hadoop is a popular framework for handling large datasets in a distributed computing environment. Here are the key components and concepts related to data storage in Hadoop:

  • Hadoop Distributed File System (HDFS)

    • Purpose: HDFS is designed to store large data sets reliably and to stream those data sets at high bandwidth to user applications.

    • Architecture: It uses a master/slave architecture. A typical deployment has a single NameNode (master) and multiple DataNodes (slaves).
    • NameNode: Manages the file system namespace, maintains the file system tree and metadata for all the files and directories. It does not store the actual data.
    • DataNode: Stores the actual data as blocks. Each file is split into one or more blocks, and these blocks are stored in a set of DataNodes.
    • Block Size: By default, HDFS block size is 128 MB, but this is configurable per file. This large block size helps to minimize the cost of seek operations by making a sequential read performance optimal.
  • HBase is a distributed, scalable, big data store, modeled after Google's Bigtable and is part of the Apache Hadoop ecosystem. It runs on top of the Hadoop Distributed File System (HDFS) and provides Bigtable-like capabilities for Hadoop. Here’s a detailed look at its key components, architecture, and how it manages data:
    • Architecture Overview
      • Column-Oriented: Unlike traditional relational databases, HBase is a column-oriented database which is particularly well-suited for sparse data sets, where many columns have null values.
      • Key-Value Store: Data is stored as rows, and each row is uniquely identified by a row key. Rows are composed of columns, and columns are grouped into column families.
    • Key Components
      • HMaster: Oversees the HBase cluster by assigning regions to RegionServers and managing the cluster metadata. It also handles DDL (Data Definition Language) operations like creating and deleting tables.
      • RegionServer: Manages regions, handling read and write requests for the data (regions) it holds. Each server runs in a JVM instance and can serve thousands of regions depending on configuration and the types of loads.
      • Region: A region is a contiguous range of rows stored together. The table is horizontally partitioned into regions, and each RegionServer serves a subset of the table’s regions.
      • ZooKeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by HBase.
      • HFile: The storage format for HBase. HFiles are binary files designed for high-speed lookups and are stored in HDFS.
    • Data Storage and Access
      • Column Families: Data is stored in columns grouped by column families. Column families must be declared at schema design time. All column family members are stored together on disk, which can significantly affect HBase performance.
      • Cells: The intersection of rows and columns where data is stored. Each cell is versioned, with its version being a timestamp assigned by the system or specified by the client.
      • Write-Ahead Log (WAL): Ensures data durability. Before any changes are made to an HFile, they are logged in the WAL. In case of a failure, the WAL is used to recover data.
    • High Availability and Fault Tolerance
      • Automatic Failover: HBase uses HDFS for its underlying storage which provides robustness and data replication. HBase is integrated with ZooKeeper which helps in leader election in case of Master failure and also in recovery of RegionServers.
      • Region Replication: Region replication can be used to enhance availability; if one region server goes down, another can serve the same region’s data.
      • Load Balancing: HMaster automatically rebalances regions across RegionServers to distribute load evenly.
    • Scalability
      • Horizontal Scaling: HBase can scale out by adding more nodes to the cluster. New RegionServers can be added without downtime, and HBase can redistribute data across the new servers.
    • Use Cases
      • Real-time Query and Analysis: Due to its low latency data access, HBase is ideal for real-time querying of big data.
      • Time Series Data: Its column-oriented structure and ability to store rows across a wide range of time makes it suitable for time series data analysis.
    • Tools and Interfaces
      • Shell: HBase provides an interactive shell (using the command line) that can be used to interact directly with HBase.
      • APIs: There are APIs available for Java to interact with HBase programmatically. Also, there is support for REST, Avro, or Thrift to support non-Java front-ends.
  • Data Replication 
    • Purpose: Ensures reliability and high availability by replicating the data blocks across multiple machines.
    • Replication Policy: The default replication factor is 3, but this is configurable. Typically, one replica is stored on a local node, one on a different rack, and another on a different node in the same rack.
    • Rebalancing and Replication Management: Handled automatically by the NameNode, which periodically receives a heartbeat and a blockreport from each of the DataNodes in the cluster.
  • Write and Read Operations
    • Write Operation: When writing data, the client requests the NameNode to create new blocks for a file. The NameNode returns a list of DataNodes to host the block replicas. The client then writes blocks to the DataNodes in a pipeline fashion (i.e., first to one, then the next, and so on).
    • Read Operation: When reading data, the client queries the NameNode for the block locations. The client then accesses the DataNodes directly to retrieve the blocks.
  • YARN (Yet Another Resource Negotiator)
    • Purpose: While not directly involved in storage, YARN is responsible for managing computing resources in clusters and using them for scheduling users' applications.
    • Components:
      • ResourceManager: Manages the use of resources across the cluster.
      • NodeManager: Manages resources and workflow on a single machine.
  • High Availability and Fault Tolerance
    • NameNode High Availability: Hadoop can be configured to use multiple NameNodes so that if the primary fails, a secondary NameNode takes over to ensure continuous availability of the HDFS service.
    • DataNode Failure Handling: If a DataNode fails, the system automatically replicates the data blocks hosted on that node to other nodes to maintain the configured replication factor.
  • Tools and Utilities
    • HDFS Commands: For interaction with HDFS, administrators and users can use HDFS shell commands.
    • Web Interface: HDFS also provides a web UI to browse the file system, view the status of files, and manage the file system.

===========================================

         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         

 

 

 

 

 



















































 

 

 

 

 

=================================================================================