Electron microscopy
 
PythonML
Hive in Hadoop
- Python Automation and Machine Learning for ICs -
- An Online Book: Python Automation and Machine Learning for ICs by Yougui Liao -
Python Automation and Machine Learning for ICs                                                           http://www.globalsino.com/ICs/        


Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Hive in Hadoop is a data warehousing tool designed to facilitate querying and managing large datasets residing in distributed storage. It is built on top of Apache Hadoop, a popular framework for processing large data sets using distributed computing techniques. Apache Hive was initially developed for Hadoop, Hive provides a SQL-like interface to run queries on large data sets in HDFS. It is commonly used in data warehousing scenarios for managing and querying structured data. Basic Hive's key features and functions are:

  • SQL-like Language (HiveQL): Hive provides a SQL-like interface called HiveQL (HQL), which allows users to write queries easily if they are already familiar with SQL. This makes it accessible for data analysts who may not be skilled in Java, the primary language used in Hadoop environments.
  • Data Warehousing Components: Hive organizes data into tables, making it suitable for data warehousing tasks. It supports various data formats and can integrate with Hadoop input/output formats.
  • Metadata Storage: Hive uses a metadata storage in a relational database to store the structure of the tables along with their data types and other properties. This metadata helps in data serialization and deserialization.
  • Execution Engine: While Hive queries have a SQL-like syntax, they are converted into MapReduce, Tez, or Spark jobs under the hood to be executed across a Hadoop cluster. This enables handling of large-scale data across multiple machines.
  • Optimization: Hive also allows for some query optimizations, including query rewrites and other techniques that can improve performance over standard MapReduce jobs.
  • JDBC clients allow Java applications based on ODBC to connect to Hive.
  • Extensibility: Users can extend Hive's capabilities by writing custom user-defined functions (UDFs) for tasks that are not covered by built-in functions.
  • Hive supports several file formats, including:
    • Flat files (like text files)
    • Sequence file (a binary file format)
    • Hive supports other specific file formats like ORC (Optimized Row Columnar) and Parquet, which could be considered under "record columnar files" if this refers to a columnar storage format.

===========================================

         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         

 

 

 

 

 



















































 

 

 

 

 

=================================================================================