Hive

Hive is an SQL language that processes and analyzes data in Hadoop. It does not require knowledge of any programming language. Hive is not suitable for OLTP, it is designed for analyzing big data.

Hadoop in not a database, it is an ecosystem of tools that enables the features we require and desire when dealing with big data. Hadoop runs on HDFS and its native language is MapReduce. Hive converts your SQL commands to MapReduce. Hive also supports workflow integration with other tools such as Excel or Cognos.

Compared to working directly in MapReduce, Hive is a big improvement in user-friendliness but Hive is slow, batch-oriented, and does not fully support ANSI-SQL.

Hive is not the only good option when working with Hadoop. Apache Spark is a great product. It supports Python, Scala, and SQL.

How Does Data Flow in Hive

Data is collected from various data sources and stored in HDFS. The format of the data can be SQL, CSV, text documents, or anything else. Ath this point the data is raw, unprocessed, unstructured, and partitioned (distributed across servers). When the data moves from HDFS to Hive warehouse based on your queries, it is processed, structured, and aggregated. But it still remains partitioned. Beyond Hive warehouse, the data will contain the final data and data format desired by the end user.

Hive Data Units

Hive data is organized into databases, tables, partitions, and buckets. A database is simply a namespace. A table contains homogenous units of data with the same schema. SQL queries run against the tables. Tables can be managed or external. Managed tables move data from its current location in HDFS to Hive warehouse, which is a specific location on HDFS. External tables point to data in HDFS. Managed tables are more reliable because the location of data in HDFS can change and suddenly your Hive queries would generate errors.

HUE - Hive User Experience

We use Hive through Hue. It is a user-friendly interface that allows us to perform different functions for Hive. It makes managing data and workflow very easy. It comes with data browsers and query editors. You only need to click around for 30 minutes to understand the interface.

Importing data from CSV files

  1. Data Browsers > Metastore Tables > Create table (icon)
  2. Upload the file
  3. Define delimiter and verify results.

If your csv is not processed correctly, modify the csv delimiter or download and use a jar file that processes the csv correctly.

Fetching Data from Hive

We use select statements to fetch data from Hive. For selecting, Hive supports SELECT, FROM, JOIN, WHERE, GROUP BY, HAVING, ORDER BY, AND LIMIT. Aliases and functions such as count(), min(), max(), etc. are also supported.

select * from sales limit 100;

Complex Data Types

A row of table can contain array, map or struct. A map is a collection of key/value pairs, associative array. struct is a collection of name value fields:

-- struct('phone','home','123-456-7890')
select u.id, u.phone.home from user u;
-- key/value address('home','city',...)
select u.id, u.address['home'].city from user u;
-- array  subscriptions('nytimes','wsj','nat geo')
select u.id, u.subscrptions[0] from user u;

explode parses complex data structures to columns so you don't need to use . operators to reach your data

select u.id, sub.subscription
from u.user
lateral view explode(u.subscriptions) s and sub

Aggregating Data

Simple: sum(), min, max, avg, count advanced: stdev()

select sum(sales) as total
from sales
where year = 2016

Grouping set return multiple levels of aggregation in one statement.

select margin, sale
from sales
group by margin, sale
grouping sets (margin, sale)

the last line is union of (margin, sale) with margin and (margin, sale) with sale.

Cube returns all possible aggregate combinations

select margin, sale
from sales
group by margin, sale
with cube

Rollup returns hierarchical combinations

select margin, sale
from sales
group by margin, sale
with rollup

Joins are supported and use ANSI-SQL syntax.

Hive Functions

Hive Functions extend Hive. There are string functions such as concat and substr(), math functions such as abs() and sqrt(), date functions such as to_date(), and conditional functions such as if() and coalesce(). Their usage is pretty much how they are used in an OLTP SQL query.

Partitioned Tables

Hive tables often contain massive amounts of data simply because we use Hadoop for big data. Using conditional logic, you can divide this data into smaller tables called partitioned tables. For example, you can divide sales data by geography or fiscal year. In such case, each year or location will be in a different partition table.

First we create the table from csv. Go to Query Editors > Hive.

create table sales (sid int, location string, orderdate date, sale float)
partitioned by (year int)
stored as textfile;

Add subfolders under the sales directory for each partition year. Add csv file for each year.

Create partitions and add them to the table

alter table sales
add partition (year = 2016)
location '2016/';

Conclusion

Hive is pretty easy to work with. Simply log into HUE interface and write SQL queries. There are a few differences as you have seen here. After Hive, you should also look into Apache Spark.