Hive
Hive is a "pluggable software" running on top of YARN/HDFS. Its primary goal is to facilitate reading, writing, and managing large datasets residing in distributed storage using SQL.
With Hive, it is possible to project a structure onto data already in storage and to write Hive Query Language (HQL) statements that are similar to standard SQL statements. Of course, Hive implements only a limited subset of SQL commands, but is still quite helpful.
Under the hood, HQL statement are translated into MapReduce jobs and executed across a Hadoop cluster.
Resources
This tutorial is heavily inspired from the HortonWorks tutorial and is illustrated in command-line.
Data Preparation
Here, we will use a CSV file with batting records, i.e. statistics about baseball players for each year.
The headers are:
playerID, yearID, stint, teamID, lgID, G ,G_batting, AB, R, H, 2B, 3B, HR, RBI, SB, CS, BB, SO, IBB, HBP, SH, SF, GIDP, G_old
which stand for:
G=games, AB=at bats, R=runs, H=hits, 2B=doubles, 3B=triples, HR=dinger, RBI=runs batted in, SB=stolen base, CS=caught stealing, BB=base on balls, SO=strikeout, IBB=intentional walks, HBP=hit by pitch, SH=sacrifice hits, SF=sacrifice flys, GIDP=ground into double play
In this program, we will simply output the maximum number of runs made by one player for each year.
- Load the csv file Batting.csv into your home folder in HDFS:
hdfs dfs -copyFromLocal Batting.csv /user/$(whoami)
Create a database in Hive
Note: In order to have this tutorial to work for everybody, let's create a database prefixed by your username (${env:USER}
inside hive).
Type the "hive" command to start the hive shell, and continue as follows:
$ hive create database ${env:USER}_test;
Create a temp table to load the file into
$ hive use ${env:USER}_test; create table temp_batting (col_value STRING); LOAD DATA INPATH '/user/${env:USER}/Batting.csv' OVERWRITE INTO TABLE temp_batting;
Create the final table and insert the data into
$ hive use ${env:USER}_test; create table batting (player_id STRING, year INT, runs INT); insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting;
Run your first query
$ hive --database ${USER}_test SELECT year, max(runs) FROM batting GROUP BY year;
Run a more complex query
$ hive --database ${USER}_test SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs) ;
Hive external table
As you might have noticed, with hive "normal" tables, you need to upload the data in HDFS, create a temp table, load the data into this temp table, create another, final, table and eventually copy and format the data from the temp table to the final one.
Another idea is to rely on hive external table, which will read the data directly from the CSV file.
Let's create a table mapping the CSV file and run a query on top of it.
Note: the external table location (where the data is physically stored) MUST be a folder. We'll change the directory structure to accommodate this requirement.
hdfs dfs -mkdir /user/$(whoami)/Batting hdfs dfs -copyFromLocal Batting.csv /user/$(whoami)/Batting # note here that the file needs to be re-uploaded because the LOAD DATA consumed (and removed the file)
And then create the external table
$ hive --database ${USER}_test create EXTERNAL table batting_ext (player_id STRING, year INT, stint STRING, team STRING, lg STRING, G STRING, G_batting STRING, AB STRING, runs INT, H STRING, x2B STRING, x3B STRING, HR STRING, RBI STRING, SB STRING, CS STRING, BB STRING, SO STRING, IBB STRING, HBP STRING, SH STRING, SF STRING, GIDP STRING, G_Old STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/${env:USER}/Batting/';
Let's run the two queries
$ hive --database ${USER}_test SELECT year, max(runs) FROM batting_ext GROUP BY year; SELECT a.year, a.player_id, a.runs from batting_ext a JOIN (SELECT year, max(runs) runs FROM batting_ext GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs) ;
Cleanup
Once you're done, you can cleanup your environment, deleting the tables and the database.
$ hive drop table temp_batting; drop table batting; drop table batting_ext; drop database ${env:USER}_test;