Hive Research Notes

A while ago I was looking at Hive (https://hive.apache.org/) for one of potential big data projects. Below are some of my research notes.

I have generated following testing data sets

  • Users – approximately 500k users data with fake (generated) phone numbers and emails
  • Actions – 1 to 50 historical actions (check-ins, social media activities) for every user, binded by phone number
  • Profiles – simulated “dynamic” profiles data including general profile and gender specific profiles data

Data can be loaded from a flat file. File format is specified by the table schema.

1
2
hive> CREATE TABLE users (ID INT, phone STRING, email STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> LOAD DATA LOCAL INPATH '/home/bm/RND/Hive/users_500000.csv' INTO TABLE users;

Table in Hive is in fact a file in HDFS
Hive allows you to load data using Java, PHP and Python API.
Hive accepts connections using JDBC

You operate on loaded data using HQL language

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
hive> select count(*) from users;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/bm/bm_20121025112727_a272beb0-1e81-4b63-a83c-c0257e559041.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2012-10-25 11:27:19,504 null map = 0%,  reduce = 0%
2012-10-25 11:27:25,523 null map = 100%,  reduce = 0%
2012-10-25 11:27:28,529 null map = 100%,  reduce = 100%
Ended Job = job_local_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
200
Time taken: 20.824 seconds

As you can see on the example Hive is translating every query to Map-Reduce jobs executed on the Hadoop cluster.

You can store the results in the table

1
2
3
4
hive> CREATE TABLE user_attribs (ID INT, phone STRING, email STRING, sex STRING, city STRING, education STRING, q1 STRING, q2 INT, q3 STRING, q4 STRING, q5 STRING)

hive> INSERT OVERWRITE TABLE user_attribs SELECT u.ID, u.phone, u.email, g.sex, g.city, g.education, m.q1, m.q2, m.q3, m.q4, w.q5 FROM profile_general g JOIN profile_man m ON (g.email = m.email) JOIN users u ON (g.email = u.email) JOIN profile_woman w ON (w.ID = u.ID)
INSERT OVERWRITE LOCAL DIRECTORY '/home/bm/Dev/KP_TEST_CASES_500000/hive_out' SELECT * FROM user_attribs;

If you don’t specify the output table the result will be dumped to the temporary table and printed to the stdout.

Leave a Reply