A while ago I was looking at Hive (https://hive.apache.org/) for one of potential big data projects. Below are some of my research notes.
I have generated following testing data sets
- Users – approximately 500k users data with fake (generated) phone numbers and emails
- Actions – 1 to 50 historical actions (check-ins, social media activities) for every user, binded by phone number
- Profiles – simulated “dynamic” profiles data including general profile and gender specific profiles data
Data can be loaded from a flat file. File format is specified by the table schema.
1 2 | hive> CREATE TABLE users (ID INT, phone STRING, email STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; hive> LOAD DATA LOCAL INPATH '/home/bm/RND/Hive/users_500000.csv' INTO TABLE users; |
Table in Hive is in fact a file in HDFS
Hive allows you to load data using Java, PHP and Python API.
Hive accepts connections using JDBC
You operate on loaded data using HQL language
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | hive> select count(*) from users; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. Execution log at: /tmp/bm/bm_20121025112727_a272beb0-1e81-4b63-a83c-c0257e559041.log Job running in-process (local Hadoop) Hadoop job information for null: number of mappers: 0; number of reducers: 0 2012-10-25 11:27:19,504 null map = 0%, reduce = 0% 2012-10-25 11:27:25,523 null map = 100%, reduce = 0% 2012-10-25 11:27:28,529 null map = 100%, reduce = 100% Ended Job = job_local_0001 Execution completed successfully Mapred Local Task Succeeded . Convert the Join into MapJoin OK 200 Time taken: 20.824 seconds |
As you can see on the example Hive is translating every query to Map-Reduce jobs executed on the Hadoop cluster.
You can store the results in the table
1 2 3 4 | hive> CREATE TABLE user_attribs (ID INT, phone STRING, email STRING, sex STRING, city STRING, education STRING, q1 STRING, q2 INT, q3 STRING, q4 STRING, q5 STRING) hive> INSERT OVERWRITE TABLE user_attribs SELECT u.ID, u.phone, u.email, g.sex, g.city, g.education, m.q1, m.q2, m.q3, m.q4, w.q5 FROM profile_general g JOIN profile_man m ON (g.email = m.email) JOIN users u ON (g.email = u.email) JOIN profile_woman w ON (w.ID = u.ID) INSERT OVERWRITE LOCAL DIRECTORY '/home/bm/Dev/KP_TEST_CASES_500000/hive_out' SELECT * FROM user_attribs; |
If you don’t specify the output table the result will be dumped to the temporary table and printed to the stdout.