Split View: Apache Hive Query example (MovieLens)
Apache Hive Query example (MovieLens)
Overview
일반적으로 HDFS의 데이터를 다루는 법은, Mapreduce 프로그램을 작성하는 것이다. Mapreduce는 자유도는 높지만, 제대로 사용하기 위해서는 많은 공부와 시행착오가 필요하다. 그리고 가장 큰 단점은, 어렵다. 따라서 생산성이 낮다. Hive는 개발자에게 친숙한 SQL을 작성하면, 이를 Mapreduce 잡으로 변경해준다. (최근에는 Hive의 엔진으로 Mapreduce가 아니라 훨씬 빠른 Spark 또는 Tez 를 사용한다고 한다.)위에서 따라서 SQL만 아는 개발자여도 HDFS상의 데이터를 저장하거나, 조회하거나 유용한 정보를 추출할 수 있게된다. Movie Lens에서 제공하는 데이터를 이용해 Hive Query로 데이터를 다뤄보며, Hive가 얼마나 유용한지 살펴보자.
데이터 준비
다운로드
그룹렌즈라는 사이트에서 제공하는 https://grouplens.org/datasets/movielens/에서 제공하는 100k 데이터를 사용한다.

wget https://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip
hadoop fs -cp ml-100k/u.data /tmp
Hive 테이블 생성
hive shell을 실행해 주고, 아래와 같은 명령어로 u_data 테이블을 생성한다.
use default
CREATE TABLE IF NOT EXISTS u_data(
userid INT,
movieid INT,
rating INT,
unixtime TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
LOAD DATA INPATH 'hdfs:///tmp/u.data' overwrite into table u_data;
u_data 테이블이 잘 생성되었는지 확인한다.
hive> show databases;
OK
default
userdb
Time taken: 0.312 seconds, Fetched: 2 row(s)
hive> use default;
OK
Time taken: 0.023 seconds
hive> show tables;
OK
u_data
Time taken: 0.044 seconds, Fetched: 1 row(s)
hdfs 상에서도 해당 테이블이 잘 생성되었는지 확인한다.
root@ubuntu01:~# hdfs dfs -ls /user/hive/warehouse
Found 2 items
drwxr-xr-x - root supergroup 0 2022-11-22 18:52 /user/hive/warehouse/u_data
drwxr-xr-x - root supergroup 0 2022-11-22 18:43 /user/hive/warehouse/userdb.db
Query
table schema
hive> describe u_data;
OK
userid int
movieid int
rating int
unixtime timestamp
Time taken: 0.087 seconds, Fetched: 4 row(s)
row count
u_data 라는 테이블에 총 몇 개의 row 가 있는지 출력하는 쿼리를 날려보자.
hive> select count(*) from u_data;
OK
100000
Time taken: 20.914 seconds, Fetched: 1 row(s)
u_data 테이블에는 100k(10만개의) row가 있음을 확인할 수 있다.
distinct movieid
hive> select count(distinct movieid) from u_data;
OK
1682
영화 종류
distinct명령어를 통해 movieid의 갯수를 구할 수 있다.
hive> select count(distinct movieid) from u_data;
OK
1682
영화별 평점
영화별 평점 평균을 구하기 위해서는 아래와 같은 명령어를 사용할 수 있다.
GROUP BY명령어를 통해 집계 연산을 수행할 수 있다.
hive> select movieid, avg(rating) from u_data group by movieid;
영화별 평점 순위
SORT BY my_column (ASC|DESC) 명령어를 사용하면 my_column 컬럼 기준으로 내림차순 정렬 및 출력이 가능하다.
hive> select movieid, avg(rating) as avg_rating from u_data group by movieid sort by avg_rating DESC;
Reference
Apache Hive Query Example (MovieLens)
Overview
The typical way to work with data in HDFS is to write a MapReduce program. MapReduce offers high flexibility, but requires significant study and trial-and-error to use properly. And the biggest drawback is that it is difficult, resulting in low productivity. Hive converts developer-friendly SQL into MapReduce jobs. (Recently, it is said that Hive uses much faster engines like Spark or Tez instead of MapReduce.) Therefore, even developers who only know SQL can store, query, and extract useful information from data on HDFS. Let's explore how useful Hive is by working with data provided by MovieLens using Hive queries.
Data Preparation
Download
We will use the 100k dataset provided by https://grouplens.org/datasets/movielens/ from the GroupLens site.

wget https://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip
hadoop fs -cp ml-100k/u.data /tmp
Creating a Hive Table
Launch the hive shell and create the u_data table using the following command.
use default
CREATE TABLE IF NOT EXISTS u_data(
userid INT,
movieid INT,
rating INT,
unixtime TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
LOAD DATA INPATH 'hdfs:///tmp/u.data' overwrite into table u_data;
Verify that the u_data table was created successfully.
hive> show databases;
OK
default
userdb
Time taken: 0.312 seconds, Fetched: 2 row(s)
hive> use default;
OK
Time taken: 0.023 seconds
hive> show tables;
OK
u_data
Time taken: 0.044 seconds, Fetched: 1 row(s)
Also verify that the table was created properly on HDFS.
root@ubuntu01:~# hdfs dfs -ls /user/hive/warehouse
Found 2 items
drwxr-xr-x - root supergroup 0 2022-11-22 18:52 /user/hive/warehouse/u_data
drwxr-xr-x - root supergroup 0 2022-11-22 18:43 /user/hive/warehouse/userdb.db
Query
table schema
hive> describe u_data;
OK
userid int
movieid int
rating int
unixtime timestamp
Time taken: 0.087 seconds, Fetched: 4 row(s)
row count
Let's run a query to print the total number of rows in the u_data table.
hive> select count(*) from u_data;
OK
100000
Time taken: 20.914 seconds, Fetched: 1 row(s)
We can confirm that the u_data table has 100k (100,000) rows.
distinct movieid
hive> select count(distinct movieid) from u_data;
OK
1682
Number of Movie Types
Using the distinct command, we can count the number of movieids.
hive> select count(distinct movieid) from u_data;
OK
1682
Average Rating per Movie
To calculate the average rating per movie, you can use the following command.
The GROUP BY command allows you to perform aggregate operations.
hive> select movieid, avg(rating) from u_data group by movieid;
Movie Rating Rankings
Using the SORT BY my_column (ASC|DESC) command, you can sort and display results in descending order based on the my_column column.
hive> select movieid, avg(rating) as avg_rating from u_data group by movieid sort by avg_rating DESC;
Reference
-
https://cloudxlab.com/assessment/displayslide/326/hive-movielens-assignment
-
https://github.com/mionisation/BI_BigData_2_HiveDatasetAnalysis/blob/master/queries/hive-query_1.hql
Quiz
Q1: What is the main topic covered in "Apache Hive Query Example (MovieLens)"?
Learn how to use Apache Hive queries with the MovieLens dataset
Q2: What is Download?
We will use the 100k dataset provided by https://grouplens.org/datasets/movielens/ from the
GroupLens site.
Q3: Explain the core concept of Creating a Hive Table.
Launch the hive shell and create the u_data table using the following command. Verify that the
u_data table was created successfully. Also verify that the table was created properly on HDFS.
Query
Q4: What are the key aspects of row count?
Let's run a query to print the total number of rows in the u_data table. We can confirm that the
u_data table has 100k (100,000) rows.
Q5: How does Number of Movie Types work?
Using the distinct command, we can count the number of movieids.