Split View: Apache Hive Query example (MovieLens)

Apache Hive Query example (MovieLens)

Overview

일반적으로 HDFS의 데이터를 다루는 법은, Mapreduce 프로그램을 작성하는 것이다. Mapreduce는 자유도는 높지만, 제대로 사용하기 위해서는 많은 공부와 시행착오가 필요하다. 그리고 가장 큰 단점은, 어렵다. 따라서 생산성이 낮다. Hive는 개발자에게 친숙한 SQL을 작성하면, 이를 Mapreduce 잡으로 변경해준다. (최근에는 Hive의 엔진으로 Mapreduce가 아니라 훨씬 빠른 Spark 또는 Tez 를 사용한다고 한다.)위에서 따라서 SQL만 아는 개발자여도 HDFS상의 데이터를 저장하거나, 조회하거나 유용한 정보를 추출할 수 있게된다. Movie Lens에서 제공하는 데이터를 이용해 Hive Query로 데이터를 다뤄보며, Hive가 얼마나 유용한지 살펴보자.

데이터 준비
- 다운로드
- Hive 테이블 생성
Query
Reference

데이터 준비

다운로드

그룹렌즈라는 사이트에서 제공하는 https://grouplens.org/datasets/movielens/에서 제공하는 100k 데이터를 사용한다.

wget https://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip

copy-to-hdfs

hadoop fs -cp ml-100k/u.data /tmp

Hive 테이블 생성

hive shell을 실행해 주고, 아래와 같은 명령어로 u_data 테이블을 생성한다.

create-table

use default

CREATE TABLE IF NOT EXISTS u_data(
    userid INT,
    movieid INT,
    rating INT,
    unixtime TIMESTAMP
    )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

LOAD DATA INPATH 'hdfs:///tmp/u.data' overwrite into table u_data;

u_data 테이블이 잘 생성되었는지 확인한다.

hive> show databases;
OK
default
userdb
Time taken: 0.312 seconds, Fetched: 2 row(s)
hive> use default;
OK
Time taken: 0.023 seconds
hive> show tables;
OK
u_data
Time taken: 0.044 seconds, Fetched: 1 row(s)

hdfs 상에서도 해당 테이블이 잘 생성되었는지 확인한다.

root@ubuntu01:~# hdfs dfs -ls /user/hive/warehouse
Found 2 items
drwxr-xr-x   - root supergroup          0 2022-11-22 18:52 /user/hive/warehouse/u_data
drwxr-xr-x   - root supergroup          0 2022-11-22 18:43 /user/hive/warehouse/userdb.db

Query

table schema

u_data

hive> describe  u_data;
OK
userid              	int
movieid             	int
rating              	int
unixtime            	timestamp
Time taken: 0.087 seconds, Fetched: 4 row(s)

row count

u_data 라는 테이블에 총 몇 개의 row 가 있는지 출력하는 쿼리를 날려보자.

hive> select count(*) from u_data;
OK
100000
Time taken: 20.914 seconds, Fetched: 1 row(s)

u_data 테이블에는 100k(10만개의) row가 있음을 확인할 수 있다.

distinct movieid

hive> select count(distinct movieid) from u_data;
OK
1682

영화 종류

distinct명령어를 통해 movieid의 갯수를 구할 수 있다.

hive> select count(distinct movieid) from u_data;
OK
1682

영화별 평점

영화별 평점 평균을 구하기 위해서는 아래와 같은 명령어를 사용할 수 있다. GROUP BY명령어를 통해 집계 연산을 수행할 수 있다.

hive> select movieid, avg(rating) from u_data group by movieid;

영화별 평점 순위

SORT BY my_column (ASC|DESC) 명령어를 사용하면 my_column 컬럼 기준으로 내림차순 정렬 및 출력이 가능하다.

hive> select movieid, avg(rating) as avg_rating from u_data group by movieid sort by avg_rating DESC;

Reference

Apache Hive Query Example (MovieLens)

Overview

The typical way to work with data in HDFS is to write a MapReduce program. MapReduce offers high flexibility, but requires significant study and trial-and-error to use properly. And the biggest drawback is that it is difficult, resulting in low productivity. Hive converts developer-friendly SQL into MapReduce jobs. (Recently, it is said that Hive uses much faster engines like Spark or Tez instead of MapReduce.) Therefore, even developers who only know SQL can store, query, and extract useful information from data on HDFS. Let's explore how useful Hive is by working with data provided by MovieLens using Hive queries.

Data Preparation
- Download
- Creating a Hive Table
Query
Reference
- Quiz

Data Preparation

Download

We will use the 100k dataset provided by https://grouplens.org/datasets/movielens/ from the GroupLens site.

wget https://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip

copy-to-hdfs

hadoop fs -cp ml-100k/u.data /tmp

Creating a Hive Table

Launch the hive shell and create the u_data table using the following command.

create-table

use default

CREATE TABLE IF NOT EXISTS u_data(
    userid INT,
    movieid INT,
    rating INT,
    unixtime TIMESTAMP
    )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

LOAD DATA INPATH 'hdfs:///tmp/u.data' overwrite into table u_data;

Verify that the u_data table was created successfully.

hive> show databases;
OK
default
userdb
Time taken: 0.312 seconds, Fetched: 2 row(s)
hive> use default;
OK
Time taken: 0.023 seconds
hive> show tables;
OK
u_data
Time taken: 0.044 seconds, Fetched: 1 row(s)

Also verify that the table was created properly on HDFS.

root@ubuntu01:~# hdfs dfs -ls /user/hive/warehouse
Found 2 items
drwxr-xr-x   - root supergroup          0 2022-11-22 18:52 /user/hive/warehouse/u_data
drwxr-xr-x   - root supergroup          0 2022-11-22 18:43 /user/hive/warehouse/userdb.db

Query

table schema

u_data

hive> describe  u_data;
OK
userid              	int
movieid             	int
rating              	int
unixtime            	timestamp
Time taken: 0.087 seconds, Fetched: 4 row(s)

row count

Let's run a query to print the total number of rows in the u_data table.

hive> select count(*) from u_data;
OK
100000
Time taken: 20.914 seconds, Fetched: 1 row(s)

We can confirm that the u_data table has 100k (100,000) rows.

distinct movieid

hive> select count(distinct movieid) from u_data;
OK
1682

Number of Movie Types

Using the distinct command, we can count the number of movieids.

hive> select count(distinct movieid) from u_data;
OK
1682

Average Rating per Movie

To calculate the average rating per movie, you can use the following command. The GROUP BY command allows you to perform aggregate operations.

hive> select movieid, avg(rating) from u_data group by movieid;

Movie Rating Rankings

Using the SORT BY my_column (ASC|DESC) command, you can sort and display results in descending order based on the my_column column.

hive> select movieid, avg(rating) as avg_rating from u_data group by movieid sort by avg_rating DESC;

Reference

Quiz

Q1: What is the main topic covered in "Apache Hive Query Example (MovieLens)"?

Learn how to use Apache Hive queries with the MovieLens dataset

Q2: What is Download?

We will use the 100k dataset provided by https://grouplens.org/datasets/movielens/ from the GroupLens site.

Q3: Explain the core concept of Creating a Hive Table.

Launch the hive shell and create the u_data table using the following command. Verify that the u_data table was created successfully. Also verify that the table was created properly on HDFS. Query

Q4: What are the key aspects of row count?

Let's run a query to print the total number of rows in the u_data table. We can confirm that the u_data table has 100k (100,000) rows.

Q5: How does Number of Movie Types work?

Using the distinct command, we can count the number of movieids.