버전 비교

  • 이 줄이 추가되었습니다.
  • 이 줄이 삭제되었습니다.
  • 서식이 변경되었습니다.

...

코드 블럭
languagetext
linenumberstrue
refresh log_type;
select * from log_type limit 100;
+--------+--------+--------+------+-------+-----+-------+
| field1 | field2 | field3 | year | month | day | host  |
+--------+--------+--------+------+-------+-----+-------+
| bar    | baz    | bletch | 2013 | 07    | 28  | host1 |
| bar    | baz    | bletch | 2013 | 08    | 01  | host1 |
| bar    | baz    | bletch | 2013 | 07    | 29  | host1 |
| bar    | baz    | bletch | 2013 | 07    | 28  | host2 |
+--------+--------+--------+------+-------+-----+-------+

Parquet 파일의 스키마에서 테이블 생성하기

이미 Parquet 파일이 있는 경우 동일한 스키마를 가진 새로운 테이블을 생성하려면 다음의 쿼리문을 실행합니다.

코드 블럭
languagesql
linenumberstrue
CREATE DATABASE airlines_data;

USE airlines_data;

CREATE EXTERNAL TABLE airlines_external
LIKE PARQUET 'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
STORED AS PARQUET LOCATION 'hdfs:staging/airlines'

DESCRIBE/SHOW 활용하기

코드 블럭
languagetext
linenumberstrue
> SHOW TABLE STATS airlines_external;
+-------+--------+--------+--------------+-------------------+---------+-------------------+
| #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats |
+-------+--------+--------+--------------+-------------------+---------+-------------------+
| -1    | 8      | 1.34GB | NOT CACHED   | NOT CACHED        | PARQUET | false             |
+-------+--------+--------+--------------+-------------------+---------+-------------------+

> SHOW FILES IN airlines_external;
+----------------------------------------------------------------------------------------+----------+-----------+
| path                                                                                   | size     | partition |
+----------------------------------------------------------------------------------------+----------+-----------+
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq | 252.99MB |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq | 13.43MB  |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq  | 252.84MB |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq  | 63.92MB  |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq | 183.64MB |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq | 240.04MB |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq | 211.35MB |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq | 151.46MB |           |
+----------------------------------------------------------------------------------------+----------+-----------+

> DESCRIBE airlines_external;
+---------------------+--------+---------------------------------------------------+
| name                | type   | comment                                           |
+---------------------+--------+---------------------------------------------------+
| year                | int    | Inferred from Parquet file.                       |
| month               | int    | Inferred from Parquet file.                       |
| day                 | int    | Inferred from Parquet file.                       |
| dayofweek           | int    | Inferred from Parquet file.                       |
| dep_time            | int    | Inferred from Parquet file.                        |
| crs_dep_time        | int    | Inferred from Parquet file.                       |
| arr_time            | int    | Inferred from Parquet file.                       |
| crs_arr_time        | int    | Inferred from Parquet file.                       |
| carrier             | string | Inferred from Parquet file.                       |
| flight_num          | int    | Inferred from Parquet file.                       |
| tail_num            | int    | Inferred from Parquet file.                       |
| actual_elapsed_time | int    | Inferred from Parquet file.                       |
| crs_elapsed_time    | int    | Inferred from Parquet file.                       |
| airtime             | int    | Inferred from Parquet file.                       |
| arrdelay            | int    | Inferred from Parquet file.                       |
| depdelay            | int    | Inferred from Parquet file.                       |
| origin              | string | Inferred from Parquet file.                       |
| dest                | string | Inferred from Parquet file.                       |
| distance            | int    | Inferred from Parquet file.                       |
| taxi_in             | int    | Inferred from Parquet file.                       |
| taxi_out            | int    | Inferred from Parquet file.                       |
| cancelled           | int    | Inferred from Parquet file.                       |
| cancellation_code   | string | Inferred from Parquet file.                       |
| diverted            | int    | Inferred from Parquet file.                       |
| carrier_delay       | int    | Inferred from Parquet file.                       |
| weather_delay       | int    | Inferred from Parquet file.                       |
| nas_delay           | int    | Inferred from Parquet file.                       |
| security_delay      | int    | Inferred from Parquet file.                       |
| late_aircraft_delay | int    | Inferred from Parquet file.                       |
+---------------------+--------+---------------------------------------------------+

> DESCRIBE FORMATTED airlines_external;
+------------------------------+-------------------------------
| name                         | type
+------------------------------+-------------------------------
...
| # Detailed Table Information | NULL
| Database:                    | airlines_data
| Owner:                       | impala
...
| Location:                    | /user/impala/staging/airlines
| Table Type:                  | EXTERNAL_TABLE
...
| # Storage Information        | NULL
| SerDe Library:               | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
| InputFormat:                 | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputForma
| OutputFormat:                | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
...

Sub Query를 위한 With 사용하기

코드 블럭
languagetext
linenumberstrue
> SELECT COUNT(*) AS 'rows', COUNT(tail_num) AS 'non-null tail numbers'
  FROM airlines_external;
+-----------+-----------------------+
| rows      | non-null tail numbers |
+-----------+-----------------------+
| 123534969 | 412968                |
+-----------+-----------------------+

> WITH t1 AS
  (SELECT COUNT(*) AS 'rows', COUNT(tail_num) AS 'nonnull'
  FROM airlines_external)
SELECT `rows`, `nonnull`, `rows` - `nonnull` AS 'nulls',
  (`nonnull` / `rows`) * 100 AS 'percentage non-null'
FROM t1;
+-----------+---------+-----------+---------------------+
| rows      | nonnull | nulls     | percentage non-null |
+-----------+---------+-----------+---------------------+
| 123534969 | 412968  | 123122001 | 0.3342923897119365  |
+-----------+---------+-----------+---------------------+

테이블의 CREATE 문 확인하기

코드 블럭
> SHOW CREATE TABLE airlines_external;
"CREATE EXTERNAL TABLE airlines_data.airlines_external (
  year INT COMMENT 'inferred from: optional int32 year',
  month INT COMMENT 'inferred from: optional int32 month',
  day INT COMMENT 'inferred from: optional int32 day',
  dayofweek INT COMMENT 'inferred from: optional int32 dayofweek',
  dep_time INT COMMENT 'inferred from: optional int32 dep_time',
  crs_dep_time INT COMMENT 'inferred from: optional int32 crs_dep_time',
  arr_time INT COMMENT 'inferred from: optional int32 arr_time',
  crs_arr_time INT COMMENT 'inferred from: optional int32 crs_arr_time',
  carrier STRING COMMENT 'inferred from: optional binary carrier',
  flight_num INT COMMENT 'inferred from: optional int32 flight_num',
  tail_num INT COMMENT 'inferred from: optional int32 tail_num',
  actual_elapsed_time INT COMMENT 'inferred from: optional int32 actual_elapsed_time',
  crs_elapsed_time INT COMMENT 'inferred from: optional int32 crs_elapsed_time',
  airtime INT COMMENT 'inferred from: optional int32 airtime',
  arrdelay INT COMMENT 'inferred from: optional int32 arrdelay',
  depdelay INT COMMENT 'inferred from: optional int32 depdelay',
  origin STRING COMMENT 'inferred from: optional binary origin',
  dest STRING COMMENT 'inferred from: optional binary dest',
  distance INT COMMENT 'inferred from: optional int32 distance',
  taxi_in INT COMMENT 'inferred from: optional int32 taxi_in',
  taxi_out INT COMMENT 'inferred from: optional int32 taxi_out',
  cancelled INT COMMENT 'inferred from: optional int32 cancelled',
  cancellation_code STRING COMMENT 'inferred from: optional binary cancellation_code',
  diverted INT COMMENT 'inferred from: optional int32 diverted',
  carrier_delay INT COMMENT 'inferred from: optional int32 carrier_delay',
  weather_delay INT COMMENT 'inferred from: optional int32 weather_delay',
  nas_delay INT COMMENT 'inferred from: optional int32 nas_delay',
  security_delay INT COMMENT 'inferred from: optional int32 security_delay',
  late_aircraft_delay INT COMMENT 'inferred from: optional int32 late_aircraft_delay'
)
STORED AS PARQUET
LOCATION 'hdfs://a1730.example.com:8020/user/impala/staging/airlines'
TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false',
  'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0',
  'rawDataSize'='-1')"

INSERT INTO & SELECT 하기

특정 테이블의 데이터를 로딩하여 다시 INSERT를 수행할때 다음의 쿼리를 실행할 수 있습니다.

코드 블럭
languagetext
linenumberstrue
> INSERT INTO airlines_data.airlines
  PARTITION (year)
  SELECT
    month,
    day,
    dayofweek,
    dep_time,
    crs_dep_time,
    arr_time,
    crs_arr_time,
    carrier,
    flight_num,
    actual_elapsed_time,
    crs_elapsed_time,
    airtime,
    arrdelay,
    depdelay,
    origin,
    dest,
    distance,
    taxi_in,
    taxi_out,
    cancelled,
    cancellation_code,
    diverted,
    carrier_delay,
    weather_delay,
    nas_delay,
    security_delay,
    late_aircraft_delay,
    year
  FROM airlines_data.airlines_external;

COMPUTE STATS

코드 블럭
languagetext
linenumberstrue
> COMPUTE INCREMENTAL STATS airlines;
+-------------------------------------------+
| summary                                   |
+-------------------------------------------+
| Updated 22 partition(s) and 27 column(s). |
+-------------------------------------------+

> SHOW TABLE STATS airlines;
+-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+
| year  | #Rows     | #Files | Size     | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                                                                 |
+-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+
| 1987  | 1311826   | 1      | 11.75MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1987 |
| 1988  | 5202096   | 1      | 44.04MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1988 |
| 1989  | 5041200   | 1      | 46.07MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1989 |
| 1990  | 5270893   | 1      | 46.25MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1990 |
| 1991  | 5076925   | 1      | 46.77MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1991 |
| 1992  | 5092157   | 1      | 48.21MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1992 |
| 1993  | 5070501   | 1      | 47.46MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1993 |
| 1994  | 5180048   | 1      | 47.47MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1994 |
| 1995  | 5327435   | 1      | 62.40MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1995 |
| 1996  | 5351983   | 1      | 62.93MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1996 |
| 1997  | 5411843   | 1      | 65.05MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1997 |
| 1998  | 5384721   | 1      | 62.21MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1998 |
| 1999  | 5527884   | 1      | 65.10MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1999 |
| 2000  | 5683047   | 1      | 67.68MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2000 |
| 2001  | 5967780   | 1      | 74.03MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2001 |
| 2002  | 5271359   | 1      | 74.00MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2002 |
| 2003  | 6488540   | 1      | 99.35MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2003 |
| 2004  | 7129270   | 1      | 123.29MB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2004 |
| 2005  | 7140596   | 1      | 120.72MB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2005 |
| 2006  | 7141922   | 1      | 121.88MB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2006 |
| 2007  | 7453215   | 1      | 130.87MB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2007 |
| 2008  | 7009728   | 1      | 123.14MB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2008 |
| Total | 123534969 | 22     | 1.55GB   | 0B           |                   |         |                   |                                                                                                          |
+-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+