...
파일에 쿼리를 작성하고 파일을 로딩하여 쿼리를 실행하기 위해서 다음의 커맨드를 사용할 수 있습니다.
코드 블럭 | ||||
---|---|---|---|---|
| ||||
# impala-shell.sh -i impala-host -f query.sql |
...
코드 블럭 | ||||
---|---|---|---|---|
| ||||
refresh log_type; selectsdt * from log_type limit 100; +--------+--------+--------+------+-------+-----+-------+ | field1 | field2 | field3 | year | month | day | host | +--------+--------+--------+------+-------+-----+-------+ | bar | baz | bletch | 2013 | 07 | 28 | host1 | | bar | baz | bletch | 2013 | 08 | 01 | host1 | | bar | baz | bletch | 2013 | 07 | 29 | host1 | | bar | baz | bletch | 2013 | 07 | 28 | host2 | +--------+--------+--------+------+-------+-----+-------+ |
Parquet 파일의 스키마에서 테이블 생성하기
이미 Parquet 파일이 있는 경우 동일한 스키마를 가진 새로운 테이블을 생성하려면 다음의 쿼리문을 실행합니다.
코드 블럭 | ||||
---|---|---|---|---|
| ||||
CREATE DATABASE airlines_data;
USE airlines_data;
CREATE EXTERNAL TABLE airlines_external
LIKE PARQUET 'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
STORED AS PARQUET LOCATION 'hdfs:staging/airlines' |
DESCRIBE/SHOW 활용하기
코드 블럭 | ||||
---|---|---|---|---|
| ||||
> SHOW TABLE STATS airlines_external;
+-------+--------+--------+--------------+-------------------+---------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+--------+--------+--------------+-------------------+---------+-------------------+
| -1 | 8 | 1.34GB | NOT CACHED | NOT CACHED | PARQUET | false |
+-------+--------+--------+--------------+-------------------+---------+-------------------+
> SHOW FILES IN airlines_external;
+----------------------------------------------------------------------------------------+----------+-----------+
| path | size | partition |
+----------------------------------------------------------------------------------------+----------+-----------+
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq | 252.99MB | |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq | 13.43MB | |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq | 252.84MB | |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq | 63.92MB | |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq | 183.64MB | |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq | 240.04MB | |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq | 211.35MB | |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq | 151.46MB | |
+----------------------------------------------------------------------------------------+----------+-----------+
> DESCRIBE airlines_external;
+---------------------+--------+---------------------------------------------------+
| name | type | comment |
+---------------------+--------+---------------------------------------------------+
| year | int | Inferred from Parquet file. |
| month | int | Inferred from Parquet file. |
| day | int | Inferred from Parquet file. |
| dayofweek | int | Inferred from Parquet file. |
| dep_time | int | Inferred from Parquet file. |
| crs_dep_time | int | Inferred from Parquet file. |
| arr_time | int | Inferred from Parquet file. |
| crs_arr_time | int | Inferred from Parquet file. |
| carrier | string | Inferred from Parquet file. |
| flight_num | int | Inferred from Parquet file. |
| tail_num | int | Inferred from Parquet file. |
| actual_elapsed_time | int | Inferred from Parquet file. |
| crs_elapsed_time | int | Inferred from Parquet file. |
| airtime | int | Inferred from Parquet file. |
| arrdelay | int | Inferred from Parquet file. |
| depdelay | int | Inferred from Parquet file. |
| origin | string | Inferred from Parquet file. |
| dest | string | Inferred from Parquet file. |
| distance | int | Inferred from Parquet file. |
| taxi_in | int | Inferred from Parquet file. |
| taxi_out | int | Inferred from Parquet file. |
| cancelled | int | Inferred from Parquet file. |
| cancellation_code | string | Inferred from Parquet file. |
| diverted | int | Inferred from Parquet file. |
| carrier_delay | int | Inferred from Parquet file. |
| weather_delay | int | Inferred from Parquet file. |
| nas_delay | int | Inferred from Parquet file. |
| security_delay | int | Inferred from Parquet file. |
| late_aircraft_delay | int | Inferred from Parquet file. |
+---------------------+--------+---------------------------------------------------+
> DESCRIBE FORMATTED airlines_external;
+------------------------------+-------------------------------
| name | type
+------------------------------+-------------------------------
...
| # Detailed Table Information | NULL
| Database: | airlines_data
| Owner: | impala
...
| Location: | /user/impala/staging/airlines
| Table Type: | EXTERNAL_TABLE
...
| # Storage Information | NULL
| SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
| InputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputForma
| OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
... |
Sub Query를 위한 With 사용하기
코드 블럭 | ||||
---|---|---|---|---|
| ||||
> SELECT COUNT(*) AS 'rows', COUNT(tail_num) AS 'non-null tail numbers'
FROM airlines_external;
+-----------+-----------------------+
| rows | non-null tail numbers |
+-----------+-----------------------+
| 123534969 | 412968 |
+-----------+-----------------------+
> WITH t1 AS
(SELECT COUNT(*) AS 'rows', COUNT(tail_num) AS 'nonnull'
FROM airlines_external)
SELECT `rows`, `nonnull`, `rows` - `nonnull` AS 'nulls',
(`nonnull` / `rows`) * 100 AS 'percentage non-null'
FROM t1;
+-----------+---------+-----------+---------------------+
| rows | nonnull | nulls | percentage non-null |
+-----------+---------+-----------+---------------------+
| 123534969 | 412968 | 123122001 | 0.3342923897119365 |
+-----------+---------+-----------+---------------------+ |
테이블의 CREATE 문 확인하기
코드 블럭 | ||||
---|---|---|---|---|
| ||||
> SHOW CREATE TABLE airlines_external;
"CREATE EXTERNAL TABLE airlines_data.airlines_external (
year INT COMMENT 'inferred from: optional int32 year',
month INT COMMENT 'inferred from: optional int32 month',
day INT COMMENT 'inferred from: optional int32 day',
dayofweek INT COMMENT 'inferred from: optional int32 dayofweek',
dep_time INT COMMENT 'inferred from: optional int32 dep_time',
crs_dep_time INT COMMENT 'inferred from: optional int32 crs_dep_time',
arr_time INT COMMENT 'inferred from: optional int32 arr_time',
crs_arr_time INT COMMENT 'inferred from: optional int32 crs_arr_time',
carrier STRING COMMENT 'inferred from: optional binary carrier',
flight_num INT COMMENT 'inferred from: optional int32 flight_num',
tail_num INT COMMENT 'inferred from: optional int32 tail_num',
actual_elapsed_time INT COMMENT 'inferred from: optional int32 actual_elapsed_time',
crs_elapsed_time INT COMMENT 'inferred from: optional int32 crs_elapsed_time',
airtime INT COMMENT 'inferred from: optional int32 airtime',
arrdelay INT COMMENT 'inferred from: optional int32 arrdelay',
depdelay INT COMMENT 'inferred from: optional int32 depdelay',
origin STRING COMMENT 'inferred from: optional binary origin',
dest STRING COMMENT 'inferred from: optional binary dest',
distance INT COMMENT 'inferred from: optional int32 distance',
taxi_in INT COMMENT 'inferred from: optional int32 taxi_in',
taxi_out INT COMMENT 'inferred from: optional int32 taxi_out',
cancelled INT COMMENT 'inferred from: optional int32 cancelled',
cancellation_code STRING COMMENT 'inferred from: optional binary cancellation_code',
diverted INT COMMENT 'inferred from: optional int32 diverted',
carrier_delay INT COMMENT 'inferred from: optional int32 carrier_delay',
weather_delay INT COMMENT 'inferred from: optional int32 weather_delay',
nas_delay INT COMMENT 'inferred from: optional int32 nas_delay',
security_delay INT COMMENT 'inferred from: optional int32 security_delay',
late_aircraft_delay INT COMMENT 'inferred from: optional int32 late_aircraft_delay'
)
STORED AS PARQUET
LOCATION 'hdfs://a1730.example.com:8020/user/impala/staging/airlines'
TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false',
'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0',
'rawDataSize'='-1')" |
INSERT INTO & SELECT 하기
특정 테이블의 데이터를 로딩하여 다시 INSERT를 수행할때 다음의 쿼리를 실행할 수 있습니다.
코드 블럭 | ||||
---|---|---|---|---|
| ||||
> INSERT INTO airlines_data.airlines
PARTITION (year)
SELECT
month,
day,
dayofweek,
dep_time,
crs_dep_time,
arr_time,
crs_arr_time,
carrier,
flight_num,
actual_elapsed_time,
crs_elapsed_time,
airtime,
arrdelay,
depdelay,
origin,
dest,
distance,
taxi_in,
taxi_out,
cancelled,
cancellation_code,
diverted,
carrier_delay,
weather_delay,
nas_delay,
security_delay,
late_aircraft_delay,
year
FROM airlines_data.airlines_external; |
COMPUTE STATS
코드 블럭 | ||||
---|---|---|---|---|
| ||||
> COMPUTE INCREMENTAL STATS airlines; +-------------------------------------------+ | summary | +-------------------------------------------+ | Updated 22 partition(s) and 27 column(s). | +-------------------------------------------+ > SHOW TABLE STATS airlines; +-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+ | year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+ | 1987 | 1311826 | 1 | 11.75MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1987 | | 1988 | 5202096 | 1 | 44.04MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1988 | | 1989 | 5041200 | 1 | 46.07MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1989 | | 1990 | 5270893 | 1 | 46.25MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1990 | | 1991 | 5076925 | 1 | 46.77MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1991 | | 1992 | 5092157 | 1 | 48.21MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1992 | | 1993 | 5070501 | 1 | 47.46MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1993 | | 1994 | 5180048 | 1 | 47.47MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1994 | | 1995 | 5327435 | 1 | 62.40MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1995 | | 1996 | 5351983 | 1 | 62.93MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1996 | | 1997 | 5411843 | 1 | 65.05MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1997 | | 1998 | 5384721 | 1 | 62.21MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1998 | | 1999 | 5527884 | 1 | 65.10MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1999 | | 2000 | 5683047 | 1 | 67.68MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2000 | | 2001 | 5967780 | 1 | 74.03MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2001 | | 2002 | 5271359 | 1 | 74.00MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2002 | | 2003 | 6488540 | 1 | 99.35MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2003 | | 2004 | 7129270 | 1 | 123.29MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2004 | | 2005 | 7140596 | 1 | 120.72MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2005 | | 2006 | 7141922 | 1 | 121.88MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2006 | | 2007 | 7453215 | 1 | 130.87MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2007 | | 2008 | 7009728 | 1 | 123.14MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2008 | | Total | 123534969 | 22 | 1.55GB | 0B | | | | | +-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+ |