HDFS Cache

Impala의 쿼리 성능을 높이기 위해서 HDFS Cache 기능을 사용할 수 있으며 멀티 캐쉬를 사용할 수도 있습니다. 우선 다음과 같이 HDFS Cache를 생성합니다.

# hdfs cacheadmin -addPool four_gig_pool -owner impala -limit 4000000000

테이블 생성시 또는 테이블 생성 후에 테이블에 캐쉬를 지정할 수 있습니다.

-- Cache the entire table (all partitions).
alter table census set cached in 'pool_name';

-- Remove the entire table from the cache.
alter table census set uncached;

-- Cache a portion of the table (a single partition).
-- If the table is partitioned by multiple columns (such as year, month, day),
-- the ALTER TABLE command must specify values for all those columns.
alter table census partition (year=1960) set cached in 'pool_name';

-- Cache the data from one partition on up to 4 hosts, to minimize CPU load on any
-- single host when the same data block is processed multiple times.
alter table census partition (year=1970)
  set cached in 'pool_name' with replication = 4;

-- At each stage, check the volume of cached data.
-- For large tables or partitions, the background loading might take some time,
-- so you might have to wait and reissue the statement until all the data
-- has finished being loaded into the cache.
show table stats census;
+-------+-------+--------+------+--------------+--------+
| year  | #Rows | #Files | Size | Bytes Cached | Format |
+-------+-------+--------+------+--------------+--------+
| 1900  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
| 1940  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
| 1960  | -1    | 1      | 11B  | 11B          | TEXT   |
| 1970  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
| Total | -1    | 4      | 44B  | 11B          |        |
+-------+-------+--------+------+--------------+--------+

보다 상세한 사항은 HDFS Caching with Impala을 참고하십시오.

Data Cache for Remote Read

Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve) 필드에 다음을 추가합니다. quota는 1TB 형식으로 입력합니다.

--data_cache=dir1,dir2,dir3,...:quota

보다 상세한 사항은 Data Cache for Remote Reads을 참고하십시오.