Impala with HDFS
Although Impala typically works well with many large files in an HDFS storage system, there are times when you might perform some file cleanup to reclaim space, or advise developers on techniques to minimize space consumption and file duplication.
- Use compact binary file formats where practical. Numeric and
time-based data in particular can be stored in more compact form in
binary data files. Depending on the file format, various compression and
encoding features can reduce file size even further. You can specify the
STORED ASclause as part of the
CREATE TABLEstatement, or
ALTER TABLEwith the
SET FILEFORMATclause for an existing table or partition within a partitioned table.
You manage underlying data files differently depending on whether the corresponding Impala table is defined as an internal or external table:
- Use the
DESCRIBE FORMATTEDstatement to check if a particular table is internal (managed by Impala) or external, and to see the physical location of the data files in HDFS.
- For Impala-managed (“internal”) tables, use
DROP TABLEstatements to remove data files.
- For tables not managed by Impala (“external”) tables, use
appropriate HDFS-related commands such as
hdfs dfs, or
distcp, to create, move, copy, or delete files within HDFS directories that are accessible by the
impalauser. Issue a
REFRESH table_namestatement after adding or removing any files from the data directory of an external table.
- Use external tables to reference HDFS data files in their original location. With this technique, you avoid copying the files, and you can map more than one Impala table to the same set of data files. When you drop the Impala table, the data files are left undisturbed.
- Use the
LOAD DATAstatement to move HDFS files into the data directory for an Impala table from inside Impala, without the need to specify the HDFS path of the destination directory. This technique works for both internal and external tables.
- Use the
- Make sure that the HDFS trashcan is configured correctly. When you remove files from HDFS, the space might not be reclaimed for use by other files until sometime later, when the trashcan is emptied.
- Drop all tables in a database before dropping the database itself.
- If an
INSERTstatement encounters an error, and you see a directory named .impala_insert_staging or _impala_insert_staging left behind in the data directory for the table, it might contain temporary data files taking up space in HDFS. You might be able to salvage these data files.
For example, delete those files through commands such as
hdfs dfsto reclaim space before re-trying the
DESCRIBE FORMATTED table_nameto see the HDFS path where you can check for temporary files.
- Specify a different location in the Cloudera Manager in the
Impala Daemon Scratch Directories field to
customize the intermediate directory used during large sort, join,
aggregation, or analytic function operations.
- By default, intermediate files are stored in the directory /tmp/impala-scratch.
- You can specify a single directory, or a comma-separated list of directories.
- The scratch directories must be on the local filesystem, not in HDFS.
- You might specify different directory paths for different hosts, depending on the capacity and speed of the available storage devices.
- If there is less than 1 GB free on the filesystem where that directory resides, Impala still runs, but writes a warning message to its log.
- If Impala encounters an error reading or writing files in a scratch directory during a query, Impala logs the error and the query fails.