Cloudera Documentation
Apache Impala SQL ReferencePDF version

Virtual column

You can include a virtual column in a standard SELECT statement, select INPUT__FILE__NAME from <tablename>, to retrieve the data file name that stores the actual row in a table.

You can use this virtual column INPUT__FILE__NAME in multiple ways.

  • To view the file names corresponding to each record.
  • To extract the metadata using virtual columns if external source files ingested to HDFS contain metadata like a timestamp or source system identifier, and if you prefer to use it in subsequent operations.
  • To extract the timestamp of the data derived from the fully qualified file name.
  • To forward the technical timeline in a data warehouse scenario (for example, the "LoadDateTS" in a hybrid Data Vault architecture).

You can perform ceratin operations on this virtual column. The following list contains some of the supported operations:

  • Invoke “Functions”
  • Filter rows
  • "GROUP BY" operation

This example provides the file name corresponding to each record.

select INPUT__FILE__NAME, id, name from users where dt='20140210' group by
          INPUT__FILE__NAME;

Result

The above query returns results similar to the following

hdfs://localhost.localdomain:8020/user/hive/warehouse/users/users1.txt 2 user2
hdfs://localhost.localdomain:8020/user/hive/warehouse/users/users2.txt 42 john.doe

You can use the supported string functions to extract information from the URL.

We want your opinion

How can we improve this page?

What kind of feedback do you have?