Virtual column

You can include a virtual column in a standard SELECT statement, select INPUT__FILE__NAME from <tablename>, to retrieve the data file name that stores the actual row in a table.

List of use cases

You can use this virtual column INPUT__FILE__NAME in multiple ways.

  • To view the file names corresponding to each record.
  • To extract the metadata using virtual columns if external source files ingested to HDFS contain metadata like a timestamp or source system identifier, and if you prefer to use it in subsequent operations.
  • To extract the timestamp of the data derived from the fully qualified file name.
  • To forward the technical timeline in a data warehouse scenario (for example, the "LoadDateTS" in a hybrid Data Vault architecture).

Supported operations

You can perform ceratin operations on this virtual column. The following list contains some of the supported operations:

  • Invoke “Functions”
  • Filter rows
  • "GROUP BY" operation

Example

This example provides the file name corresponding to each record.

select INPUT__FILE__NAME, id, name from users where dt='20140210' group by
          INPUT__FILE__NAME;

Result

The above query returns results similar to the following

hdfs://localhost.localdomain:8020/user/hive/warehouse/users/users1.txt 2 user2
hdfs://localhost.localdomain:8020/user/hive/warehouse/users/users2.txt 42 john.doe

You can use the supported string functions to extract information from the URL.