You can include a virtual column in a standard SELECT statement, select INPUT__FILE__NAME from <tablename>, to retrieve the data file name that stores the actual row in a table.
List of use cases
You can use this virtual column
INPUT__FILE__NAME in multiple ways.
- To view the file names corresponding to each record.
- To extract the metadata using virtual columns if external source files ingested to HDFS contain metadata like a timestamp or source system identifier, and if you prefer to use it in subsequent operations.
- To extract the timestamp of the data derived from the fully qualified file name.
- To forward the technical timeline in a data warehouse scenario (for example, the "LoadDateTS" in a hybrid Data Vault architecture).
You can perform ceratin operations on this virtual column. The following list contains some of the supported operations:
- Invoke “Functions”
- Filter rows
- "GROUP BY" operation
This example provides the file name corresponding to each record.
select INPUT__FILE__NAME, id, name from users where dt='20140210' group by INPUT__FILE__NAME;
The above query returns results similar to the following
hdfs://localhost.localdomain:8020/user/hive/warehouse/users/users1.txt 2 user2
hdfs://localhost.localdomain:8020/user/hive/warehouse/users/users2.txt 42 john.doe
You can use the supported string functions to extract information from the URL.