Convert an HDFS file to ORC

In CDP Data Center, to use Hive to query data in HDFS, you apply a schema to the data and then store data in ORC format.

To convert data stored in HDFS into the recommended format for querying in Hive, you create a schema for the HDFS data by creating a Hive external table, and then create a Hive-managed table to convert and query the data in ORC format. The conversion is a parallel and distributed action, and no standalone ORC conversion tool is necessary. Suppose you have the following CSV file that contains a header line that describes the fields and subsequent lines that contain the following data:
Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
"chevrolet chevelle malibu",18,8,307,130,3504,12,1970-01-01,A
"buick skylark 320",15,8,350,165,3693,11.5,1970-01-01,A
"plymouth satellite",18,8,318,150,3436,11,1970-01-01,A
"amc rebel sst",16,8,304,150,3433,12,1970-01-01,A
"ford torino",17,8,302,140,3449,10.5,1970-01-01,A
You removed the header from the CSV file.
  1. Create an external table:
    CREATE EXTERNAL TABLE IF NOT EXISTS Cars(
        Name STRING, 
        Miles_per_Gallon INT,
        Cylinders INT,
        Displacement INT,
        Horsepower INT, 
        Weight_in_lbs INT,
        Acceleration DECIMAL,
        Year DATE,
        Origin CHAR(1))
    COMMENT 'Data about cars from a public database'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    location '/user/<username>/visdata';
  2. Create a Hive-managed table to convert the data to ORC.
    CREATE TABLE IF NOT EXISTS mycars(
    Name STRING, 
    Miles_per_Gallon INT,
    Cylinders INT,
    Displacement INT,
    Horsepower INT, 
    Weight_in_lbs INT,
    Acceleration DECIMAL,
    Year DATE,
    Origin CHAR(1))
    COMMENT 'Data about cars from a public database'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS ORC;
  3. Insert the data from the external table into the Hive-managed table.
    INSERT OVERWRITE TABLE mycars SELECT * FROM cars;
  4. Verify that you imported the data into the ORC-formatted table correctly:
    hive> SELECT * FROM mycars LIMIT 3;
    OK
    "chevrolet chevelle malibu"	18	8	307	130	3504	12	1970-01-01	A
    "buick skylark 320"	15	8	350	165	3693	12	1970-01-01	A
    "plymouth satellite"	18	8	318	150	3436	11	1970-01-01	A
    Time taken: 0.144 seconds, Fetched: 3 row(s)