InferAvroSchema

Description:

Examines the contents of the incoming FlowFile to infer an Avro schema. The processor will use the Kite SDK to make an attempt to automatically generate an Avro schema from the incoming content. When inferring the schema from JSON data the key names will be used in the resulting Avro schema definition. When inferring from CSV data a "header definition" must be present either as the first line of the incoming data or the "header definition" must be explicitly set in the property "CSV Header Definition". A "header definition" is simply a single comma separated line defining the names of each column. The "header definition" is required in order to determine the names that should be given to each field in the resulting Avro definition. When inferring data types the higher order data type is always used if there is ambiguity. For example when examining numerical values the type may be set to "long" instead of "integer" since a long can safely hold the value of any "integer". Only CSV and JSON content is currently supported for automatically inferring an Avro schema. The type of content present in the incoming FlowFile is set by using the property "Input Content Type". The property can either be explicitly set to CSV, JSON, or "use mime.type value" which will examine the value of the mime.type attribute on the incoming FlowFile to determine the type of content present.

Tags:

kite, avro, infer, schema, csv, json

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueAllowable ValuesDescription
Schema Output DestinationSchema Output Destinationflowfile-content
  • flowfile-attribute
  • flowfile-content
Control if Avro schema is written as a new flowfile attribute 'inferred.avro.schema' or written in the flowfile content. Writing to flowfile content will overwrite any existing flowfile content.
Input Content TypeInput Content Typeuse mime.type value
  • use mime.type value
  • json
  • csv
Content Type of data present in the incoming FlowFile's content. Only "json" or "csv" are supported. If this value is set to "use mime.type value" the incoming Flowfile's attribute "MIME_TYPE" will be used to determine the Content Type.
CSV Header DefinitionCSV Header DefinitionThis property only applies to CSV content type. Comma separated string defining the column names expected in the CSV data. EX: "fname,lname,zip,address". The elements present in this string should be in the same order as the underlying data. Setting this property will cause the value of "Get CSV Header Definition From Data" to be ignored instead using this value.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Get CSV Header Definition From DataGet CSV Header Definition From Datatrue
  • true
  • false
This property only applies to CSV content type. If "true" the processor will attempt to read the CSV header definition from the first line of the input data.
CSV Header Line Skip CountCSV Header Line Skip Count0This property only applies to CSV content type. Specifies the number of lines that should be skipped when reading the CSV data. Setting this value to 0 is equivalent to saying "the entire contents of the file should be read". If the property "Get CSV Header Definition From Data" is set then the first line of the CSV file will be read in and treated as the CSV header definition. Since this will remove the header line from the data care should be taken to make sure the value of "CSV header Line Skip Count" is set to 0 to ensure no data is skipped.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
CSV delimiterCSV delimiter,Delimiter character for CSV records
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
CSV Escape StringCSV Escape String\This property only applies to CSV content type. String that represents an escape sequence in the CSV FlowFile content data.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
CSV Quote StringCSV Quote String'This property only applies to CSV content type. String that represents a literal quote character in the CSV FlowFile content data.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Pretty Avro OutputPretty Avro Outputtrue
  • true
  • false
If true the Avro output will be formatted.
Avro Record NameAvro Record NameValue to be placed in the Avro record schema "name" field. The value must adhere to the Avro naming rules for fullname. If Expression Language is present then the evaluated value must adhere to the Avro naming rules.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Number Of Records To AnalyzeNumber Of Records To Analyze10This property only applies to JSON content type. The number of JSON records that should be examined to determine the Avro schema. The higher the value the better chance kite has of detecting the appropriate type. However the default value of 10 is almost always enough.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
CharsetCharsetUTF-8Character encoding of CSV data.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)

Relationships:

NameDescription
unsupported contentThe content found in the flowfile content is not of the required format.
successSuccessfully created Avro schema from data.
failureFailed to create Avro schema from data.
originalOriginal incoming FlowFile data

Reads Attributes:

NameDescription
mime.typeIf configured by property "Input Content Type" will use this value to determine what sort of content should be inferred from the incoming FlowFile content.

Writes Attributes:

NameDescription
inferred.avro.schemaIf configured by "Schema output destination" to write to an attribute this will hold the resulting Avro schema from inferring the incoming FlowFile content.

State management:

This component does not store state.

Restricted:

This component is not restricted.

Input requirement:

This component requires an incoming relationship.

System Resource Considerations:

None specified.