Source Configuration Values
Table 6.1. Apache Kafka
Configuration Field | Description, requirements, tips for configuration |
Cluster Name | Mandatory. Specifies the service pool defined in SAM to get metadata about Kafka cluster |
Security Protocol | Mandatory. Specifies the protocol to be used to communicate with Kafka brokers such as PLAINTEXT. A list of protocols supported by the Kafka service and based on the cluster name selected are automatically suggested. If you select a protocol with SSL or SASL, you must complete the related configuration fields. |
Bootstrap Servers | Mandatory. A comma-separated string of host:port values representing Kafka broker listeners. Auto suggest with a list of options based on the selected security protocol. |
Kafka topic | Mandatory. The Kafka topic from which to read data. You must ensure that the corresponding topic schema is defined in Schema Registry. |
Consumer Group Id | Mandatory. A unique string that identifies the consumer group it belongs to. Used to keep track of consumer offsets. |
Reader schema version | Optional. The version of the schema for the topic to read from. The default value is the version used by the producer to write data to the topic. |
Kerberos client principal | Mandatory for SASL only. Client principal to use to connect to brokers while using SASL GSSAPI mechanism for Kerberos (used in case of security protocol being SASL_PLAINTEXT or SASL_SSL). |
Kerberos keytab file | Optional(Mandatory for SASL). Keytab file location on worker node containing the secret key for client principal while using SASL GSSAPI mechanism for Kerberos(used in case of security protocol being SASL_PLAINTEXT or SASL_SSL). |
Kafka service name | Optional(Mandatory for SASL). Service name under which Kafka broker is running (used in case of security protocol being SASL_PLAINTEXT or SASL_SSL). |
Fetch minimum bytes | Optional. The minimum number of bytes the broker should return for a fetch request. Default value is 1. |
Maximum fetch bytes per partition | Optional. The maximum amount of data per partition that the broker can return. Default value is 1048576. |
Maximum records per poll | Optional. The maximum number of records a poll can return. Default value is 500. |
Poll timeout(ms) | Optional. Time, in milliseconds, spent waiting in poll if data is not available. Default value is 200. |
Offset commit period(ms) | Optional. Period, in milliseconds, after which offsets are committed. Default value is 30000. |
Maximum uncommitted offsets | Optional.Defines the maximum number of polled records that can be pending commit status before another poll can take place. Default value is 10000000. This value depends on the size of each message in Kafka and the memory available to the worker jvm process. |
First poll offset strategy | Optional. Offset used by the Kafka spout in the first poll to Kafka broker. You must choose one of EARLIEST", "LATEST", "UNCOMMITTED_EARLIEST", and "UNCOMMITTED_LATEST". Default value is EARLIEST_UNCOMMITTED, which means that, by default, it starts from the earliest uncommitted offset for the consumer group ID. |
Partition refresh period(ms) | Optional. Period, in milliseconds, after which Kafka is polled for new topics or partitions. Default value is 2000. |
Emit null tuples? | Optional. A flag to indicate if null tuples should be emitted to downstream components or not. Default value is false. |
First retry delay(ms) | Optional. Interval delay, in milliseconds, for first retry of a failed Kafka spout message. Default value is 0. |
Retry delay period(ms) | Optional. Retry delay period(geometric progression) in milliseconds for second and subsequent retries for a failed Kafka spout message. Default value is 2. |
Maximum retries | Optional. Maximum number of times a failed message is retried before it is acked and committed. Default value is 2147483647. |
Maximum retry delay(ms) | Optional. Maximum interval, in milliseconds, to wait before successive retries for a failed Kafka spout message. Default value is 10000. |
Consumer startup delay(ms) | Optional. Delay, in milliseconds, after which Kafka is polled for records. This specified delay is intended to ensure that all executors are active before they are polled, so that partitions are well balanced among executors. This also ensures that onPartitionsRevoked and onPartitionsAssigned status does not occur and cause duplicate tuples. Default value is 60000. |
SSL keystore location | Optional. The location of the key store file. Used when Kafka client connectivity is over SSL. |
SSL keystore location | Optional. The store password for the key store file. |
SSL key password | Optional. The password of the private key in the key store file. |
SSL truststore location | Optional(Mandatory for SSL). The location of the trust store file. |
SSL truststore password | Optional(Mandatory for SSL). The password for the trust store file. |
SSL enabled protocols | Optional. Comma-separated list of protocols enabled for SSL connections. |
SSL keystore type | Optional. File format of keystore file. Default value is JKS. |
SSL truststore type | Optional. File format of truststore file. Default value is JKS |
SSL protocol | Optional. SSL protocol used to generate SSLContext. Default value is TLS. |
SSL provider | Optional. Security provider used for SSL connections. Default value is default security provider for JVM. |
SSL cipher suites | Optional. Comma-separated list of cipher suites. This is a named combination of authentication, encryption, MAC, and key exchange algorithm used to negotiate the security settings for a network connection using TLS or SSL network protocol. By default, all the available cipher suites are supported. |
SSL endpoint identification algorithm | Optional. The endpoint identification algorithm to validate server host name using server certificate. |
SSL key manager algorithm | Optional. The algorithm used by key manager factory for SSL connections. Default value is SunX509. |
SSL secure random implementation | Optional. The SecureRandom PRNG implementation to use for SSL cryptographic operations. |
SSL trust manager algorithm | Optional. The algorithm used by trust manager factory for SSL connections. Default value is the trust manager factory algorithm configured for the Java Virtual Machine. Default value is PKIX. |
Table 6.2. Event Hubs
Configuration Field | Description, requirements, tips for configuration |
Username | The Event Hubs user name (policy name in Event Hubs Portal) |
Password | The Event Hubs password (shared access key in Event Hubs Portal) |
Namespace | The Event Hubs namespace |
Entity Path | The Event Hubs entity path |
Partition Count | The number of partitions in the Event Hubs |
ZooKeeper Connection String | The ZooKeeper connection string |
Checkpoint Interval | The frequency at which offsets are checkpointed |
Receiver Credits | Receiver credits |
Max Pending Messages Per Partition | The max pending messages per partition |
Enqueue Time Filter | The enqueue time filter |
Consumer Group Name | The consumer group name |
Table 6.3. HDFS
Configuration Field | Description, requirements, tips for configuration |
Cluster Name | Service pool defined in SAM to get metadata information about HDFS cluster |
HDFS URL | HDFS namenode URL |
Input File Format | The format of the file being consumed dictates the type of reader used to read the file. Currently, only com.hortonworks.streamline.streams.runtime.storm.spout.JsonFileReader is supported. |
Source Dir | The HDFS directory from which to read the files. |
Archive Dir | The Hortonworks Data File System location to which files from the source dir are moved after being completely read. |
Bad Files Dir | Files from Source Dir will be moved to this HDFS location if there is an error encountered while processing them. |
Lock Dir | Location in which lock files (used to synchronize multiple reader instances) are created. Defaults to a .lock'subdirectory under the source directory. |
Commit Frequency Count | If not set to 0, records progress in the lock file after the specified number of records are processed. |
Commit Frequency Secs | The number of seconds after which progress in the lock file is recorded. |
Max Outstanding | Limits the number of unACKed tuples by pausing tuple generation (if ACKers are used in the topology). |
Lock Timeout Seconds | Duration of inactivity after which a lock file is considered abandoned and ready for another spout to take ownership. |
Ignore Suffix | File names with this suffix in the source directory are not processed. |