Source Configuration Values
Table 6.1. Kafka
Configuration Field | Description, requirements, tips for configuration |
Cluster Name | Mandatory. Service pool defined in SAM to get metadata information about Kafka cluster |
Security Protocol | Mandatory. Protocol to be used to communicate with kafka brokers. E.g. PLAINTEXT. Auto suggest with a list of protocols supported by Kafka service based on cluster name selected. If you select a protocol with SSL or SASL make sure to fill out the related config fields |
Bootstrap Servers | Mandatory. A comma separated string of host:port representing Kafka broker listeners. Auto suggest with a list of options based on security protocol selected above |
Kafka topic | Mandatory. Kafka topic to read data from. Make sure that corresponding schema for topic is defined in Schema Registry |
Consumer Group Id | Mandatory. A unique string that identifies the consumer group it belongs to. Used to keep track of consumer offsets |
Reader schema version | Optional. Version of schema for topic to read from. Default value is the version used by producer to write data to topic |
Kerberos client principal | Optional(Mandatory for SASL). Client principal to use to connect to brokers while using SASL GSSAPI mechanism for Kerberos(used in case of security protocol being SASL_PLAINTEXT or SASL_SSL) |
Kerberos keytab file | Optional(Mandatory for SASL). Keytab file location on worker node containing the secret key for client principal while using SASL GSSAPI mechanism for Kerberos(used in case of security protocol being SASL_PLAINTEXT or SASL_SSL) |
Kafka service name | Optional(Mandatory for SASL). Service name that Kafka broker is running as(used in case of security protocol being SASL_PLAINTEXT or SASL_SSL) |
Fetch minimum bytes | Optional. The minimum number of bytes the broker should return for a fetch request. Default value is 1 |
Maximum fetch bytes per partition | Optional. The maximum amount of data per-partition the broker will return. Default value is 1048576 |
Maximum records per poll | Optional. The maximum number of records a poll will return. Default value is 500 |
Poll timeout(ms) | Optional. Time in milliseconds spent waiting in poll if data is not available. Default value is 200 |
Offset commit period(ms) | Optional. Period in milliseconds at which offsets are committed. Default value is 30000 |
Maximum uncommitted offsets | Optional.Defines the max number of polled records that can be pending commit, before another poll can take place. Default value is 10000000. This value should depend on the size of each message in Kafka and the memory available to the worker jvm process |
First poll offset strategy | Optional. Offset used by the Kafka spout in the first poll to Kafka broker. Pick one from enum values. ["EARLIEST", "LATEST", "UNCOMMITTED_EARLIEST", "UNCOMMITTED_LATEST"]. Default value is EARLIEST_UNCOMMITTED. It means that by default it will start from the earliest uncommitted offset for the consumer group id provided above |
Partition refresh period(ms) | Optional. Period in milliseconds at which Kafka will be polled for new topics and/or partitions. Default value is 2000 |
Emit null tuples? | Optional. A flag to indicate if null tuples should be emitted to downstream components or not. Default value is false |
First retry delay(ms) | Optional. Interval delay in milliseconds for first retry for a failed Kafka spout message. Default value is 0 |
Retry delay period(ms) | Optional. Retry delay period(geometric progression) in milliseconds for second and subsequent retries for a failed Kafka spout message. Default value is 2 |
Maximum retries | Optional. Maximum number of times a failed message is retried before it is acked and committed. Default value is 2147483647 |
Maximum retry delay(ms) | Optional. Maximum interval in milliseconds to wait before successive retries for a failed Kafka spout message. Default value is 10000 |
Consumer startup delay(ms) | Optional. Delay in milliseconds after which Kafka will be polled for records. This value is to make sure all executors come up before first poll from each executor happens so that partitions are well balanced among executors and onPartitionsRevoked and onPartitionsAssigned is not called later causing duplicate tuples to be emitted. Default value is 60000 |
SSL keystore location | Optional. The location of the key store file. Used when Kafka client connectivity is over SSL |
SSL keystore location | Optional. The store password for the key store file |
SSL key password | Optional. The password of the private key in the key store file |
SSL truststore location | Optional(Mandatory for SSL). The location of the trust store file |
SSL truststore password | Optional(Mandatory for SSL). The password for the trust store file |
SSL enabled protocols | Optional. Comma separated list of protocols enabled for SSL connections |
SSL keystore type | Optional. File format of keystore file. Default value is JKS |
SSL truststore type | Optional. File format of truststore file. Default value is JKS |
SSL protocol | Optional. SSL protocol used to generate SSLContext. Default value is TLS |
SSL provider | Optional. Security provider used for SSL connections. Default value is default security provider for JVM |
SSL cipher suites | Optional. Comma separated list of cipher suites. This is a named combination of authentication, encryption, MAC and key exchange algorithm used to negotiate the security settings for a network connection using TLS or SSL network protocol. By default all the available cipher suites are supported |
SSL endpoint identification algorithm | Optional. The endpoint identification algorithm to validate server hostname using server certificate |
SSL key manager algorithm | Optional. The algorithm used by key manager factory for SSL connections. Default value is SunX509 |
SSL secure random implementation | Optional. The SecureRandom PRNG implementation to use for SSL cryptographic operations |
SSL trust manager algorithm | Optional. The algorithm used by trust manager factory for SSL connections. Default value is the trust manager factory algorithm configured for the Java Virtual Machine. Default value is PKIX |
Table 6.2. Event Hubs
Configuration Field | Description, requirements, tips for configuration |
Username | The Event Hubs user name (policy name in Event Hubs Portal) |
Password | The Event Hubs password (shared access key in Event Hubs Portal) |
Namespace | The Event Hubs namespace |
Entity Path | The Event Hubs entity path |
Partition Count | The number of partitions in the Event Hubs |
ZooKeeper Connection String | The ZooKeeper connection string |
Checkpoint Interval | The frequency at which offsets are checkpointed |
Receiver Credits | Receiver credits |
Max Pending Messages Per Partition | The max pending messages per partition |
Enqueue Time Filter | The enqueue time filter |
Consumer Group Name | The consumer group name |
Table 6.3. HDFS
Configuration Field | Description, requirements, tips for configuration |
Cluster Name | Service pool defined in SAM to get metadata information about HDFS cluster |
HDFS URL | HDFS namenode URL |
Input File Format | The format of the file being consumed dictates the type of reader used to read the file. Currently only ‘com.hortonworks.streamline.streams.runtime.storm.spout.JsonFileReader’ is supported |
Source Dir | The HDFS directory from which to read the files. |
Archive Dir | Files from source dir will be moved to this HDFS location after being completely read. |
Bad Files Dir | Files from Source Dir will be moved to this HDFS location if there is an error encountered while processing them. |
Lock Dir | Lock files (used to synchronize multiple reader instances) will be created in this location. Defaults to a '.lock' subdirectory under the source directory. |
Commit Frequency Count | Records progress in the lock file after specified number of records are processed. Setting it to 0 disables this. |
Commit Frequency Secs | Records progress in the lock file after specified secs have elapsed. Must be greater than 0. |
Max Outstanding | Limits the number of unACKed tuples by pausing tuple generation (if ACKers are used in the topology). |
Lock Timeout Seconds | Duration of inactivity after which a lock file is considered to be abandoned and ready for another spout to take ownership. |
Ignore Suffix | File names with this suffix in the source dir will not be processed. |