Source, operator and sink in DataStream API
A DataStream represents the data records and the operators. There are pre-implemented sources and sinks for Flink, and you can also use custom defined connectors to maintain the dataflow with other functions.
DataStream<String> source = env.addSource(consumer) .name("Kafka Source") .uid("Kafka Source") .map(record -> record.getId() + "," + record.getName() + "," + record.getDescription()) .name("ToOutputString"); StreamingFileSink<String> sink = StreamingFileSink .forRowFormat(new Path(params.getRequired(K_HDFS_OUTPUT)), new SimpleStringEncoder<String>("UTF-8")) .build(); source.addSink(sink) .name("FS Sink") .uid("FS Sink"); source.print();
- Sources are where your program reads its input from. You can attach a source to your
program by using
StreamExecutionEnvironment.addSource(sourceFunction). Flink comes with a number of pre-implemented source functions. For the list of sources, see the Apache Flink documentation.
- Streaming Analytics in Cloudera supports the following sources:
- Operators transform one or more DataStreams into a new DataStream. When choosing the
operator, you need to decide what type of transformation you need on your data. The
following are some basic transformation:
- MapTakes one element and produces one element.
- FlatMapTakes one element and produces zero, one, or more elements.
- FilterEvaluates a boolean function for each element and retains those for which the function returns true.
- KeyByLogically partitions a stream into disjoint partitions. All records with the same key are assigned to the same partition. This transformation returns a
dataStream.keyBy() // Key by field "someKey" dataStream.keyBy() // Key by the first element of a Tuple
- WindowWindows can be defined on already partitioned KeyedStreams. Windows group the data in each key according to some characteristic (e.g., the data that arrived within the last 5 seconds).
dataStream.keyBy().window(TumblingEventTimeWindows.of(Time.seconds(5))); // Last 5 seconds of data
For the full list of operators, see the Apache Flink documentation.
- Data sinks consume DataStreams and forward them to files, sockets, external systems, or print them. Flink comes with a variety of built-in output formats that are encapsulated behind operations on the DataStreams. For the list of sources, see the Apache Flink documentation.
- Streaming Analytics in Cloudera supports the following sinks: