kite-morphlines-protobuf

This maven module contains morphline commands for reading, extracting, and transforming protocol buffer objects.

readProtobuf

The readProtobuf command (source code) parses an InputStream or byte array that contains protobuf data. For each protobuf object, the command emits a morphline record containing the top level object as an attachment in the field _attachment_body.

The input stream or byte array is read from the first attachment of the input record.

The command provides the following configuration options:

Property Name Default Description
protobufClass [] The fully qualified name of a Java class that was generated by the protoc compiler. This Java class contains protobuf message definitions.
outputClass [] The name of an inner Java class (within protobufClass) for deserializing data to.

Example usage:

readProtobuf { 
  protobufClass : org.kitesdk.morphline.protobuf.Protos 
  outputClass : RepeatedLongs 
}

And protobuf schema for protoc:

option java_package = "org.kitesdk.morphline.protobuf";
option java_outer_classname = "Protos";
option java_generate_equals_and_hash = true;
option optimize_for = SPEED;

message RepeatedLongs {
  repeated sint64 longVal = 1;
}

message Complex {

  message Name {
    optional uint32 intVal = 1;
    optional uint64 longVal = 2;
    optional double doubleVal = 3;
    optional float floatVal = 4;
    repeated string stringVal = 5;
    optional RepeatedLongs repeatedLong = 6;
  }

  message Link {
    repeated string language = 1;
    required string url = 2;
  }

  enum Type {
    QUERY = 1;
    UPDATE = 2;
  }

  required sint32 docId = 1;
  required Name name = 2;
  repeated Link link = 3;
  required Type type = 4;
}

extractProtobufPaths

The extractProtobufPaths command (source code) extracts specific values from a protobuf object, akin to a simple form of XPath. The command uses zero or more path expressions to extract values from a protobuf instance object.

The protobuf input object is expected to be contained in the field _attachment_body, and typically placed there by an upstream readProtobuf command.

Each path expression consists of a record output field name (on the left side of the colon ':') as well as zero or more path steps (on the right hand side), each path step separated by a '/' slash, akin to a simple form of XPath. Repeated values(Lists) are traversed with the '[]' notation.

The result of a path expression is a list of objects, each of which is added to the given record output field. To check if the property is set and serialized in protobuf message is used the has<PropertyName>() method and if the property isn't set then there is no result of a path expression. That means the output field is not passed to next command.

The command provides the following configuration options:

Property Name Default Description
objectExtractMethod toByteArray Java method that is called on the protobuf object to get a value to pass to the next command if the type of value on a path is a protobuf object. Options are: toByteArray - the "toByteArray()" method is called to get serialized bytes from the protobuf object. toString - the "toString()" method is called to get a String representation of a protobuf object. none - no method is called and the whole protobuf object is passed to the next command.
enumExtractMethod name Java method that is called to get a value to pass to the next command if the type of value on a path is an enum object. Options are: name - the "name()" method is called to get a String representation of the enum object. getNumber - the "getNumber()" method is called to get an int representation of the enum object, none - no method is called and the whole enum object is passed to the next command.
paths [] Zero or more protobuf path expressions.

Example usage:

extractProtobufPaths {
  objectExtractMethod : toByteArray
  enumExtractMethod : name
  paths : { 
    "docId" : "/docId"
    "name" : "/name"
    "intVal" : "/name/intVal"
    "longVal" : "/name/longVal"
    "doubleVal" : "/name/doubleVal"
    "floatVal" : "/name/floatVal"
    "stringVals" : "/name/stringVal[]"
    "longVals" : "/name/repeatedLong/longVal[]"
    "links" : "/link[]"
    "languages" : "/link[]/language"
    "urls" : "/link[]/url"
    "type" : "/type"
  } 
}