kite-morphlines-saxon
This maven module contains morphline commands for reading, extracting and transforming XML and HTML with XPath, XQuery and XSLT.
convertHTML
The convertHTML
command (source code) converts any HTML to XHTML, using the
TagSoup Java library.
Instead of parsing well-formed or valid XML, this command parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup (and hence this command) is designed for people who have to process this stuff using some semblance of a rational application design. By providing this converter, it allows standard XML tools to be applied to even the worst malformed HTML.
The command reads an InputStream or byte array from the first attachment (field _attachment_body) of the input record, parses it as HTML and replaces the field with UTF-8 encoded XHTML.
The command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
supportedMimeTypes | null | Optionally, require the input record to match one of the MIME types in this list. |
charset | null | The character encoding to use for parsing input, for example, UTF-8. If none is specified the charset specified in the _attachment_charset input field is used instead. |
noNamespaces | true | A value of false indicates namespace URIs and unprefixed local names for element and attribute names will be available. |
noCDATA | false | A value of true indicates that the parser will treat CDATA elements specially. |
noBogons | false | A value of true indicates that the parser will ignore unknown elements. |
emptyBogons | false | A value of true indicates that the parser will give unknown elements a content model of EMPTY; a value of false, a content model of ANY. |
noRootBogons | false | A value of true indicates that the parser will allow unknown elements to be the root element. |
noDefaultAttributes | false | A value of true indicates that the parser will return default attribute values for missing attributes that have default values. |
noColons | false | A value of true indicates that the parser will translate colons into underscores in names. |
noRestart | false | A value of true indicates that the parser will attempt to restart the restartable elements. |
suppressIgnorableWhitespace | true | A value of false indicates that the parser will transmit whitespace in element-only content via the SAX ignorableWhitespace callback. |
Example usage:
convertHTML {
charset : UTF-8
}
xquery
The xquery
command (source code) parses an InputStream that contains an
XML document and runs the given W3C XQuery over the XML document, using the Saxon Java library.
For each item in the query result sequence, the command emits a corresponding morphline
record.
The command reads an InputStream or byte array from the first attachment (field _attachment_body) of the input record.
Per the W3C specs, every valid XPath (e.g. //tweets/tweet[@color='blue']) is also a valid XQuery. If you are comfortable with XPath you are already almost there.
An XQuery result sequence contains zero or more items such as element nodes, attribute nodes, text nodes, atomic values, etc. For each item in the query result sequence, the morphline command converts the item to a record and pipes that record to the next morphline command. For an attribute node the attribute's XPath string value is filled into the record field named after the attribute name. For an element node the attributes and children of the element are treated as follows: The XPath string value of the attribute or child is filled into the record field named after the child's name.
For example, in order to generate two morphline records, the first morphline record with a
firstName
field that contains Joe
, as well as a
lastName
field that contains Bubblegum
, and the second
morphline record with a firstName
field that contains
Alice
, as well as a lastName
field that contains
Pellegrino
, your xquery command should be formulated such that it outputs
two XML fragments like this:
<record>
<firstName>Joe</firstName>
<lastName>Bubblegum</lastName>
</record>
<record>
<firstName>Alice</firstName>
<lastName>Pellegrino</lastName>
</record>
The xquery command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
supportedMimeTypes | null | Optionally, require the input record to match one of the MIME types in this list. |
languageVersion | "1.0" | Must be "1.0" for XQuery 1.0 or "3.0" for XQuery 3.0. |
features | null | An optional JSON object containing zero or more name-value pairs that represent configuration properties for Saxon features. |
extensionFunctions | [] | An optional list of Java class names that implement custom Saxon extension functions. Each such Java class must implement net.sf.saxon.s9api.ExtensionFunction as described in the Saxon documentation. |
fragments | n/a | An array containing exactly one fragment JSON object, as described below. |
Each fragment provides the following configuration options: | ||
fragmentPath | n/a | Currently must be "/" |
externalVariables | null | An optional JSON object containing zero or more name-value pairs that are bound and passed in as external variables to the query. Example: myVar : "hello world" |
externalFileVariables | null | An optional JSON object containing zero or more name-path pairs that refer to XML files on the local file system, and are bound and passed in as external variables to the query. These files are loaded once on program startup and subsequently remain memory resident across queries. This can be used for efficient joins where the join table is static and fits into main memory. Example: myDoc : src/test/resources/testdocuments/helloworld.xml |
queryFile | null | A relative or absolute path of a local file from which to load the query. |
queryString | null | An inline string from which to load the query. One of queryFile or queryString must be present, but not both. Example: """/tweets/tweet""" |
Example usage:
xquery {
fragments : [
{
fragmentPath : "/"
externalVariables : {
myVariable : "hello world"
}
queryString : """
(: Example test xquery :)
declare variable $myVariable as xs:string external;
for $tweet in /tweets/tweet
return
<record>
{$tweet/@id}
{$tweet/user/@screen_name}
<myStatusCounts>{string($tweet/user/@statuses_count)}</myStatusCounts>
<text>{$tweet/text}</text>
<greeting>{$myVariable}</greeting>
</record>
"""
}
]
}
Here is an example output record for the query above:
id:11111112
screen_name:fake_user1
myStatusCounts:11111
text:Come, see new hot tub under Redwood Tree!
greeting:hello world
More example usage:
xquery {
fragments : [
{
fragmentPath : "/"
queryString : """
(: Example xquery :)
for $req in /request
return
<record>
<date> { string($req/data/agreementDate) } </date>
<tradeId> { string($req/trade/@tradeId) } </tradeId>
<partyId>
{
for $keyword in $req/trade/keyWords/keyword
where $keyword/name = "memberId"
return string($keyword/value)
}
</partyId>
<fullText> { $req } </fullText>
</record>
"""
}
]
}
More examples can be found in the unit tests.
Here is an example extension function along with a corresponding example xquery.
For more background, see resources such as the XQuery Primer and XQuery FLOWR Tutorial and XQuery: A Guided Tour and Wikipedia.
xslt
The xslt
command (source code) parses an InputStream that contains an
XML document and runs the given W3C XSL Transform over the XML document, using the Saxon Java library.
For each item in the query result sequence, the command emits a corresponding morphline
record.
The command reads an InputStream or byte array from the first attachment (field _attachment_body) of the input record.
An XSLT result sequence contains zero or more items such as element nodes, attribute nodes, text nodes, atomic values, etc. For each item in the query result sequence, the morphline command converts the item to a record and pipes that record to the next morphline command. For an attribute node the attribute's XPath string value is filled into the record field named after the attribute name. For an element node the attributes and children of the element are treated as follows: The XPath string value of the attribute or child is filled into the record field named after the child's name.
For example, in order to generate two morphline records, the first morphline record with a
firstName
field that contains Joe
, as well as a
lastName
field that contains Bubblegum
, and the second
morphline record with a firstName
field that contains
Alice
, as well as a lastName
field that contains
Pellegrino
, your xslt command should be formulated such that it outputs
two XML fragments like this:
<record>
<firstName>Joe</firstName>
<lastName>Bubblegum</lastName>
</record>
<record>
<firstName>Alice</firstName>
<lastName>Pellegrino</lastName>
</record>
The command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
supportedMimeTypes | null | Optionally, require the input record to match one of the MIME types in this list. |
features | null | An optional JSON object containing zero or more name-value pairs that represent configuration properties for Saxon features. |
extensionFunctions | [] | An optional list of Java class names that implement custom Saxon extension functions. Each such Java class must implement net.sf.saxon.s9api.ExtensionFunction as described in the Saxon documentation. |
fragments | n/a | An array containing exactly one fragment JSON object, as described below. |
Each fragment provides the following configuration options: | ||
fragmentPath | n/a | Currently must be "/" |
parameters | null | An optional JSON object containing zero or more name-value pairs that are bound and passed in as XSLT parameters to the query. Example: myVar : "hello world" |
fileParameters | null | An optional JSON object containing zero or more name-path pairs that refer to XML files on the local file system, and are bound and passed in as external variables to the query. These files are loaded once on program startup and subsequently remain memory resident across queries. This can be used for efficient joins where the join table is static and fits into main memory. Example: myDoc : src/test/resources/testdocuments/helloworld.xml |
queryFile | null | A relative or absolute path of a local file from which to load the query. |
queryString | null | An inline string from which to load the query. One of queryFile or queryString must be present, but not both. |
Example usage:
xslt {
fragments : [
{
fragmentPath : "/"
parameters : {
myVariable : "hello world"
}
queryString : """
<!-- Example XSLT identity transformation -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
}
]
}
More examples can be found in the unit tests.
For more background, see resources such as the XSLT Tutorial and Wikipedia.