Impala String Functions
- All the functions that accept STRING arguments also accept the VARCHAR and CHAR types introduced in Impala 2.0.
- Whenever VARCHAR or CHAR values are passed to a function that returns a string value, the return type is normalized to STRING. For example, a call to concat() with a mix of STRING, VARCHAR, and CHAR arguments produces a STRING result.
Related information:
The string functions operate mainly on these data types: STRING Data Type, VARCHAR Data Type (CDH 5.2 or higher only), and CHAR Data Type (CDH 5.2 or higher only).
Function reference:
Impala supports the following string functions:
- ascii(string str)
- Purpose: Returns the numeric ASCII code of the first character of the argument.
Return type: int
- char_length(string a), character_length(string a)
- Purpose: Returns the length in characters of the argument string. Aliases for the length() function.
Return type: int
- concat(string a, string b...)
- Purpose: Returns a single string representing all the argument values joined together.
Return type: string
Usage notes: concat() and concat_ws() are appropriate for concatenating the values of multiple columns within the same row, while group_concat() joins together values from different rows.
- concat_ws(string sep, string a, string b...)
- Purpose: Returns a single string representing the second and following argument values joined together, delimited by a specified separator.
Return type: string
Usage notes: concat() and concat_ws() are appropriate for concatenating the values of multiple columns within the same row, while group_concat() joins together values from different rows.
- find_in_set(string str, string strList)
- Purpose: Returns the position (starting from 1) of the first occurrence of a specified string within a comma-separated string. Returns NULL if either argument is NULL, 0 if the search string is not found, or 0 if the search string contains a comma.
Return type: int
- group_concat(string s [, string sep])
- Purpose: Returns a single string representing the argument value concatenated together for each row of the result set. If the optional separator string is
specified, the separator is added between each pair of concatenated values.
Return type: string
Usage notes: concat() and concat_ws() are appropriate for concatenating the values of multiple columns within the same row, while group_concat() joins together values from different rows.
By default, returns a single string covering the whole result set. To include other columns or values in the result set, or to produce multiple concatenated strings for subsets of rows, include a GROUP BY clause in the query.
Strictly speaking, group_concat() is an aggregate function, not a scalar function like the others in this list. For additional details and examples, see GROUP_CONCAT Function.
- initcap(string str)
- Purpose: Returns the input string with the first letter capitalized.
Return type: string
- instr(string str, string substr)
- Purpose: Returns the position (starting from 1) of the first occurrence of a substring within a longer string.
Return type: int
- length(string a)
- Purpose: Returns the length in characters of the argument string.
Return type: int
- locate(string substr, string str[, int pos])
- Purpose: Returns the position (starting from 1) of the first occurrence of a substring within a longer string, optionally after a particular position.
Return type: int
- lower(string a), lcase(string a)
- Purpose: Returns the argument string converted to all-lowercase.
Return type: string
- lpad(string str, int len, string pad)
- Purpose: Returns a string of a specified length, based on the first argument string. If the specified string is too short, it is padded on the left with a
repeating sequence of the characters from the pad string. If the specified string is too long, it is truncated on the right.
Return type: string
- ltrim(string a)
- Purpose: Returns the argument string with any leading spaces removed from the left side.
Return type: string
- parse_url(string urlString, string partToExtract [, string keyToExtract])
- Purpose: Returns the portion of a URL corresponding to a specified part. The part argument can be 'PROTOCOL', 'HOST', 'PATH', 'REF', 'AUTHORITY', 'FILE', 'USERINFO', or 'QUERY'. Uppercase is required for these literal values. When requesting the QUERY portion of the URL,
you can optionally specify a key to retrieve just the associated value from the key-value pairs in the query string.
Return type: string
Usage notes: This function is important for the traditional Hadoop use case of interpreting web logs. For example, if the web traffic data features raw URLs not divided into separate table columns, you can count visitors to a particular page by extracting the 'PATH' or 'FILE' field, or analyze search terms by extracting the corresponding key from the 'QUERY' field.
- regexp_extract(string subject, string pattern, int index)
- Purpose: Returns the specified () group from a string based on a regular expression pattern. Group 0 refers to the entire extracted string, while group 1,
2, and so on refers to the first, second, and so on (...) portion.
Return type: string
In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used by the Google RE2 library. For details, see the RE2 documentation. It has most idioms familiar from regular expressions in Perl, Python, and so on, including .*? for non-greedy matches.
In Impala 2.0 and later, a change in the underlying regular expression library could cause changes in the way regular expressions are interpreted by this function. Test any queries that use regular expressions and adjust the expression patterns if necessary. See Incompatible Changes Introduced in Cloudera Impala 2.0.0 / CDH 5.2.0 for details.
Because the impala-shell interpreter uses the \ character for escaping, use \\ to represent the regular expression escape character in any regular expressions that you submit through impala-shell . You might prefer to use the equivalent character class names, such as [[:digit:]] instead of \d which you would have to escape as \\d.
Examples:
This example shows how group 0 matches the full pattern string, including the portion outside any () group:
[localhost:21000] > select regexp_extract('abcdef123ghi456jkl','.*?(\\d+)',0); +------------------------------------------------------+ | regexp_extract('abcdef123ghi456jkl', '.*?(\\d+)', 0) | +------------------------------------------------------+ | abcdef123ghi456 | +------------------------------------------------------+ Returned 1 row(s) in 0.11s
This example shows how group 1 matches just the contents inside the first () group in the pattern string:
[localhost:21000] > select regexp_extract('abcdef123ghi456jkl','.*?(\\d+)',1); +------------------------------------------------------+ | regexp_extract('abcdef123ghi456jkl', '.*?(\\d+)', 1) | +------------------------------------------------------+ | 456 | +------------------------------------------------------+ Returned 1 row(s) in 0.11s
Unlike in earlier Impala releases, the regular expression library used in Impala 2.0 and later supports the .*? idiom for non-greedy matches. This example shows how a pattern string starting with .*? matches the shortest possible portion of the source string, returning the rightmost set of lowercase letters. A pattern string both starting and ending with .*? finds two potential matches of equal length, and returns the first one found (the leftmost set of lowercase letters).
[localhost:21000] > select regexp_extract('AbcdBCdefGHI','.*?([[:lower:]]+)',1); +--------------------------------------------------------+ | regexp_extract('abcdbcdefghi', '.*?([[:lower:]]+)', 1) | +--------------------------------------------------------+ | def | +--------------------------------------------------------+ [localhost:21000] > select regexp_extract('AbcdBCdefGHI','.*?([[:lower:]]+).*?',1); +-----------------------------------------------------------+ | regexp_extract('abcdbcdefghi', '.*?([[:lower:]]+).*?', 1) | +-----------------------------------------------------------+ | bcd | +-----------------------------------------------------------+
- regexp_replace(string initial, string pattern, string replacement)
- Purpose: Returns the initial argument with the regular expression pattern replaced by the final argument string.
Return type: string
In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used by the Google RE2 library. For details, see the RE2 documentation. It has most idioms familiar from regular expressions in Perl, Python, and so on, including .*? for non-greedy matches.
In Impala 2.0 and later, a change in the underlying regular expression library could cause changes in the way regular expressions are interpreted by this function. Test any queries that use regular expressions and adjust the expression patterns if necessary. See Incompatible Changes Introduced in Cloudera Impala 2.0.0 / CDH 5.2.0 for details.
Because the impala-shell interpreter uses the \ character for escaping, use \\ to represent the regular expression escape character in any regular expressions that you submit through impala-shell . You might prefer to use the equivalent character class names, such as [[:digit:]] instead of \d which you would have to escape as \\d.
Examples:
These examples show how you can replace parts of a string matching a pattern with replacement text, which can include backreferences to any () groups in the pattern string. The backreference numbers start at 1, and any \ characters must be escaped as \\.
Replace a character pattern with new text:
[localhost:21000] > select regexp_replace('aaabbbaaa','b+','xyz'); +------------------------------------------+ | regexp_replace('aaabbbaaa', 'b+', 'xyz') | +------------------------------------------+ | aaaxyzaaa | +------------------------------------------+ Returned 1 row(s) in 0.11s
Replace a character pattern with substitution text that includes the original matching text:
[localhost:21000] > select regexp_replace('aaabbbaaa','(b+)','<\\1>'); +----------------------------------------------+ | regexp_replace('aaabbbaaa', '(b+)', '<\\1>') | +----------------------------------------------+ | aaa<bbb>aaa | +----------------------------------------------+ Returned 1 row(s) in 0.11s
Remove all characters that are not digits:
[localhost:21000] > select regexp_replace('123-456-789','[^[:digit:]]',''); +---------------------------------------------------+ | regexp_replace('123-456-789', '[^[:digit:]]', '') | +---------------------------------------------------+ | 123456789 | +---------------------------------------------------+ Returned 1 row(s) in 0.12s
- repeat(string str, int n)
- Purpose: Returns the argument string repeated a specified number of times.
Return type: string
- reverse(string a)
- Purpose: Returns the argument string with characters in reversed order.
Return type: string
- rpad(string str, int len, string pad)
- Purpose: Returns a string of a specified length, based on the first argument string. If the specified string is too short, it is padded on the right with
a repeating sequence of the characters from the pad string. If the specified string is too long, it is truncated on the right.
Return type: string
- rtrim(string a)
- Purpose: Returns the argument string with any trailing spaces removed from the right side.
Return type: string
- space(int n)
- Purpose: Returns a concatenated string of the specified number of spaces. Shorthand for repeat(' ',n).
Return type: string
- strleft(string a, int num_chars)
- Purpose: Returns the leftmost characters of the string. Shorthand for a call to substr() with 2 arguments.
Return type: string
- strright(string a, int num_chars)
- Purpose: Returns the rightmost characters of the string. Shorthand for a call to substr() with 2 arguments.
Return type: string
- substr(string a, int start [, int len]), substring(string a, int start [, int len])
- Purpose: Returns the portion of the string starting at a specified point, optionally with a specified maximum length. The characters in the string are
indexed starting at 1.
Return type: string
- translate(string input, string from, string to)
- Purpose: Returns the input string with a set of characters replaced by another set of characters.
Return type: string
- trim(string a)
- Purpose: Returns the input string with both leading and trailing spaces removed. The same as passing the string through both ltrim() and rtrim().
Return type: string
- upper(string a), ucase(string a)
- Purpose: Returns the argument string converted to all-uppercase.
Return type: string