In order to find syntactic and text units of a certain length, i.e. words, chunks or sentences, one can use the function length().

Syntax

length(min_length, [max_length,] [term,…])

The first two arguments min_length and max_length are used to set minimum and maximum length (in characters), respectively. The argument min_length is required, while the max_length is optional. If max_length is not specified, the function matches arguments that consist of at least min_length characters.

Example

length(2, 5, .) matches all words from 2 to 5 characters (the dot symbol without quotes matches any word);

length(5, partofspeech(noun)) matches all nouns that are 5 characters or more.

In order to specify a unit of length, users may use the optional named parameter count:= that takes the following values:

Value

Explanation

character

length is calculated in symbols (default value)

token

length is calculated in tokens (both words and punctuation)

word

length is calculated in words

punctuation

sets the number of punctuation signs

Example

length(2,3, keyword()) matches keywords that consist of 2 or 3 characters, e.g., "oil", "gas", "CEO";

length(2,3, keyword(), count:=token) matches keywords that consist of 2 or 3 tokens, e.g., "reference book", "high price".

length(2,2, count:=token, entity(Companies)) matches 2-token company names, e.g. "Abbott Laboratories", "CNS Therapeutics", "Stryker Corporation";

length(3,4, count:=word, sentence()) matches sentences that consist of 3 or 4 words, ignoring punctuation, e.g. "I love cakes", "There are common challenges".

Note: for more information about the entity() and keyword() functions used in the example above see Search for objects extracted by other text analysis nodes.

Task Example: Find USA Postal Codes

phrase(0, length(2,2, case(upper, char(alpha))), length(5, 5, char(digit))) matches two letters in upper case followed by 5 digits, that are often postal codes in the USA (CA 92108, NY 10004, TX 77002, etc.).

pdl length 1