Regular Expression Support

The regex() function is used to find text fragments corresponding to a regular expression pattern.

Syntax

regex(regular_expression)

The function takes as argument a regular expression enclosed in quotation marks.

The regular expression must correspond to the Perl coding standard (http://perldoc.perl.org/perlre.html).

The function regex() accepts the following optional named parameters:

Named parameter

Comments

scope:=word/sentence/paragraph/text

limits the scope of the expression to word/sentence/paragraph/text;

casesens:=yes/no

switches case sensitivity on/off;

ignore_ws:=yes/no

allows to ignore/forbid white spaces in a regular expression;

wholeword:=yes/no

fragment matched by the regular expression is/is not at a word boundary.

Example

regex("[a-z]{3,3}\d{3,3}") matches the sequence of characters "three letters— three digits", this query may be used to match license plate numbers.

Note

1) By default, the scope parameter is set to word (the function regex() matches separated tokens of the entire text corresponding to the expression).

If users also wish to find word fragments corresponding to a regular expression, they should add the parameter wholeword:=no.

Example

regex("a.*t") matches "aircraft", "about", but does not match "imaginative" or "statement".

regex("a.*t", wholeword:=no) matches "aircraft", "about", "imaginative", and "statement".

2) In order to find a text fragment composed of more than one token and corresponding to a regular expression, the "scope:=sentence/paragraph/text" optional parameter should be specified.

Example

regex("a.+t") = regex("a.+t", scope:=word) matches "aircraft", "about", "affect", but does not match "answered that", because the latter is made up of more than one token.

regex("a.*t", scope:=sentence) matches "aircraft", "about", "affect", and "answered that".

By default, if scope:=sentence/paragraph/text, a text fragment corresponding to a regular expression may not assert a word boundary. In order to change this behaviour, users should add wholeword:=yes to the query expression.

Example

regex("a.*b", scope:=sentence) matches "WAS WEARING BOOTS" and "Ann Webb".

regex("a.*b", scope:=sentence, wholeword:=yes) matches "Ann Webb", but does not match "WAS WEARING BOOTS".

3) By default, the casesens parameter is set to no (the function regex() does not take case into consideration).

In order to make the expression case sensitive, users should switch the case sensitivity on, using the named parameter "casesens:=yes".

Example

regex("A.+t", casesens:=yes) matches "Agreement", "Airport", but does not match "AGREEMENT" or "airport".

4) By default, ignore_ws:=yes, i.e., white spaces within a regular expression are ignored. Users can change this behaviour, either using ignore_ws:=no, or a white space special character \s.

Example

regex("a b", scope:=text) = regex("ab", scope:=text) matches "absent", but does not match "a basketball".

regex("a b", scope:=text, ignore_ws:=no) matches "a basketball", but does not match "absent".

Task example 1: Find telephone numbers

In order to match telephone numbers in +X-XXX-XXX-XXXX, X-XXX-XXX-XXXX, XXX-XXX-XXXX format users can write the following query expression:

regex("\+?\d?\-?\d{3,3}\-?\d{3,3}\-?\d{4,4}", scope:=sentence)

This expression does not function without the scope:=sentence parameter, because we are looking for a text fragment which is made up of several tokens.

pdl regex 1

Task example 2: Find e-mail addresses

In order to search for e-mail addresses users can write the following query:

regex("[a-z]{1,20}@[a-z]{1,20}\.[a-z]{1,3}", scope:=sentence).

Like in the example above, it does not function without scope:=sentence parameter, because we are looking for a text fragment which is composed of several tokens.

pdl regex 2

The latter can be found adding punctuation marks (".", "-") before "@" sign:

regex("[a-z.-]{1,20}@[a-z]{1,20}\.[a-z]{1,3}", scope:=sentence).

pdl regex 3