Search within a given range

The near() function finds sequences of arguments in unspecified order within specified distance.

near() accepts words and phrases in any form, as well as other word or sequence search functions, as arguments.

Syntax

near([distance,] term_1, term2,..)

The optional parameter distance is used to set the maximum allowed difference between the positions of the first and the last argument of the sequence. Its default value is 1.

The function also supports the following optional named parameters:

Parameter

Explanation

distance

Sets exact difference in positions between the first and the last argument of the sequence.

min_distance

Sets minimum difference in positions between the first and the last argument of the sequence.

max_distance

Sets maximum difference in positions between the first and the last argument of the sequence. Synonymous to the optional parameter distance.

range

Sets exact length of the sequence in tokens.

min_range

Sets minimum length of the sequence in tokens.

max_range

Sets maximum length of the sequence in tokens.

gap

Sets exact number of tokens between arguments of the sequence.

min_gap

Sets minimum number of tokens between arguments of the sequence.

max_gap

Sets maximum number of tokens between arguments of the sequence.

interval

Sets exact difference in positions between arguments of the sequence.

min_interval

Sets minimum difference in positions between arguments of the sequence.

max_interval

Sets maximum difference in positions between arguments of the sequence.

Optional named parameters allow_punct and allow_space regulate whether, correspondingly, punctuation marks and spaces are allowed within the sequence. The parameters take on the following values:

Value

Explanation

yes (default value)

Punctuation marks/spaces are allowed within the sequence.

no

Punctuation marks/spaces are not allowed within the sequence.

Note

Characters such as number sign (#), ampersand (&), commercial at sign (@) and percent sign (%) are considered special symbols in PolyAnalyst rather than punctuation. Thus, they are included in the distance count.

Examples

near(document, number()) matches «document: 110», but not «document # 110».

near(2, document, number()) matches «document: 110» and «document # 110».

The optional named parameter match regulates the volume of text extracted by the function. The parameter takes on the following values:

Value

Explanation

arguments (default value)

Only the arguments listed inside the function are extracted.

range

The whole fragment of text between the first and the last argument is extracted.

Example

near(company, ship) = near(1, company, ship) matches «company’s shipping business» and «a large shipping company».

near(3, company, produce) = near(company, produce, max_range:=4) = near(company, produce, max_gap:=2) matches «the company also produces coking and thermal coal», «the deal will produce the fourth-most-valuable company in the country» and «[…​] for the purpose of producing revenue. The Company considers historical performance […​]».

near(company, loss, profit, max_interval:=10, match:=range) matches «During the fourth quarter of 2001 the Company had a net profit of $163,000 compared to a net loss of $650,000 in 2000.».

Notes

  1. The function snear(), synonymous to sentence(near()), is used to find sequences of arguments in unspecified order within one sentence.

  2. The function fnear() is used to find sequences of arguments in specified order within specified distance.

The functions support the same named parameters as the near() function.

Examples

snear(3, company, produce) matches «the company also produces coking and thermal coal», «the deal will produce the fourth-most-valuable company in the country», but not «[…​] for the purpose of producing revenue. The Company considers historical performance […​]».

fnear(company, project, max_gap:=2) matches «company’s bauxite project», but not «best for the project and the company».

Search within a sentence

The function sentence() finds sequences of arguments in unspecified order within the specified number of sentences.

sentence() accepts words and phrases in any form, as well as other word or sequence search functions, as arguments.

Syntax

sentence([distance,] [term_1,] [term_2,..])

This function has no required arguments. When called without arguments, the function matches all sentences. The optional parameter distance allows to set the maximum amount of sentences, within which the arguments should be found. Its default value is 1.

The optional named parameter match regulates the volume of text extracted by the function. The parameter takes on the following values:

Value

Explanation

arguments (default value)

Only the arguments listed inside the function are extracted.

range

The whole fragment of text between the first and the last arguments is extracted.

The function also supports the optional named parameter whole:=yes which extracts sentences made up only by the arguments listed in the query.

Example

sentence(company, project) matches «The company also is moving forward with an enterprise data warehouse project.» and «The cost of the project is estimated at $3.6 billion, the companies said».

sentence(teva, cut, job) matches "Teva cuts 1,000 jobs across US and Europe" and "Teva cuts jobs".

sentence(teva, cut, job, whole:=yes) matches "Teva cuts jobs".

Note

  1. The function sfollow(), synonymous to sentence(follow()) or follow(1,) matches sequences of arguments in specified order within one sentence.

  2. The function snear(), synonymous to sentence(near()), is used to find sequences of arguments in unspecified order within one sentence.

Search within a line

The function line() finds sequences of arguments in unspecified order within a specified number of lines.

line() accepts words and phrases in any form, as well as other word or sequence search functions, as arguments.

Syntax

line([lines_number,] [term_1] [, term_2])

This function has no required arguments. When called without arguments, the function matches all lines.

The optional parameter lines_number allows to set the maximum amount of lines, within which the arguments should be found. Its default value is 1.

The function also supports optional named parameters:

Value

Explanation

match:=arguments (default value)

Only the arguments listed inside the function are extracted.

match:=range

The whole fragment of text between the first and the last arguments is extracted.

whole:=yes

Whole lines made up by the arguments contained in the query are extracted.

min_length

Sets the minimal line’s length in tokens.

max_length

Sets the maximal line’s length in tokens.

Note

The line() function ignores empty lines between arguments.

Example

line(2, entity(People), entity(Phone Numbers)) matches all mentions of people and their telephone numbers within two lines of text, ignoring any empty lines between them.

pdl near sentence 1 en
pdl near sentence 2 en
Search within a paragraph

The function paragraph() finds sequences of arguments in unspecified order within a specified number of paragraphs.

paragraph() accepts words and phrases in any form, as well as other word or sequence search functions, as arguments.

Syntax

paragraph([paragraphs_number,] [term_1] [, term_2])

This function has no required arguments. When called without arguments, the function matches all paragraphs.

The optional parameter paragraphs_number allows to set the maximum amount of paragraphs, within which the arguments should be found. Its default value is 1.

The optional named parameter match regulates the volume of text extracted by the function. The parameter takes on the following values:

Value

Explanation

arguments (default value)

Only the arguments listed inside the function are extracted.

range

The whole fragment of text between the first and the last arguments is extracted.

whole:=yes

Paragraphs made up only by the arguments listed in the query will be extracted.

Note

Results of splitting the text into paragraphs can be seen in the Index or Text Tagger nodes.

Task example: Searching for id and title of document

In order to find mentions of the id and title of a text in the corpus, the following query can be used:

paragraph(2, "document id", title)

This query searches for its arguments within two consecutive paragraphs, because they might be separated by an empty line.

pdl near sentence 3 en

Search within a dataset

The function document() is used to search within a dataset.

Syntax

document([min,][max,][term_1,][term_2,…​])

This function has no required arguments. When called without arguments, it matches all words in the document.

Optional parameters min and max specify the minimal and maximal document number within a dataset. When they are omitted, the function searches within the whole dataset.

All arguments to search for must be within the same document.

The function also supports the following optional named parameters:

Parameter

Explanation

match:=range

The whole fragment of text between the first and the last arguments is extracted.

match:=arguments

Only the arguments listed inside the function are extracted (default value).

whole:=yes/no

Regulates whether to extract sentences made up only by the arguments listed in the query or not (set to no by default).

allow_punct:=yes/no

Regulates whether punctuation marks are allowed within the sequence (set to yes by default).

allow_space:=yes/no

Regulates whether spaces are allowed within the sequence (set to yes by default).

min_doc:=<numeral>

Specifies the minimal document number within a dataset.

max_doc:=<numeral>

Specifies the maximal document number within a dataset.

mode:=forward/backward

Specifies a document’s position from the beginning/end of the dataset.

Note

If the first or/and the second arguments are numbers, they will be interpreted as min_doc and max_doc optional parameters respectively.

When both the first numerical arguments min and max and optional named parameters min_doc and max_doc are specified, priority will be given to the latter.

Example

document(1, 2) matches the first two documents of the dataset.

document(1, 2, mode:=backward) matches the last two documents of the dataset.

The function document() may be combined with functions like case(), length(), lemma(), etc.

Example

case(upper, document()) matches all documents written in uppercase.

case(upper, document(abc)) matches documents containing ABC in upper case.

length(2, document(), count:=word) matches documents containing two words or more.

lemma(noun, document()) matches documents containing only nouns.

Task example: Searching for vaccine name mentions in the last 50 documents of a dataset

In order to find mentions of the names of vaccines within 50 last documents of a dataset the following query can be used:

document(1, 50, orn(Pfizer, Sputnik, Moderna), mode:=backward)

pdl near sentence 4 en