Rule Confidence and Conflict Resolution

When one or several rules have overlapping matches, XPDL determines the best match according to the following strategy:

1) If one match is entirely contained within another, only the longest match is returned.

Example

In Figure 1, the upper-level rule ("geo_university") matches university names that consist of a word from Geoadministrative dictionary followed by a word "university" (e.g. "Tokyo University"). The nested rule matches longer phrases, such as "Tokyo University of Technology".

Figure 1. Example ruleset extracting university names

Rule fragment

rule: geo_university
{
 /* for example, "Tokyo University" */
 query: {phrase(dictword(geoadministrative), university)}:univ

 result: Match = $univ

 rule: university_of
 {
 /* for example, "Tokyo University of Technology" */
 query: {phrase(0, $univ, of, case(title))}:univ_of

 result: Match = $univ_of
 }
}

The ruleset is run on the following text:

The rule "geo_university" matches "Osaka University" while the rule "university_of" matches larger phrase "Osaka University of Economics". The match "Osaka University" is filtered out because it is a part of the larger match "Osaka University of Economics". As a result, the ruleset returns only "Osaka University of Economics" as shown in Figure 2.

Figure 2. Output for the ruleset in Figure 1

2) If several rules match partially overlapping sequences, both matches are returned.

Example

In Figure 3 the rule "geo_university" matches university names that consist of a word from Geoadministrative dictionary followed by the word "university" (e.g. "Oxford University"). The rule "context_university" retrieves short names of the universities using the contexts like "graduate from" or "student at" (e.g. "graduate from Oxford").

Figure 3. Example ruleset extracting university names and short names

Rule fragment

 rule: geo_university
 {
 /* for example, "Oxford University" */
 query: {phrase(dictword(geoadministrative), university)}:univ

 result: Match = $univ
 }

 rule: context_university
 {
 /* for example, "graduate from Oxford" */
 query: {phrase(graduate or master, from, case(title))}:univ

 result: Match = $univ
 }

The ruleset is run on the following text:

The first rule matches the phrase "Yale University", while the second rule matches the partially overlapping phrase "graduated from Yale". As seen in Figure 4, both results are returned in the output.

Figure 4. Output for the ruleset in Figure 1

3) If several rules match exactly the same phrase, the rule with the highest confidence value is preferred. The confidence section is optional.

The syntax is as follows:

confidence: <confidence_value>

where <confidence_value> is any number ranging from 0 (the lowest confidence) to 1 (the highest confidence).

Confidence value is used to indicate whether the rule should have priority over other rules in cases where several rules match the same sequence. The default confidence value is 1. If the confidence value is not set, the rule inherits the confidence value from its parent rule.

Note that the confidence value is not statistically calculated. It’s a manually assigned value which reflects the analyst’s subjective assessment of the rule’s ability to capture relevant matches.

If several rules have the same confidence, the rule that outputs more attributes is preferred. If the number of attributes is the same, the rule declared earlier is preferred.

Example

In Figure 5, the rule "geo_university" matches university names that consist of a word from the Geoadministrative dictionary followed by the word "university" (e.g. "Oxford University"). The rule also outputs the university location in a separate column. The rule "title_university" matches university names that consist of a title-cased word followed by the word "university" (e.g. "International University").

Figure 5. Example ruleset extracting university names and locations

Rule fragment

 rule: geo_university
 {
 /* for example, "Oxford University" */
 query: {phrase({dictword(geoadministrative)}:loc, university)}:univ

 confidence: 0.8

 result: Match = $univ
    attribute: Location = $loc
 }

 rule: title_university
 {
 /* for example, "International University" */
 query: {phrase(case(title), university)}:univ

 confidence: 0.8

 result: Match = $univ

 }

The ruleset is run on the following text:

Both rules match exactly the same phrase "Osaka University". In this case the output of the rule "title_university" is preferred because of higher confidence value, so the "Location" column is left empty, as shown in Figure 6.

Figure 6. Output for the ruleset in Figure 5

If the rule "geo_university" had higher or equal confidence value, the output would look like the result shown in Figure 7.

Figure 7. Output for the ruleset in Figure 5 with "geo_university" rule confidence increased

4) If overlapping results have different result names, all of them are returned (even if the matches are identical).

Example

The ruleset in Figure 8, but the rules have equal confidence values and output results under different names ("Match1" and "Match2" respectively).

Figure 8. Example ruleset extracting university names

Rule fragment

 rule: geo_university
 {
 /* for example, "Oxford University" */
 query: {phrase(dictword(geoadministrative), university)}:univ

 result: Match1 = $univ
 }

 rule: title_university
 {
 /* for example, "International University" */
 query: {phrase(case(title), university)}:univ

 result: Match2 = $univ
 }

The ruleset is run on the following text:

Both rules match exactly the same phrase "Osaka University", but as seen in Figure 9, both results are returned because of different names.

Figure 9. Output for the ruleset in Figure 8