Xml and More: Java Annotation Patterns Engine (JAPE)

In this article, we will introduce the Java Annotation Patterns Engine (JAPE), a component of the open-source General Architecture for Text Engineering (GATE) platform.

Example of JAPE Grammar

See below for an example of JAPE grammar (by convention, file type of jape). And see the detailed descriptions of its components in later sections. The context under which this grammar will be executed includes:

doc —Implements gate.Document interface. JAPE rules always work on an input annotation set associated with a specific document.
inputAS —Implements gate.AnnotationSet interface and represents the input annotation set.
outputAS - Implements gate.AnnotationSet interface and represents the output annotation set.
ontology - Implements gate.creole.ontology.Ontology interface, which can be used in language analyzing.

// Valentin Tablan, 29/06/2001
// $id$


Phase:postprocess
Input: Token SpaceToken
Options: control = appelt debug = true

// CR+LF | CR |LF+CR -> One single SpaceToken
Rule: NewLine
(
({SpaceToken.string=="\n"}) |
({SpaceToken.string=="\r"}) |
({SpaceToken.string=="\n"}{SpaceToken.string=="\r"}) |
({SpaceToken.string=="\r"}{SpaceToken.string=="\n"})
):left
-->
{
gate.AnnotationSet toRemove = (gate.AnnotationSet)bindings.get("left");
outputAS.removeAll(toRemove);
//get the tokens
java.util.ArrayList tokens = new java.util.ArrayList(toRemove);
//define a comparator for annotations by start offset
Collections.sort(tokens, new gate.util.OffsetComparator());
String text = "";
Iterator tokIter = tokens.iterator();
while(tokIter.hasNext())
text += (String)((Annotation)tokIter.next()).getFeatures().get("string");

gate.FeatureMap features = Factory.newFeatureMap();
features.put("kind", "control");
features.put("string", text);
features.put("length", Integer.toString(text.length()));
outputAS.add(toRemove.firstNode(), toRemove.lastNode(), "SpaceToken", features);
}

What's JAPE?

JAPE is a finite state transducer that operates over annotations based on regular expressions. Thus it is useful for pattern-matching, semantic extraction, and many other operations over syntactic trees such as those produced by natural language parsers.

A JAPE grammar consists of a set of phases, each of which consists of a set of pattern/action rules. The phases run sequentially and constitute a cascade of finite state transducers over annotations. One of the main reasons for using a sequence of phases is that a pattern can only be used once in each phase, but it can be reused in a later phase.

Input annotations could be specified at the beginning of the grammar. This specifies what types of annotations will be processed by the rules of grammar. Other not-mentioned types will be ignored. By default the transducer will include Token, SpaceToken and Lookup.

Options specification defines the method of rule matching (i.e., control) and debug flag for the rules (i.e., debug) in the grammar. There are five control styles:

brill — When more than one rule matches the same region of the document, they all are fired.
all — Similar to brill, in that it will also execute all matching rules, but the matching will continue from the next offset to the current one.
first — With the first style, a rule fires for the first match that‘s found.
once — Once a rule has fired, the whole JAPE phase exits after the first match.
appelt — Only one rule can be fired for the same region of text, according to a set of priority rules. The appelt control style is the most appropriate for named entity recognition as under appelt only one rule can fire for the same pattern.

If debug is set to true, any rule-firing conflicts will be displayed in the messages window if the grammar is running in appelt mode and there is more than one possible match.

Following the declaration portions of grammar, a list of rules are specified. Each rule consists of a left-hand-side (LHS) and a right-hand-side (RHS). The LHS of the rules consists of an annotation pattern description. The RHS consists of annotation manipulation statements. Annotations matched on the LHS of a rule may be referred to on the RHS by means of labels that are attached to pattern elements.

LHS

LHS contains annotation pattern that may contain regular expression operators e.g. ("+", "?" , "*"). However, you should avoid the use of “*” based regular expressions for better performance.

There are 3 main ways in which the pattern can be specified:

As a string of text
- e.g., {Token.string == "Oracle"}
- This pattern matches a string of text with the value of "Oracle".
As the attributes (and values) of an annotation
- e.g., ({Token.kind == word, Token.category == NNP, Token.orth == upperInitial})?
- The above pattern uses the Part of Speech (POS) annotation where kind=word, category=NNP and orth=upperInitial.
As an annotation previously assigned from a gazetteer, tokeniser, or other module
- ({Company})?:c1 ({Positives}):v ({Company})?:c2 ({Split}|{CC})?
- The above pattern matches annotations of Company type, followed by annotations of Positives type, etc. The first-matched pattern element is labeled as c1, the second-matched pattern element is labeled as v, etc.

There are different kind of operators supported:

Equality operators (“==” and “!=”)
- {Token.kind == "number"}, {Token.length != 4}
Comparison operators (“<”, “<=”, “>=” and “>”)
- {Token.string > "aardvark"}, {Token.length < 10}
Regular expression operators (“=~”, “==~”, “!~” and “!=~”)
- {Token.string =~ "[Dd]ogs"}, {Token.string !~ "(?i)hello"}
- ==~ and !=~ are also provided, for whole-string matching
{X contains Y} and {X within Y} for checking annotations within the context of other annotations

You can even define custom operators by implementing gate.jape.constraint.ConstraintPredicate.

RHS

The right-hand-side (RHS) consists of annotation manipulation statements. For example, you can add/remove/update annotations associated with a document. Alternatively, RHS can contain Java code to create or manipulate annotations. In this article, we will focus only on RHS' implemented in Java code.

On the RHS, Java code can reference the following variables (which are passed as parameters to the RHS action):

doc
bindings
annotations
inputAS
outputAS
ontology

annotations is provided for backward compatibility and should not be used for new implementations. inputAS and outputAS represent the input and output annotation set. Normally, these would be the same (by default when using ANNIE, these will be the “Default” annotation set) . However, the user is at liberty to change the input and output annotation sets in the parameters of the JAPE transducer at runtime. Therefore, it cannot be guaranteed that the input and output annotation sets will be the same, and we should specify the annotation set we are referring to.

Annotations matched on the LHS of a rule may be referred to on the RHS by means of labels that are attached to pattern elements. They can be retrieved by using bindings as follows:

gate.AnnotationSet toRemove = (gate.AnnotationSet)bindings.get("c1");

This returns a temporary annotation set which holds all the annotations matched on the LHS that have been labeled as "c1."

In the following discussions, we assume you use ANNIE and both inputAS and outputAS points to the same annotation set.

On the RHS, you can do any of the following:

Remove annotations from document's annotation set(s)
Update annotations in document's annotation set(s)
Add new annotations to document's annotation set(s)

However, if you try to remove annotations while using the same iterator for other tasks at the same time, you may see:

java.util.ConcurrentModificationException

The solution is to collect all to-be-removed annotations on a list and process them at the end.

RhsAction

To understand the RHS action, you need to know how a JAPE rule is translated to its executable binary in Java. For example, rule ConjunctionIdentifier2

Rule: ConjunctionIdentifier2
(
({Token.category=="CC"}):conj2
)
-->
:conj2
{

gate.AnnotationSet matchedAnns= (gate.AnnotationSet) bindings.get("conj2");
gate.FeatureMap newFeatures= Factory.newFeatureMap();
newFeatures.put("rule","ConjunctionIdentifierr21");
outputAS.add(matchedAnns.firstNode(),matchedAnns.lastNode(),"CC", newFeatures);
}

will be translated into:

1  // ConjunctionIdentifierConjunctionIdentifier2ActionClass14
2  package japeactionclasses;
3  import java.io.*;
4  import java.util.*;
5  import gate.*;
6  import gate.jape.*;
7  import gate.creole.ontology.*;
8  import gate.annotation.*;
9  import gate.util.*;
10
11  public class ConjunctionIdentifierConjunctionIdentifier2ActionClass14
12  implements java.io.Serializable, RhsAction {
13    public void doit(gate.Document doc,
14                     java.util.Map bindings,
15                     gate.AnnotationSet annotations,
16                     gate.AnnotationSet inputAS, gate.AnnotationSet outputAS,
17                     gate.creole.ontology.Ontology ontology) throws gate.jape.JapeException {
18      gate.AnnotationSet conj2Annots = bindings.get("conj2");
19      if(conj2Annots != null && conj2Annots.size() != 0) {
20
21
22      gate.AnnotationSet matchedAnns= (gate.AnnotationSet) bindings.get("conj2");
23      gate.FeatureMap newFeatures= Factory.newFeatureMap();
24      newFeatures.put.("rule","ConjunctionIdentifierr21");
25      outputAS.add(matchedAnns.firstNode(),matchedAnns.lastNode(),"CC", newFeatures);
26
27      }
28    }
29  }

Notice that RHS of the rule is wrapped into doit method which has the following signature:

public void doit(gate.Document doc,
java.util.Map bindings,
gate.AnnotationSet annotations,
gate.AnnotationSet inputAS, gate.AnnotationSet outputAS,
gate.creole.ontology.Ontology ontology)

That's why you can reference:

doc
bindings
annotations
inputAS
outputAS
ontology

without declaring them.

Also notice that the return type of doit method is void. It means that you can exit from the middle of action execution by issuing a return statement:

if (annotation.getFeatures().get("tagged") != null)
return;

References

Sunday, May 29, 2011

Java Annotation Patterns Engine (JAPE)

1 comment: