Main Menu Powered by <TEI:TOK> |
TEITOK Help PagesStandoff AnnotationsThe biggest drawback of using XML for linguistic annotation is that XML is obligatorily hierarchical: tags cannot overlap. That is why in TEI pages are not marked as block, but only the beginning of the page is marked by an empty tag, with an implicit ending. This because otherwise paragraphs that span across a page break would end up breaking the XML. And that is why in TEITOK, XML elements are split up when they cut across a token (or sentence) boundary. However, such solution can only solve so much. Whenever more information needs to be encoded in the corpus that spans multiple words, and which can with other types of information, there is not way to encode such inforarmation inside the XML file. The only way to encode such information is what is called stand-off: stored in a separate file which is linked to the TEI document. Stand-Off in TEITOKIn TEITOK, standoff files are themselves also XML files, stored in the Annotations folder. Each type of stand-off annotion is kept in a separate file, so for error annotation there can be a file Annotations/error.xml. In principle, all error annotations for all XML files is kept in a single file (although they can be physically kept in separate files using XML inclusion). The XML file consists of three parts: the first part contains a description of the annotation type, the second part defines which tags are used in the annotation, and the third part contains the actual annotations as a list of annotations per XML file.
Say we have a file FILE001.xml in our xmlfiles, and we want to mark names over that (and other) files, where our names can overlap, so
we cannot just add them to our XML - say in National Bank of Schotland we want to be able to mark both National Bank and
Bank of Schotland as a name, which would lead to incorrect XML if we do it correctly. The way that is done is to keep a file
called
But the fact that stand-off files are kept sepately means that the normal way of editing does not work: our FILE001.xml does not even
know there is a names annotation for it. There is, instead, a dedicated module for stand-off annotation, which you call by referring both
to the file you want to annotate, and the type of annotation you want to edit/view:
In order to work with our stand-off annotation, we need to define what we want to annotate. We do this in a file
With this, the interface will allow us to click on a selection of tokens, and a pop-up will appear that lets us type in a UID
and select a Type. We furthermore specify that our types are the relevant subclassification, and the interface will display a button
for each type, and clicking that will highlight all the person names, or all the company names in the text. If we mark out the
National Bank, our
In order to make it easier to work with annotations, we need to furthermore tell the system about it, which we do in the settings. So for our names, we can add the following section to our settings.xml:
This will create a link at the bottom of each XML file to jump to our names annotation, either by creating a new file, or by reading the existing annotation. Stand-Off in CQPStand-off annotations can be exported to CQP by defining which annotations to export in the cqp definitions. Due to the nature of CQP, not annotations can be fully exported: In order to export our entire annotation, we would add the settings below, which will export our names as name sattributes, with a drop-down for the type just as in the case of other sattributes in TEITOK/CQP:
With that we can search for
But stand-off annotations form one of the main reasons why TEITOK does not use a VRT format; there would be no way to export
our overlapping names using VRT, since it would be impossible to say which name we are closing with a </name>. Instead,
tt-cwb-encode writes CQP fiels directly, and generates overlapping sattributes if our annotations contain them; it is important
to state that any sattribute overlapping with an existing on is completely ignored by CQP, and also tt-cqp, which does allow
overlapping sattributes, still does not fully support them, since
Another things is that in TEITOK, annotations can be discontinuous: Back to index |