GECEG Manual: POS Tagging


1. General Points


A general introduction
base, affix, lemma / subcat, inflection
scheme


2. Base Labels


general introduction to POS
base labels are innermost part of POS; correpsond to root, stem, base.
general principles: "Label part of speech tags by form Principle" examples, "one", D, ONE,? always same, function differentiates usage. Relatively clear cut, (e.g. the blind is (ADJ blind)). Some tricky cases though, where category unstable and changes.
exceptions: (1) SBNJ, even though often like ADV or P. Reason: impossible to decide. So, new category based on function. a list of POS and mnemonic
a more detailed description of each POS
base labels are 2-4 characters long

Base Label Overview List

The GeCeG uses the following 24 basic part of speech labels.

Base Label Mnemonic
ADJ adjective
ADV adverb
COMP complementizer
CONJ coordinating conjunction
DD complex demonstrative
DS simple demonstrative
FW foreign word
NCO common noun
NEG negative particle
NPR proper name
NUM numeral
ONE one
PREP preposition
PRO pronoun
PRO$ possessive pronoun
SBNJ subordinating conjunction
TO non-finite marker to
VBA active or present participle
VBF finite verb
VBI imperative
VBN infinitive
VBP passive or past participle
WADV interrogative adverb
WPRO interrogative pronoun


2.1. Inflection


sequence of inflection for each POS
list of inflections, which POS are they marked on
always attached by default, but some exceptions, e.g. Case, Nunmber, Gender on attributive adjectives, but not on predicative ones, unless marked (earlies texts).

Guidelines Regarding Syncretism

Syncretism refers to the fact that cells in an inflectional paradigm can be formally identical. In early German, this is the case in particular for the inflectional category 'gender.', but can occur with other features as well.
Where syncretism cannot be resolved, the relevant feature is indicated as ^X in the fixed sequence of inflectional categories attached to the part-of-speech base labels. Syncretism is interpreted as any ambiguous value for a category, even if only a subset of possible feature values is actually applicable. For example, since the fixed sequence of inflectional features on common nouns is ^case^number^gender, a label like NCO^ACC^SG^X will represent an accusative, singular noun whose value in the gender slot is syncretic. That means that the gender value is not unambiguous; it could be syncretic only between some possible feature values though, for example, only between masculine and neuter.
However, syncretism is avoided at all cost wherever possible. It should be conceived of as a last-resort annotation mechanism that is only applied in cases where all other attempts at disambiguation fail and that would not exist in a perfect world.
Potentially syncretic forms are disambiguated wherever possible in the following ways:
  • (1) Agreement patterns in a specific syntactic environment. A word form may permit several feature structures in isolation but be unambiguous given the syntactic contexts. For example, the simple demonstrative dien is dative, plural but syncretic for gender. If, however, it occurs as the antecedent of a subject-extracted relative clause, dien die ... 'those who ...', the relative pronoun, unambiguously masculine in Old High German, disambiguates the simple demonstrative as antecedent and relative pronoun must agree in phi features. The word dien would then be annotated accordingly as 'masculine.'
  • (2) Dictionary citations and Modern German intuitions of gender on nouns. Nouns are rarely syncretic for gender as this feature can be looked up in dictionaries or be intuited to some degree from native Modern German speakers. Schützeichel’s (1987) dictionary was used for disambiguating potentially syncretic gender on nouns. The same guideline applies to borrowed words, for example from Latin.
  • (3) Neuter default on demonstratives in extraposition and preposition + determiner constructions. Unambiguous cases show that demonstratives are overwhelmingly neuter in gender when they occur as heads in extraposition structures like "That i is nice [that you came] i" or as complements in grammaticalized prepositional phrases like "for that" meaning 'therefore.' Syncretic simple demonstratives are therefore always annotated as neuter in such structures.


Inflection Overview List

The GeCeG annotates words for the following 8 inflectional features.

Inflectional feature Feature values Mnemonic
caseNOM nominative
GEN genitive
DAT dative
ACC accusative
INS instrumental
degreeR comparative
S superlative
genderNEU neuter
MSC masculine
FEM feminine
moodOPT optative
IND indicative
numberSG singular
PL plural
person2 second
3 third
1 first
strengthSTRONG strong
WEAK weak
tensePRES present
PAST past


Inflection Details

case


Case is annotated on nominal categories.
The inflectional feature case is marked on the following base labels: ADJ DD DS NCO NPR NUM ONE PRO PRO$ VBA VBN VBP WPRO

degree


something about degree
The inflectional feature degree is marked on the following base labels: ADJ ADV

gender


something about gender
The inflectional feature gender is marked on the following base labels: ADJ DD DS NCO NPR NUM ONE PRO PRO$ VBA VBP

mood


something about mood
The inflectional feature mood is marked on the following base labels: VBF

number


something about number
The inflectional feature number is marked on the following base labels: ADJ DD DS INF NCO NPR PRO PRO$ VBA VBF VBI VBP

person


something about person
The inflectional feature person is marked on the following base labels: PRO VBF VBI

strength


Something about strength
The inflectional feature strength is marked on the following base labels: ADJ ONE VBA VBP

tense


Tense is a formal category of finite verbs.
The inflectional feature tense is marked on the following base labels: VBF

2.2. Word Formation


general description of derivational prefix, suffix, compounding
list of prefixes and suffixes included
order of elements in order of actual string

Compounding

Compounds consist of combinations of two or more part of speech labels. Virtually any combination is conceivable. The individual labels are strung together with + signs in the order of the actual string. Inflectional features are indicated only on the last label (i.e., on the head). Lemmas and subcategories, however, are added to all base labels. Part-of-speech labels are only joined with + signs into compounds when they are spelled together or hyphenated in the text edition. In other words, compounds are never created artificially by stringing word forms together that are not united in the text edition. Instead, cases of potential compounding are handled with explicit syntactic annotation, most importantly modification functions, MOD.

Examples

The following instances of compounding are particularly noteworthy:

  • (1) R-Proforms are compounds consisting of a German cognate of one of the locative adverbs here, there or where combined with a preposition. They are named for the final r-sound at the end of all the locative adverbs. Examples are German equivalents of words like hereafter, therein or whereby. R-Proforms are labelled ADV~LOC+PREP with a possible extension on the preposition for temporal or locative semantics.

  • Examples


Affix Details

GE


The prefix ge is very versatile in German. It is most regular on verbs, where it often indicates perfectivity. However, the affix assumes numerous other, less regular meanings as well. Wherever a form of the affix appears overtly, it is indicated as GE+ irrespective of its semantic transparency.
The prefix GE can occur with the following base labels: ADJ NCO VBF

IPX


IPX is mnemonic for 'inseparable prefix.' It is the label for a class of verbal prefixes that are always fused to the verbal root. The following prefixes are included: be, ver, zer. The affix GE is also an inseparable prefix but has its own affix label as it is very common. In order to be tagged as IPX, the prefix must not be included in bold print of the citation form of the relevant verbs in Schützeichel's dictionary.
The prefix IPX can occur with the following base labels: VBA VBF VBP

SPX


SPX is mnemonic for '(potentially) separable prefix.' It is the label for a class of verbal prefixes that are sometimes fused to the verbal root but can also be separated from it. Note that there are verbs that include a SPX prefix but are not in actual fact separable. SPX differs from IPX in that the latter are necessarily fused to the root.
The following prefixes are included: under, ubar. In order to be tagged as SPX, Schützeichel's dictionary must not include the prefix in bold print of the citation form of the relevant verbs and it must list the prefix in an independent entry as a preposition or adverb.
The prefix SPX can occur with the following base labels: VBF

UN


un is a negative prefix. It reverses the meaning.
The prefix UN can occur with the following base labels: ADJ

2.3. Lemmas, Subcategories


list of special lemmata / subcats for each POS
an overview list might be good because there are so many now...
only one per base label. in rare cases where perhaps more than one is applicable? the more general one wins out.

Lemmas and Subcategories Overview List

The GeCeG annotates words for the following 23 lemmas and subcategories.

Lemma / Subcategory Mnemonic
BEbe
CNTconjunctive
DINyour (sg.)
DOdo
FRGNforeign
HVhave
IRAher
IROtheir
IUWyour (pl.)
LOClocative
MINmy
MOREmore
NOTnot
OTHERother
PRPRpreterite-presentia
QNTquantificational
RFLXreflexive
SELFself
SINhis, its
SOso
TMPtemporal
UNSour
WRDwerdan


Lemmas and Subcategories Details

a) lemmas

~BE


The verb be. The paradigm of this verb is formed from two different verbal roots, wesan and sin. There is some overlap between these roots, but generally they are functionally split: the former normally provides the past, non-finite and imperative forms, the latter the present forms (Braune & Eggers 1987: §378, 379).
The lemma BE is indicated on the following base label: VBF VBN

~DIN


Second person singular possessor of possessive pronouns, 'your' (singular).
The lemma DIN is indicated on the following base label: PRO$

~DO


The verb do.
The lemma DO is indicated on the following base label: VBF VBN

~HV


The verb have.
The lemma HV is indicated on the following base label: VBF VBN

~IRA


Third person singular feminine possessor of possessive pronouns, 'her.'
The lemma IRA is indicated on the following base label: PRO$

~IRO


Third person plural possessor of possessive pronouns, 'their.'
The lemma IRO is indicated on the following base label: PRO$

~IUW


Second person plural possessor of possessive pronouns, 'your' (plural).
The lemma IUW is indicated on the following base label: PRO$

~MIN


First person singular possessor of possessive pronouns, 'my.'
The lemma MIN is indicated on the following base label: PRO$

~MORE


The word more. It always types a comparative adverb.
The lemma MORE is indicated on the following base label: ADV

~OTHER


The word other, an adjectival lemma.
The lemma OTHER is indicated on the following base label: ADJ

~RFLX


In addition to first, second and gendered third person pronouns, early German also has designated third person (non-nominative) reflexive pronouns, which express co-referentiality with a third person subject of any gender. There are only two special reflexive forms: sih (accusative singular and accusative plural) and sin (genitive singular), 'herself, himself, itself.' Only these two forms are typed with the lemma ~RFLX for reflexivity. For all other persons and cases, appropriate personal pronouns are used reflexively. For example, mih 'me' is an accusative first person singular pronoun, but can also be used reflexively, 'myself.' Where other third person pronouns are used reflexively, they must in addition agree in gender for their subjects.
The lemma RFLX is indicated on the following base label: PRO

~SELF


Forms of the word self are typed as an adjectival lemma. Its early German spellings are selb,selbo,selp etc. It can be uninflected.
The lemma SELF is indicated on the following base label: ADJ

~SIN


Third person singular masculine or neuter possessor of possessive pronouns, 'his, its.'
The lemma SIN is indicated on the following base label: PRO$

~UNS


First person plural possessor of possessive pronouns, 'our.'
The lemma UNS is indicated on the following base label: PRO$

~WRD


The verb werdan 'become, happen.' It is a common copula and develops into a marker of futurity.
The lemma WRD is indicated on the following base label: VBF

b) subcategories

~CNT


The subcategory CNT is tagged on a small, closed class of adverbs with the following properties: (1) Their meaning is often contrastive (however, but), conjunctive (also, additionally) or consequential (hence, therefore). The tag name, CNT, is mnemonic for these three types. (2) They frequently modify elements other than the event structure of the clause, like initial topics or subordinating conjunctions. Furthermore, they may form correlative conjunction structures with other conjunctions. (3) They often show clitic-like behavior in that they may appear in a fixed syntactic place, namely in second position in clauses where the verb appears late. The subcategory CNT is added to the adverbs afur 'but', ouh 'also'.
The subcategory CNT is indicated on the following base label: ADV

~FRGN


Words that have clearly non-native characteristics, like peculiar inflectional endings, and that do not occur in uninterrupted sequences of at least three words are typed as ~FRGN, 'foreign.' Other than that, they are annotated as if they were native German words. See the foreign language function and foreign words for more details.
The subcategory FRGN is indicated on the following base label: ADJ NCO NPR

~LOC


The subcategory ~LOC designates locatives. The category should be conceptualised as an inherent lexical feature on particular lexical items. For example, the adverb here has inherently locative semantics. However, some words are interpreted as locative only in certain contexts and they are only typed ~LOC in such cases. For example, the preposition in can have a locative reading, as in the phrase in Bavaria, but it can also be used for other roles, as in the phrase in this way. The range of possible thematic micro-roles annotated ~LOC is wide: not only stative locations, but also sources, goals, paths, directions etc. are annotated as locatives.
Words are also typed as ~LOC where the locative interpretation arises only metaphorically. For example, the semantic source domain of a verb may not literally express motion (e.g. aspire), but in connection with a prepositional phrase it is clear that a goal, source or place interpretation is implied (e.g. to a position). The prepositional head would then be typed as locative.

Comparison with other corpora CorpusSearch corpora like the PPCME or YCOE mark locative information either exclusively or additionally with syntactic extension labels (e.g. ADVP-LOC for a locative adverb phrase). In contrast, the GeCeG marks locative features exclusively on lexical items. Furthermore, the meaning of this features is wider in the GeCeG than in the other corpora as locatives comprise stative locations as well as directions etc.

The subcategory LOC is indicated on the following base label: ADV PREP SBNJ WADV

~NOT


The subcategory NOT is used for the negative indefinite pronoun meaning 'nothing' as well as the negative adverb meaning 'not. These words derive historically from the negative particle ni fused to a common noun, wiht, 'thing', niowht > nieht > (Modern German) nicht / nichts. All forms of not and nothing are typed NOT irrespective of the stage that they have reached in the grammaticalization process. Specifically, if the word wiht is still discernible so that the negative particle and the head noun are transparent, the tag is NEG+NCO~NOT. If the head noun has become formally opaque because w and/or h are not present, the tag is NCO~NOT. There is no special base label for indefinite pronouns. If the parse makes more sense with a verbal negation reading - sometimes this is difficult to decide - the tag is ADV~NOT. Examples:
(NEG+NCO~NOT neowiht) 'nothing'
(NCO~NOT nieht) 'nothing'
(ADV~NOT niot) 'not'

The subcategory NOT is indicated on the following base label: NCO

~PRPR


PRPR stands for 'preterite-present verbs.' They are a small group of anomalous verbs in the Germanic languages, in which the present tense derives historically form (strong) past endings, and the present tense endings are derived from the new (weak) past endings with a dental suffix. The subcategory ~PRPR comprises the verbs listed in Braune & Eggers (1987: §371 - 376), an.
The subcategory PRPR is indicated on the following base label: VBF

~QNT


The subcategory QNT comprises the quantifiers all, many. Quantifiers do not have their own morphological category in German. Instead, they behave like adjectives with inflections for all declension features and strength. Quantifiers are therefore tagged on ADJ base labels. Quantification is handled as a function of the syntactic annotation. However, an explicit QNT type is useful in case the quantifier functions as a nominal head.
The subcategory QNT is indicated on the following base label: ADJ

~SO


The subcategory SO comprises several common words meaning roughly 'so' used in a large number of grammatical constructions.
The subcategory SO is indicated on the following base label: ADV SBNJ

~TMP


The subcategory ~TMP stands for temporal information. The category should be conceptualised as an inherent lexical feature on particular lexical items. For example, the adverb then has inherently temporal semantics. However, some words are interpreted as temporal only in certain contexts and they are only typed ~TMP in such cases. For example, the preposition on can have a temporal reading, as in the phrase on Monday, but it can also be used for other roles, as in the phrase on my own. The range of possible thematic micro-roles annotated ~TMP is wide: not only points in time, but also durations, intervals, and extents of time are are annotated as temporal.

Comparison with other corpora CorpusSearch corpora like the PPCME or YCOE mark temporal information either exclusively or additionally with syntactic extension labels (e.g. ADVP-TMP for a temporal adverb phrase). In contrast, the GeCeG marks temporal features exclusively on lexical items. Furthermore, the meaning of this features is somewhat wider in the GeCeG than in some other corpora as temporal information does not only answer the question when but also how long.

The subcategory TMP is indicated on the following base label: ADV PREP SBNJ

2.4. Details and Examples


examples of use of POSTag, inflections and their order, possible prefixes and suffixes (word formation), and possible lemmata and subcategories

ADJ (adjective)


Adjectives have two main functions. Firstly, they can modify or quantify nominal heads or be nominal heads themselves. In this function, they are marked for all declension features as well as strength. Secondly, they can be clausal predicates. In this function, they usually occur without any overt inflections and no inflectional categories are indicated on the base label.
In addition, adjectives can be inflected for comparative or superlative degree. Such adjectives are always declined weak and hence strength is not indicated on their base label.
The category 'adjective' is tagged for the inflectional categories: case degree gender number strength. They occur in the order: ADJ^case^number^gender^degree^strength.
The base label ADJ can occur with the following affixes: GE UN.
The base label ADJ can be typed for the following lemmas or subcategories: FRGN OTHER QNT SELF.

Examples

ADV (adverb)


Adverbs provide modifying information on events or properties. Normally, they are completely uninflected.
However, adverbs can sometimes occur in the comparative or superlative. In such cases only, adverbs are marked for the inflectional category degree.
The category 'adverb' is tagged for the inflectional categories: degree. They occur in the order: ADV^degree.
The base label ADV is not associated with derivational affixes.
The base label ADV can be typed for the following lemmas or subcategories: CNT LOC MORE SO TMP.

Examples

COMP (complementizer)


Complementizers are heads of all finite complement, modifier and adjunct clauses; they always take a finite subordinate clause (SUB) as their complement. The most common overt complementizer in early German is t(h)az 'that' and its spelling variants, which introduces complement or consecutive adverbial clauses. Frequently, however, there is no overt complementizer but a subordinating conjunction, relative pronoun, wh-expression etc. instead. In such cases, an empty complementizer is introduced into the syntactic structure, whose form is *cpz*.
The base label COMP does not have inflections.
The base label COMP is not associated with derivational affixes.
The base label COMP is never typed for particular lemmas or subcategories.

Examples

CONJ (coordinating conjunction)


Coordinating conjunctions (or just 'conjunctions' for short) join together any number of constituents (2, 3, 4, ...) of any lengths (words, phrases, sentences) into a larger coordinate structure. The individual conjuncts are more or less alike in syntactic category, hierarchically equivalent (rather than super- or sub-ordinate to each other) and the whole coordinate structure is of the same category as its individual conjuncts as well. Typical conjunctions are and, but, or etc. In cases where a coordination structure needs to be projected but an overt conjunction is not available, an empty conjunction, *cnj*, is introduced.
The base label CONJ does not have inflections.
The base label CONJ is not associated with derivational affixes.
The base label CONJ is never typed for particular lemmas or subcategories.

Examples

DD (complex demonstrative)


The complex demonstrative, cognate with Modern English this, these, derives historically from a combination of the simple demonstrative der with an indeclinable morpheme se < *sa, *si. It comprises all variants of the forms in the paradigm below (Braune & Eggers 1987: §288). simple demonstrative paradigm
For syncretic forms, the usual guidelines regarding syncretism apply.
The category 'complex demonstrative' is tagged for the inflectional categories: case gender number. They occur in the order: DD^case^number^gender.
The base label DD is not associated with derivational affixes.
The base label DD is never typed for particular lemmas or subcategories.

Examples

DS (simple demonstrative)


The simple demonstrative pronoun is used as a definite article, relative pronoun and medial deictic demonstrative. It comprises all variants of the forms in the below paradigm (Braune & Eggers 1987: §287).
simple demonstrative paradigm
For syncretic forms, the usual guidelines regarding syncretism apply.
The category 'simple demonstrative' is tagged for the inflectional categories: case gender number. They occur in the order: DS^case^number^gender.
The base label DS is not associated with derivational affixes.
The base label DS is never typed for particular lemmas or subcategories.

Examples

FW (foreign word)


Foreign words are used exclusively for unanalyzed sequences of at least three foreign words in titles, excipits, technical citations etc. Foreign words are avoided otherwise. See the syntactic function FOREIGN for more details. Single or pairs of foreign words are annotated as if they were native German words, both for their base labels and their extensions. However, they are also typed for the subcategory ~FRGN, 'foreign', to indicate their non-native characteristics.

Comparison with other corpora CorpusSearch corpora like the YCOE or PPCME behave just like the GeCeG in that they use FW for foreign words in sequences. However, they also allow this label more liberally elsewhere, for instance as heads of noun phrases, while the GeCeG assigns to all individual or paired foreign words a native word class and merely types them as foreign.

The base label FW does not have inflections.
The base label FW is not associated with derivational affixes.
The base label FW is never typed for particular lemmas or subcategories.

Examples

NCO (common noun)


Common nouns are used to refer to classes of entities or non-unique instances of classes. As long as the inflectional categories can be identified, even foreign common nouns are tagged as NCO, not as FW.
The category 'common noun' is tagged for the inflectional categories: case gender number. They occur in the order: NCO^case^number^gender.
The base label NCO can occur with the following affixes: GE.
The base label NCO can be typed for the following lemmas or subcategories: FRGN NOT.

Examples

NEG (negative particle)


Early German has a particle specifically used for negation, ne, ni etc. NEG is used as a special part-of-speech label just for this word. Other forms of negation, most importantly the negative adverb nieht etc. 'not', are tagged differently.
The base label NEG does not have inflections.
The base label NEG is not associated with derivational affixes.
The base label NEG is never typed for particular lemmas or subcategories.

Examples

NPR (proper name)


Proper names are used to refer to unique entities in the world, such as famous individuals, locations, God etc. Proper names may take inflectional endings not found on common nouns (e.g. accusative singular ending in -an (Braune & Eggers 1987: §195)). As long as the inflectional categories can be identified, even foreign proper names are tagged as NPR, not as FW.
The category 'proper name' is tagged for the inflectional categories: case gender number. They occur in the order: NPR^case^number^gender.
The base label NPR is not associated with derivational affixes.
The base label NPR can be typed for the following lemmas or subcategories: FRGN.

Examples

NUM (numeral)


Numerals are cardinal numbers from two on. The lemma one has its own part of speech label, ONE. Numerals do not inflect for number and hence this feature is never indicated. The numerals two and three are always marked for case and gender. The numerals four to nineteen are tagged for case and gender only if they are overtly present, usually when the numeral is a nominal head or occurs post-nominally.
The category 'numeral' is tagged for the inflectional categories: case gender. They occur in the order: NUM^case^gender.
The base label NUM is not associated with derivational affixes.
The base label NUM is never typed for particular lemmas or subcategories.

Examples

ONE (one)


The lemma one can be used in German as a numeral, an indefinite pronoun or an indefinite article. In all of these uses, it is tagged ONE. The word is always marked for case and gender. It can only occur in the singular and so number is not indicated to avoid redundancy. It can take strong and weak inflectional endings like adjectives.
The category 'one' is tagged for the inflectional categories: case gender strength. They occur in the order: ONE^case^gender^strength.
The base label ONE is not associated with derivational affixes.
The base label ONE is never typed for particular lemmas or subcategories.

Examples

PREP (preposition)


Prepositions occur with noun phrase complements and govern their case. Prepositions have a wide range of possible, though prototypically spatial or temporal, meanings or mark grammatical relations.
The base label PREP does not have inflections.
The base label PREP is not associated with derivational affixes.
The base label PREP can be typed for the following lemmas or subcategories: LOC TMP.

Examples

PRO (pronoun)


Personal pronouns are pronouns that can be used to refer to actors involved in the speech situation (speaker, addressees). They are marked for case, number and person. In addition, third person pronouns are tagged for gender.
The category 'pronoun' is tagged for the inflectional categories: case gender number person. They occur in the order: PRO^case^person^number^gender.
The base label PRO is not associated with derivational affixes.
The base label PRO can be typed for the following lemmas or subcategories: RFLX.

Examples

PRO$ (possessive pronoun)


Possessive pronouns are pronouns that express a dependency, prototypically possession, towards its head noun.
They comprise the following seven stems, depending on person, number and gender of the possessor.
simple demonstrative paradigm
These stems are derived from the genitive forms of personal pronouns with the exception of sīn, which comes from the reflexive. The stems are indicated as lemmas on the base label PRO$ (~MIN 'my', ~UNS 'our' etc.).
Possessive pronouns are inflected with the normal declension features, case, number and gender, depending on the possessed noun and syntactic environment of the whole noun phrase. They always inflect like strong adjectives even if they occur after demonstratives. Since it is thus redundant, strength is not indicated as an inflectional feature on PRO$. Third person singular feminine (ira) and third person plural possessive pronouns (iro) are not normally overtly inflected but gradually develop their strong adjectival declension endings from the 12th century on (Braune & Eggers 1987: $284, Anm. 1). Furthermore, nominative singular possessive pronouns often do not show overt inflectional endings (e.g. mīn instead of mīnēr 'my' (nominative singular masculine)). Irrespective of the actual overt inflectional features, case, number and gender are always marked on the base label. The normal guidelines regarding syncretism apply.
The category 'possessive pronoun' is tagged for the inflectional categories: case gender number. They occur in the order: PRO$^case^number^gender.
The base label PRO$ is not associated with derivational affixes.
The base label PRO$ can be typed for the following lemmas or subcategories: DIN IRA IRO IUW MIN SIN UNS.

Examples

SBNJ (subordinating conjunction)


Subordinating conjunctions occur with subordinate clauses and embed them into a higher clause. They can be polysemous with prepositions (e.g., after in "after he came" vs. "after the game") or adverbs (e.g., before as in "I saw him before" vs. "I saw him before he left") or be formally unambiguous (e.g., although). Irrespective of such potential category overlaps, the GeCeG tags all words that are not complementizers and introduce subordinate clauses as SBNJ.
The base label SBNJ does not have inflections.
The base label SBNJ is not associated with derivational affixes.
The base label SBNJ can be typed for the following lemmas or subcategories: LOC SO TMP.

Examples

TO (non-finite marker to)


Early German has a non-finite marker, ze, zi, zu etc., which occurs with infinitives and is cognate with English to.
The base label TO does not have inflections.
The base label TO is not associated with derivational affixes.
The base label TO is never typed for particular lemmas or subcategories.

Examples

VBA (active or present participle)


Active (or present) participles are non-finite verbs as they do not occur with a complete set of the conjugation features, i.e. subject agreement, tense and mood. They function similarly to adjectives. Hence, they can occur as modifiers or heads in the nominal domain and are then fully inflected for case, number, gender and strength. Alternatively, they can occur as main verb predicates on the clausal level, for instance in progressive constructions. In this usage, they do not normally occur with any inflectional endings at all. Inflections are marked only if an overt ending is present. If an overt ending is present, all inflectional categories are indicated. The normal guidelines regarding syncretism apply. Active (or present) participles of particular verbs, like the copula or auxiliary be and others, are marked with their own lemma.
The category 'active or present participle' is tagged for the inflectional categories: case gender number strength. They occur in the order: VBA^case^number^gender^strength.
The base label VBA can occur with the following affixes: IPX.
The base label VBA is never typed for particular lemmas or subcategories.

Examples

VBF (finite verb)


Finite verbs are marked for tense, person, number and mood so that they are different from non-finite verb forms like infinitives and participles. Particular verbs, like the copula or auxiliary be and others, are marked with their own lemma.
The category 'finite verb' is tagged for the inflectional categories: mood number person tense. They occur in the order: VBF^person^number^tense^mood.
The base label VBF can occur with the following affixes: GE IPX NEG SPX.
The base label VBF can be typed for the following lemmas or subcategories: BE DO HV PRPR WRD.

Examples

VBI (imperative)


Imperatives are expressed with designated affixes on present verb stems. There are only three possible forms: second person singular imperatives, which are usually expressed with the bare verbal stem, second person plural imperatives, with endings in -et, -at, -ent etc., and first person plural imperatives (adhortatives), with endings in -ames, -emes, -en etc. The latter two are frequently syncretic with finite present tense verbs of the same person and number. Since all imperatives are based on present, indicative stems, tense and mood is not indicated on imperatives.
The category 'imperative' is tagged for the inflectional categories: number person. They occur in the order: VBI^person^number.
The base label VBI is not associated with derivational affixes.
The base label VBI is never typed for particular lemmas or subcategories.

Examples

VBN (infinitive)


Infinitives are non-finite verbs as they do not occur with a complete set of the conjugation features, i.e. subject agreement, tense and mood. Infinitives occur with special suffixes, like an, ōn, ēn, en etc.
Normally, infinitives are not inflected. However, they can occur as gerunds with dative (rarely genitive or instrumental) endings. Inflected infinitives appear frequently after the non-finite marker to. If there are no overt case endings, infinitives are not marked for any inflectional categories. If an infinitive has an overt gerund ending, case is indicated as an inflectional feature on the base label. Infinitives of particular verbs, like the copula or auxiliary be and others, are marked with their own lemma.

Comparison with other corpora The label VBN may cause some confusion for users familiar with CorpusSearch corpora like the YCOE or PPCME. Here, the label is used for infinitives. It is a mnemonic formed from verb, VB, plus the characteristic German infinitival ending, -an, -en, -on etc., N. In the latter two corpora, on the other hand, VBN is used for past participles (in the PPCME, specifically perfect participles) and N is mnemonic for the Modern English participle ending, -en (as in take – taken, drive – driven etc.).

The category 'infinitive' is tagged for the inflectional categories: case. They occur in the order: VBN^case.
The base label VBN is not associated with derivational affixes.
The base label VBN can be typed for the following lemmas or subcategories: BE DO HV.

Examples

VBP (passive or past participle)


Passive (or past) participles are non-finite verbs as they do not occur with a complete set of the conjugation features, i.e. subject agreement, tense and mood. They function similarly to adjectives. Hence, they can occur as modifiers or heads in the nominal domain and are then fully inflected for case, number, gender and strength. Alternatively, they can occur as main verb predicates on the clausal level, for instance in perfect or passive constructions. In this usage, they do not normally occur with any inflectional endings at all. Inflections are marked only if an overt ending is present. If an overt ending is present, all inflectional categories are indicated. The normal guidelines regarding syncretism apply. Passive (or past) participles of particular verbs, like the copula or auxiliary be and others, are marked with their own lemma.
The category 'passive or past participle' is tagged for the inflectional categories: case gender number strength. They occur in the order: VBP^case^number^gender^strength.
The base label VBP can occur with the following affixes: IPX.
The base label VBP is never typed for particular lemmas or subcategories.

Examples

WADV (interrogative adverb)


Interrogative adverbs are words like where or why etc. which are used to refer to questioned adjuncts. They are uninflected.
The base label WADV does not have inflections.
The base label WADV is not associated with derivational affixes.
The base label WADV can be typed for the following lemmas or subcategories: LOC.

Examples

WPRO (interrogative pronoun)


Interrogative pronouns are the words (h)wer 'who' and (h)waz 'what' when they are used to refer to questioned noun phrases. Some grammar books assign masculine or feminine gender to the former and neuter gender to the latter word (Braune & Eggers 1987: $291). The GeCeG does not follow this convention since either word can be used to refer to noun phrases of any grammatical gender. The difference is rather about animate vs. inanimate referents. Interrogative pronouns only occur in the singular so that indicating number would be redundant. Consequently, interrogative pronouns are marked only for case.
The category 'interrogative pronoun' is tagged for the inflectional categories: case. They occur in the order: WPRO^case.
The base label WPRO is not associated with derivational affixes.
The base label WPRO is never typed for particular lemmas or subcategories.

Examples

3. Quasi-Base Labels


The GeCeG includes three labels that appear in the same contexts as regular part-of-speech tags, but behave fundamentally differently in that they do not actually indicate a basic category of words. These are: CODE, EC and TAG. They are described in more detail below.

CODE (CorpusSearch Comments)


CODE has the following four uses in the GeCeG:

  • Meta-linguistic comments

  • Meta-linguistic comments are labelled CODE. Comments give additional information that are not immediately relevant for the parsed tree. They are included in curley brackets with the introductory strings EDITION:, for comments on a remarkable layout in the text edition used, or COMMENT:, for any other comment. The words of the comment itself are separated by underscores. Comments are put directly into the syntactic tree where they are relevant.

    Examples

  • Word-Splitting

  • Words that belong to distinct base labels are frequently spelled together. Such complex words are normally treated as compounds. However, in some instances, it is necessary to split up the distinct base labels. That is the case when the individual components belong to different syntactic functions that are not embedded in each other. For example, two pronouns could be strung together, one functioning as a direct, the other as an indirect object. To indicate that a complex word has been changed to two or more simple words, a CODE label is introduced between the syntactic function labels that the components belong to. The form of this CODE label is +. The forms of the split words themselves are not altered at all.
    Word splits are required in the following cases: (1) negation fused to verbs.

    Comparison with other corpora Other CorpusSearch corpora handle word splitting differently. The PPCME and YCOE indicate emendations of any kind, including word splits, by the addition of a dollar sign to the beginning of every word. Additonally, the original form is repeated as a {TEXT:...} comment, labelled CODE.
    Furthermore, the contexts that require word splits are slightly different in the GeCeG and other CorpusSearch corpora. For example, fused negation is split off the verb in the GeCeG but treated as a compound in the PPCME and YCOE.


    Examples

  • Label for Gloss

  • CODE is used as the label for Modern English Glosses to early German sentences. See the syntactic annotation under Glosses for more details.
  • Label for Latin

  • CODE is the label for Latin source material in text files that include it. See the syntactic annotation under Latin for more details.


EC (empty category)


TAG-# (Tag)


TAG is a unification marker that is used in the GeCeG to handle filler-gap / movement structures. It is the only label in the entire corpus that occurs with a numerical index, e.g. TAG-1. It is not enclosed in asterisks. The form of TAG is always 0 and all 0s occur as the form of TAG. For more details, see the syntactic annotation under filler-gap / movement dependencies.