C58 : the Greek Discourse Relations Corpus
Papers
Special issue of the Bulletin of Scientific Terminology and Neologisms on
Aspects of Corpus Linguistics:
Principles, implementations, challenges
(submitted)
Logistic regression and linear discriminant analysis for Elaboration and Commentary in the discourse relations' corpus C58
Elaboration and Commentary are two discourse relations that seem to play a crucial role in discourse, Although they are semantically close, since they both describe parts of the textual content, there is a clear difference in pragmatic terms. Commentary describes the relation between a communicative act and its agent, wheareas Elaboration relates parts of events within the main storyline. Reese et al. (2007), Adam and Vergez-Couret (2010) and Versley & Gastel (2013) emphasize the role of lexical or phrasal clues that help us recognize the main distinguishing features of the two relations. This paper addresses the question whether lexical semantics, argument structure, subject type and the distance between two utterances play a significant role in distinguishing features for Elaboration and Commentary. I will attempt to answer the question by implementing two classification models for predicting the two relations in a test set of pairs of related utterances, a logistic regression and a linear discriminant analysis model. The construction of prediction models for textual structure are one of the most essential parts of dialogue, question answering systems and/or other automated systems in Natural Language Processing and Computational Linguistics. The second aim of the paper is to present some key linguistic aspects of two related utterances that help us unveil where the author’s perspectivization on a described event starts (and hence more likely to be a Commentary) and where it ends (and probably it is Elaboration on an objective fact). Last but not least, this paper presents the main characteristics of the Discourse Relations’ Corpus C58, the first annotated corpus for discourse relations in Greek.
ICGL12 (submitted)
Corpus C58 and a data-analytic approach to the interface between intra- and inter-sentential linguistic information
Over the last two decades there has been a strong tendency in CL to use linguistically annotated corpora, in order to build classification systems that mine texts in a smarter manner, (cf. Pustejovsky and Stubbs (2012), Marcu and Echihabi (2002)). However, the current state of the art in corpus annotation and exploitation stays at the simplest level of linguistic description and uses most of the times lexically or phrasally annotated textual data. Ignoring the still unresolved -to some extent- problem of defining grammatical categories in linguistic theorizing, the vast majority of existing linguistically annotated resources stay at the morphological or syntactic annotation level and rarely semantically annotated resources are created and exploited at all. Therefore, nowadays efforts for integrating more linguistic information into CL systems ought to focus on two dimensions:
1. the semantics of utterances,
2. the annotation of more linguistic description levels, such as the discourse one.
Clearly, the focus on morphological and syntactic annotation is largely explained due to their usefulness for current NL applications, since there are clear natural language formal clues as to what the unit term of description and its parts are. Moreover, the subtle nature of semantics and the difficulty to pin down its interface to pragmatics prevent current research efforts from building relevant annotated language resources.
Discourse annotation is an unresolved and challenging annotation task for a system to undertake. The present paper aims to point out the need for considering the formal semantic and pragmatic machinery developed within the theoretical linguistic tradition in order to deal with challenging issues related to linguistic annotation at the textual level that could be beneficiary for ML-inspired classification or other CL systems, such as question-answering or summarization systems. Throughout this paper we will present some aspects of the first manually and discourse annotated corpus for Modern Greek, the C58 corpus that brings up the complexity of discourse annotation and indicates the main corpus design practices and choices made for the compilation of C58. Its second part presents the first data-driven research results that unveil the close and intricate relationship between inter- and intra-sentential levels of linguistic analysis.
Next section describes briefly SDRT, a formal discourse semantic theory that underlies C58’s annotation guidelines. SDRT provides a rich toolset for describing utterance interdependencies based on formal criteria for defining discourse relations and for using graph-based representations. Sections 4 and 5 describe the main features of C58, while the following section presents one of the most important challenges any discourse relations’ corpus needs to face, namely discourse segmentation. Sections 7 and 8 present the first results and conclusions drawn upon C58’s annotations.
Relevant Sources & Literature
-
Brat annotation tool: http://brat.nlplab.org/about.html
-
Verbnet's database: http://verbs.colorado.edu/~mpalmer/projects/verbnet.html
-
Karin Kipper, Martha Palmer, Owen Rambow 2002. Extending PropBank with VerbNet Semantic Predicates. Workshop on Applied Interlinguas, held in conjunction with AMTA-2002.Tiburon, CA.
-
Levin, B. and M. Rappaport Hovav 1996. Lexical Semantics and Syntactic Structure. In S. Lappin, ed., The Handbook of Contemporary Semantic Theory, Blackwell, Oxford, 487-507.
-
Rappaport Hovav, M. and B. Levin 1998. Morphology and Lexical Semantics. In A. Zwicky and A. Spencer, (eds), Handbook of Morphology, Blackwell, Oxford, 248-271.
-
Levin, B. 2014. Semantic Roles. In M. Aronoff, ed., Oxford Bibliographies in Linguistics, Oxford University Press, New York.
-
Asher, Nicholas and Alex Lascarides 2003. Logics of Conversation. Cambridge University Press.
-
Lascarides, A. and N. Asher 2007. Segmented Discourse Representation Theory: Dynamic Semantics with Discourse Structure. in H. Bunt and R. Muskens (eds) Computing Meaning: Volume 3, 87-124, Springer Verlag.
-
Asher, N. and Vieu, L. 2005. Subordinating and coordinating discourse relations. Lingua 115: 591-610.