Sanskrit Word Segmentation and Morphological Parsing Hackathon Tasks

Word Segmentation & Morphological Parsing for Sanskrit

Task Submissions

The participants can submit their results for the tasks here in the following link: WSMP Submissions

Task 1 Word Segmentation for Sanskrit

Most of the Sanskrit texts found in ancient manuscripts are known to have been written in the sandhied/joint form. So Word Segmentation forms the preliminary task of Sentential analysis. Word segmentation is the task of identifying the words in a given character sequence (sentence). It is challenging in a language like Sanskrit, where the word boundaries are often obscured due to Sandhi.

For instance, the sentence rāmaḥ vanam gacchati when sandhied, results into rāmovanaṅgacchati. We see that three individual words combine in their boundaries to form a single word. Resolving this is necessary to obtain the individual segments which will be put into analyses in further tasks of sentential analysis (morphological parsing, dependency parsing, semantic analysis, etc.).

There is another kind of Sandhi found between components of compounds where two or more stems join to form a compound word. For instance, rāmālayaḥ is a compound word formed from the components rāma and ālayaḥ. Resolving Sandhi within compounds is essential as the knowledge of the individual components of a compound will also help in its analysis in the downstream tasks [Krishna et al., 2021]. And Sanskrit being rich in multicomponent compound constructions, the analysis of the sentence would majorly require the analysis of the individual components of the compounds present in the sentence.

Another major challenge for this task owes to the number of segmentations possible for a given sentence, allowing non-determinism right at the start of the linguistic analysis. According to the Sanskrit Heritage Reader(SHR), there are 4 possible segmentations for the sequence rāmovanaṅgacchati, and the two-component compound rāmālayaḥ has 20 possible segmentations. This is mainly due to the non-deterministic combinations of the phonemes and additionally due to the multiple morphological analyses for a given word.

Thus, the requirement of this task is to predict the correct segmentation of the given sequence. The participants will be provided with the unsegmented sentences (at least 100,000) along with their corresponding ground truth segmentation and morphological tags. The sentences have been taken from the Digital Corpus of Sanskrit(DCS).

Related Literature:

Krishna, A., Santra, B., Gupta, A., Satuluri, P., & Goyal, P. (2021). A Graph-Based Framework for Structured Prediction Tasks in Sanskrit. Computational Linguistics, 46(4), 785–845. https://doi.org/10.1162/coli_a_00390

Krishna, A., Santra, B., Satuluri, P., Bandaru S, P., Faldu B., Singh, Yajuvendra., & Goyal, P. (2016). Word Segmentation in Sanskrit using Path Constrained Random Walks. The COLING 2016 Organizing Committee, 46(4), 494–504. https://aclanthology.org/C16-1048

Reddy, Vikas.,Krishna, A., Santra, B., Gupta, P., Satuluri, P., Vineeth, M, R., Sharma, V., & Goyal, P. (2018). Building a Word Segmenter for Sanskrit Overnight. European Language Resources Association (ELRA). https://aclanthology.org/L18-1264

SHR is a lexicon driven shallow parser that encodes all the rules of Sandhi as per traditional Sanskrit grammar and which can enumerate exhaustively all possible lexically valid segmentations for a given sequence

DCS is an annotated corpus having more than 650,000 sentences.

Task 2 Morphological Parsing for Sanskrit

Morphological parsing is the task of identifying the morphemes constituting the words in a sentence. Sanskrit, a morphologically-rich and a free word order language, poses a series of challenges for sentential analysis. Its morphological richness brings ambiguity into the picture and results in multiple analysis for a particular word form.

Since there is limited restriction in the movement of words within a sentence, the positional information does not help in sentential analyses. The prose texts mostly follow a common canonical order. But for verses, the word order is shuffled to fit poetic meters. A large number of texts are written in verses or to complicate things further, they are written as a combination of prose and verse. And it is required for us to look into other features to obtain important information regarding the role played by each of the words in the sentence. The words encode within themselves the grammatical role played by them in the sentence. The grammatical categories in a morphological class are indicative of the syntactic roles and the morphological agreement between the words in a construction. With the help of the morphological markers present mostly as affixes in the words, we can arrive at the possible morphological analyses of the words.

This task focuses on obtaining the correct stem and the morphological tag of the inflected forms in a sentence. Given a sentence of segmented words, each word should be annotated with the following parameters:
a. stem
b. morphological tag (this is obtained from the list of tags proposed by the Sanskrit Heritage Reader)
The comparison with other available tags along with denotations of each tag is available here.

Morphological parsing in Sanskrit is challenging primarily because of two factors. First, Sanskrit has a rich tagset of about 1,635 possible tags. Second, an inflected form in Sanskrit may lead to multiple morphological analyzes, due to syncretism and homonymy. Sanskrit is a fusional language where a morpheme encodes multiple grammatical categories. The same word-form or stem could denote multiple senses or meanings, and this is termed as homonymy. This poses ambiguity at the level of semantics. Sanskrit also has syncretism where the same word form could denote different morphological categories of the same stem. Handling these two is challenging as the analysis has to be done from the level of semantics and also taking the context into consideration.

The requirement of this task is to predict the stems and morphological tags given the segmented words. The participants will be provided with the morphological tags for each of the sentences as training data.

Related Literature:

Krishna, A., Gupta, A., Helwwig, O., & Goyal, P. (2020). Evaluating Neural Morphological Taggers for Sanskrit. Association for Computational Linguistics, 198–203. https://aclanthology.org/2020.sigmorphon-1.23/

Krishna, A., Santra, B., Satuluri, P., Bandaru S, P., Faldu B., Singh, Yajuvendra., & Goyal, P. (2018). Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit. Association for Computational Linguistics, 2550–2561. https://aclanthology.org/D18-1276/

Hellwig, Oliver (2016). Improving the Morphological Analysis of Classical Sanskrit. The COLING 2016 Organizing Committee. 142–151. https://aclanthology.org/W16-3715/

Task 3 Combined Word Segmentation and Morphological Parsing for Sanskrit

This task focuses on jointly solving the word segmentation and morphological tagging tasks for Sanskrit. Being a free word order language, Sanskrit is all the more lenient for poetic constructions where arranging the words to adhere to metrical constraints is a big concern. Hence, the whole input context is desirable when making each prediction of the output. Having the whole input, and deploying a joint methodology for solving these two tasks enables us to consider both the segmented form and also the morphological tags together which does make the system more efficient.

The dependency of one task with the other is evident from the way Word Segmentation and Morphological Parsing has been dealt with in recent years. The necessity for the individual segment to be lexically and semantically valid during Word segmentation resolves the ambiguity of multiple possibilities upto a certain limit, beyond which the semantics and context of the sentence are to be taken into consideration. Also, the morphological parsing does require the segmented forms for analyses

This task can be achieved in two ways:
a. As a pipeline of Task 1 and Task 2. The prediction from Word Segmentation is fed as input to the Morphological Parsing to predict the corresponding tags for each word.
b. As a joint system where the word form is considered along with all the tags for training. And both the segmented word form as well as the tags are predicted at once.
The participants are again provided with the joint and segmented sentences. And also the ground truth morphological analysis for each of those sentences will be provided.

Related Literature:

Krishna, A., Santra, B., Gupta, A., Satuluri, P., & Goyal, P. (2020). A Graph-Based Framework for Structured Prediction Tasks in Sanskrit. Computational Linguistics, 46(4), 785–845. https://aclanthology.org/2020.cl-4.4