Sanskrit Word Segmentation and Morphological Parsing Hackathon

Word Segmentation & Morphological Parsing for Sanskrit

WORD SEGMENTATION &

MORPHOLOGICAL PARSING

FOR SANSKRIT

All Links for the Hackathon
Results
Forum for Information Retrieval Evaluation (FIRE '21)

In association with

FIRE 2021

Recent Updates

Nov 22 Hackathon Completed! Check Results here
Nov 20 We are live!! Checkout here
Nov 20 Test Data can be accessed here
Nov 1 The Discord Server is live here
Oct 15 Online Portal is open and can be accessed here
Sep 10 Dataset Released. Register here.
Aug 30 Website for WSMP launched
Aug 13 Hackathon for WSMP with FIRE conference

Nov 22

The Hackathon has officially ended. The leaderboard is available here

About the Hackathon

Sentential analysis in Sanskrit poses various challenges in each of the stages of word segmentation, morphological parsing, dependency parsing, etc. This hackathon focuses on developing methodologies to handle Word Segmentation and Morphological Parsing in Sanskrit.

There are three tasks: Word Segmentation, Morphological Parsing and Combined Word Segmentation & Morphological Parsing. The online portal for the competition will be in Codalab. The participants are requested to register for the dataset and can try out their models with the training and development dataset on the online portal.

The top 3 performs in the competition will be awarded with prizes. The details of the prizes will be updated soon. The details for the tasks are given below:

Task 1

Word Segmentation: The writings in Sanskrit follow a structured scheme where the words often undergo phonetic transformations at the juncture of their boundaries, thus modifying the phonemes at these boundaries and also obscuring the original boundaries. This process of euphonic assimilation or joining of the words is known as Sandhi and the splitting or segmentation of such joint word forms is known as Sandhi-Viccheda. While Sandhi is deterministic, Sandhi-Viccheda is not. It is desirable to identify the individual words in a sentence and obtain the semantically most valid split of the sentence for subsequent processing in downstream tasks.
Task 2

Morphological Parsing: Sanskrit is a morphologically rich fusional language, and morphology plays a crucial role in carrying the grammatical information encoded in a sentence. However, this is a challenging task due to the prevalence of syncretism and homonymy expressed by the words. A word is formed by the combination of preverb(s), stem(s) and suffixes. The suffixes denote the morphological category of the word. The information regarding the grammatical role played by a word in a sentence is latent in the morphological category of the word. This task focuses on predicting the morphological tags for each of the words in a given sentence.
Task 3

Combined Word Segmentation and Morphological Parsing: Given the sequential dependency between the aforementioned tasks, we encourage a joint or pipeline based formulation, by combining Tasks 1 and 2. If a pipeline structure is deployed, then the segmented form of the sentence from the Word Segmentation should be fed to the Morphological Parser. On the other hand, one participating team may model both the tasks jointly. There are interdependencies with both the tasks, and joint modelling of such related tasks is preferable over a pipeline-based approach. So, both the segmented forms and the morphological tags are to be predicted.

Online Portal

The competition is conducted on the following platform:
WSMP

Important Dates

10th September 2021
Training Dataset Release
-----
15th October 2021
Online Portal for Registration
-----
20-21 November 2021
Test Data Release
Date of Hackathon
-----
13-17 December 2021
FIRE '21
Declaration of Results
-----

Recent Updates

About the Hackathon

Task 1

Task 2

Task 3

Online Portal

Important Dates