Sanskrit Word Segmentation and Morphological Parsing Hackathon Dataset

Dataset

General Instructions:

1. The instructions and explanations for the dataset and the three tasks are provided in an ipython notebook (WSMP_Instructions), for which a .pdf is also provided

2. The dataset bundle contains 10 files and make sure that all these files are put under the same directory.

3. The hackathon platform will be "Codalab". It has been released and can be accessed here: https://competitions.codalab.org/competitions/35744

4. The event will be a two-day hackathon (20-21 November).

5. A team with at most 5 members may register for the event.

6. All the participants will have to fill a Google form to register for using the dataset.

7. The link for the form is available here:

REGISTER FOR WSMP DATASET

Overview of the dataset:

(a) Joint Sentences: 90,000 for training and 10,000 for development

(b) The ground truth segmentation for each of the sentences in (a).

(c) The ground truth morphological analysis (stem and morphological category) for each of the sentences.

(d) The ground truth segmentation and morphological analysis (segmented word, stem and morphological category) for each of the sentences.

(e) Additionally, an auxiliary file (in .graphml format) will be provided for all the sentences of (a). Each of these files possesses a graph made up of all the possible segments of the sentence as its nodes with the morphological tags as its attributes. The edges denote the compatibility of two segments to be present in a single segmentation. The ground truth segmentation is also encoded in this graph.

Word Segmentation & Morphological Parsing for Sanskrit

Datasets & Evaluation

Dataset

Evaluation