Senseval-3

Task: Catalan Lexical Sample


General description

Catalan Lexical Sample Task
We propose a "Lexical-Sample" task for Catalan in order to evaluate supervised and semi-supervised learning systems for WSD. Each participant will be provided with a relatively small set of labeled examples (2 thirds of 75+15*#senses) and a comparatively very large set of unlabeled examples (ten times more, when possible) for 27 words. The test set will be comprised with one third of 75+15*#senses. We target at two types of participants: supervised systems (not using unlabeled data) and semi-supervised systems (those taking profit from the unlabeled data), but unsupervised systems can also participate, of course. The MiniDir sense inventory, which is specially developed for the task, is manually linked to WordNet 1.5 (automatic links to WordNet1.6/1.7 will be also provided). This task will be coordinated with other lexical-sample tasks (Basque, Spanish, English, Italian, Rumanian) in order to share around 10 of the target words.


Datasets and task general information

All data sets and complementary information are organized in the following files:

README                       - contains information about the Catalan Lexical Sample Task and the datasets provided
words.info                       - information about the 27 words treated
MiniDir.xml                   - sense inventory used to annotate examples

Catalan-samples.train.raw.xml            - training examples for all 27 words
Catalan-samples.train.tagged.xml       - training examples for all 27 words: lemmatized and POS tagged version
Catalan-samples.unlab.raw.xml           - unlabeled examples for all 27 words
Catalan-samples.unlab.tagged.xml      - unlabeled examples for all 27 words: lemmatized and POS tagged version
Catalan-samples.test.raw.xml              - test examples for all 27 words
Catalan-samples.test.tagged.xml         - test examples for all 27 words: lemmatized and POS tagged version

All this files are grouped in the following 3 gzipped tar-ed packages for downloading:

- LexicalSample.ca.train.raw.tgz (11.1Mb):
   contains informative files,MiniDir.xml, and the "raw" version of the traininig examples, both labeled and unlabeled.

- LexicalSample.ca.train.tagged.tgz (39.2Mb):
   contains the "tagged" versions of the traininig examples, both labeled and unlabeled.

- LexicalSample.ca.test.tgz (4.5Mb):
   contains test datasets, both "raw" and "tagged"
 

   * For downloading and using all these materials please register as a participant at the Senseval-3 Web site and follow the downloading instructions * 

News

March 10, 2004:    Training/test datasets become available at the Senseval-3 Web site
February 27, 2004: This page has been set up.


Task organizers

Lluís Màrquez
TALP Research Center
Software Department, Technical University of Catalonia
(lluism@lsi.upc.es)

M. Antònia Martí
CLiC, Universitat de Barcelona
(amarti@ub.edu)

Mariona Taulé
CLiC, Universitat de Barcelona
(mtaule@uoc.edu)



Last update: March 10, 2004.