Senseval-3

Task: Spanish Lexical Sample


General description

Spanish Lexical Sample Task
We propose a "Lexical-Sample" task for Spanish in order to evaluate supervised and semi-supervised learning systems for WSD. Each participant will be provided with a relatively small set of labeled examples (2 thirds of 75+15*#senses) and a comparatively very large set of unlabeled examples (ten times more, when possible) for 46 words. The test set will be comprised with one third of 75+15*#senses. We target at two types of participants: supervised systems (not using unlabeled data) and semi-supervised systems (those taking profit from the unlabeled data), but unsupervised systems can also participate, of course. The MiniDir sense inventory, which is specially developed for the task, is manually linked to WordNet 1.5 (automatic links to WordNet1.6/1.7 will be also provided). This task is coordinated with other lexical-sample tasks (Basque, Catalan, English, Italian, Rumanian) in order to share around 10 of the target words.


Datasets and task general information

All data sets and complementary information are organized in the following files:

README                         - contains information about the Spanish Lexical Sample Task and the datasets provided
words.info                        - information about the 46 words treated
MiniDir.xml                   - sense inventory used to annotate examples

Spanish-samples.train.raw.xml            - training examples for all 46 words
Spanish-samples.train.tagged.xml        - training examples for all 46 words: lemmatized and POS tagged version
Spanish-samples.unlab.raw.xml           - unlabeled examples for all 46 words
Spanish-samples.unlab.tagged.xml      - unlabeled examples for all 46 words: lemmatized and POS tagged version
Spanish-samples.test.raw.xml               - test examples for all 46 words
Spanish-samples.test.tagged.xml           - test examples for all 46 words: lemmatized and POS tagged version

All this files are grouped in the following 3 gzipped tar-ed packages for downloading:

- LexicalSample.es.train.raw.tgz (19.0Mb):
   contains informative files,MiniDir.xml, and the "raw" version of the traininig examples, both labeled and unlabeled.

- LexicalSample.es.train.tagged.tgz (63.2Mb):
   contains the "tagged" versions of the traininig examples, both labeled and unlabeled.

- LexicalSample.es.test.tgz (5.4Mb):
   contains test datasets, both "raw" and "tagged"
 

   * For downloading and using all these materials please register as a participant at the Senseval-3 Web site and follow the downloading instructions * 

News

March 18, 2004:       The official scorer2 program does not work properly  on the "Spanish lexical sample" datasets due to the initial accent on the word "órgano". Please, remove accents from the "key" and "output" files before running the evaluation program. Sorry for the inconvenience.

March 01, 2004:       Training/test datasets become available at the Senseval-3 Web site
February 27, 2004:   This page has been set up.


Task organizers

Lluís Màrquez
TALP Research Center
Software Department, Technical University of Catalonia
(lluism@lsi.upc.es)

M. Antònia Martí
CLiC, Universitat de Barcelona
(amarti@ub.edu)

Mariona Taulé
CLiC, Universitat de Barcelona
(mtaule@uoc.edu)



Last update: February 29, 2004.