Senseval-3
Task: Spanish Lexical Sample
General description
Spanish Lexical Sample Task
We propose a "Lexical-Sample" task for Spanish in order to evaluate
supervised and semi-supervised learning systems for WSD. Each
participant
will be provided with a relatively small set of labeled examples (2
thirds
of 75+15*#senses) and a comparatively very large set of unlabeled
examples
(ten times more, when possible) for 46 words. The test set will be
comprised
with one third of 75+15*#senses. We target at two types of
participants:
supervised systems (not using unlabeled data) and semi-supervised
systems
(those taking profit from the unlabeled data), but unsupervised systems
can also participate, of course. The MiniDir sense inventory, which is
specially developed for the task, is manually linked to WordNet 1.5
(automatic
links to WordNet1.6/1.7 will be also provided). This task is
coordinated
with other lexical-sample tasks (Basque, Catalan, English, Italian,
Rumanian)
in order to share around 10 of the target words.
Datasets and task general information
All data sets and complementary information are organized in the
following
files:
README
- contains information about the Spanish Lexical Sample Task and the
datasets
provided
words.info
- information about the 46 words treated
MiniDir.xml
- sense inventory used to annotate examples
Spanish-samples.train.raw.xml
- training examples for all 46 words
Spanish-samples.train.tagged.xml
- training examples for all 46 words: lemmatized and POS tagged version
Spanish-samples.unlab.raw.xml
- unlabeled examples for all 46 words
Spanish-samples.unlab.tagged.xml -
unlabeled examples for all 46 words: lemmatized and POS tagged version
Spanish-samples.test.raw.xml
- test examples for all 46 words
Spanish-samples.test.tagged.xml
- test examples for all 46 words: lemmatized and POS tagged version
All this files are grouped in the following 3 gzipped tar-ed
packages
for downloading:
- LexicalSample.es.train.raw.tgz (19.0Mb):
contains informative files,MiniDir.xml, and the "raw"
version of the traininig examples, both labeled and unlabeled.
- LexicalSample.es.train.tagged.tgz (63.2Mb):
contains the "tagged" versions of the traininig examples,
both labeled and unlabeled.
- LexicalSample.es.test.tgz (5.4Mb):
contains test datasets, both "raw" and "tagged"
* For downloading and using all these materials please register as a
participant
at the Senseval-3
Web site and follow the downloading instructions *
News
March 18, 2004: The
official scorer2 program does not work properly on the "Spanish
lexical sample" datasets due to the initial accent on the word
"órgano". Please, remove accents from the "key" and "output"
files before running the evaluation program. Sorry for the
inconvenience.
March 01, 2004: Training/test
datasets become available at the Senseval-3 Web site
February 27, 2004: This page has been set up.
Task organizers
Lluís
Màrquez
TALP Research Center
Software Department, Technical University of Catalonia
(lluism@lsi.upc.es)
M. Antònia Martí
CLiC, Universitat de Barcelona
(amarti@ub.edu)
Mariona Taulé
CLiC, Universitat de Barcelona
(mtaule@uoc.edu)
Last
update: February 29, 2004.