Senseval-3
Task: Catalan Lexical Sample
General description
Catalan Lexical Sample Task
We propose a "Lexical-Sample" task for Catalan in order to evaluate
supervised and semi-supervised learning systems for WSD. Each
participant will be provided with a relatively small set of labeled
examples (2 thirds of 75+15*#senses) and a comparatively very large set
of unlabeled examples (ten times more, when possible) for 27
words. The test set will be comprised with one third of 75+15*#senses.
We target at two types of participants: supervised systems (not using
unlabeled data) and semi-supervised systems (those taking profit from
the unlabeled data), but unsupervised systems can also participate, of
course. The MiniDir sense inventory, which is specially developed for
the task, is manually linked to WordNet 1.5 (automatic links to
WordNet1.6/1.7 will be also provided). This task will be coordinated
with other lexical-sample tasks (Basque, Spanish, English, Italian,
Rumanian) in order to share around 10 of the target words.
Datasets and task general information
All data sets and complementary information are organized in the
following
files:
README
- contains information about the Catalan Lexical Sample Task and the
datasets
provided
words.info
- information about the 27 words treated
MiniDir.xml
- sense inventory used to annotate examples
Catalan-samples.train.raw.xml
- training examples for all
27 words
Catalan-samples.train.tagged.xml
- training examples for all 27 words: lemmatized and POS tagged version
Catalan-samples.unlab.raw.xml
- unlabeled examples for all 27 words
Catalan-samples.unlab.tagged.xml -
unlabeled examples for all 27 words: lemmatized and POS tagged version
Catalan-samples.test.raw.xml
- test examples for all 27 words
Catalan-samples.test.tagged.xml
- test examples for all 27 words: lemmatized and POS tagged version
All this files are grouped in the following 3 gzipped tar-ed
packages
for downloading:
- LexicalSample.ca.train.raw.tgz (11.1Mb):
contains informative files,MiniDir.xml, and the "raw"
version of the traininig examples, both labeled and unlabeled.
- LexicalSample.ca.train.tagged.tgz (39.2Mb):
contains the "tagged" versions of the traininig examples,
both labeled and unlabeled.
- LexicalSample.ca.test.tgz (4.5Mb):
contains test datasets, both "raw" and "tagged"
* For downloading and using all these materials please register as a
participant
at the Senseval-3
Web site and follow the downloading instructions *
News
March 10, 2004: Training/test
datasets become available at the Senseval-3 Web site
February 27, 2004: This
page
has been set up.
Task organizers
Lluís Màrquez
TALP Research Center
Software Department, Technical University of Catalonia
(lluism@lsi.upc.es)
M. Antònia Martí
CLiC, Universitat de Barcelona
(amarti@ub.edu)
Mariona Taulé
CLiC, Universitat de Barcelona
(mtaule@uoc.edu)
Last
update: March 10, 2004.