The semi-automatic creation of MultiWordNet
In the construction of the Italian
component of MultiWordNet, we have developed
techniques for the semi-automatic acquisition
of lexical information, in order to speed up
both the construction of the corresponding Italian
synsets and the detection of lexical divergences
between English and Italian. These techniques
rely on various sources, among which Princeton
WordNet, and the Collins English/Italian bilingual
dictionary.
Two main procedures have been
developed, called Assign-procedure and
Lexical Gaps-procedure.
The Assign-procedure exploits the information
on translation equivalents contained in the
Collins dictionary to build Italian synsets
in correspondence with the synsets already existing
in the Princeton Wordnet. A mapping algorithm
takes as input an Italian word sense, with all
the related information, and tries to assign
the sense to a synset of the English WordNet.
The algorithm is based on the activation of
a number of rules, each of them taking into
consideration a particular kind of information,
such as, for example, the presence of a semantic
code in the Italian sense, e.g. the label CULINARY
for one of the three senses of the word "pizza".
Each rule contributes to the assignment with
a partial score. The output of the algorithm
is either an assignment of the Italian sense
to a certain English synset, when the global
score (given by the sum of the partial scores
contributed by the single rules) reaches a fixed
threshold, or a failure to assign the sense,
when the global score does not reach the threshold.
The set of assignments produced looking into
all the senses of the Italian words of the dictionary
constitutes an automatically created Italian
WordNet aligned with PWN. The data of the automatic
version are then tested against manually acquired
data, with the aim of incrementally improve
the precision level of the algorithm.
The Lexical Gaps-procedure identifies
lexical gaps in a semi-automatic way. The procedure
classifies translation equivalents in two main
groups: idioms and restricted collocations on
the one hand and free combinations of words
(which imply gaps) on the other hand. Knowledge
contained in dictionaries and structural regularities
exhibited by idioms, restricted collocations
and gaps can be exploited to automatically distinguish
them from each other with a certain degree of
confidence. The procedure is able to detect
lexical gaps both from English to Italian and
from Italian to English.
For details about the procedures see [Pianta
et al. 2002].
A final manual check is performed
on all the data automatically acquired, in order
to guarantee the reliability of the resource.