The MultiWordNet project

  • The approach
  • The semi-automatic creation of MultiWordNet
  • What's in MultiWordNet
  • How MultiWordNet is used in Natural Language Processing applications

    What's in MultiWordNet

    MultiWordNet is a multilingual lexical database including information about English and Italian words. It is an extension of WordNet 1.6, a lexical database for English developed at the Princeton University. MultiWordNet contains information about the following aspects of the English and Italian lexica:

    • lexical relations between words;
    • semantic relations between lexical concepts (synsets);
    • correspondences between Italian and English lexical concepts;
    • semantic fields (domains).

    The basic lexical relationship in MultiWordNet is lexical synonymy. Groups of synonyms are used to identify lexical concepts, which are called synsets. Here is an example of an Italian synset:

    {elaboratore, computer, cervello_elettronico, calcolatore}

    Synsets are the most important units in MultiWordNet. Different types of semantic relationships can be attached to them. For example, the above synset has three different semantic relationships:

    has_hypernym {macchina}
    has_hyponym {calcolatore_analogico}, {calcolatore_digitale}, etc.
    has_part {microchip, chip}, etc.

    As a result of the approach followed in the construction of MultiWordNet, cross-language correspondence is defined between synsets as well:

    {elaboratore, computer, cervello_elettronico, calcolatore}
    {computer, data_processor, electronic_computer, information_processing_system}

    MultiWordNet also contains domain information. Each synset has been annotated with at least one domain label, selected from a set of about two hundred labels hierarchically organized (see WordNet Domains for further information). In our example, the synset is labeled with the "Computer Science" semantic field.

    The lastest version of MultiWordNet (1.39) contains around 58,000 Italian word senses and 41,500 lemmas organized into 32,700 synsets aligned whenever possible with Princeton WordNet English synsets. The following table reports all the details.

      Nouns Verbs Adj Adverbs   Total
    Word Senses 43,449 8,271 4,425 1,789   57,934
    Lemmas 31,525 4,431 4,130 1,405 41,491
    Total number of synsets 25,043 4,170 2,454 1,006 32,673
    New Italian synsets
    (with no correspondent in PWN)
    2,768 31 26 0 2,825
    English-to-Italian gaps 370 142 232 26 770

    As regards MultiWordNet relations, all Princeton WordNet relations are represented. Moreover, the new NEAREST relation has been added. The NEAREST relation is an intralinguistic semantic relation connecting a synset which is a gap to its semantically nearest synset (usually an hyponym or an hypernym). The NEAREST relation is typically used to manage denotation differences, i.e. cases in which a lexical concept in one language has no synonymous correspondent in the other language: a translation equivalent exists but it is more general or more specific.
    As an example, the Italian word "abbronzante" has not one corresponding synonymous translation equivalent in English but two more specific translation equivalents "suntan cream" and "suntan oil". The English synset corresponding to the Italian "abbronzante" is empty as it is a gap, but is is connected through a NEAREST relation to its semantically nearest synsets "suntan cream" and "suntan oil".

    A total of 53,002 relations are represented in MultiWordNet. Semantic relations hold for both languages while only English lexical relations are represented in the currently available version of MultiWordNet. All the details are given in the following table.

    Semantic Relations     Lexical Relations
    HAS_HYPERONYM *33,195 ANTONYM 3,266
    HAS-PART 4,925 PERTAINS-TO 2,335
    ENTAILMENT 213    
    ATTRIBUTE 743    
    CAUSES 117    
    NEAREST 40    
    Total 45,593 Total 7,509

    * 30,323 relations refer to already existing synsets from Princeton WordNet; 2,872 relations refer to new synsets

    MultiWordNet is available both for reasearch and commercial purposes. See the "Obtain a licence" page for details.

