OMW Documentation on LMF
This page provides some guidelines on how to prepare LMF for the Open Multilingual Wordnet.
Lexical markup framework (LMF; ISO 24613:2008) is the ISO International Organization for Standardization ISO/TC37 standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The LMF variant that we use here (GWA-LMF) is inspired by Wordnet-LMF. The schema is hosted on github, with documentation.
Guidelines for preparing the LMF
Here are more details on how to prepare the file.
Wordnet Metadata
Each lexicon must have correct metadata (see here for more detail) Extra properties may be included from the Dublin core.
- id: A short name for the resource
e.g. pwn; bahasa - label: The full name for the resource
e.g. Princeton WordNet; Wordnet Bahasa - language: Please follow BCP-47, i.e., use a two-letter
code if available else a three-letter code
e.g. en; id, zsm - email: Please give a contact email address
- license: The license of your resource (please provide URL)
e.g. https://opensource.org/licenses/MITCurrently we recommend:
- Wordnet: wordnet
- CC BY: http://opendefinition.org/licenses/cc-by/
- ODC BY: http://opendefinition.org/licenses/odc-by/
- CC BY SA: http://opendefinition.org/licenses/cc-by-sa/
- version A string identifying this version (following
major.minor format)
e.g. 3.0; 1.3 - url A URL for your project homepage
e.g. http://wordnet.princeton.edu/; http://wn-msa.sourceforge.net/ - citation The paper to cite for this resource
- status The status of the resource, e.g., "valid", "checked", "unchecked"
- confidenceScore A numeric value between 0 and 1 giving the confidence in the correctness of the element. Only entries with a value of 1 will be considered for the ILI.
Notes on the entries
There is extensive documentation with the schemas. Here we include a few tips that are not covered there.
Definitions
If you want to include a definition from somewhere else (such as the Princeton wordnet), or in a language other than that of the wordnet, please note it explicitly:
<Definition language="ja">辞書の編集者または筆者</Definition> <Definition dc:source="pwn-3.0" language="en">a compiler or writer of a dictionary</Definition>
Semantic Relations
If you have a relation type not included in the list we have, please use other and give your more explicit type as dc:type. Or, if your type is a more specific subclass of an existing type, you can use the supertype and mark the specific type with dc:type.
<SynsetRelation relType="other" dc:type="emotion" target="example-en-1234-n"/> <SynsetRelation relType="antonym" dc:type="gradable antonym" target="example-en-1234-n"/>
Variants
You can add variations of lemmas, including orthographic variations and transliterations, as shown below. You can have various classes of transliteration, and if they are automatically generated, you can give them a confidence score.
<LexicalEntry id="w613347"> <Lemma writtenForm="动物沟通" partOfSpeech="n" script="Hans"/> <Form writtenForm="dòngwùgōutōng" script="Latn-pinyin"> <Tag category="transliteration">pīnyīn</Tag> <Tag category="confidence">0.77</Tag> </Form> <Form writtenForm="dong4wu4gou1tong1" script="Latn-pinyin"> <Tag category="transliteration">pin1yin1</Tag> <Tag category="confidence">0.77</Tag> </Form> <Form writtenForm="dongwugoutong" script="Latn-pinyin"> <Tag category="transliteration">pinyin</Tag> <Tag category="confidence">0.77</Tag> </Form> </LexicalEntry>
Synset Identifiers and adding Synsets to CILI
- You can also add new synsets to the Collaborative Interlingual Index (CILI).
- Synset ids in your LMF file should take the project id followed by a hyphen preceding the id that will be considered original (e.g. pwn-00001740-n for original id 00001740-n in lexicon with id pwn) This is because XML ids cannot start with numbers.
- Synsets in your LMF file must make an explicit reference to their ILI status: an ILI id preceeded by the letter 'i' (e.g. i78871) indicating full equivalance; the string 'in' indicating the suggestion of a new concept to ILI; or the empty string indicating that the concept is only used internally by this wordnet.
- All new ILI candidates must have been hand checked by a human;
- Newly suggested concepts must provide a unique English definition within the ILI repository, with at least 20 characters or 5 words;
- By uploading your Wordnet LMF to the OMW, you agree to release the English definitions accompanying new ILI candidates under a CC BY 4.0 license or later version
- New concepts must be linked, directly or indirectly (through new synsets), to an existing ILI concept.
- The list of available relations is available under SynsetRelation.relType in the DTD link provided above, but excluding see_also;
- Your LMF file must include all targets of relations to be valid.
Tools for constructing GWA-LMF
- We have a script for converting the simple tsv used in OMW 1.0 to GWA-LMF.
- WordnetLoom can export to GWA-LMF.
- We have an interconverter for desired formats. (external tool)
References
The basic structure of the OMW and CILI is described here (this web page is more up-to-date):
- Piek Vossen, Francis Bond and John P. McCrae (2016)
- Toward a truly multilingual Global Wordnet Grid. In Eighth meeting of the Global WordNet Conference (GWC 2016), Bucharest
- Piek Vossen, Francis Bond, John P. McCrae and Christiane Fellbaum (2016)
- CILI: the Collaborative Interlingual Index. In Eighth meeting of the Global WordNet Conference (GWC 2016), Bucharest