index.html

<!DOCTYPE html>
<html lang="en">
	<!-- todo: validate, e.g., using https://validator.w3.org/ -->
  <head>
    <title>The Ontolex Module for Frequency, Attestation and Corpus Information</title>
    <meta charset='utf-8'>
    <script src='http://www.w3.org/Tools/respec/respec-w3c-common'
            async class='remove'></script>
	<link rel="stylesheet" href="stylesheets/codemirror.css">


	<!-- style taken from the ontolex report -->
	<style>
h2 a {
    color:#444;
    text-decoration:none;
}
h2 :link {
    color:#444;
}
h2 :visited {
    color:#444;
}
h1 a {
    color:#444;
    text-decoration:none;
}
h1 :link {
    color:#444;
}
h1 :visited {
    color:#444;
}
.entity { border:1px solid #000080; width:80%; margin-left:auto; margin-right:auto; maxwidth:80%; margin-left:auto; margin-right:auto; margin-bottom:30px; margin-top:30px; padding: 10px; }
.entity h3 { margin-top:3px;padding-bottom:5px;border-bottom:1px solid #000080; }
.description { border-top:1px dashed #808080; border-bottom:1px dashed #808080; margin-top:5px; padding-bottom:5px; }
img.example { max-width:100%; margin-left:auto;margin-right:auto;display:block;}
.beispiel { border: 1px dashed #808080; width:80%; margin-left:auto; margin-right:auto; margin-bottom:30px; margin-top:30px; overflow:hidden;}
.beispiel img { text-align:center; margin: 0px; border: 1px solid #000080;}
.beispiel pre { border:none; font-size:130%;}
.beispiel a { display:block;  margin: 20px; padding:0px;}
.caption {display:none;}
.tn img {max-width:100%;}

</style>


	<script src="javascripts/codemirror-compressed.js"></script>
	<script src="http://codemirror.net/mode/sparql/sparql.js"></script>
	<script src="http://codemirror.net/addon/runmode/runmode.js"></script>
	<script src="http://codemirror.net/addon/runmode/colorize.js"></script>

    <script class='remove'>
      var respecConfig = {

          specStatus: "CG-DRAFT",
          doRDFa: "1.1",
          shortName:  "ontolex-frac",
		  publishDate:  "2018-11-09",
          editors: [
                {   name:       "Christian Chiarcos",
                    url:        "http://acoli.informatik.uni-frankfurt.de/",
                    company:    "Applied Computational Linguistics, Goethe Universität Frankfurt, Germany",
                    companyURL: "http://informatik.uni-frankfurt.de" },

				{   name:       "editor2"},
				{	name:		"editor3"}
          ],
		  authors: [
                {   name:       "author1"},
				 {   name:       "author2"},
				  {   name:       "..."},

          ],
		  previousMaturity: "CG-DRAFT",
	      previousPublishDate:  "",
          wg:           "Ontology Lexica",
          wgURI:        "http://www.w3.org/community/ontolex/",
          wgPublicList: "http://lists.w3.org/Archives/Public/public-ontolex/",
//          wgPatentURI:  "http://www.w3.org/2004/01/pp-impl/424242/status",
      };
    </script>
	<link rel="stylesheet" href="stylesheets/codemirror.css">
    <script src="javascripts/codemirror.js"></script>
  </head>
  <body>
    <section id='abstract'>
      <p>
       This document describes the <em>module for frequency, attestation and corpus information</em> of the Lexicon Model for Ontologies (<em>lemon</em>) as a result of the work of the Ontology Lexica community group (OntoLex). The module is targeted at complementing dictionaries and other linguistic resources containing lexicographic data with a vocabulary to express</p>

	   <ul>
	   <li> corpus-derived statistics (frequency and cooccurrence information, collocations),</li>
	   <li> pointers from lexical resources to corpora and other collections of text (attestations),</li>
	   <li> the annotation of corpora and other language resources with lexical information (lemmatization against a dictionary), and</li>
	   <li> distributional semantics (collocation vectors, word embeddings, sense embeddings, concept embeddings).</li>
	   </ul>

	   <p>
	   The module tackles use cases in corpus-based lexicography, corpus linguistics and natural language processing, and operates in combination with the <em>lemon</em> core module, referred to as <em>OntoLex</em>, as well as with other <em>lemon</em> modules.
	  </p>
    </section>

    <section id='sotd'>
      <p>This document is a working draft for a module for frequency, attestation and corpus data of the OntoLex specifications.
       It is not a W3C Standard nor is it on the W3C Standards Track.</p>
      <p>There are a number of ways that one may participate in the development of this report:</p>
      <ul>
      <li>Mailing list: <a href="http://lists.w3.org/Archives/Public/public-ontolex/">public-ontolex@w3.org</a>
      <li>Wiki: <a href="https://www.w3.org/community/ontolex/wiki/Main_Page">Main page</a>
      <li>More information about meetings of the ONTOLEX group can be obtained
        <a href="https://www.w3.org/community/ontolex/wiki/Main_Page#Meetings">here</a></li>

	<li><a href="https://github.com/acoli-repo/ontolex-frac">Source code</a>
	     for this document can be found on Github.</li>
   </ul>
   
   <p>Disclaimer: This draft follows closely the structure and design of <a href="https://jogracia.github.io/ontolex-lexicog/">The Ontolex Lexicography Module. Draft Community Group Report 28 October 2018</a>, edited by Julia Bosque-Gil and Jorge Gracia. In particular, motivational and introductory text are partially adapted without being marked as quotes. This is to be replaced by original text before publication.
   </p>
    </section>

   <section>
      <h2>Introduction</h2>

	  <section>
	  <h3> Background and Motivation </h3>
      <p> The <a href="https://www.w3.org/2016/05/ontolex/"><em>lemon</em> model</a> provides a <a href="https://www.w3.org/2016/05/ontolex/#core">core</a> vocabulary (OntoLex) to represent <em>linguistic information</em> associated to ontology and vocabulary elements. The model follows the principle of <em>semantics by reference</em> in the sense that the semantics of a <a href="https://www.w3.org/2016/05/ontolex/#LexicalEntry">lexical entry</a> is expressed by reference to an individual, class or property defined in an ontology. </p>
	  
	  <p> The current version of <em>lemon</em> (as an outcome of the OntoLex group, sometimes referred as OntoLex-lemon in the literature) as well as its previous version (<a href="https://lemon-model.net/">lemon</a> [<cite><a href="#bib-lemon_paper">1</a></cite>]) have been increasingly used in the context of dictionaries and lexicographical data to convert existent lexicographic information into the standards and formats of the Semantic Web. In consequence, a designated <em>lemon</em> <a href="">module for lexicography</a> (<em>lexicog</em>) has been designed, with applications in monolingual [<cite><a href="#bib-klimek-kdict">2</a></cite>], bilingual [<cite><a href="#bib-gracia-apertium">3</a></cite>], and multilingual [<cite><a href="#bib-bosque-kdict">4</a></cite>] dictionaries, as well as diachronic [<cite><a href="#bib-kahn-diachronic">5</a></cite>], dialectal [<cite><a href="#bib-declerck-dialectal">6</a></cite>], and etymological ones [<cite><a href="#bib-abromeit-etymological">7</a></cite>], among others.
	  This module is partially motivated by requirements of corpus-based lexicography (frequency and collocation information) and digital philology (linking lexical resources with corpus data).</p>
	  
	  <p> A second motivation for a <em>lemon</em> model for corpus-based information comes from natural language processing. With the rise of distributional semantics since the early 1990s, lexical semantics have been complemented by corpus-based co-occurrence statistics (KEYNESS-REFERENCE???), collocation vectors (Schütze 1993), word embeddings (Collobert et al. 2012) and sense embeddings (??? and Schütze, 2017). With the proposed module, <em>lemon</em> can serve as a community standard to encode, store and exchange vector representations (embeddings) along with the lexical concepts, senses, lemmas or words that they represent. The processing of word embeddings is beyond the scope of this module. Embeddings are thus represented as literals ("BLOB").</p>
	  
	  <p> The added value of using linked data technologies to represent such information is an increased level of interoperability and integration between different types of lexical resources, the textual data they pertain to, as well as distributional representations of words, lexical senses and lexical concepts. Creating a <em>lemon</em> module in the OntoLex CG is a suitable means for establishing a vocabulary on a broad consensus that takes into account all use cases identified above in an adequate fashion.	 
	<!--	From lexicog:  
	<p> After analysing the literature, the proposers of this module perceived an obvious need for reaching some agreement that allows for a better and more inter-operable migration of existing dictionaries into linked data [<cite><a href="#bib-bosque-module">10</a></cite>]. For illustration, the OGL ontology [<cite><a href="#bib-parvizi-oxford">11</a></cite>] has its own notion of dictionary entry  materialised in the <tt>ogl:Entry</tt> class, while in [<cite><a href="#bib-bosque-kdict">4</a></cite>] the ad-hoc  <tt>kd:dictionaryEntry</tt> relation was introduced in the conversion of the KD Multilingual Global Series dictionaries, i.e, different researchers introduced their own modelling solutions to account for similar notions. Being interoperability a key issue in linked data technologies, building a common space in which these concepts can be agreed on and commonly defined comes as a logical step.--> 
	The OntoLex community is the natural forum to accomplish this for several reasons: </p>

	   <ol>
		<li> The extended use of <em>lemon</em> to support digital lexicography,
		<li> the improved application and applicabiltiy of <em>lemon</em> in natural language processing,
		<li> the coming together of the lexicography, AI and human language technology communities, resp. resources, and
		<li> the possibility of reusing already available mechanisms in <em>lemon</em>, preventing researchers from "re-inventing the wheel",
	   </ol>

	  </section>
	  
	  <section> 
	  <h3> Aim and Scope </h3>

	  <p>
		The goal of this module is to complement <em>lemon</em> core elements with a vocabulary layer to represent lexicographical and semantic information derived from or defined with reference to corpora and external resources in a way that (a) <i>generalizes</i> over use cases from digital lexicography, natural language processing, artificial intelligence, computational philology and corpus linguistics, that (b) facilitates <i>exchange, storage and re-usability</i> of such data along with lexical information, 
		and that (c) <i>minimizes information loss</i>.
	  </p>

	<p> The scope of the model is three-fold:
		<ol>
			<li> extending the <em>OntoLex-lexicog</em> model with corpus information to support existing challenges in corpus-driven lexicography,</li>
			<li> modelling <em>existing</em> lexical and distributional-semantic resources (corpus-based dictionaries, collocation dictionaries, embeddings) as linked data, to allow their conjoint publication and inter-operation by Semantic Web standards, and
			<li> providing a conceptual / abstract model of relevant concepts in <em>distributional semantics</em> that facilitates building linked data-based applications that consume and combine both lexical and distributional information.
		</ol>
	
	<div class="note"><p>
		<em>Corpus</em> as used throughout this document is understood in its traditional, broader sense as a structured data collection -- or material suitable for being included into such a collection, such as manuscripts or other works.
		We do not intend to limit the use of the term to corpora in a linguistic or NLP sense. Language resources of any kind (web documents, dictionaries, plain text, unannotated corpora, etc.) are considered "corpus data" and a collection of such information as a "corpus" in this sense. Any information drawn from or pertaining to such information is considered "corpus-based".
	</p>
	</div>
	
	<!--
	lexicog:
	<div class="note"><p>
	In terms of applying the module, we propose the following best practice or "rule of thumb" ... :
		<ol>
			<li> As long as the entities in OntoLex and the other  <em>lemon</em> modules, together with those of catalogues of linguistic categories (e.g. LexInfo), suffice to represent the information encoded in the lexicographic resource (e.g., lexical entry, part of speech, translation, ...), the OntoLex lexicography module should not be instantiated.
			<li> In case that there is some lexicographic information that cannot be modelled by using either OntoLex or any of the other <em>lemon</em> modules (e.g., to denote sense ordering), then the model should be instantiated but avoiding duplicities and keeping extra information to the minimum.
		</ol>
	The reason behind this is that this module adds some complexity by providing additional description capabilities to the purely lexical description accounted by OntoLex. If this information is not needed for a specific conversion, i.e, if the lexicographical view is not key, reusing <em>lemon</em> would allow to keep the representation simpler but still sufficient.

	</p></div-->


	  </section>

	  <section>
	  <h3> Namespaces </h3>

	  This is a list of relevant namespaces that will be used in the rest of this document:

	  <p> OntoLex module for frequency, attestation and corpus information
	   <pre><code class="cm">
		@prefix frac: &lt;http://www.w3.org/ns/lemon/frac#&gt; .
		 </code>
		</pre>
	  </p>

	  <p> OntoLex (core) model and other <em>lemon</em> modules:
	   <pre><code class="cm">
		@prefix ontolex: &lt;http://www.w3.org/ns/lemon/ontolex#&gt; .
		@prefix synsem: &lt;http://www.w3.org/ns/lemon/synsem#&gt; .
		@prefix decomp: &lt;http://www.w3.org/ns/lemon/decomp#&gt; .
		@prefix vartrans: &lt;http://www.w3.org/ns/lemon/vartrans#&gt; .
		@prefix lime: &lt;http://www.w3.org/ns/lemon/lime#&gt; .
		@prefix lexicog: &lt;http://www.w3.org/ns/lemon/lexicog#&gt; .
		 </code>
		</pre>
	  </p>

		<p> Other models [TO REVIEW]:
	   <pre><code class="cm">
		@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;.
		@prefix owl: &lt;http://www.w3.org/2002/07/owl#&gt;.
		@prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt;.
		@prefix skos: &lt;http://www.w3.org/2004/02/skos#&gt;.
		@prefix dbr: &lt;http://dbpedia.org/resource/&gt;.
		@prefix dbo: &lt;http://dbpedia.org/ontology/&gt;.
		@prefix void: &lt;http://rdfs.org/ns/void#&gt;.
		@prefix lexinfo: &lt;http://www.lexinfo.net/ontology/2.0/lexinfo#&gt;.
		@prefix dct: &lt;http://purl.org/dc/terms/&gt;.
		@prefix provo: &lt;http://www.w3.org/ns/prov#&gt;.
		@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt;.
		@prefix oa: &lt;http://www.w3.org/ns/oa#&gt;.
		@prefix aat: &lt;http://vocab.getty.edu/aat/>.</code>
  	   </pre>
	  </p>


	  </section>
	  <section> 
	  <h3> ontolex:Element </h3>

	  	<p>
		We consider all <em>lemon</em> core concepts as being countable, annotatable/attestable and suitable for a numerical representation by means of a vector (embedding). For this reason, we define the rdfs:domain of all properties that link lexical and corpus information by means of ontolex:Element, an abstract superclass of 
		ontolex:Form (for word frequency and plain word/phrase embeddings),
		ontolex:LexicalEntry (for lemma frequency and lemma-based word/phrase embeddings), 
		ontolex:LexicalSense (for sense frequency and sense embeddings), and
		ontolex:LexicalConcept (for concept frequency and concept embeddings).
				
	    <figure>
        <img src="img/ontolex-element.png" 
			 title="ontolex:Element" 
			 alt="ontolex-element.png" width="80%"><figcaption>ontolex:Element as a superclass of ontolex:LexicalEntry, ontolex:Form, ontolex:LexicalSense and ontolex:LexicalConcept</figcaption>
		</figure>

		<div class="note"><p>
		Such a top-level concept used to exist in <em>Monnet-lemon</em>, but has been abandoned in the 2016 edition of <em>lemon</em>.
		If this concept is not provided by a future revision of the <em>lemon</em> core vocabulary, it will be introduced by this module.
		Note that the introduction of ontolex:Element has no effect on <em>lemon</em> core other that facilitating vocabulary organization, as ontolex:Element is not to be used for data modeling.</p></div>
		
	</p>


   </section>
</section>

    <section>
      <h2>Overview</h2>

	  The following diagram depicts the OntoLex module for frequency, attestation and corpus information (<i>fraq</i>). Boxes represent classes of the model. Arrows with filled heads represent object properties. Arrows with empty heads represent rdfs:subClassOf. 
	  Vocabulary elements introduced by this module are shaded grey (classes) or set in <i>italics</i>.

    <figure>
        <!--img src="img/ontolex-frac-2018-11.png" title="ontolex-frac-2018-11.png" alt="ontolex-frac-2018-11.png" width="80%"><figcaption>Module for Frequency, Attestation and Corpus Information (<i>frac</i>), overview</figcaption-->
		<img src="img/ontolex-frac-2019-03.png" title="ontolex-frac-2019-03.png" alt="ontolex-frac-2019-03.png" width="80%"><figcaption>Module for Frequency, Attestation and Corpus Information (<i>frac</i>), overview</figcaption>
    </figure>
	
	<div class="note"><p>DISCUSSION:
		Looks more complicated than it is. Shall we drop inferrable information ? (rdf:rest, rdf:first are available vocabulary elements because ContextualRelation is a subclass of rdf:List, subclasses of ontolex:Element should be dropped once ontolex:Element is introduced.)
		
		Keep rdf:List elements only if preserved in other ontolex modules.
	</p>
	</div>
	</section>

	<section>
	<h2>Definitions</h2>

	<section>
	<h3>Frequency</h3>
	
	<p> Frequency information is a crucial component in human language technology. Corpus-based lexicography originates with Francis and Kucera (1958), and subsequently, the analysis of frequency distributions of word forms, lemmas and other linguistic elements has become a standard technique in lexicography and philology, and given rise to the field of corpus linguistics.
	At its core, this means that lexicographers use corpus frequency and distribution information while compiling lexical entries (also see the section on collocations and similarity below). 
	As a qualitative assessment, frequency can be expressed with <a href="http://www.lexinfo.net/ontology/2.0/lexinfo#frequency">lexinfo:frequency</a>, "[t]he relative commonness with which a term occurs". However, this is an object property with possible values lexinfo:commonlyUsed, lexinfo:infrequentlyUsed, lexinfo:rarelyUsed, while absolute counts over a particular resource (corpus) require novel vocabulary elements.
	</p>
	<p>
	Absolute frequencies are used in computational lexicography (e.g., the <a href="http://oracc.museum.upenn.edu/epsd2/">Electronic Penn Sumerian Dictionary</a>), and they are an essential piece of information for NLP and corpus linguistics.
	In order to avoid confusion with lexinfo:Frequency, this is defined with reference to a particular dataset, a corpus.
	</p>
	
	<p><div class='entity'>
       <h3>frequency (ObjectProperty)</h3>

           <div>
           <p><strong>URI:</strong> <a href="http://www.w3.org/nl/lemon/frac#frequency" class="uri">http://www.w3.org/nl/lemon/frac#frequency</a></p>
    		</div>
    		<div class='comment'>
    			<p>The property <strong>frequency</strong> assigns a particular ontolex:Element a frac:CorpusFrequency.</p>
    		</div>
    		<div class='description'>
    			<p><strong>rdfs:range</strong> ontolex:Element</p>
    			<p><strong>rdfs:domain</strong> frac:CorpusFrequency</p>
    		</div>
    </div></p>
	
	<p><div class='entity'>
       <h3>CorpusFrequency (Class)</h3>

           <div>
           <p><strong>URI:</strong> <a href="http://www.w3.org/nl/lemon/frac#CorpusFrequency" class="uri">http://www.w3.org/nl/lemon/frac#CorpusFrequency</a></p>
    		</div>
    		<div class='comment'>
    			<p><strong>Corpus frequency</strong> provides the absolute number of attestations (rdf:value) of a particular ontolex:Element (see frac:frequency) in a particular language resource (dct:source).</p>
    		</div>
    		<div class='description'>
    			<p><strong>SubClassOf:</strong> rdf:value exactly 1 xsd:int, dct:source min 1</p>
    		</div>
    </div></p>

	<div class="note">
	<p>If information from multiple language resources is aggregated (also cf. the section on embeddings below), multiple <tt>dct:source</tt> statements should be provided, to each resource individually. The cardinality of <tt>dct:source</tt> is thus 1 or higher.
	</p>
	</div>
	
	<div class="Note">
	<p>QUESTION: better alternative to dct:source?</p>
	</div>
	
	<p>
	The following example illustrates word and form frequencies for the Sumerian word <i>a</i> (n.) "water" from the <a href="http://oracc.museum.upenn.edu/epsd2/sux">Electronic Penn Sumerian Dictionary</a> and the frequencies of the underlying corpus.
	</p>
	
	  <div class='beispiel'>
		<!--p><a href='Examples/example1.png' class='tn'/><img src='Examples/example1.png'/></a></p-->
		<div>
		<pre><code>
# word frequency, over all form variants
epsd:a_water_n a ontolex:LexicalEntry;
 frac:frequency [ 
  a frac:CorpusFrequency;
  rdf:value "4683"^^xsd:int;
  dct:source &lt;http://oracc.museum.upenn.edu/epsd2/pager> ] .

# form frequency for individual orthographical variants
epsd:a_water_n ontolex:canonicalForm [
 ontolex:writtenRep "𒀀"@sux-Xsux, "a"@sux-Latn;
 frac:frequency [
  a frac:CorpusFrequency;
  rdf:value "4656"^^xsd:int;
  dct:source &lt;http://oracc.museum.upenn.edu/epsd2/pager> ] ] .

epsd:a_water_n ontolex:otherForm [
 ontolex:writtenRep "𒀉"@sux-Xsux, "a2"@sux-Latn;
 frac:frequency [
  a frac:CorpusFrequency;
  rdf:value "1"^^xsd:int;
  dct:source &lt;http://oracc.museum.upenn.edu/epsd2/pager> ] ] .

epsd:a_water_n ontolex:otherForm [
 ontolex:writtenRep "𒂊"@sux-Xsux, "e"@sux-Latn;
 frac:frequency [
  a frac:CorpusFrequency;
  rdf:value "24"^^xsd:int;
  dct:source &lt;http://oracc.museum.upenn.edu/epsd2/pager> ] ].</code></pre>
		</div>
	  </div>
	  
	  <p>
	  The example shows orthographic variation (in the original writing system, Sumerian Cuneiform sux-Xsux, and its Latin transcription sux-Latn). It is slightly simplified insofar as the ePSD2 provides individual counts for different periods and that only three of six orthographical variants are given. Note that these are orthographical variants, not morphological variants (which are not given in the dictionary).
	  </p>
	  
	  <div class="note">
	  <p>It is necessary to provide the link to the underlying corpus <em>for every frequency assessment</em> because the same element may receive different counts over different corpora. For data modelling, it is recommended to define a corpus- or collection-specific subclass of frac:CorpusFrequency with a fixed dct:source value. This leads to more compact data and avoids potential difficulties with the Open World Assumption (interpretability of incomplete data).

	  <div class='beispiel'>
		<div>
			<pre>
				<code>
# Corpus Frequency in the EPSD corpus
:EPSDFrequency rdfs:subClassOf frac:CorpusFrequency.
:EPSDFrequency rdfs:subClassOf
 [ a owl:Restriction ;
   owl:onProperty dct:source ;
   owl:hasValue &lt;http://oracc.museum.upenn.edu/epsd2/pager> ] .

# frequency assessment
epsd:a_water_n frac:frequency [ 
  a :EPSDFrequency;
  rdf:value "4683"^^xsd:int ].</code>
	</pre>
	</div>
	</div>
	</div>
	
    <div class="note">
	<p>frac:CorpusFrequency can be extended with additional filter conditions to define sub-corpora. For example, we can restrict the subcorpus to a particular time period, e.g., the Neo-Sumerian Ur III period:
	
	<div class="beispiel">
		<div>
			<pre>
				<code>
# EPSD frequency for the Ur-III period (aat:300019910)
:EPSDFrequency_UrIII 
 rdfs:subClassOf :EPSDFrequency;
 rdfs:subClassOf
   [ a owl:Restriction ;
     owl:onProperty dct:temporal ;
     owl:hasValue aat:300019910 ] .

# frequency assessment for sub-corpus
epsd:a_water_n frac:frequency [ 
  a :EPSDFrequency_UrIII;
  rdf:value "2299"^^xsd:int ].
  </code></pre></div></div></div>

  </section>
	 
	 
	 <section>
	 <h3>Attestation</h3>
	 <div class="note">
		<p>This is an attempt for a consensus model based on Depuydt and de Does (2018) and Khan and Boschetti (2018). We do focus on data structures, the following aspects are not covered: Datatype properties regarding confidence (assumed to be in lexinfo), bibliographical details (subject to other vocabularies), and details of resource linking (subject to other vocabularies).</p>
		
	    <figure>
        <img src="img/attestations-lexcit.png" 
			 title="Depuydt and de Does (2018)" 
			 alt="img/attestations-lexcit.png" width="80%"/><figcaption>Attestation module following Depuydt and de Does (2018)</figcaption>
		</figure>

	    <figure>
        <img src="img/attestations-khan-boschetti.png" 
			 title="Khan and Boschetti (2018)" 
			 alt="img/attestations-khan-boschetti.png" width="80%"/><figcaption>Attestation module following Khan and Boschetti (2018)</figcaption>
		</figure>
	 </div>
	 
	 <p>"Lexicographers use examples to support their analysis of the headword. The examples can either be
	authentic (exact quotations), adapted (modified versions of authentic examples) or invented examples. 
	Authentic examples are attributed quotations (citations), which not only elucidate
	meaning and illustrate features of the headword (spelling, syntax, collocation, register etc.), but also
	function as attestations and are used provide evidence of the existence of a headword.
	We therefore call these examples “attestations”." (Depuydt and de Does 2018)
	</p>
	
		<p><div class='entity'>
       <h3>attestation (ObjectProperty)</h3>

           <div>
           <p><strong>URI:</strong> <a href="http://www.w3.org/nl/lemon/frac#attestation" class="uri">http://www.w3.org/nl/lemon/frac#attestation</a></p>
    		</div>
    		<div class='comment'>
    			<p>The property <strong>attestation</strong> assigns a particular ontolex:Element a frac:Attestation.</p>
    		</div>
    		<div class='description'>
    			<p><strong>rdfs:range</strong> ontolex:Element</p>
    			<p><strong>rdfs:domain</strong> frac:Attestation</p>
    		</div>
    </div></p>
	
	<p><div class='entity'>
       <h3>Attestation (Class)</h3>

           <div>
           <p><strong>URI:</strong> <a href="http://www.w3.org/nl/lemon/frac#Attestation" class="uri">http://www.w3.org/nl/lemon/frac#Attestation</a></p>
    		</div>
    		<div class='comment'>
    			<p>An <strong>Attestation</strong> is normally an exact or normalized quotation or excerpt from a source document that illustrates a particular form, sense or lexeme in authentic data.
				Attestations should be accompanied by a <tt>Citation</tt> or the URI of a digital edition of the respective locus (<tt>dct:source</tt>). This URI can be externally defined (e.g., as a <tt>oa:Annotation</tt> or as a NIF URI), and can refer either to the entire work or to the exact location of the attestation within this source. 
    		</div>
    		<div class='description'>
    			<p><strong>SubClassOf:</strong> rdf:quotation exactly 1 xsd:string</p>
    		</div>
    </div></p>

	<p><div class='entity'>
       <h3>citation (ObjectProperty)</h3>

           <div>
           <p><strong>URI:</strong> <a href="http://www.w3.org/nl/lemon/frac#citation" class="uri">http://www.w3.org/nl/lemon/frac#citation</a></p>
    		</div>
    		<div class='comment'>
    			<p>The property <strong>citation</strong> assigns a particular ontolex:Element a frac:Citation.</p>
    		</div>
    		<div class='description'>
    			<p><strong>rdfs:range</strong> ontolex:Element</p>
    			<p><strong>rdfs:domain</strong> frac:Citation</p>
    		</div>
    </div></p>

	<p><div class='entity'>
       <h3>Citation (Class)</h3>

           <div>
           <p><strong>URI:</strong> <a href="http://www.w3.org/nl/lemon/frac#Citation" class="uri">http://www.w3.org/nl/lemon/frac#Citation</a></p>
    		</div>
    		<div class='comment'>
    			<p>A <strong>Citation</strong> is a bibliographical reference to a source for the definition or illustration of a particular sense, form or lexeme. A citation <i>can</i> provide an attestation, but can also stand on its own.
    		</div>
    </div></p>
	
	<div class="note"><p>Details of bibliographical references are beyond the scope of the current proposal. Several designated vocabularies exist, e.g., FaBiO and CiTO,
		<!-- Peroni, S., & Shotton, D. (2012). FaBiO and CiTO: Ontologies for describing bibliographic resources and citations. Web Semantics: Science, Services and Agents on the World Wide Web, 17: 33{43. DOI: 10.1016/j.websem.2012.08.001 -->
		Bibo, 
		<!-- D'Arcus, B., & Giasson, F. (2009). Bibliographic Ontology Specication. Specication Document, 4 November 2009. Retrieved April 9, 2014, from http://bibliontology.com/ -->
		the Open Citation Corpus,
	<!-- Shotton, D. (2013). Publishing: Open citations. Nature, 502(7471): 295{297. DOI:
10.1038/502295a-->
SpringerNature SciGraph
	<!-- https://scigraph.springernature.com/explorer -->
	 BiRO or C4O
	<!-- Di Iorio, A., Nuzzolese, A. G., Peroni, S., Shotton, D. M., & Vitali, F. (2014, May). Describing bibliographic references in RDF. In SePublica. -->
	</p>
	</div>
	
	<p><div class='entity'>
       <h3>makeAttestation (ObjectProperty)</h3>

           <div>
           <p><strong>URI:</strong> <a href="http://www.w3.org/nl/lemon/frac#makeAttestation" class="uri">http://www.w3.org/nl/lemon/frac#makeAttestation</a></p>
    		</div>
    		<div class='comment'>
    			<p>The property <strong>makeAttestation</strong> assigns a particular Citation a frac:Attestation.</p>
    		</div>
    		<div class='description'>
    			<p><strong>rdfs:range</strong> frac:Citation</p>
    			<p><strong>rdfs:domain</strong> frac:Attestation</p>
    		</div>
    </div></p>
	
	<div class="note"><p>CC: Naming follows K and B, I'm not too happy with the name, though, because it's too close to <tt>attestation</tt>, it will likely be confused.</p></div>
	
	 </section>
	 
	 <section>
	 <h3>Embeddings</h3>
	 
	 <p>In distributional semantics, the contexts in which a word is attested are taken to define its meaning. Contextual similarity is thus a correlate of semantic similarity. Different representations of context are possible, the most prominent model to date is the form of a vector.
	 A word vector can be created, for example, by means of a reference list of vocabulary items, where every reference word is associated with a fixed position, e.g., <i>ship</i> with position 1, <i>ocean</i> with 2, <i>sky</i> with 3, etc. 
	 Given a corpus (and a selection criterion for collocates, e.g., within the same sentence), every word in the corpus can be described by the frequency that a reference word occurred as a collocate in the corpus.
	 Assume we want to define the meaning of <i>frak</i>, with (exactly) the following attestations in our sample corpus (random samples from <a href="https://en.wikiquote.org/wiki/Battlestar_Galactica_(2003)">wikiquote</a>):
	 
	 <ul>
	 <li><i>It's in the frakking ship!</i></li>
	 <li><i>Have you lost your frakkin' mind?</i></li>
	 <li><i>Oh, for frak's sake, let me see if I can make heads or tails of it.</i></li>
	 <li><i>It's a frakking Cylon.</i></li>
	 <li><i>Our job isn't to be careful, it's to shoot Cylons out of the frakking sky!</i></li>
	 </ul>
	 
	 With the following list of reference words: <tt>(ship, ocean, lose, find, brain, mind, head, sky, Cylon, ...)</tt>, we obtain the vector <tt>(1,0,1,0,0,1,1,1,2,...)</tt> for the lemma (lexical entry) <i>frak</i>. For practical applications, these vectors are projected into lower-dimensional spaces, e.g., by means of statistical (Schütze 1993) or neural methods (Socher et al. 2011). 
	 <!-- Socher, R., Huang, E. H., Pennin, J., Manning, C. D., & Ng, A. Y. (2011). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances in neural information processing systems (pp. 801-809). --> 
	 The process of mapping a word to a numerical vector and its result are referred to as "word embedding". Aside from collocation counts, other methods for creating word embeddings do exist, but they are always defined relative to a corpus.
	 </p>
	 
	 <p>Embeddings have become a dominating paradigm in natural language processing and machine learning, but, if compiled from large corpora, they require long training periods and thus tend to be re-used. 
	 However, embedding distributions often use tool-specific binary formats (cf. <a href="https://radimrehurek.com/gensim/models/word2vec.html">Gensim</a>), and thus a portability problem arises. 
	 CSV and related formats (cf. <a href="https://github.com/baojie/senna/tree/master/embeddings">SENNA embeddings</a>) are a better alternative, but their application to sense and concept embeddings (as provided, for example, by Rothe and Schütze 2017) 
	 <!-- Rothe, S., & Schütze, H. (2017). Autoextend: Combining word embeddings with semantic resources. Computational Linguistics, 43(3), 593-617. -->
	 is problematic if their distribution is detached from the definition of the underlying sense and concept definitions.
	 With frac, Ontolex-lemon provides a vocabulary for the conjoint publication and sharing of embeddings and lexical information at all levels: non-lemmatized words (ontolex:Form), lemmatized words (ontolex:LexicalEntry), phrases (ontolex:MultiWordExpression), lexical senses (ontolex:LexicalSense) and lexical concepts (ontolex:LexicalConcept).</p>

	 <div class="note">
	 <p>We focus on <em>publishing and sharing</em> embeddings, not on their processing by means of Semantic Web formalisms, and thus, embeddings are represented as untyped or string literals with whitespace-separated numbers. If necessary, more elaborate representations, e.g., using rdf:List, may subsequently be generated from these literals.</p>
	 </div>
	 
	 <p>Lexicalized embeddings provide their data via <tt>rdf:value</tt>, and should be published together with their metadata, most importantly
		<ul>
		<li>procedure/method (<tt>dct:description</tt> with free text, e.g., "CBOW", "SKIP-GRAM", "collocation counts")</li>
		<li>corpus (<tt>dct:source</tt>)</li>
		<li>dimensionality (<tt>dct:extent</tt>)</li>
		</ul>
	 </p>

	<p><div class='entity'>
       <h3>embedding (ObjectProperty)</h3>

           <div>
           <p><strong>URI:</strong> <a href="http://www.w3.org/nl/lemon/frac#embedding" class="uri">http://www.w3.org/nl/lemon/frac#embedding</a></p>
    		</div>
    		<div class='comment'>
    			<p>The property <strong>embedding</strong> assigns a particular ontolex:Element a frac:Embedding.</p>
    		</div>
    		<div class='description'>
    			<p><strong>rdfs:range</strong> ontolex:Element</p>
    			<p><strong>rdfs:domain</strong> frac:Embedding</p>
    		</div>
    </div></p>
	
	<p><div class='entity'>
       <h3>Embedding (Class)</h3>

           <div>
           <p><strong>URI:</strong> <a href="http://www.w3.org/nl/lemon/frac#Embedding" class="uri">http://www.w3.org/nl/lemon/frac#Embedding</a></p>
    		</div>
    		<div class='comment'>
    			<p>An <strong>Embedding</strong> provides a numerical vector (the string of <tt>rdf:value</tt>) for a given ontolex:Element (see <tt>frac:embedding</tt>). It is defined by the methodology used for creating it (<tt>dct:description</tt>), the URI of the corpus or language resource from which it was created (<tt>dct:source</tt>), and its dimensionality (length of the vector, <tt>dct:extent</tt>).</p>
    		</div>
    		<div class='description'>
    			<p><strong>SubClassOf:</strong> rdf:value exactly 1 xsd:string, dct:source min 1, dct:description min 1</p>
    		</div>
    </div></p>

	<div class="note">
	 <p>Question: Rename "Embedding" (the concept, not the property) to "Vector" ?</p>
	 </div>
	 
	 <div class="note">
	 <p>For embeddings, we recommend using whitespace-separated numbers as their <tt>rdf:value</tt>. In particular, commas as separators are discouraged because they might be confused with the decimal point, depending on the locale of the user. We recommend the following regular expression for parsing embedding values (example in Perl):</p>
	 <p>
	 <code>split(/[^0-9\.,\-]+/, $value)</code></p>
	 <p>This means that doubles should be provided in the conventional format, not using the exponent notation.
	 </p>
	 </div>

	 <p>
	 The 50-dimensional
	 <a href="https://nlp.stanford.edu/projects/glove/">GloVe</a> 6B (Wikipedia 2014+Gigaword 5) embedding for <i>frak</i> is given below:
	 </p>
	 <p><tt>frak 0.015246 -0.30472 0.68107 -0.59727 -0.95368 -1.0931 0.58783 -0.19128 0.49108 0.61215 -0.14967 0.68197 0.22723 0.38514 -0.54721 -0.71187 0.21832 0.59857 0.1076 -0.23619 -0.86604 -0.91168 0.26087 -0.42067 0.60649 0.80644 -1.0477 0.67461 0.34154 -0.072511 -1.01 0.35331 -0.35636 0.9764 -0.62665 -0.29075 0.50797 -1.3538 0.18744 0.27852 -0.22557 -1.187 -0.11523 -0.078265 0.29849 0.22993 -0.12354 0.2829 1.0697 0.015366</tt></p>
	 
	 <p>
	 As a lemma (LexicalEntry) embedding, this can be represented as follows:
	 </p>
	 
	  <div class='beispiel'>
		<div>
			<pre>
				<code>
:frak a ontolex:LexicalEntry;
  ontolex:canonicalForm/ontolex:writtenRep "frak"@en;
  frac:embedding [ 
    a frac:Embedding;
	rdf:value "0.015246 -0.30472 0.68107 ...";
	dct:source 
	  &lt;http://dumps.wikimedia.org/enwiki/20140102/>,
	  &lt;https://catalog.ldc.upenn.edu/LDC2011T07>;
	dct:extent 50^^^xsd:int;
	dct:description "GloVe v.1.1, documented in Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation, see https://nlp.stanford.edu/projects/glove/; uncased"@en. ].</code>
	</pre>
	</div>
	</div>
	
	<div class="note">
		<p>As with <tt>frac:Frequency</tt>, we recommend defining resource-specific subclasses of <tt>frac:Embedding</tt> in order to reduce redundancy in the data:</p>
		
		<div class="beispiel">
		<div>
		<pre>
		<code>
# resource-specific embedding class
:GloVe6BEmbedding_50d rdfs:subClassOf frac:Embedding;
  rdfs:subClassOf 
    [ a owl:Restriction;
	  owl:onProperty dct:source;
	  owl:hasValue 
		  &lt;http://dumps.wikimedia.org/enwiki/20140102/>,
		  &lt;https://catalog.ldc.upenn.edu/LDC2011T07> ],
	[ a owl:Restriction;
	  owl:onProperty dct:extent;
	  owl:hasValue 50^^^xsd:int ],
	[ a owl:Restriction;
	  owl:onProperty dct:description;
	  owl:hasValue "GloVe v.1.1, documented in Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation, see https://nlp.stanford.edu/projects/glove/; uncased"@en. ].

# embedding assignment
:frak a ontolex:LexicalEntry;
  ontolex:canonicalForm/ontolex:writtenRep "frak"@en;
  frac:embedding [ 
    a :GloVe6BEmbedding_50d;
	rdf:value "0.015246 -0.30472 0.68107 ..." ].</code></pre>
	</div></div></div>
	
	<div class="note">
	<p>Examples for non-word embeddings:
		<ul>
		<li> <a href="http://www.cis.lmu.de/~sascha/AutoExtend/">AutoExtend</a>: (a method to build) synset and lexeme embeddings, data <a href="http://www.cis.lmu.de/~sascha/AutoExtend/embeddings.zip">here</a></li>
		<li> <a href="https://github.com/uhh-lt/sensegram">SenseGram</a>: sense embeddings, data <a href="http://ltdata1.informatik.uni-hamburg.de/sensegram/">here</a></li>
		<li> <a href="http://tudarmstadt-lt.github.io/vec2synset/">Vec2Synset</a>: (a method to build) WordNet synset (= LexicalConcept) embeddings</li>
		<li> <a href="https://minimaxir.com/2017/04/char-embeddings/">Character embeddings</a> are probably beyond the scope of OntoLex, unless characters are regarded LexicalEntries. (Which they could, for languages such as Chinese or Sumerian certainly, but also for Western languages -- given the fact that character-level pseudo entries are sometimes used in dictionaries to describe the phonology and orthography of a language. This is the case, for example, for Grimm's <a href="http://woerterbuchnetz.de/cgi-bin/WBNetz/wbgui_py?sigle=DWB">Deutsches Wörterbuch</a>.)</li>
		</ul>
	</p>
	</div>
	 
	</section>
	
	<section><h2>Collocations</h2>
	
	<div class="note"><p>CC: this is a part I am less certain about, mostly because of the rdf:List modelling (which is inspired by lexicog). Alternative suggestions welcome.</p></div>
	
	<p>Collocation analysis is an important tool for lexicographical research and instrumental for modern NLP techniques. It has been the mainstay of 1990s corpus linguistics and continues to be an area of active research in computational philology. ... (MORE MOTIVATION AND EXAMPLES)</p>
	
	<p>Collocations are usually defined on surface-oriented criteria, i.e., as a relation between forms or lemmas (lexical entries), not between senses, but they can be analyzed on the level of word senses (the sense that gave rise to the idiom or collocation). Indeed, collocations often contain a variable part, which can be represented by a <tt>ontolex:LexicalConcept</tt>.</p>
	
	<p>Collocations can involve two or more words, they are thus modelled as an <tt>rdf:List</tt> of <tt>ontolex:Element</tt>s.
	Collocations may have a fixed or a variable word order. By default, we assume variable word order, where a fixed word order is required, the collocation must be assigned <tt>lexinfo:termType lexinfo:idiom</tt>.</p>
	
	<p>Collocations obtained by quantitative methods are characterized by their method of creation (<tt>dct:description</tt>), their collocation strength (<tt>rdf:value</tt>), and the corpus used to create them (<tt>dct:source</tt>). Collocations share these characteristics with other types of contextual relations (see below), and thus, these are inherited from the abstract <tt>frac:ContextualRelation</tt> class.</p>
	
	<p><div class='entity'>
       <h3>ContextualRelation (Class)</h3>

           <div>
           <p><strong>URI:</strong> <a href="http://www.w3.org/nl/lemon/frac#ContextualRelation" class="uri">http://www.w3.org/nl/lemon/frac#ContextualRelation</a></p>
    		</div>
    		<div class='comment'>
    			<p><strong>ContextualRelation</strong> provides a relation between two or more lexical elements, characterized by a <tt>dct:description</tt> of the nature of relation, a corpus (<tt>dct:source</tt>) from which this relation was inferred, and a weight or probability assessment (<tt>rdf:value</tt>).</p>
    		</div>
    		<div class='description'>
    			<p><strong>SubClassOf:</strong> rdf:List; rdf:value exactly 1 xsd:double, dct:source min 1, dct:description min 1 xsd:string</p>
    		</div>
    </div></p>
	
	<p>We distinguish two primary contextual relations: syntagmatic (between co-occurring elements) and paradigmatic (between elements that can be substituted for each other). Syntagmatic contextual relations are formalized with <tt>frac:Collocation</tt>.</p>
	
	<p><div class="entity">
		<h3>Collocation (Class)</h3>
		<div>
		<p><strong>URI:</strong> <a href="http://www.w3.org/nl/lemon/frac#Collocation" class="uri">http://www.w3.org/nl/lemon/frac#Collocation</a></p>
    		</div>
    		<div class='comment'>
    			<p>A <strong>Collocation</strong> is a <tt>frac:ContextualRelation</tt> that holds between two or more <tt>ontolex:Element</tt>s based on their co-occurrence within the same utterance and characterized by their collocation weight (<tt>rdf:value</tt>) in one or multiple source corpora (<tt>dct:source</tt>).
    		</div>
    		<div class='description'>
    			<p><strong>SubClassOf:</strong> <tt>frac:ContextualRelation</tt></p>
				<p><strong>rdf:first:</strong> only <tt>ontolex:Element</tt></p>
				<p><strong>rdf:rest*/rdf:first:</strong> only <tt>ontolex:Element</tt>
			</div>
		</div>
	</p>
	
	<p>
	Collocations are lists of ontolex:Elements, and formalized as <tt>rdf:List</tt>. Collocation elements can thus be directly accessed by <tt>rdf:first</tt>, <tt>rdf:_1</tt>, <tt>rdf:_2</tt>, etc. The property <tt>rdf:rest</tt> returns a <tt>rdf:List</tt> of <tt>ontolex:Element</tt>s, but not a <tt>frac:Collocation</tt>.</p>
	
	<p>By default, <tt>frac:Collocation</tt> is insensitive to word order. If a collocation is word order sensitive, it should be characterized by an appropriate <tt>dct:description</tt>, as well as by having <tt>lexinfo:termType lexinfo:idiom</tt>.</p>

	<div class="note">
	<p><tt>lexinfo:idiom</tt> is ``[a] group of words in a fixed order that have a particular meaning that is different from the meanings of each word understood on its own.'' In application to automatically generated collocations, the criterion of having `a particular meaning' is necessarily replaced by `a particular distribution pattern', as reflected by the collocation weight (<tt>rdf:value</tt>). <i>Idioms</i> in the narrower sense of lexicalized multi-word expressions should not be modelled as <tt>frac:Collocation</tt>s, but as <tt>ontolex:MultiWordExpression</tt>s.
	[TO BE DISCUSSED]
	</p>
	</div>
	
	<p>The most elementary level of a collocation is an n-gram, as provided, for example, by <a href="http://storage.googleapis.com/books/ngrams/books/datasetsv2.html">Google Books</a>, which provide n-gram frequencies per publication year as tab-separated values. For 2008, the 2012 edition provides the following statistics for the bigram <i>kill</i> + <i>switch</i>.
	</p>

	<div class='beispiel'>
		<div>
			<pre>
				<code>
# form-form bigrams
kill	switch	2008	199	121

# form-lexeme bigrams
kill	switch_NOUN	2008	187	115
kill	switch_VERB	2008	8	8
				
# lexeme-form bigrams
kill_ADJ	switch	2008	70	48
kill_NOUN	switch	2008	89	64
kill_VERB	switch	2008	40	30
				
# lexeme-lexeme bigrams
kill_VERB	switch_VERB	2008	2	2
kill_NOUN	switch_NOUN	2008	83	61
kill_VERB	switch_NOUN	2008	35	26
kill_ADJ	switch_NOUN	2008	69	48
kill_NOUN	switch_VERB	2008	6	6
</code></pre></div></div>

	<p>In this example, forms are string values (cf. <tt>ontolex:LexicalForm</tt>), lexemes are string values with parts-of-speech (cf. <tt>ontolex:LexicalEntry</tt>). A partial ontolex-frac representation is given below:
	</p>
	
	<div class='beispiel'>
		<div>
			<pre>
				<code>
# kill (verb)
:kill_v a ontolex:LexicalEntry;
  lexinfo:partOfSpeech lexinfo:verb;
  ontolex:canonicalForm :kill_cf.

# kill (canonical form)
:kill_cf ontolex:writtenRep "kill"@en.

# switch (noun)
:switch_n a ontolex:LexicalEntry;
  lexinfo:partOfSpeech lexinfo:noun;
  ontolex:canonicalForm :switch_cf.

# switch (canonical form)
:switch_cf ontolex:writtenRep "switch"@en.

# form-form bigrams
(:kill_cf :switch_cf) a frac:Collocation;
  rdf:value "199";
  dct:description "2-grams, English Version 20120701, word frequency";
  dct:source &lt;https://books.google.com/ngrams>;
  dct:temporal "2008"^^xsd:date;
  lexinfo:termType lexinfo:idiom.

(:kill_cf :switch_cf) a frac:Collocation;
  rdf:value "121";
  dct:description "2-grams, English Version 20120701, document frequency";
  dct:source &lt;https://books.google.com/ngrams>;
  dct:temporal "2008"^^xsd:date;
  lexinfo:termType lexinfo:idiom.
  
# form-lexeme bigrams
(:kill_cf :switch_n) a frac:Collocation;
  rdf:value "187";
  dct:description "2-grams, English Version 20120701, word frequency";
  dct:source &lt;https://books.google.com/ngrams>;
  dct:temporal "2008"^^xsd:date;
  lexinfo:termType lexinfo:idiom.

(:kill_cf :switch_n) a frac:Collocation;
  rdf:value "115";
  dct:description "2-grams, English Version 20120701, document frequency";
  dct:source &lt;https://books.google.com/ngrams>;
  dct:temporal "2008"^^xsd:date;
  lexinfo:termType lexinfo:idiom.
</code></pre></div></div>

	<div class="note"><p>Question: can canonical forms be shared across different lexical entries? For the case of plain word n-grams, this is presupposed here.</p></div>
	
	<p>The second example illustrates more complex types of collocation are provided as provided by the <a href="http://corpora.uni-leipzig.de/en/res?corpusId=eng_news_2012">Wortschatz</a> portal (scores and definitions as provided for <a href="http://corpora.uni-leipzig.de/en/res?corpusId=eng_news_2012&word=beans">beans</a>, <a href="http://corpora.uni-leipzig.de/en/res?corpusId=eng_news_2012&word=spill+the+beans">spill the beans</a>, etc.
	</p>
	
	<div class='beispiel'>
		<div>
			<pre>
				<code>
@prefix wsen: &lt;http://corpora.uni-leipzig.de/en/res?corpusId=eng_news_2012&word=>

# selected lexical entries
# (we assume that every Wortschatz word is an independent lexical entry)
wsen:beans a ontolex:LexicalEntry;
  ontolex:canonicalForm/ontolex:writtenRep "beans"@en.
wsen:spill a ontolex:LexicalEntry;
  ontolex:canonicalForm/ontolex:writtenRep "spill"@en.
wsen:green a ontolex:LexicalEntry;
  ontolex:canonicalForm/ontolex:writtenRep "green"@en.
wsen:about a ontolex:LexicalEntry;
  ontolex:canonicalForm/ontolex:writtenRep "about"@en

# collocations, non-lexicalized
(wsen:spill wsen:beans) a frac:Collocation;
  rdf:value "182";
  dct:description "cooccurrences in the same sentence, unordered";
  dct:source &lt;http://corpora.uni-leipzig.de/en/res?corpusId=eng_news_2012>.

(wsen:green wsen:beans) a frac:Collocation;
  rdf:value "778";
  dct:description "left neighbor cooccurrence";
  dct:source &lt;http://corpora.uni-leipzig.de/en/res?corpusId=eng_news_2012>;
  lexinfo:termType lexinfo:idiom.
  
(wsen:beans wsen:about) a frac:Collocation;
  rdf:value "35";
  dct:description "right neighbor cooccurrence";
  dct:source &lt;http://corpora.uni-leipzig.de/en/res?corpusId=eng_news_2012>;
  lexinfo:termType lexinfo:idiom.
  
# multi-word expression, lexicalized (!)
wsen:spill+the+beans a ontolex:MultiWordExpression;
  ontolex:canonicalForm/ontolex:writtenRep "spill the beans"@en.

(wsen:beans wsen:spill+the+beans) a frac:Collocation;
  rdf:value "401";
  dct:description "cooccurrences in the same sentence, unordered";
  dct:source &lt;http://corpora.uni-leipzig.de/en/res?corpusId=eng_news_2012>.
</code></pre></div></div>
  
	<!--p>More examples https://www.sketchengine.eu/documentation/statistics-used-in-sketch-engine/</p--> 
	
	<div class="note"><p>Again, it is recommended to define resource-specific subclasses of <tt>frac:Collocation</tt> with default values for <tt>dct:description</tt>, <tt>dct:source</tt>, and (where applicable) <tt>lexinfo:termType</tt>.
	</p>
	</div>

	
	</section>
	
	<section>
	<h2>Similarity</h2>
	
	<p>Similarity is a paradigmatic contextual relation between elements that can replace each other in the same context. In distributional semantics, a quantitative assessment of the similarity of two forms, lexemes, phrases, word senses or concepts is thus grounded in numerical representations of their respective contexts, i.e., their embeddings.
	In a broader sense of `embedding', also bags of words fall under the scope of <tt>frac:Embedding</tt>, see the usage note below.
	</p>
	
	<p>Similarity is characterized by a similarity score (<tt>rdf:value</tt>), e.g., the number of shared dimensions/collocates (in a bag-of-word model) or the cosine distance between two word vectors (for fixed-size embeddings), the corpora which we used to generate this score (<tt>dct:source</tt>), and the method used for calculating the score (<tt>dct:description</tt>).</p>
	
	<p>Similarity is symmetric. The order of similes is irrelevant.</p>
	
	<p>Like <tt>frac:Collocation</tt>, quantitative similarity relations are modelled as a subclass of <tt>frac:ContextualRelation</tt> (and thus, as an <tt>rdf:List</tt>).</p>

	
	<p><div class="entity">
		<h3>Similarity (Class)</h3>
		<div>
		<p><strong>URI:</strong> <a href="http://www.w3.org/nl/lemon/frac#Similarity" class="uri">http://www.w3.org/nl/lemon/frac#Similarity</a></p>
    		</div>
    		<div class='comment'>
    			<p><strong>Similarity</strong> is a <tt>frac:ContextualRelation</tt> that holds between two or more <tt>frac:Embedding</tt>s, and is characterized by a similarity score (<tt>rdf:value</tt>) in one or multiple source corpora (<tt>dct:source</tt>) and a <tt>dct:description</tt> that explains the method of comparison.
    		</div>
    		<div class='description'>
    			<p><strong>SubClassOf:</strong> <tt>frac:ContextualRelation</tt></p>
				<p><strong>rdf:first:</strong> only <tt>frac:Embedding</tt></p>
				<p><strong>rdf:rest*/rdf:first:</strong> only <tt>frac:Embedding</tt>
			</div>
		</div>
	</p>

		<p>
	<tt>frac:Similarity</tt> applies to two different use cases: The specific similarity between (exactly) two words, and similarity clusters (synonym groups obtained from clustering quantitatively obtained synonym candidates according to their distributional semantics in a particular corpus) that can contain an arbitrary number of words. 
	Both differ in the semantics of <tt>rdf:value</tt>: 
	Quantitatively obtained similarity <i>relations</i> normally provide a different score for every pair of similes. 
	Within a similarity <i>cluster</i>, a generalization over these pair-wise scores must be provided. 
	This could be the minimal similarity between all cluster members or a score produced by the clustering algorithm (e.g., depth or size of cluster).
	This must be explained in <tt>dct:description</tt>.
	</p>
	
	<div class="note">
		<p>
	Similarity clusters are typical outcomes of <a href="https://www.cs.york.ac.uk/semeval2010_WSI/datasets.html">Word Sense Induction</a> techniques or <a href="http://www.aclweb.org/anthology/D10-1056">unsupervised POS tagging</a>. Classical sample data are Brown clusters, e.g., <a href="https://github.com/Derekkk/Brown-Word-Clustering-and-word-similarity/blob/master/results-brown.txt">here</a> or <a href="https://s3-eu-west-1.amazonaws.com/downloads.gate.ac.uk/resources/derczynski-chester-boegh-brownpaths.tar.bz2">here</a>.
		</p>
	</div>
	
	<div class="note">
		  <p><tt>Similarity</tt> is defined as a property of embeddings, not between <tt>ontolex:Element</tt>s. 
	  This excludes at least two important use cases: </p>
	  <ul>
	  <li>manual similarity assessments as used for evaluating similarity assessments, and as created, for example, as part of psycholinguistic association or priming experiments (also cf. WordNet synsets, which provide, however, detailed lexicographic information in addition to similarity, and which thus to be represented as <tt>ontolex:LexicalConcept</tt>),</li>
	  <li>similarity assessments obtained by other means than embeddings, e.g., by means of a traditional bag of words.
	  </li>
	  </ul>
	  </p>
	  <p>
	  In both (and similar) cases, the recommendation is to make use of (a resource-specific subclass of) <tt>frac:Embedding</tt>, nevertheless, and to document the specifics of the similarity relation and/or the embeddings in the <tt>dct:description</tt> of these embeddings. For the first use case, this approach can be justified by assuming that embeddings are correlated with a psycholinguistically `real' phenomenon. For the second use case, any bag of words can be interpreted as an infinite-size binary vector for which an embedding provides a fixed-size approximation.
	  </p>
	  </div>
	  
  	<div class="note">
	<p>
	As with frequency and embeddings, a resource-specific similarity type can be defined, analoguously. In particular, this is required if directed (asymmetric) similarity assessments are to be provided.
	</p>
	</div>
	</section>

	</section>

	<section>
	<h2>Corpus Annotation (non-normative)</h2>
	
	<div class="note"><p>The Ontolex Module for Frequency, Attestation and Corpus Information does not specify a vocabulary for annotating corpora or other data with lexical information, as this is being provided by the <a href="https://www.w3.org/TR/annotation-vocab/">Web Annotation Vocabulary</a>. The following description is non-normative as Web Annotation is defined in a separate W3C recommendation. The definitions below are reproduced, and refined only insofar as domain and range declarations have been refined to our usecase.</p>
	</div>
	
	<p>In Web Annotation terminology, the annotated element is the `target', the content of the annotation is the `body', and the process and provenance of the annotation is expressed by properties of <tt>oa:Annotation</tt>.</p>
	
	<div class="entity">
		<h3>oa:Annotation (Class)</h3>
		<div>
		<p><strong>IRI:</strong> <a href="http://www.w3.org/ns/oa#Annotation" class="uri">http://www.w3.org/ns/oa#Annotation</a></p>
    		</div>
    		<div class='description'>
				<p><strong>Required Predicates:</strong> <a href="#hastarget">oa:hasTarget</a>, <a href="#rdf-type">rdf:type</a>, <a href="#hasbody">oa:hasBody</a></p>
            <p><strong>Recommended Predicates:</strong> <a href="#motivatedby">oa:motivatedBy</a>, <a href="#dcterms-creator">dcterms:creator</a>, <a href="#dcterms-created">dcterms:created</a></p>
            <p><strong>Other Predicates:</strong> <a href="#styledby">oa:styledBy</a>, <a href="#dcterms-issued">dcterms:issued</a>, <a href="#as-generator">as:generator</a> </p>
			</div>
		</div>

        <div class="diagram_img">
          <img src="https://www.w3.org/TR/annotation-vocab/images/examples/annotation.png" alt="oa:Annotation with properties" longdesc="#example_anno">
        </div>

	<div class="entity">
		<h3>oa:hasBody (Object Property)</h3>
		<div>
		<p><strong>IRI:</strong> <a href="http://www.w3.org/ns/oa#hasBody" class="uri">http://www.w3.org/ns/oa#hasBody</a></p>
		</div>
		<div class="comment">The object of the relationship is a resource that is a body of the Annotation. In the context of <em>lemon</em>, the body is an <tt>ontolex:Element</tt></div>
		<div class="description">
			<p><strong>Domain:</strong> oa:Annotation</p>
			<p><strong>Range:</strong> ontolex:Element</p>
		</div>
		<div class="diagram_img">
		<img src="https://www.w3.org/TR/annotation-vocab/images/examples/hasBody.png" alt="oa:hasBody"/>
		</div>
	</div>
	
	<div class="entity">
		<h3>oa:hasTarget (Object Property)</h3>
		<div>
		<p><strong>IRI:</strong> <a href="http://www.w3.org/ns/oa#hasTarget" class="uri">http://www.w3.org/ns/oa#hasTarget</a></p>
		</div>
		<div class="comment">The relationship between an Annotation and its Target.</div>
		<div class="description">
			<p><strong>Domain:</strong> oa:Annotation</p>
		</div>
	</div>
	
	<p>The Web Annotation Vocabulary supports different ways to define targets. This includes:
	
	<ul>
	<li> plain URI: The target can be a URI defined within the corpus (e.g., if corpus data is provided as native RDF, or by means of the <tt>@about</tt> attribute in an <a href="https://www.w3.org/TR/rdfa-primer/">HTML/XML+RDFa</a> document, or by means of <tt>@xml:id</tt> in a <a href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.html">TEI/XML</a> document).</li>
	<li> string URI: String URIs provide the possibility to point directly to a text fragment in a web document, using the URI schemas as provided by <a href="https://tools.ietf.org/html/rfc5147">RFC5147</a> (text files only) or <a href="http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/nif-core.html#introduction">NIF</a> (all text-based formats).</li>
	<li> <a href="https://www.w3.org/TR/annotation-vocab/#textpositionselector">oa:TextPositionSelector</a>: a range of text defined by the start and end positions of the selection in the stream</li>
	<li><a href="https://www.w3.org/TR/annotation-vocab/#datapositionselector">oa:DataPositionSelector</a>: a range of data by recording the start and end positions of the selection in the stream</li>
	<li><a href="https://www.w3.org/TR/annotation-vocab/#textquoteselector">oa:TextQuoteSelector</a>: The TextQuoteSelector describes a range of text by copying it. The TextQuoteSelector can include some of the text immediately before (a prefix) and after (a suffix) it to distinguish between multiple copies of the same sequence of characters. If this does suffice for disambiguation, all matching text fragments in the document are being annotated.</li>
	<li><a href="https://www.w3.org/TR/annotation-vocab/#xpathselector">
	oa:XPathSelector</a>: select elements and content within a resource that supports the Document Object Model via a specified XPath value.</li>
	<li> <a href="https://www.w3.org/TR/annotation-vocab/#rangeselector">oa:RangeSelector</a>: identify the beginning and the end of the selection by using other Selectors.</li>
	</ul>
	</p>
	
	<div class="note">
	<p><tt>oa:Annotation</tt> explicitly allows <i>n:m</i> relations between <tt>ontolex:Element</tt>s and elements in the annotated elements. It is thus sufficient for every <tt>ontolex:Element</tt> to appear in one <tt>oa:hasBody</tt> statement in order to produce a full annotation of the corpus.
	</div>
	
	<div class="note">
	<p>As for frequency, embeddings, etc., resource-specific annotation classes can be defined by <tt>owl:Restriction</tt> so that modelling effort and verbosity are reduced. These should follow the same conventions.</p>
	</div>
	
	</section>
	
	<section>
	  <h2>Usage guidelines</h2>

	  <section>
	  <h3>Resource-specific subclasses of frac concepts</h3>
		<p>As corpus-derived information requires provenance and other metadata, the frac module uses reification (class-based modelling) for concepts such as frequency or embeddings. In a data set, this information will be recurring, and for redundancy reduction, we recommend to provide resource-specific subclasses of frac concepts that provide metadata by means of <tt>owl:Restriction</tt>s that provide the value for the respective properties. This was illustrated above for the relevant frac classes.</p>
		
		<p>
		As a rule of best practice, we recommend for such cases to provide (a copy of) the OWL definitions of resource-specific classes <em>in the same graph (and file) as the data</em>.
		Within the graph containing the data, the following SPARQL 1.1 query <em>must</em> return the full frac definition of all instances of, say, <tt>:EPSDFrequency</tt> (see examples above):
		</p>
		
		<p><div class="beispiel"><div><pre><code>
CONSTRUCT {
	?data a ?class, ?sourceClass; ?property ?value.
} WHERE {
  ?data a ?sourceClass.	                  # e.g., [] a :EPSDFrequency
  ?sourceClass (rdfs:subClassOf|owl:equivalentClass)* ?class.
  FILTER(strstarts(str(?class),'http://www.w3.org/ns/lemon/frac#'))
    # ?class: all superclasses of ?sourceClass which are in the <em>frac</em> namespace
  { # return all value restrictions
	  ?class (rdfs:subClassOf|owl:equivalentClass)* ?restriction.
	  ?restriction a owl:Restriction.
	  ?restriction owl:onProperty ?property.
	  ?restriction owl:hasValue ?value.
  } UNION {
    # return all directly expressed values
	  ?data ?property ?value.
	  FILTER(?property in (dct:source,rdf:value))
	  # TODO: update list of properties
  }
}</code></pre></div></div></p>

	<p>This query can be used as a test for <em>frac</em> compliancy, and for property `inference'. Note that it does not support <tt>owl:intersection</tt> nor <tt>owl:join</tt>, nor <tt>owl:sameAs</tt>.</p>
  
	<div class="note">
	<p>We use the OWL2/DL vocabulary for modelling restrictions. However, <em>lemon</em> is partially compatible with OWL2/DL only in that several modules use <tt>rdf:List</tt> -- which is a reserved construct in OWL2. Therefore, the primary means of accessing and manipulation <em>lemon</em> and <em>ontolex-frac</em> data is by means of SPARQL, resp., RDF- (rather than OWL-) technology. In particular, we do not guarantee nor require that OWL2/DL inferences can be used for validating or querying <em>lemon</em> and <em>ontolex-frac</em> data.
	</p>
	</div>
		
	  </section>
	  
	  <section>
	  <h3>RDF Serializations and CSV</h3>
	  
	  <p>Usually, numerical information drawn from corpora is  distributed and shared as comma-separated values (CSV), e.g., ngram lists or embeddings.
	  Ontolex-frac as an RDF vocabulary is agnostic about its serialization (RDF/TTL, RDF/XML, JSON-LD, etc.), but in particular, it is compliant with CSV and related tabular formats by means of W3C recommendations such as <a href="https://www.w3.org/TR/csv2rdf/">CSV2RDF</a>, <a href="https://www.w3.org/TR/rdb-direct-mapping/">RDB Direct Mapping</a>
	  and the <a href="https://www.w3.org/TR/r2rml/">RDB to RDF Mapping Language</a>. For corpus-derived lexical-semantic information which is typically distributed in CSV, the best practice is to continue to do so, but to provide a mapping to Ontolex-frac as this provides a vocabulary for their interpretation as Linked Data, and thus establishes an interoperability layer over the raw data without creating additional overhead.</p>
	  
	  <div class="note">
		<p>Ontolex-frac is compliant with CSV formats, but its handling of structured information has an impact on the CSV format. In particular, individual dimensions of embeddings must not use comma as separator in order to be mapped to a single literal. For the example embedding of <i>frak</i> above, the first column (containing the word) should be comma-separated, the following columns (containing the embedding) should be white-space separated.
		</p>
	  </div>
	  </section>
	</section>

	<h2>Acknowledgements</h2>

	  TBC


   <h2>References</h2>
   
   <div class="note"><p>from lexicog, to be revised</p></div>

<dt id="bib-mccrae-lemon">[1]</dt>
<dd>J. McCrae, G. Aguado-de Cea, P. Buitelaar, P. Cimiano, T. Declerck, A. Gómez-Pérez, J. Gracia, L. Hollink, E. Montiel-Ponsoda, D. Spohr, and T. Wunner, <a href="http://dx.doi.org/10.1007/s10579-012-9182-3"> <cite>"Interchanging lexical resources on the Semantic Web" </cite></a>.  Language Resources and Evaluation, vol. 46, 2012.

<dt id="bib-klimek-kdict">[2]</dt>
<dd> B. Klimek and M. Brümmer, <cite>"Enhancing lexicography with semantic language databases"</cite> Kernerman Dictionary News, 23, 5-10. 2015.

<dt id="bib-gracia-apertium">[3]</dt>
<dd>J. Gracia, M. Villegas, A. Gómez-Pérez, and N. Bel, <cite> "The apertium bilingual dictionaries on the web of data"</cite> Semantic Web Journal, vol. 9, no. 2, pp. 231-240, Jan. 2018.

<dt id="bib-bosque-kdict">[4]</dt>
<dd> J. Bosque-Gil, J. Gracia, E. Montiel-Ponsoda, and G. Aguado-de Cea, <cite>"Modelling multilingual lexicographic resources for the web of data: the k dictionaries case"</cite> in Proc. of GLOBALEX'16 workshop at LREC'15, Portoroz, Slovenia, May 2016.

<dt id="bib-kahn-diachronic">[5]</dt>
<dd> F. Khan, J. E. Díaz-Vera, and M. Monachini, <cite>"Representing Polysemy and Diachronic Lexico-Semantic Data on the Semantic Web"</cite> In SWASH at ESWC (2016)

<dt id="bib-declerck-dialectal">[6]</dt>
<dd> T. Declerck and E. Wandl-Vogt, <cite> "Cross-linking Austrian dialectal Dictionaries through formalized Meanings"</cite> In Proceedings of the XVI EURALEX International
Congress, pp. 329–343. 2014.

<dt id="bib-abromeit-etymological">[7]</dt>
<dd> F. Abromeit, C. Chiarcos, C. Fäth and M. Ionov, <cite>"Linking the Tower of Babel: Modelling a Massive Set of Etymological Dictionaries as RDF"</cite> In LDL 2016 5th Workshop on Linked Data in Linguistics: Managing, Building and Using Linked Language Resources (p. 11). May 2016.


<dt id="bib-bosque-lexicography">[8]</dt>
<dd> J. Bosque-Gil, J. Gracia, and A. Gómez-Pérez, <cite>"Linked data in lexicography"</cite> Kernerman Dictionary News, pp. 19-24, Jul. 2016.

<dt id="bib-declerck-paneuropean">[9]</dt>
<dd> T. Declerck, E. Wandl-Vogt, and K. Mörth, <cite>"Towards a Pan European Lexicography by Means of Linked (Open) Data"</cite> In Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference (pp. 342-355), 2015.


<dt id="bib-bosque-module">[10]</dt>
<dd> J. Bosque-Gil, J. Gracia, and E. Montiel-Ponsoda, <a href="http://ceur-ws.org/Vol-1899/OntoLex_2017_paper_5.pdf"> <cite>  "Towards a module for lexicography in OntoLex" </cite></a> in Proc. of the LDK workshops: OntoLex, TIAD and Challenges for Wordnets at 1st Language Data and Knowledge conference (LDK 2017), Galway, Ireland, vol. 1899.    CEUR-WS, pp. 74-84, Jun 2017.

<dt id="bib-parvizi-oxford">[11]</dt>
<dd> A. Parvizi, M. Kohl, M. González, R. Saurí, <cite>"Towards a Linguistic Ontology with an Emphasis on Reasoning and Knowledge Reuse"</cite> Language Resources and Evaluation Conference (LREC), May 2016.

<dt id="bib-gracia-native">[12]</dt>
<dd> J. Gracia, I. Kernerman, and J. Bosque-Gil, <a href=" https://elex.link/elex2017/wp-content/uploads/2017/09/paper33.pdf"> <cite>"Toward linked data-native dictionaries"</cite> </a> in. Proc. of eLex 2017 conference (Electronic lexicography in the 21st century), in Leiden, Netherlands. Lexical Computing CZ s.r.o., pp. 550-559, Sep. 2017.


<dt id="bib-stolk-onomasiological">[13]</dt>
<dd> S. Stolk, <cite>"OntoLex and Onomasiological Ordering: Supporting Topical Thesauri"</cite> in Proc. of the LDK2017 Workshops, NUI Galway, Ireland, 18 June (pp. 60–67), 2017.

<dt id="bib-elmaarouf-verbs">[14]</dt>
<dd> I. El Maarouf, J. Bradbury, and P. Hanks, <cite>"PDEV-lemon: a Linked Data implementation of the Pattern Dictionary of English Verbs based on the Lemon model"</cite>. In 3rd Workshop on Linked Data in Linguistics: Multilingual Knowledge Resources and Natural Language Processing (p. 88). 2014.

<dt id="bib-kahn-citations">[15]</dt>
<dd> F. Khan and F. Boschetti, <cite>"Towards a Representation of Citations in Linked Data Lexical Resources"</cite> In proc. of the XVIII EURALEX International Congress (EURALEX 2018). 2018

<dt id="american-heritage-dict">[16]</dt>
<dd><cite>animal. American Heritage Dictionary. Houghton Mifflin Harcourt, 1994.</cite> Last accessed 28.10.18.

<dt id="RAE-dict">[17]</dt>
<dd><cite>blanco. Diccionario de la Lengua Española (DLE). Versión electrónica de la 23. Edición. December 2017.</cite> Last accessed 28.10.18.

<dt id="OED-dict-air">[18]</dt>
<dd><cite>air. Oxford English Living Dictionaries Online. </cite> Last accessed 01.11.18. https://en.oxforddictionaries.com/definition/air


</body>
<script>
 setTimeout(function(){CodeMirror.colorize();}, 20);
</script>

</html>