This is a first version of the Wikimedia project etytree. The aim of the project is to visualize in an interactive web page the etymological tree (i.e., the etymology of a word in the form of a tree, with ancestors, cognate words, derived words, etc.) of any word in any language using data extracted from Wiktionary.
This project has been inspired by my interest in etymology, in open source collaborative projects and in interactive visualizations.
If you have comments on the project please write on its talk page.
The master branch is for development and for local installs. The webpack-branch is used in production.
Etytree uses data extracted from an XML dump of the English Wiktionary using an algorithm implemented in dbnary_etymology. The extracted data is kept in sync with Wiktionary each time a new dump is generated (the dump currently used dates back to September 28th, 2017). Data extracted with dbnary_etymology has been loaded into a Virtuoso DBMS which can be accessed at wmflabs etytree-virtuoso sparql endpoint and explored with a faceted browser.
The list of languages and ISO codes can be found at resources/data and are imported from Wiktionary and periodically updated (the current files date back to September 22nd, 2017). File etymology-only_languages.csv has been created from Wiktionary data with a lua module available here. File iso-639-3.tab has been downloaded from this link (the first line has been removed). File list_of_languages.csv has been downloaded from Wiktionary.
I have defined an ontology for etymologies here. In particular I have defined properties etymologicallyRelatedTo, etymologicallyDerivesFrom and etymologicallyEquivalentTo. This ontology needs improvements.
Property http://www.w3.org/2000/01/rdf-schema#seeAlso is used to link etymological entries to the Wiktionary pages they have been extracted from.
Besides etymological relationships, the database also contain POS-s, definitions, senses and more as extracted by dbnary. The ontology for dbnary is defined here.
The code is distributed under MIT licence and the data is distributed under Creative Commons Attribution-ShareAlike 3.0.
The site's html files are contained in the repo root. The main page is index.html. To view the site you just need to navigate to the root of the repo.
This code queries the wmflabs etytree-virtuoso sparql endpoint which I have set up and populated with data (RDF) produced with dbnary_etymology.
An example query to the sparql endpoint follows:
PREFIX eng: <http://etytree-virtuoso.wmflabs.org/dbnary/eng/>
SELECT ?p ?o {
eng:__ee_door ?p ?o
}
If you want to find all entries containing string "door":
SELECT DISTINCT ?s {
?s rdfs:label ?label .
?label bif:contains "door" .
}
If you want to find ancestors of "door":
PREFIX dbetym: <http://etytree-virtuoso.wmflabs.org//dbnaryetymology#>
PREFIX eng: <http://etytree-virtuoso.wmflabs.org/dbnary/eng/>
SELECT DISTINCT ?o {
eng:__ee_1_door dbetym:etymologicallyRelatedTo+ ?o .
}
INSTALL jsdoc-to-markdown
You would to have sudo privileges
npm install -g jsdoc-to-markdown
mkdir ./docs
cd ./resources/js/
jsdoc2md -f app.js datamodel.js data.js etytree.js liveTour.js graph.js > ../../docs/test.md
The RDF database of etymological relationships is periodically extracted when a new dump of the English Wiktionary is released. The code used to extract the data is available at dbnary_etymology.
dbnary_etymology is a Maven project (use java 8 and maven3).
Let's assume you cloned the repository in your home:
cd ~/dbnary_etymology/
mvn site
mvn javadoc:jar
First you need an XML dump of English Wiktionary. Then you need to convert it into UTF-8 format (using iconv for example):
VERSION=20170920
DATA_DIR=/srv/datasets/dumps/$VERSION/ #output data folder
tmp_dump=/public/dumps/public/enwiktionary/$VERSION/enwiktionary-$VERSION-pages-articles.xml.bz2 #path to the dump
mkdir ${DATA_DIR}
dump=${DATA_DIR}/enwiktionary-$VERSION-pages-articles.utf-16.xml
bzcat ${tmp_dump} |iconv -f UTF-8 -t UTF-16 > $dump #This operation takes approximately 7 minutes.
With the following code you can extract data relative to English words:
OUT_DIR=/srv/datasets/dbnary/$VERSION/ #output folder
LOG_DIR=/srv/datasets/dbnary/$VERSION/logs/
EXECUTABLE=~/dbnary_etymology/dbnary-extractor/target/dbnary-extractor-2.0e-SNAPSHOT-jar-with-dependencies.jar
mkdir ${OUT_DIR}
mkdir ${LOG_DIR}
PREFIX=http://etytree-virtuoso.wmflabs.org/dbnary
LOG_FILE=${LOG_DIR}/enwkt-$VERSION.ttl.log
OUT_FILE=${OUT_DIR}/enwkt-$VERSION.ttl
ETY_FILE=${OUT_DIR}/enwkt-$VERSION.etymology.ttl
rm ${LOG_FILE}
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug -cp $EXECUTABLE org.getalp.dbnary.cli.ExtractWiktionary -l en --prefix $PREFIX -E ${ETY_FILE} -o ${OUT_FILE} $dump test 3>&1 1>>${LOG_FILE} 2>&1 #This operation takes approximately 45 minutes
#compress the output if needed
gzip ${OUT_FILE}
gzip ${ETY_FILE}
#after inspecting the log file, I usually only keep the last few lines
tail ${LOG_FILE} > ${LOG_DIR}/tmp
mv ${LOG_DIR}/tmp ${LOG_FILE}
For memory reasons I only process a subset of the full data set at a time (from page 0 to page 1800000 - which takes approximately 100 minutes, from page 1899999 to page 3600000 which takes approximately 50 minutes, from page 3600000 to page 6000000 which takes approximately 100 minutes). Note that 24G are needed to process the data.
fpage=0
tpage=1800000
LOG_FILE=${LOG_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl.log
OUT_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl
ETY_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.etymology.ttl
rm ${LOG_FILE}
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug -cp $EXECUTABLE org.getalp.dbnary.cli.ExtractWiktionary -l en --prefix $PREFIX -x --frompage $fpage --topage $tpage -E ${ETY_FILE} -o ${OUT_FILE} $dump test 3>&1 1>>${LOG_FILE} 2>&1
gzip ${OUT_FILE}
gzip ${ETY_FILE}
#after inspecting the log file, I usually only keep the last few lines
tail ${LOG_FILE} > ${LOG_DIR}/tmp
mv ${LOG_DIR}/tmp ${LOG_FILE}
fpage=1800000
tpage=3600000
LOG_FILE=${LOG_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl.log
OUT_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl
ETY_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.etymology.ttl
rm ${LOG_FILE}
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug -cp $EXECUTABLE org.getalp.dbnary.cli.ExtractWiktionary -l en --prefix $PREFIX -x --frompage $fpage --topage $tpage -E ${ETY_FILE} -o ${OUT_FILE} $dump test 3>&1 1>>${LOG_FILE} 2>&1
gzip ${OUT_FILE}
gzip ${ETY_FILE}
#after inspecting the log file, I usually only keep the last few lines
tail ${LOG_FILE} > ${LOG_DIR}/tmp
mv ${LOG_DIR}/tmp ${LOG_FILE}
fpage=3600000
tpage=6000000
LOG_FILE=${LOG_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl.log
OUT_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl ETY_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.etymology.ttl
rm ${LOG_FILE}
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug -cp $EXECUTABLE org.getalp.dbnary.cli.ExtractWiktionary -l en --prefix $PREFIX -x --frompage $fpage --topage $tpage -E ${ETY_FILE} -o ${OUT_FILE} $dump test 3>&1 1>>${LOG_FILE} 2>&1
gzip ${OUT_FILE}
gzip ${ETY_FILE}
#after inspecting the log file, I usually only keep the last few lines
tail ${LOG_FILE} > ${LOG_DIR}/tmp
mv ${LOG_DIR}/tmp ${LOG_FILE}
WORD="door"
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary.eng=debug -cp $EXECUTABLE org.getalp.dbnary.cli.GetExtractedSemnet -x -l en --etymology testfile $dump $WORD
For VERSION=20170920:
cp ~/dbnary_etymology/dbnary-ontology/src/main/resources/org/getalp/dbnary/dbnary_etymology.owl /srv/datasets/dbnary/$VERSION/
cp ~/dbnary_etymology/dbnary-ontology/src/main/resources/org/getalp/dbnary/dbnary.owl /srv/datasets/dbnary/$VERSION/
From isql execute the following steps (step A):
SPARQL CLEAR GRAPH <http://etytree-virtuoso.wmflabs.org/dbnary>;
SPARQL CLEAR GRAPH <http://etytree-virtuoso.wmflabs.org/dbnaryetymology>;
ld_dir ('/srv/datasets/dbnary/20170920/', '*.ttl.gz','http://etytree-virtuoso.wmflabs.org/dbnary');
ld_dir ('/srv/datasets/dbnary/20170920/', '*.owl','http://etytree-virtuoso.wmflabs.org/dbnaryetymology');
-- do the following to see which files were registered to be added:
SELECT * FROM DB.DBA.LOAD_LIST;
-- if unsatisfied use:
-- delete from DB.DBA.LOAD_LIST;
rdf_loader_run(); ----- 1378390 msec.
-- do nothing too heavy while data is loading
checkpoint; ----- 50851 msec.
commit WORK; ----- 1417 msec.
checkpoint;
EXIT;
In case an error occurs:
12:00:44 PL LOG: File /srv/datasets/dbnary/20170920//enwkt-0_1800000.etymology.ttl.gz error 37000 SP029: TURTLE RDF loader, line 10636983: syntax error processed pending to here.
12:06:09 PL LOG: File /srv/datasets/dbnary/20170920//enwkt-1800000_3600000.etymology.ttl.gz error 37000 SP029: TURTLE RDF loader, line 4772623: syntax error processed pending to here.
edit files manually:
zcat /srv/datasets/dbnary/20170920//enwkt-0_1800000.etymology.ttl.gz > /srv/datasets/dbnary/20170920//enwkt-0_1800000.etymology.ttl
emacs -nw /srv/datasets/dbnary/20170920//enwkt-0_1800000.etymology.ttl.gz #goto-line 10636983
#change line
gzip /srv/datasets/dbnary/20170920//enwkt-0_1800000.etymology.ttl
Go to step A above and repeat. Then run the following command from the terminal
isql 1111 dba password /opt/virtuoso/db/bootstrap.sql
After dealing with errors relaunch the server.
From isql:
sparql SELECT COUNT(*) WHERE { ?s ?p ?o } ;
sparql SELECT ?g COUNT(*) { GRAPH ?g {?s ?p ?o.} } GROUP BY ?g ORDER BY DESC 2;
-- Build Full Text Indexes by running the following commands using the Virtuoso isql program
RDF_OBJ_FT_RULE_ADD (null, null, 'All');
VT_INC_INDEX_DB_DBA_RDF_OBJ ();
-- Run the following procedure using the Virtuoso isql program to populate label lookup tables periodically and activate the Label text box of the Entity Label Lookup tab:
urilbl_ac_init_db();
-- Run the following procedure using the Virtuoso isql program to calculate the IRI ranks. Note this should be run periodically as the data grows to re-rank the IRIs.
s_rank();
The following link will help you set up CORS for Virtuoso: http://vos.openlinksw.com/owiki/wiki/VOS/VirtTipsAndTricksCORsEnableSPARQLURLs
To start:
cd /opt/virtuoso/db
virtuoso-t -f
To stop:
cd /opt/virtuoso-opensource/bin
isql 1111 dba password
SQL> shutdown();
- Add qualifiers to links between nodes: inherited word (template inherited), borrowed word (template borrowed), named from people, developed from initialism, surface analysis, long detailed etymology (propose a new template?), invented word/coined expression (coined by), back-formation (e.g.: burglar -> burgle, play the tamburine -> tambour, i.e. remove a morpheme, real or perceived) (template back-form), compound (template compound), initialism, acronym, abbreviation, clipping, blend/portmanteau (template blend), calque/loan translation, year template (propose a new template?), cognates (I actually plan to ignore this).
- Parse glosses in templates
- Parse nested templates
- Add zoom to tooltip
- Add etymology controversies.
- Add alternative etymologies.
- Parse diacritics.
- Maybe consider Dialects:
Module:da:Dialects ?
Module:en:Dialects This module provides labels to {{alter}}, which is used in the Alternative forms section.
Module:grc:Dialects This module translates from dialect codes to dialect names for templates such as {{alter}}. (e.g. aio -> link = 'Aeolic Greek', display = 'Aeolic')
Module:he:Dialects
Module:hy:Dialects ?
Module:la:Dialects (e.g.: aug -> link = Late Latin#Late and post-classical Latin, display = post-Augustan)
- Maybe consider additional modules:
Module:families/data mapping language code -> language name (e.g.: aav -> canonicalName = "Austro-Asiatic",otherNames = {"Austroasiatic"}