Python code to produce an SQLite database, ready to offer lemma search on the web for Epidocs XML greek documents.
This code is at a very early stage and is not ready for distribution. It works, at least for the developpers. Fill an issue upper if you want to work with the developpers to make it work for your corpus.
- A corpus of greek texts conforming to the tei-epidoc.rng schema. Example
…/myprojects$git clone https://github.com/OpenGreekAndLatin/First1KGreek.git
- A python 3 installation, >= 3.6, < 3.10 (at 2022-01)
ubuntu.21.10:…$python3 -V
Python 3.9.7 - The pip packager
- The Python libxml wrapper for XSLT transformations
ubuntu.21.10:…$sudo pip3 install lxml
- pie_extended, the lemmatizer from Thibault Clérice, with the greek model, takes a while, and can fall in a depedencies hell if you have some required packages installed in other versions than desired by pie. This scenario has worked (Cython allow scikit to recompile itself)
ubuntu.21.10:…$sudo pip3 install Cython
ubuntu.21.10:…$sudo pip3 install pie-extended
ubuntu.21.10:…$pie-extended download grc
Not stable for now.
For a faster lemmatisation, if you have an Nvidia graphic card, you can use it for work (and not only gaming). Install the latest Nvidia pilots, and the Cuda toolkit to use the processors of your graphic card, ant install the python lib
ubuntu.21.10:~$ sudo apt install nvidia-cuda-toolkit
Installation for Windows
- Install nvidia cuda pilots
- Install PyTorch 1.7.1, lemmatization with papie 0.3.9 requires torch<=1.7.1,>=1.3.1, chose the torch version according to your cuda pilot version
- (The full lematization of the Iliad and the Odyssey takes about 5 minutes with cuda on an rtx 3060ti and about 13 minutes without, the 2.6 multiplication factor is about the same with a much larger corpus.)
A python package suppose usually that you have already a running Python installation, but if not, and if you are on windows, the system will not help vou to make good choices like linux. Here some hints that may save you time, at least at date (2022-01).
- Install Python 3.8, don’t try to be newer than others. Verbapy is a Digital Humanity library, it requires research libs. Researchers are not paid to dicover new bugs on new versions of Python. Tick NOW (much more easier to explain than after) Add Python 3.8 to PATH, and pip.
- Don’t try to install python globally on windows (ex:
C:\Program Files\Python38). This good practice as a linux admin will run you in "deps hell" with windows. - Verify thoses commands in your preferred console
win10>python -V
Python 3.8.10
win10>where python
C:\Users{YOU}\AppData\Local\Programs\Python\Python38\python.exe - Update pip (the python package installer)
win10>pip install --user --upgrade pip
(--user should not be required, but sometimes, it seems) - Now you should have a Python correct to work, try to install an omportant requirement
win10>pip install lxml