`make lists -j32` doesn't seem to be honoring the thread count. (Also happens when calling `make training -j32`) #382

ipaqmaster · 2024-04-01T07:15:51Z

Hi team,

I'm training a model on some font with START_MODEL=eng and while the resulting .traineddata can correctly guess a lot of things with the font there are some which trip it up. It was only trained on a couple thousand lines.

To try and lazily solve this problem I'm trying again with far more training lines than previously. (160k; very overkill and likely pointless cycles to train).

During make training I've noticed that many preparation steps take place in parallel but it seems the lists step is calling tesseract data/font-ground-truth/abc_00001.tif data/font-ground-truth/abc_00001 --psm 13 lstm.train one .tif at a time.

In my limited experience with this software this seems like a step that could be run concurrently and would help speed up the initial data preparation step in getting to the actual training part of the process without having to resort to scripting.

Is it possible to make this training preparation step run in parallel with multiple -jxx jobs?

The text was updated successfully, but these errors were encountered:

ipaqmaster · 2024-04-01T08:34:24Z

Worked around this with the below bash scripting to speed things up:

cd tesstrain

# Generate .box's
find data/*ground-truth/ -type f -name '*.tif' | while read line ; do [ ! -f "${line/.*/}.box" ] && echo "PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i \"${line}\" -t \"${line/.*/}.gt.txt\" > \"${line/.*/}.box\"" ; done | parallel -j$(nproc)


# Generate .lstmf's
find data/*ground-truth/ -type f -name '*.tif' | while read line ; do [ ! -f "${line/.*/}" ] && [ -f "${line/.*/}.box" ] && [ ! -f "${line/.*/}.lstmf" ] && echo "tesseract \"${line}\" ${line/.*/} --psm 13 lstm.train" ; done | parallel -j$(nproc)

stweil · 2024-04-10T11:14:46Z

I always used make -j for parallel builds of box and lstmf files, and it worked fine (with png images instead of tiff, but that should not matter). Meanwhile I have an even better alternative which no longer requires box and lstmf files.

yaofuzhou · 2024-05-23T20:07:14Z

Hi - not necessarily the answer you were looking for, but Tesstrain is essentially a wrapper to help you run a sequence of Tesseract binaries with hopefully the correct parameters. Here is my way to significantly speed up the development process -

Use GPT/Claude to decompose the Tesstrain makefile into a series of components:

A master Makefile
A config.mk to store all parameters that can be included by the various components
unicharset.mk for make unicharset
lists.mk for make lists
training.mk for make training
and perhaps a misc.mk for the rest

Understand what each component does. Ask GPT/Claude to explain to you if needed.
Translate the core task of each .mk component to Python, which GPT/Claude is much better at. Have, say, unicharset.mk to call Python unicharset.py to execute the same tasks.
Identify in each .py what tasks are parallelizable, and ask GPT/Claude to modify the code to leverage multithreading or multiprocessing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`make lists -j32` doesn't seem to be honoring the thread count. (Also happens when calling `make training -j32`) #382

`make lists -j32` doesn't seem to be honoring the thread count. (Also happens when calling `make training -j32`) #382

ipaqmaster commented Apr 1, 2024

ipaqmaster commented Apr 1, 2024

stweil commented Apr 10, 2024

yaofuzhou commented May 23, 2024

make lists -j32 doesn't seem to be honoring the thread count. (Also happens when calling make training -j32) #382

make lists -j32 doesn't seem to be honoring the thread count. (Also happens when calling make training -j32) #382

Comments

ipaqmaster commented Apr 1, 2024

ipaqmaster commented Apr 1, 2024

stweil commented Apr 10, 2024

yaofuzhou commented May 23, 2024

`make lists -j32` doesn't seem to be honoring the thread count. (Also happens when calling `make training -j32`) #382

`make lists -j32` doesn't seem to be honoring the thread count. (Also happens when calling `make training -j32`) #382