அச்சு தமிழ் எழுத்துணரி செயற்கை நூண்ணறிவு பயிற்சித் தரவுத்தளம் உருவாக்கும் திட்டம்

Project is to generate labeled data sets of Tamil letters Uyir, Mei, Uyirmei for 347 letter forms so that we are able to train classifiers based on this data.

தமிழ் அரிச்சுவடி

This project is open-sourced under MIT License.

Keras Models - ஆழக்கற்றல் மாதிரிகள் உருவாக்குதல்

tfkeras_demo.py என்ற நிரல் கேராஸ் - டென்சார்ஃப்ளோ திரட்டின் வழி உருவாக்கலாம். இது CNN என்ற செயற்கைப்பின்னல் வழி 92% சரிவர பயிற்சி தரவிலும் 82% பரிசோதனைத்தரவிலும் சிறப்பாக செயல்படுகிறது. tfkeras_demo.py will train a simple 2-layer CNN for 92% training and 82% test accuracy.

Outliers - சரியில்லாத சில தரவுபடங்கள்.

சில தரவிலுள்ள படங்களின் உருவங்கள் சரிவர இல்லாமல் உள்ளது. அதாவது எழுத்து வடிவ படம் 28x28 சதுரத்தி கொள்ளாமல் வெட்டுப்பட்டும், சட்டத்தில் வெளியில் "சிந்தியது போல்" காட்சிப்பட்டிருக்கும். இது சரியான பயிற்சிப்படம் இல்லை. இவற்றை outliers என்று கொண்டு இவற்றில் பயிற்சி தரவில் இருந்து நீக்கப்படவேண்டும்.

Font Resources:

Thamizha Tamil fonts: https://github.com/thamizha/tamil-fonts
Tamil Fonts for various encodings http://tamilnation.co/digital/Tamil%20Fonts%20%26%20Software.htm#Unicode_Fonts
Apple Mac OS-X https://support.apple.com/en-us/HT206872#download

MNIST data set meta-data

60000x784 array of data and label of 60000x1
60000 letter-images across all 13 letters for Tamil Uyir + Ayudham will make 4616 sets.
If we use 50 fonts, we will be required to make about 93 modified sets, for the same 13-data.
We need to make a font-list available to use as a config file.
93 sets of 13-letters per font is what we want to come up with. 93 - scale (use 2-fonts for) { translate, rotate }. - 93/4 ~ 23 translations, 23 rotations or 30 rotations and 16 translations.
Finally a PR version of data set is shown as a 16x13 composite of 28x28 px images. Which is 448x364 sized.

Alternate algorithm using available Resources

Algorithm can use existing fonts with 20k set with rotation, 20kset with translation and 20k set with both rotation and translation in any order.
This will use a round-robin queue based method to train the data.
Roundrobin method provides 4616 samples of each label - pretty uniform.
Only 35 fonts of my collected data are suitable for Unicode processing. TAM/TAB fonts are nice but have significant overhead at this time.

Memory capabilities

Matrix size of floats 60000x784 is easily loaded with size of 324MB in RAM.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.idea		.idea
data		data
exported		exported
letters-hand-drawn-corrected		letters-hand-drawn-corrected
letters-hand-drawn		letters-hand-drawn
tamil_model_ckpt		tamil_model_ckpt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
all-letters.png		all-letters.png
app.py		app.py
composite.png		composite.png
demo.py		demo.py
demo2.py		demo2.py
demo3.py		demo3.py
demo_mnist.py		demo_mnist.py
font1.png		font1.png
fontdb.py		fontdb.py
golden_mnist.py		golden_mnist.py
kerans_mnist_cnn.py		kerans_mnist_cnn.py
keras_mnist.py		keras_mnist.py
model.py		model.py
paper.py		paper.py
process_handwrittens_128px.py		process_handwrittens_128px.py
puzzle.py		puzzle.py
requirements.txt		requirements.txt
sample_image_dataset.py		sample_image_dataset.py
tfdemo.py		tfdemo.py
tfkeras_contd.py		tfkeras_contd.py
tfkeras_convcontd.py		tfkeras_convcontd.py
tfkeras_convert.py		tfkeras_convert.py
tfkeras_convpredict.py		tfkeras_convpredict.py
tfkeras_demo.py		tfkeras_demo.py
tfkeras_fcpredict.py		tfkeras_fcpredict.py
tfsavedmodel_demo.py		tfsavedmodel_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

அச்சு தமிழ் எழுத்துணரி செயற்கை நூண்ணறிவு பயிற்சித் தரவுத்தளம் உருவாக்கும் திட்டம்

தமிழ் அரிச்சுவடி

Keras Models - ஆழக்கற்றல் மாதிரிகள் உருவாக்குதல்

Outliers - சரியில்லாத சில தரவுபடங்கள்.

Font Resources:

MNIST data set meta-data

Alternate algorithm using available Resources

Memory capabilities

About

Releases

Packages

Contributors 2

Languages

License

Ezhil-Language-Foundation/acchu-tamilocr-dataset

Folders and files

Latest commit

History

Repository files navigation

அச்சு தமிழ் எழுத்துணரி செயற்கை நூண்ணறிவு பயிற்சித் தரவுத்தளம் உருவாக்கும் திட்டம்

தமிழ் அரிச்சுவடி

Keras Models - ஆழக்கற்றல் மாதிரிகள் உருவாக்குதல்

Outliers - சரியில்லாத சில தரவுபடங்கள்.

Font Resources:

MNIST data set meta-data

Alternate algorithm using available Resources

Memory capabilities

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages