Traditional Chinese Handwriting Dataset

繁體中文手寫資料集

Preface 前言

In the way of data science, we believe every scholar, scientists might have heard about MNIST dataset, or played with Fashion MNIST. As a traditional Chinese user, we couldn't help but wonder: is it possible for machine learning, neural networks to recognize handwritten traditional Chinese characters? Let's challenge!

在走過資料科學的路上，相信每一位學者、科學家都聽過 MNIST dataset (手寫數字資料集)，或許也玩過 Fashion MNIST；身為繁體中文使用者，難免開始好奇：手寫繁體中文是否也有機會讓機器學習、神經網路成功辨識呢？讓我們一起來挑戰！

Description 資料集說明

Original dataset was produced based on Tegaki, an open-source package. Total 13,065 different Chinese characters, with average of 50 samples for each character.

原始資料集基於 Tegaki 開源套件下產出，總計 13,065 個不同的中文字，每一個字體平均有 50 個樣本。

Updates 更新紀錄

2020.04.21 提供資料集部署操作範例 (感謝 Yen-Lin 博士熱情貢獻)
2020.04.20 上傳最新資料集 (4,803個常用字；圖片大小：50x50pixels；共計 250,712 個圖片檔) (教育部 4,808 個常用字)
2020.04.20 Uploaded the first dataset (4,803 charaters; image size: 50x50pixels; total 250,712 images)
2020.09.03 Released the whole dataset (13,065 charaters; image size: 300x300pixels; total 684,677 images)

Data samples 資料樣本

完整資料集 - 各樣本資料夾
手寫"自由"範例

Usage 使用方法

1. 完整資料集 - whole Dataset (13,065 characters)

git clone https://github.com/chenkenanalytic/handwritting_data_all.git

cat (file_path)/all_data.zip* > (file_path)/all_data.zip

unzip (file_path)/all_data.zip -d (output_path)

※ (file_path) & (output_path) 以實際檔案位置需求作修改、替換，解壓縮後資料夾名稱為 cleaned_data，共684,677個圖片。

完整資料集 - 部署操作

Colab操作程式碼參考

2. 常用字資料集 - common words Dataset (4,803 characters)

git clone https://github.com/AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset.git

※ 下載常用字資料集後，解壓縮 data 資料夾內的四個檔案，解壓縮後資料夾名稱為 cleaned_data(50_50)，共250,712個圖片。

常用字資料集 - 部署操作 (感謝 Yen-Lin 博士熱情貢獻)

Colab操作程式碼參考

本地操作程式碼參考

Issues 問題與發現

常用字資料集因壓縮至 50x50 Pixels，發現部分圖片檔筆畫不清楚、出現重疊現象。 (完整資料集較無此問題，資料為 300x300 Pixels)
完整資料集佈署範例於 Colab 上解壓縮後，中文字集檔名會出現亂碼。

License 授權

(CC BY-NC-SA 4.0)
本資料集適用 Attribution-NonCommercial-ShareAlike 4.0 International 授權。
The dataset applied Attribution-NonCommercial-ShareAlike 4.0 International license.

※ 使用、改作、分享請附上以下資訊：

本數據集由 AI . FREE Team 改作開發自 [STUST EECS_Chinese MNIST(總集)]。如有使用、改作、分享，請註明出處及此訊息。
The dataset is AI . FREE Team development from [STUST EECS_Chinese MNIST(總集)]. If used, modified, or shared, please cite the source and the mesage.
(source: https://github.com/AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset )

Citing

@misc{AI.FREE2020,
  author = {Po-Chuan Chen},
  title = {Traditional Chinese Handwriting Dataset},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset}},
}

Source 資料來源

原資料集來源：https://scidm.nchc.org.tw/dataset/stusteecs_chinese_mnist

介紹說明影片：https://www.youtube.com/watch?v=eJy1BtkqHX4

來源說明：本數據集開發修改自南臺科技大學電子系所提供之中文手寫字集。

Description: The Dataset is developed from Chinese handwriting data set, which is provided by Dept. EECS, Southern Taiwan University of Science and Technology.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitattributes		.gitattributes
Data_Deployment_all.ipynb		Data_Deployment_all.ipynb
README.md		README.md
all_data.zip.001		all_data.zip.001
all_data.zip.002		all_data.zip.002
all_data.zip.003		all_data.zip.003
all_data.zip.004		all_data.zip.004
all_data.zip.005		all_data.zip.005
all_data.zip.006		all_data.zip.006
all_data.zip.007		all_data.zip.007
all_data.zip.008		all_data.zip.008
all_data.zip.009		all_data.zip.009
all_data.zip.010		all_data.zip.010
all_data.zip.011		all_data.zip.011
all_data.zip.012		all_data.zip.012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Traditional Chinese Handwriting Dataset

繁體中文手寫資料集

Preface 前言

Description 資料集說明

Updates 更新紀錄

Data samples 資料樣本

Usage 使用方法

1. 完整資料集 - whole Dataset (13,065 characters)

完整資料集 - 部署操作

2. 常用字資料集 - common words Dataset (4,803 characters)

常用字資料集 - 部署操作 (感謝 Yen-Lin 博士熱情貢獻)

Issues 問題與發現

License 授權

Citing

Source 資料來源

About

Releases

Packages

Languages

chenkenanalytic/handwritting_data_all

Folders and files

Latest commit

History

Repository files navigation

Traditional Chinese Handwriting Dataset

繁體中文手寫資料集

Preface 前言

Description 資料集說明

Updates 更新紀錄

Data samples 資料樣本

Usage 使用方法

1. 完整資料集 - whole Dataset (13,065 characters)

完整資料集 - 部署操作

2. 常用字資料集 - common words Dataset (4,803 characters)

常用字資料集 - 部署操作 (感謝 Yen-Lin 博士 熱情貢獻)

Issues 問題與發現

License 授權

Citing

Source 資料來源

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

常用字資料集 - 部署操作 (感謝 Yen-Lin 博士熱情貢獻)

Packages