Content itself quality evaluation by machine learning
Get Qiita API token and set it to environment variable.
$ export QiitaToken=xxx
(only read_qiita
scope is required)
Then use Dockerfile and run!
- Locate the Qiita posts on
data/raw/items
- You can get Qiita posts by Qiita API
- 1 post is 1 json file whose name is post
id
(like0a0000aa0a0000a00aa0.json
).
- Locate the annotated file
labeled_qiita_posts.csv
ondata/raw
.- It's format is
No
,url
,Title
, andannotator1
,annotator2
... (column names are as you like ).
- It's format is
Run the following script.
python scripts/data/make_data.py
Then, labeled json file is stored at data/processed/items
.
Next, execute preprocessing.
python scripts/data/preprocessing.py
posts.json
will be created at data/processed/
.
posts.json
includes splited tokens of each posts. You can use this to get the words in the posts.