Skip to content

liao961120/dcard-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dcard post data

This repo hosts the post data retrieved from Dcard API, which were colleceted for the purpose of building a small corpus. These posts came from the top-100 popular forums of Dcard. Each post is at least 100-character-long.

The post data were segmented and PoS tagged using ckiplab/ckiptagger.

Files

Concordancer

The quickest way to query KWIC concordance in this corpus with this concordancer is using docker.

Download image:

docker pull liao961120/dcard

Run server:

docker run -it -p 127.0.0.1:1420:80 liao961120/dcard

When you see Corpus Loaded printed on the command line, you can visit https://kwic.yongfu.name to use the app.

The source code of the concordancer can be found in liao961120/kwic and liao961120/kwic-backend. Read more about the concordancer in this post.

Corpus Stats

  • Number of tokens: 5292615
  • Number of posts: 19224
    • Female author: 12007 (62.46%)
    • Male author: 7217 (37.54%)

Word List (Top 100 frequent)

token pos count
1 DE 219170
2 COMMACATEGORY 214385
3 Nh 110994
4 SHI 87591
5 V_2 57263
6 WHITESPACE 54959
7 PERIODCATEGORY 53661
8 D 49773
9 <URL> FW 46867
10 Neu 46811
11 Di 45339
12 . PERIODCATEGORY 43362
13 P 42562
14 D 40836
15 Nf 38905
16 D 36364
17 Nep 34835
18 D 34730
19 Dfa 31142
20 PAUSECATEGORY 28877
21 D 28162
22 EXCLAMATIONCATEGORY 26118
23 Nh 25930
24 Na 23626
25 QUESTIONCATEGORY 22487
26 Nh 22130
27 COLONCATEGORY 21278
28 VE 20998
29 Cbb 20839
30 D 20698
31 VE 19552
32 T 17949
33 PARENTHESISCATEGORY 17945
34 PARENTHESISCATEGORY 17902
35 PARENTHESISCATEGORY 17717
36 D 17090
37 自己 Nh 16816
38 可以 D 16785
39 ( PARENTHESISCATEGORY 16666
40 DASHCATEGORY 16400
41 PARENTHESISCATEGORY 16037
42 ) PARENTHESISCATEGORY 15383
43 P 14860
44 Di 14313
45 Nh 14053
46 因為 Cbb 13466
47 Nf 13462
48 大家 Nh 13311
49 VH 13173
50 真的 D 12655
51 VC 12612
52 T 12334
53 Nep 11579
54 Ncd 11502
55 知道 VK 11115
56 覺得 VK 11043
57 所以 Cbb 11017
58 P 10927
59 我們 Nh 10806
60 T 10511
61 VL 10505
62 D 10353
63 D 9750
64 什麼 Nep 9458
65 Ng 9090
66 D 8849
67 D 8535
68 Neu 8447
69 Di 8369
70 Nf 8249
71 Da 8117
72 D 8076
73 Da 8011
74 喜歡 VK 8001
75 D 7873
76 Nf 7870
77 還是 D 7831
78 Dfa 7646
79 VC 7582
80 Nes 7542
81 時候 Na 7460
82 ETCCATEGORY 7374
83 VC 7307
84 P 7278
85 如果 Cbb 7265
86 P 7013
87 這樣 VH 6938
88 VH 6930
89 P 6923
90 看到 VE 6879
91 沒有 VJ 6842
92 T 6571
93 Dfa 6539
94 時間 Na 6467
95 P 6467
96 VH 6439
97 比較 Dfa 6409
98 一下 Nd 6376
99 然後 D 6307
100 Caa 6291