-
Notifications
You must be signed in to change notification settings - Fork 0
/
90-appendix.Rmd
630 lines (503 loc) · 25.3 KB
/
90-appendix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
\cleardoublepage
# (APPENDIX) 附录 {-}
# 数据来源 {#data-source}
## 文献检索方式
使用“肠道”和“菌群”两个主题词在 WOS 数据库中搜索 2000 年以来的文献数据,共检得56445条记录(2020-1-3)。
其中,包括 2135 条高被引论文(Highly Cited in Field)。
- 数据库
Web of Science Core Collection;
包括:SCI-EXPANDED, SSCI, CPCI-S, CPCI-SSH, CCR-EXPANDED, IC.
- 检索年限
从 2000 年至 2020 年;
- 具体使用的检索式是:
"TS=(gut OR intestine OR bowel OR intestinal OR colon OR colorectal OR gastrointestine OR gastrointestinal) AND TS=(microbiome OR microbiota OR flora OR microbe OR microbes OR commensal OR symbiont OR pathobiont OR mycobiome OR virome OR metagenome OR meta-genome)"。
**Web of Science 关于高被引论文的定义**
> As of September/October 2019, this highly cited paper received enough citations to place it in the top 1% of the academic field of Agricultural Sciences based on a highly cited threshold for the field and publication year.
## 下载“肠道菌群”研究完整数据
下载完整数据(Full Record and Cited References),保存为文本文件。
Web of Science 数据导出每次限制 500 条数据,需要分多次保存,全部下载完毕后,可以将所有文件合并到一起。
```{sh eval=F}
cat export.*.txt > all.text
```
# 文献筛选 {#core-article}
## 如何确定核心文献?
在最初阶段我们的分析依赖于全部的5万多篇文献,但是我们很快发现这个数据量在进行较复杂的分析时,一方面会导致计算量比较大,另一方面也没有必要完整覆盖。因此寻找一个方法确定核心文献十分必要。高被引论文虽然是一个备选,但是高被引论文针对的是大的学科领域,可能并不能凸显论文对道菌群研究的重要性。且我们也可以发现2009年前是没有高被引论文在这5万多篇文献中存在的,因此对于一些特别经典的菌群研究论文有所遗漏。因此我们使用**本地被引频次**这个参数对文献做了一定的筛选,即将每年本地被引频次排在前 5% 的文献作为关键文献,进行了更为充分的分析。
我们看一下本地被引频次的含义。
**本地被引频次**,即 Local Citation Score(LCS),它表示这篇文章在当前数据集中被引用的次数。
**全局引用次数**,即 Global Citation Score (GCS),它表示这篇文章被整个WOS数据库中所有文献引用的次数,也就是在 Web of Science 网站上看到的引用次数。
一篇文章GCS很高,说明被全球科学家关注较多。但是如果一篇GCS很高,而LCS很小,说明这种关注主要来自与你不是同一领域的科学家。此时,这篇文献对你的参考意义可能不大。如果 LCS 很高,则说明这篇文献与你数据集中关注的领域十分相关。所以,使用 LCS 可以快速定位一个领域的经典文献。
值得一提的是,我们按照 LCS 前 5% 筛选后得到的文献共有 `r nrow(MC)` 篇,其中覆盖了约三分之二的WoS高被引论文,比WoS高被引论文少了700多篇。不过,另外新增了约800多篇新文献(图 \@ref(fig:HC-vs-MC))。
图 \@ref(fig:HC-vs-MC-year) 显示这些论文的差别主要在哪些年份。
```{r HC-vs-MC-year, fig.cap="高被引和核心论文的差别"}
HC_specific <- setdiff(highly_cited$SR,MC$SR)
MC_specific <- setdiff(MC$SR,highly_cited$SR)
df1 <- M %>% filter(SR %in% HC_specific ) %>% group_by(PY,DT) %>% summarise(count=-n(),type="缺少")
df2 <- M %>% filter(SR %in% MC_specific ) %>% group_by(PY,DT) %>% summarise(count=n(),type="新增")
p <- rbind(df1,df2) %>% ggplot(aes(PY,count,fill=DT,color=type,text=paste0(type,DT,abs(count),"篇"))) +
geom_col() +
coord_flip() +
scale_color_brewer(palette = "Set1") +
labs(x="",y="") +
guides(fill=guide_legend(title = "文献类型",byrow = TRUE,ncol = 2),
color="none") +
theme(legend.position = c(.8,.2))
plot.ly(p) %>%
layout(showlegend=FALSE)
```
### 文献清单
```{r}
MC_specific_article <- M %>%
filter(CORE==TRUE, HC==FALSE) %>%
mutate(title=str_to_title(TI),
source=paste0("<a href=\"https://doi.org/",DI,"\">原文</a>")) %>%
tibble::column_to_rownames("SR")
HC_specific_article <- M %>%
filter(CORE==FALSE, HC==TRUE) %>%
mutate(title=str_to_title(TI),
source=paste0("<a href=\"https://doi.org/",DI,"\">原文</a>")) %>%
tibble::column_to_rownames("SR")
MC_HC_article <- M %>%
filter(CORE==TRUE, HC==TRUE) %>%
mutate(title=str_to_title(TI),
source=paste0("<a href=\"https://doi.org/",DI,"\">原文</a>")) %>%
tibble::column_to_rownames("SR")
```
针对于图 \@ref(fig:HC-vs-MC) 中的差异,我们分别统计了仅在LCS核心论文中出现和仅在高被引论文中出现的文章列表。
下表显示仅在LCS核心论文中出现的文章(图 \@ref(fig:HC-specific-article))[^data-table]。
[^data-table]: 这幅图中显示的表格支持交互操作,可以点击标题排序、在搜索框搜索和按不同的行过滤,也可以点击按钮导出数据为 Excel 文件。
```{r eval=F}
# Lists elements are written to individual worksheets, using list names as sheet names if available
list <- list("LCS-specific"=MC_specific_article,
"HighlyCited-specific"=HC_specific_article,
"LCS-and-HighlyCited" = MC_HC_article)
l <- lapply(list, function(x){
x[,c("title","PY","DT","ID","impact_factor","LCS","TC","source")]
})
names(l) <- names(list)
openxlsx::write.xlsx(l,file = "MC-specific-articles.xlsx",asTable=TRUE,rowNames=TRUE)
```
```{r MC-specific-article, fig.cap="仅在LCS核心论文中出现的文章"}
library(DT)
datatable(MC_specific_article[,c("PY","impact_factor","LCS","TC","title","source")],
colnames = c("年份","IF","LCS","GCS","标题","原文"),
escape = FALSE,
filter = "top",
caption = "",
extensions = c("Buttons"),
width = "95%",
options=list(dom = 'Bfrtip',
pageLength = 20,
buttons=list(
'pageLength',
list(extend='copy'),
list(extend="excel",
filename="export",
header=TRUE,
title="")
),
columnDefs=list(
list(width="35%",targets=5)
),
lengthMenu=list(c(20,50,100,-1),
c("20","50","100","All"))))
```
下面的表格列出了仅在WoS高被引论文中出现的文章(图 \@ref(fig:HC-specific-article))。
```{r HC-specific-article,fig.cap="仅在WoS高被引论文中出现的文章"}
datatable(HC_specific_article[,c("PY","impact_factor","LCS","TC","title","source")],
colnames = c("年份","IF","LCS","GCS","标题","原文"),
escape = FALSE,
filter = "top",
caption = "",
extensions = c("Buttons"),
width = "95%",
options=list(dom = 'Bfrtip',
pageLength = 20,
buttons=list(
'pageLength',
list(extend='copy'),
list(extend="excel",
filename="export",
header=TRUE,
title="")
),
columnDefs=list(
list(width="35%",targets=5)
),
lengthMenu=list(c(20,50,100,-1),
c("20","50","100","All"))))
```
下面的表格列出了既为WoS高被引又为LCS核心论文的文章(图 \@ref(fig:MC-HC-article))。
```{r MC-HC-article,fig.cap="既为WoS高被引又为LCS核心论文的文章"}
datatable(MC_HC_article[,c("PY","impact_factor","LCS","TC","title","source")],
colnames = c("年份","IF","LCS","GCS","标题","原文"),
escape = FALSE,
filter = "top",
caption = "",
extensions = c("Buttons"),
width = "95%",
options=list(dom = 'Bfrtip',
pageLength = 20,
buttons=list(
'pageLength',
list(extend='copy'),
list(extend="excel",
filename="export",
header=TRUE,
title="")
),
columnDefs=list(
list(width="35%",targets=5)
),
lengthMenu=list(c(20,50,100,-1),
c("20","50","100","All"))))
```
### 近两年的研究进展
特别关注近两年的最新研究进展。
下面是依据LCS筛选得到的最近两年研究论文(图 \@ref(fig:recent-LCS-research-article))。
```{r recent-LCS-research-article,fig.cap="近两年的重要研究论文(LCS高被引)"}
# 导出这两年的重要研究/观点论文(高被引+日报收录的20分以上的肠道菌群研究),
# 辛苦你周一列个表给我。完整的关键词变化列表+每年高频关键词(前50)
# 下周也优先做一下吧,多谢~
db_cache <- "E:/Spring_Work/Corporate_Bussiness/C40_Data/cache/"
papers <- readRDS(paste0(db_cache,"papers.RDS")) %>% as_tibble()
fragments <- readRDS(paste0(db_cache,"fragments.RDS")) %>% as_tibble()
real_paper <- papers %>%
filter(status==3 & classify=="audit" & journal_id=="1") %>%
select(uuid,title,share_title,summary,remark,fragment_id) %>%
left_join(fragments %>% rename("fragment_id"=id) %>% select(-uuid,-title,-remark))
mc <- real_paper %>% mutate(DI=toupper(doi)) %>% select(DI,title,summary,uuid,share_title)
recent_research_article <- MC %>%
left_join(mc) %>%
filter(PY>=2018,DT != "REVIEW") %>%
mutate(DT = as_factor(DT),
source=ifelse(str_detect(DI,"10"),
mrgut_permanent_link(base_url = "https://doi.org/",
type = "html",
uuid = DI,
title = J9,
alt = ""),
""),
daily=ifelse(is.na(uuid),
"",
mrgut_permanent_link(type="html",
uuid = uuid,
title = title,
alt = share_title)
),
Title=str_to_title(TI)) %>%
select(Title,PY,DT,impact_factor,source,daily,TC,LCS)
DT::datatable(recent_research_article,
escape = FALSE,
rownames = FALSE,
filter = "top",
width = "95%",
caption = "",
extensions = c("Buttons"),
options=list(dom = 'Bfrtip',
pageLength = 20,
buttons=list(
'pageLength',
list(extend='copy'),
list(extend="excel",
filename="export",
header=TRUE,
title="")
),
columnDefs=list(
list(width="35%",targets=0),
list(width="35%", targets=5)
),
lengthMenu=list(c(20,50,100,-1),
c("20","50","100","All"))))
```
下面是《热心肠日报》中收录的最近两年发表的影响因子大于20的研究论文(图 \@ref(fig:high-IF-research-article-in-daily))。
```{r high-IF-research-article-in-daily, fig.cap="日报中收录的影响因子>20的高水平研究论文"}
high_IF_daily_research <- real_paper %>%
mutate(DI=toupper(doi),
year=year(time)) %>%
select(DI,title,year,type,periodical,summary,uuid,share_title) %>%
left_join(MC) %>%
filter(type %in% c("Article","Communication","Letter","Perspective"),
impact_factor >= 20,
year >= 2018) %>%
mutate(DT = as_factor(DT),
source=ifelse(str_detect(DI,"10"),
mrgut_permanent_link(base_url = "https://doi.org/",
type = "html",
uuid = DI,
title = periodical,
alt = ""),
""),
daily=ifelse(is.na(uuid),
"",
mrgut_permanent_link(type="html",
uuid = uuid,
title = title,
alt = share_title)
),
Title=str_to_title(title)) %>%
select(Title,year,DT,impact_factor,source,daily,TC,LCS)
DT::datatable(high_IF_daily_research,
escape = FALSE,
rownames = FALSE,
filter = "top",
width = "95%",
caption = "",
extensions = c("Buttons"),
options=list(dom = 'Bfrtip',
pageLength = 20,
buttons=list(
'pageLength',
list(extend='copy'),
list(extend="excel",
filename="export",
header=TRUE,
title="")
),
columnDefs=list(
list(width="35%",targets=0),
list(width="35%", targets=5)
),
lengthMenu=list(c(20,50,100,-1),
c("20","50","100","All"))))
```
### 重点论文
重要文章,综合考虑引用和影响因子,即LCS高被引文章+当年高分文章(按IF排序取前5%)
1. 论文列表(区分文章类型)
2. 按关键词聚类:高频关键词列表(前50),以及每个关键词对应的文章列表
① 仅分析研究论文
② 仅分析综述
③ 全部文章类型
3. 按关键词共现聚类:高频共现关键词列表(取前50),以及每组共现关键词对应的文章列表
① 仅分析研究论文
② 仅分析综述
③ 全部文章类型
```{r fig.cap="近三年的重点论文"}
recent_research_article_2019 <- M %>%
filter(PY>=2017,percent_rank(impact_factor)>=0.95 | CORE==TRUE) %>%
left_join(mc) %>%
mutate(DT = as_factor(DT),
source=ifelse(str_detect(DI,"10"),
mrgut_permanent_link(base_url = "https://doi.org/",
type = "html",
uuid = DI,
title = J9,
alt = ""),
""),
daily=ifelse(is.na(uuid),
"",
mrgut_permanent_link(type="html",
uuid = uuid,
title = title,
alt = share_title)
),
Title=str_to_title(TI))
DT::datatable(recent_research_article_2019 %>%
select(Title,DT,PY,impact_factor,source,daily,TC,LCS,ID) %>%
mutate(ID = str_to_title(ID),
PY = as_factor(PY)),
escape = FALSE,
rownames = FALSE,
filter = "top",
width = "95%",
caption = "",
extensions = c("Buttons"),
options=list(dom = 'Bfrtip',
pageLength = 10,
buttons=list(
'pageLength',
list(extend='copy'),
list(extend="excel",
header=TRUE,
title="")
),
columnDefs=list(
list(width="25%",targets=0),
list(width="6em",targets=2),
list(width="25%", targets=5),
list(width="15%",targets=8)
),
lengthMenu=list(c(10,20,50,100,-1),
c("10","20","50","100","All"))))
```
```{r keyword-tf-idf-2019, fig.cap="对2019年的关键词进行挖掘"}
tableTag <- function (M, Tag = "CR", sep = ";")
{
if (Tag %in% c("AB", "TI")) {
M = termExtraction(M, Field = Tag, stemming = F, verbose = FALSE)
i = dim(M)[2]
}
else {
i <- which(names(M) == Tag)
}
if (Tag == "C1") {
M$C1 = gsub("\\[.+?]", "", M$C1)
}
Tab <- unlist(strsplit(as.character(M[, i]), sep))
Tab <- trimws(trimES(Tab))
Tab <- Tab[Tab != ""]
Tab <- sort(table(Tab), decreasing = TRUE)
return(Tab)
}
ID <- tableTag(recent_research_article_2019,Tag ="ID") %>%
data.frame() %>%
filter(!str_detect(Tab,search_keyword_regex)) %>%
mutate(Keyword=as_factor(str_to_title(Tab))) %>%
ungroup() %>%
select(Keyword,Freq)
```
```{r eval=F}
mat <- biblioNetwork(recent_research_article_2019, analysis = "co-occurrences", network = "keywords")
net <- networkPlot(NetMatrix = mat,
# normalize = "jaccard",
weighted = TRUE,
n=100,
Title = year,
verbose = FALSE)
g <- net$graph_pajek
# 简化图
# 删掉常用词
g <- delete.edges(g, E(g)[edge_attr(g)$weight > 2])
g <- delete.vertices(g, V(g)[str_detect(vertex.attributes(g)$id, search_keyword_regex)])
# 聚类
cluster <- cluster_fast_greedy(g)
vertex_attr(g)$group <- membership(cluster)
# size by deg
vertex_attr(g)$size <- vertex_attr(g)$deg/15
# 可视化
data <- toVisNetworkData(g)
nodes <- data$nodes %>%
mutate(size=log(deg))
edges <- data$edges
visNetwork(nodes,edges,width = 1000,height = 800) %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE,selectedBy = "group") %>%
visNodes(size = "size") %>%
visIgraphLayout()
```
```{r eval=FALSE}
ID <- ID %>% mutate(id=tolower(Keyword)) %>%
left_join(nodes) %>%
select(Keyword,Freq,deg,group) %>%
mutate(group=as_factor(group))
DT::datatable(ID,
rownames = FALSE,
filter = "top",
width = "95%",
extensions = c("Buttons"),
options=list(dom = 'Bfrtip',
pageLength = 20,
buttons=list(
'pageLength',
list(extend='copy'),
list(extend="excel",
header=TRUE,
title="")
),
columnDefs=list(
list(width="75%",targets=0)
),
lengthMenu=list(c(10,20,50,100,-1),
c("10","20","50","100","All"))))
```
## 数据库资料信息不全的情形
国家信息不全的文章有 `r nrow(df)` 篇,大多属于信息采集不完整的情况(表\@ref(tab:AU-CO-NA)),并非软件本身存在的错误所致。数据采集不完整的情况暂时无法解决。
```{r}
# 国家信息错误的情况
M %>% filter(str_detect(AU_CO_NR,"\\bNA\\b")) %>%
select(AU,SO,DT,C1,AU_CO_NR) %>%
data.frame() %>%
tableTag(Tag = "DT") %>% # tableTag() 函数针对有些行失效了?如果每个元素都没有分隔符,则会出现失效的情况。
barplot()
```
```{r AU-CO-NA}
df <- M %>% filter(str_detect(AU_CO_NR,"\\bNA\\b")) %>%
select(SR,C1,AU_CO_NR) %>%
head(10)
kable(df,caption = "国家信息缺失的文献信息")
```
```{r}
# 使用自定义函数提取作者机构信息
AU_UN_wos <- function(C1,sep=";"){
AFF <- trim(gsub("\\[.*?\\]","",C1))
listAFF <- strsplit(AFF,sep,fixed = TRUE)
AFFL <- lapply(listAFF,function(l){
affL <- strsplit(l,",",fixed = TRUE)
lapply(affL, function(x){
return(trim(x[[1]]))
})
})
AFF <- sapply(AFFL,function(x) paste0(x,collapse = sep))
AFF <- gsub("\\&","AND",AFF)
return(AFF)
}
```
`bibliometrix` 使用一个控制字段提取机构信息,一些机构名称不含下列控制字段,因此导致机构信息提取失败。为此,我们拟使用自定义函数改进提取机制。
```{r, echo=T,eval=F}
uTags=c("UNIV","COLL","SCH","INST","ACAD","ECOLE","CTR","SCI","CENTRE","CENTER","CENTRO","HOSP","ASSOC","COUNCIL",
"FONDAZ","FOUNDAT","ISTIT","LAB","TECH","RES","CNR","ARCH","SCUOLA","PATENT OFF","CENT LIB","HEALTH","NATL",
"LIBRAR","CLIN","FDN","OECD","FAC","WORLD BANK","POLITECN","INT MONETARY FUND","CLIMA","METEOR","OFFICE","ENVIR",
"CONSORTIUM","OBSERVAT","AGRI", "MIT ", "INFN", "SUNY ")
```
改进后的机构信息提取机制更加准确。表 \@ref(tab:AU-UN-NA) 表示新旧方法得出不同机构信息字段的比较。`C1` 是作者信息字段,`AU_UN` 是 `bibliometrix` 软件提取的机构信息,`AU-UN2` 是改进提取方法后得出的机构信息。经过比较,可以发现新方法提取的信息更加准确和完整。
```{r AU-UN-NA}
set.seed(20200213)
a <- M %>% sample_n(100)
a <- metaTagExtraction(a,Field = "AU_UN")
a$AU_UN2 <- AU_UN_wos(a$C1)
a %>% filter(AU_UN != AU_UN2) %>%
select(SR,AU_UN,AU_UN2,C1) %>%
DT::datatable(caption = "不同方式获取机构信息的差异")
```
```{r}
M <- metaTagExtraction(M,Field = "AU_UN")
original <- tableTag(M, "AU_UN")
M$AU_UN2 <- AU_UN_wos(M$C1)
modified <- tableTag(M, "AU_UN2")
df <- data.frame(original) %>% left_join(data.frame(modified), by=c("Tab"="Tab"))
colnames(df) <- c("AFF","original","modified")
top_AFF <- df %>% group_by(AFF) %>% summarise(total=original+modified) %>% arrange(desc(total)) %>% head(1000) %>% pull(AFF)
df <- df %>% mutate(diff = modified - original, AFF = as_factor(AFF)) %>%
pivot_longer(cols = c("original","modified"), names_to = "type", values_to = "count")
df_label <- df %>% filter(diff!=0) %>% filter(percent_rank(abs(diff))>0.99) %>% select(AFF, diff) %>%
mutate(color=ifelse(diff >0, "red","blue")) %>% unique()
p <- df %>% filter(AFF %in% top_AFF) %>%
ggplot(aes(AFF,count,fill=type,color=type,group=type)) + geom_point() + geom_line() + geom_area(alpha=1/3,position = "identity") +
# geom_line(aes(AFF,diff,color=I(color)),inherit.aes = FALSE,data = df_label) +
geom_point(aes(AFF,diff,color=I(color)),inherit.aes = FALSE,data = df_label) +
ggrepel::geom_text_repel(aes(AFF, diff, label=AFF,color=I(color)),data = df_label %>% filter(AFF %in% top_AFF), inherit.aes = FALSE) +
theme(axis.text.x = element_blank())
plotly::ggplotly(p)
```
# 使用影响因子评价 {#impact-factor}
在我们的分析中,并没有对影响因子这一指标进行过多涉及,而主要是基于引用次数来评价文章的重要性。相对于影响因子,引用次数更能反映文章的重要性。引用次数本身就是影响因子的基础,是期刊所有论文引用次数除以发文量之后得到的一个指标。然而,即便是同一本期刊上发表论文的引用次数事实上也存在巨大差异,用一个平均指标不利于发现那些最重要的文献。与此同时,同一篇文章的全局被引频次和本地被引频次也会存在明显差异,考虑到我们主要立足点是“肠道菌群”研究,那些本地被引频次更高的文献理论上更加重要。
而在高被引论文中,影响因子小于3的论文数量最少,同时3-5,5-10,10-20之间和20以上的论文数目大体相当。大体上有50%的高被引论文影响因子在10分以上,另外有50%的高被引论文在10分以下(图 \@ref(fig:HC-article-group-by-IF))。
```{r HC-article-group-by-IF, fig.cap="WoS高被引论文影响因子的分布情况"}
df <- M %>% filter(HC==TRUE) %>% group_by(PY,group) %>%
summarise(n=n()) %>%
group_by(PY) %>%
mutate(proportion=n/sum(n))
plot_by_IF(df)
```
基于 LCS 得到的核心文献中,与高被引论文的影响因子分组情况较为相似,同样可以发现影响因子小于3的论文同样最少,同时3-5,5-10,10-20之间和20以上的论文数目大体相当(图 \@ref(fig:LCS-core-article-group-by-IF))。
```{r LCS-core-article-group-by-IF, fig.cap="LCS核心论文影响因子的分布情况"}
df <- M %>% filter(CORE==TRUE) %>%
group_by(PY,group) %>%
summarise(n=n()) %>%
group_by(PY) %>%
mutate(proportion=n/sum(n))
plot_by_IF(df)
```
在前面 \@ref(core-article) 我们曾经列出了 LCS 核心文献和 WoS 高被引论文的清单,现在我们再来看一下这些不同类型文献的影响因子分布情况。
如图 \@ref(fig:core-vs-HC-boxplot) 中,可以发现 WoS 高被引特有的文献较 LCS 核心特有的文献具有更小的四分位数和中位数影响因子。这说明如果从影响因子的角度考虑,总体上 LCS 核心文献的影响因子较 WoS 高被引论文还更大一些。
```{r core-vs-HC-boxplot, fig.cap="LCS核心文献和WoS高被引文献的影响因子分布差异",fig.align="center",fig.width=6}
list <- list("LCS-specific"=MC_specific_article,
"HighlyCited-specific"=HC_specific_article,
"LCS-and-HighlyCited" = MC_HC_article)
names <- names(list)
l <- lapply(seq_along(list),function(i){
data.frame(group=names[[i]], IF=list[[i]]$impact_factor)
})
df <- do.call("rbind",l)
p <- ggplot(df,aes(group,IF)) + geom_boxplot() + coord_cartesian(ylim = c(0,50))
ggplotly(p)
```