90-appendix.Rmd

\cleardoublepage 

# (APPENDIX) 附录 {-}


# 数据来源 {#data-source}


## 文献检索方式

使用“肠道”和“菌群”两个主题词在 WOS 数据库中搜索 2000 年以来的文献数据，共检得56445条记录（2020-1-3)。

其中，包括 2135 条高被引论文（Highly Cited in Field）。

- 数据库
  
  Web of Science Core Collection;
  
  包括：SCI-EXPANDED, SSCI, CPCI-S, CPCI-SSH, CCR-EXPANDED, IC.

- 检索年限
  
  从 2000 年至 2020 年；

- 具体使用的检索式是：

  "TS=(gut OR intestine OR bowel OR intestinal OR colon OR colorectal OR gastrointestine OR gastrointestinal) AND TS=(microbiome OR microbiota OR flora OR microbe OR microbes OR commensal OR symbiont OR pathobiont OR mycobiome OR virome OR metagenome OR meta-genome)"。


**Web of Science 关于高被引论文的定义**

> As of September/October 2019, this highly cited paper received enough citations to place it in the top 1% of the academic field of Agricultural Sciences based on a highly cited threshold for the field and publication year.

## 下载“肠道菌群”研究完整数据

下载完整数据（Full Record and Cited References），保存为文本文件。

Web of Science 数据导出每次限制 500 条数据，需要分多次保存，全部下载完毕后，可以将所有文件合并到一起。

```{sh eval=F}
cat export.*.txt > all.text
```


# 文献筛选 {#core-article}

## 如何确定核心文献？

在最初阶段我们的分析依赖于全部的5万多篇文献，但是我们很快发现这个数据量在进行较复杂的分析时，一方面会导致计算量比较大，另一方面也没有必要完整覆盖。因此寻找一个方法确定核心文献十分必要。高被引论文虽然是一个备选，但是高被引论文针对的是大的学科领域，可能并不能凸显论文对道菌群研究的重要性。且我们也可以发现2009年前是没有高被引论文在这5万多篇文献中存在的，因此对于一些特别经典的菌群研究论文有所遗漏。因此我们使用**本地被引频次**这个参数对文献做了一定的筛选，即将每年本地被引频次排在前 5% 的文献作为关键文献，进行了更为充分的分析。

我们看一下本地被引频次的含义。
**本地被引频次**，即 Local Citation Score（LCS），它表示这篇文章在当前数据集中被引用的次数。
**全局引用次数**，即 Global Citation Score （GCS），它表示这篇文章被整个WOS数据库中所有文献引用的次数，也就是在 Web of Science 网站上看到的引用次数。
一篇文章GCS很高，说明被全球科学家关注较多。但是如果一篇GCS很高，而LCS很小，说明这种关注主要来自与你不是同一领域的科学家。此时，这篇文献对你的参考意义可能不大。如果 LCS 很高，则说明这篇文献与你数据集中关注的领域十分相关。所以，使用 LCS 可以快速定位一个领域的经典文献。

值得一提的是，我们按照 LCS 前 5% 筛选后得到的文献共有 `r nrow(MC)` 篇，其中覆盖了约三分之二的WoS高被引论文，比WoS高被引论文少了700多篇。不过，另外新增了约800多篇新文献（图 \@ref(fig:HC-vs-MC)）。


图 \@ref(fig:HC-vs-MC-year) 显示这些论文的差别主要在哪些年份。

```{r HC-vs-MC-year, fig.cap="高被引和核心论文的差别"}

HC_specific <- setdiff(highly_cited$SR,MC$SR)
MC_specific <- setdiff(MC$SR,highly_cited$SR)

df1 <- M %>% filter(SR  %in% HC_specific ) %>% group_by(PY,DT) %>% summarise(count=-n(),type="缺少")
df2 <- M %>% filter(SR  %in% MC_specific ) %>% group_by(PY,DT) %>% summarise(count=n(),type="新增")

p <- rbind(df1,df2) %>% ggplot(aes(PY,count,fill=DT,color=type,text=paste0(type,DT,abs(count),"篇"))) + 
  geom_col() + 
  coord_flip() +
  scale_color_brewer(palette = "Set1") +
  labs(x="",y="") + 
  guides(fill=guide_legend(title = "文献类型",byrow = TRUE,ncol = 2),
         color="none") + 
  theme(legend.position = c(.8,.2)) 
plot.ly(p) %>%
  layout(showlegend=FALSE)
```

### 文献清单

```{r}
MC_specific_article <- M %>%
  filter(CORE==TRUE, HC==FALSE) %>%
  mutate(title=str_to_title(TI),
         source=paste0("<a href=\"https://doi.org/",DI,"\">原文</a>")) %>%
  tibble::column_to_rownames("SR")
  
HC_specific_article <- M %>%
  filter(CORE==FALSE, HC==TRUE) %>%
  mutate(title=str_to_title(TI),
         source=paste0("<a href=\"https://doi.org/",DI,"\">原文</a>")) %>%
  tibble::column_to_rownames("SR")

MC_HC_article <- M %>%
  filter(CORE==TRUE, HC==TRUE) %>%
  mutate(title=str_to_title(TI),
         source=paste0("<a href=\"https://doi.org/",DI,"\">原文</a>")) %>%
  tibble::column_to_rownames("SR")
```

针对于图 \@ref(fig:HC-vs-MC) 中的差异，我们分别统计了仅在LCS核心论文中出现和仅在高被引论文中出现的文章列表。

下表显示仅在LCS核心论文中出现的文章（图 \@ref(fig:HC-specific-article)）[^data-table]。

[^data-table]: 这幅图中显示的表格支持交互操作，可以点击标题排序、在搜索框搜索和按不同的行过滤，也可以点击按钮导出数据为 Excel 文件。

```{r eval=F}
# Lists elements are written to individual worksheets, using list names as sheet names if available
list <- list("LCS-specific"=MC_specific_article,
          "HighlyCited-specific"=HC_specific_article,
          "LCS-and-HighlyCited" = MC_HC_article)
l <- lapply(list, function(x){
            x[,c("title","PY","DT","ID","impact_factor","LCS","TC","source")]
          })
names(l) <- names(list)
openxlsx::write.xlsx(l,file = "MC-specific-articles.xlsx",asTable=TRUE,rowNames=TRUE)

```


```{r MC-specific-article, fig.cap="仅在LCS核心论文中出现的文章"}
library(DT)
datatable(MC_specific_article[,c("PY","impact_factor","LCS","TC","title","source")],
          colnames = c("年份","IF","LCS","GCS","标题","原文"),
          escape = FALSE,
          filter = "top",
          caption = "",
          extensions = c("Buttons"),
          width = "95%",
          options=list(dom = 'Bfrtip',
                       pageLength = 20,
                       buttons=list(
                         'pageLength',
                         list(extend='copy'),
                         list(extend="excel",
                              filename="export",
                              header=TRUE,
                              title="")
                         ),
                       columnDefs=list(
                         list(width="35%",targets=5)
                       ),
                       lengthMenu=list(c(20,50,100,-1),
                                       c("20","50","100","All"))))
```

下面的表格列出了仅在WoS高被引论文中出现的文章（图 \@ref(fig:HC-specific-article)）。

```{r HC-specific-article,fig.cap="仅在WoS高被引论文中出现的文章"}
datatable(HC_specific_article[,c("PY","impact_factor","LCS","TC","title","source")],
          colnames = c("年份","IF","LCS","GCS","标题","原文"),
          escape = FALSE,
          filter = "top",
          caption = "",
          extensions = c("Buttons"),
          width = "95%",
          options=list(dom = 'Bfrtip',
                       pageLength = 20,
                       buttons=list(
                         'pageLength',
                         list(extend='copy'),
                         list(extend="excel",
                              filename="export",
                              header=TRUE,
                              title="")
                         ),
                       columnDefs=list(
                         list(width="35%",targets=5)
                       ),
                       lengthMenu=list(c(20,50,100,-1),
                                       c("20","50","100","All"))))
```

下面的表格列出了既为WoS高被引又为LCS核心论文的文章（图 \@ref(fig:MC-HC-article)）。


```{r MC-HC-article,fig.cap="既为WoS高被引又为LCS核心论文的文章"}
datatable(MC_HC_article[,c("PY","impact_factor","LCS","TC","title","source")],
          colnames = c("年份","IF","LCS","GCS","标题","原文"),
          escape = FALSE,
          filter = "top",
          caption = "",
          extensions = c("Buttons"),
          width = "95%",
          options=list(dom = 'Bfrtip',
                       pageLength = 20,
                       buttons=list(
                         'pageLength',
                         list(extend='copy'),
                         list(extend="excel",
                              filename="export",
                              header=TRUE,
                              title="")
                         ),
                       columnDefs=list(
                         list(width="35%",targets=5)
                       ),
                       lengthMenu=list(c(20,50,100,-1),
                                       c("20","50","100","All"))))
```


### 近两年的研究进展


特别关注近两年的最新研究进展。
下面是依据LCS筛选得到的最近两年研究论文（图 \@ref(fig:recent-LCS-research-article)）。


```{r recent-LCS-research-article,fig.cap="近两年的重要研究论文（LCS高被引）"}
# 导出这两年的重要研究/观点论文（高被引+日报收录的20分以上的肠道菌群研究），
# 辛苦你周一列个表给我。完整的关键词变化列表+每年高频关键词（前50）
# 下周也优先做一下吧，多谢~
db_cache <- "E:/Spring_Work/Corporate_Bussiness/C40_Data/cache/"
papers <- readRDS(paste0(db_cache,"papers.RDS")) %>% as_tibble()
fragments <- readRDS(paste0(db_cache,"fragments.RDS")) %>% as_tibble()

real_paper <- papers %>% 
  filter(status==3 & classify=="audit" & journal_id=="1") %>%
  select(uuid,title,share_title,summary,remark,fragment_id) %>% 
  left_join(fragments %>% rename("fragment_id"=id) %>% select(-uuid,-title,-remark))

mc <- real_paper %>% mutate(DI=toupper(doi)) %>% select(DI,title,summary,uuid,share_title)
recent_research_article <- MC %>% 
  left_join(mc) %>%
  filter(PY>=2018,DT != "REVIEW") %>%
  mutate(DT = as_factor(DT),
         source=ifelse(str_detect(DI,"10"),
                       mrgut_permanent_link(base_url = "https://doi.org/",
                                     type = "html",
                                     uuid = DI,
                                     title = J9,
                                     alt = ""),
                       ""),
         daily=ifelse(is.na(uuid),
                      "",
                      mrgut_permanent_link(type="html",
                                    uuid = uuid,
                                    title = title,
                                    alt = share_title)
                      ),
         Title=str_to_title(TI)) %>%
  select(Title,PY,DT,impact_factor,source,daily,TC,LCS)


DT::datatable(recent_research_article, 
          escape = FALSE,
          rownames = FALSE,
          filter = "top",
          width = "95%",
          caption = "",
          extensions = c("Buttons"),
          options=list(dom = 'Bfrtip',
                       pageLength = 20,
                       buttons=list(
                         'pageLength',
                         list(extend='copy'),
                         list(extend="excel",
                              filename="export",
                              header=TRUE,
                              title="")
                         ),
                       columnDefs=list(
                         list(width="35%",targets=0),
                         list(width="35%", targets=5)
                       ),
                       lengthMenu=list(c(20,50,100,-1),
                                       c("20","50","100","All"))))

```

下面是《热心肠日报》中收录的最近两年发表的影响因子大于20的研究论文（图 \@ref(fig:high-IF-research-article-in-daily)）。


```{r high-IF-research-article-in-daily, fig.cap="日报中收录的影响因子＞20的高水平研究论文"}
high_IF_daily_research <- real_paper %>%
  mutate(DI=toupper(doi),
         year=year(time)) %>% 
  select(DI,title,year,type,periodical,summary,uuid,share_title) %>%
  left_join(MC) %>%
  filter(type %in% c("Article","Communication","Letter","Perspective"),
         impact_factor >= 20,
         year >= 2018) %>%
  mutate(DT = as_factor(DT),
         source=ifelse(str_detect(DI,"10"),
                       mrgut_permanent_link(base_url = "https://doi.org/",
                                     type = "html",
                                     uuid = DI,
                                     title = periodical,
                                     alt = ""),
                       ""),
         daily=ifelse(is.na(uuid),
                      "",
                      mrgut_permanent_link(type="html",
                                    uuid = uuid,
                                    title = title,
                                    alt = share_title)
                      ),
         Title=str_to_title(title)) %>%
  select(Title,year,DT,impact_factor,source,daily,TC,LCS)

DT::datatable(high_IF_daily_research, 
          escape = FALSE,
          rownames = FALSE,
          filter = "top",
          width = "95%",
          caption = "",
          extensions = c("Buttons"),
          options=list(dom = 'Bfrtip',
                       pageLength = 20,
                       buttons=list(
                         'pageLength',
                         list(extend='copy'),
                         list(extend="excel",
                              filename="export",
                              header=TRUE,
                              title="")
                         ),
                       columnDefs=list(
                         list(width="35%",targets=0),
                         list(width="35%", targets=5)
                       ),
                       lengthMenu=list(c(20,50,100,-1),
                                       c("20","50","100","All"))))
```

### 重点论文

重要文章，综合考虑引用和影响因子，即LCS高被引文章+当年高分文章（按IF排序取前5%）

1. 论文列表（区分文章类型）
2. 按关键词聚类：高频关键词列表（前50），以及每个关键词对应的文章列表
  ① 仅分析研究论文
  ② 仅分析综述
  ③ 全部文章类型
3. 按关键词共现聚类：高频共现关键词列表（取前50），以及每组共现关键词对应的文章列表
  ① 仅分析研究论文
  ② 仅分析综述
  ③ 全部文章类型

```{r fig.cap="近三年的重点论文"}

recent_research_article_2019 <- M %>% 
  filter(PY>=2017,percent_rank(impact_factor)>=0.95 | CORE==TRUE) %>%
  left_join(mc) %>%
  mutate(DT = as_factor(DT),
         source=ifelse(str_detect(DI,"10"),
                       mrgut_permanent_link(base_url = "https://doi.org/",
                                     type = "html",
                                     uuid = DI,
                                     title = J9,
                                     alt = ""),
                       ""),
         daily=ifelse(is.na(uuid),
                      "",
                      mrgut_permanent_link(type="html",
                                    uuid = uuid,
                                    title = title,
                                    alt = share_title)
                      ),
         Title=str_to_title(TI))


DT::datatable(recent_research_article_2019 %>%
  select(Title,DT,PY,impact_factor,source,daily,TC,LCS,ID) %>%
    mutate(ID = str_to_title(ID),
           PY = as_factor(PY)), 
          escape = FALSE,
          rownames = FALSE,
          filter = "top",
          width = "95%",
          caption = "",
          extensions = c("Buttons"),
          options=list(dom = 'Bfrtip',
                       pageLength = 10,
                       buttons=list(
                         'pageLength',
                         list(extend='copy'),
                         list(extend="excel",
                              header=TRUE,
                              title="")
                         ),
                       columnDefs=list(
                         list(width="25%",targets=0),
                         list(width="6em",targets=2),
                         list(width="25%", targets=5),
                         list(width="15%",targets=8)
                       ),
                       lengthMenu=list(c(10,20,50,100,-1),
                                       c("10","20","50","100","All"))))
```

```{r keyword-tf-idf-2019, fig.cap="对2019年的关键词进行挖掘"}
tableTag <- function (M, Tag = "CR", sep = ";") 
{
    if (Tag %in% c("AB", "TI")) {
        M = termExtraction(M, Field = Tag, stemming = F, verbose = FALSE)
        i = dim(M)[2]
    }
    else {
        i <- which(names(M) == Tag)
    }
    if (Tag == "C1") {
        M$C1 = gsub("\\[.+?]", "", M$C1)
    }
    Tab <- unlist(strsplit(as.character(M[, i]), sep))
    Tab <- trimws(trimES(Tab))
    Tab <- Tab[Tab != ""]
    Tab <- sort(table(Tab), decreasing = TRUE)
    return(Tab)
}

ID <- tableTag(recent_research_article_2019,Tag ="ID") %>%
  data.frame() %>%
  filter(!str_detect(Tab,search_keyword_regex)) %>%
  mutate(Keyword=as_factor(str_to_title(Tab))) %>%
  ungroup() %>%
  select(Keyword,Freq)
```


```{r eval=F}
mat <- biblioNetwork(recent_research_article_2019, analysis = "co-occurrences", network = "keywords")
net <- networkPlot(NetMatrix = mat, 
                     # normalize = "jaccard", 
                     weighted = TRUE,
                     n=100, 
                     Title = year, 
                     verbose = FALSE)

g <- net$graph_pajek

# 简化图
# 删掉常用词
g <- delete.edges(g, E(g)[edge_attr(g)$weight > 2])
g <- delete.vertices(g, V(g)[str_detect(vertex.attributes(g)$id, search_keyword_regex)])

# 聚类
cluster <- cluster_fast_greedy(g)
vertex_attr(g)$group <- membership(cluster)

# size by deg
vertex_attr(g)$size <- vertex_attr(g)$deg/15

# 可视化
data <- toVisNetworkData(g)

nodes <- data$nodes %>%
  mutate(size=log(deg))
edges <- data$edges 

visNetwork(nodes,edges,width = 1000,height = 800) %>%
  visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE,selectedBy = "group") %>%
  visNodes(size = "size") %>%
  visIgraphLayout()
```

```{r eval=FALSE}
ID <- ID %>% mutate(id=tolower(Keyword)) %>%
  left_join(nodes) %>%
  select(Keyword,Freq,deg,group) %>%
  mutate(group=as_factor(group))


DT::datatable(ID, 
          rownames = FALSE,
          filter = "top",
          width = "95%",
          extensions = c("Buttons"),
          options=list(dom = 'Bfrtip',
                       pageLength = 20,
                       buttons=list(
                         'pageLength',
                         list(extend='copy'),
                         list(extend="excel",
                              header=TRUE,
                              title="")
                         ),
                       columnDefs=list(
                         list(width="75%",targets=0)
                       ),
                       lengthMenu=list(c(10,20,50,100,-1),
                                       c("10","20","50","100","All"))))

```


## 数据库资料信息不全的情形

国家信息不全的文章有 `r nrow(df)` 篇，大多属于信息采集不完整的情况（表\@ref(tab:AU-CO-NA)），并非软件本身存在的错误所致。数据采集不完整的情况暂时无法解决。


```{r}
# 国家信息错误的情况
M %>% filter(str_detect(AU_CO_NR,"\\bNA\\b")) %>% 
  select(AU,SO,DT,C1,AU_CO_NR)  %>% 
  data.frame() %>%
  tableTag(Tag = "DT") %>%   # tableTag() 函数针对有些行失效了？如果每个元素都没有分隔符，则会出现失效的情况。
  barplot()
```


```{r AU-CO-NA}
df <- M %>% filter(str_detect(AU_CO_NR,"\\bNA\\b")) %>% 
  select(SR,C1,AU_CO_NR) %>% 
  head(10)
kable(df,caption = "国家信息缺失的文献信息")
```


```{r}
# 使用自定义函数提取作者机构信息
AU_UN_wos <- function(C1,sep=";"){
  AFF <- trim(gsub("\\[.*?\\]","",C1))
  listAFF <- strsplit(AFF,sep,fixed = TRUE)
  AFFL <- lapply(listAFF,function(l){
    affL <- strsplit(l,",",fixed = TRUE)
    lapply(affL, function(x){
      return(trim(x[[1]]))
    })
  })
  AFF <- sapply(AFFL,function(x) paste0(x,collapse = sep))
  AFF <- gsub("\\&","AND",AFF)
  return(AFF)
}
```


`bibliometrix` 使用一个控制字段提取机构信息，一些机构名称不含下列控制字段，因此导致机构信息提取失败。为此，我们拟使用自定义函数改进提取机制。

```{r, echo=T,eval=F}
uTags=c("UNIV","COLL","SCH","INST","ACAD","ECOLE","CTR","SCI","CENTRE","CENTER","CENTRO","HOSP","ASSOC","COUNCIL",
          "FONDAZ","FOUNDAT","ISTIT","LAB","TECH","RES","CNR","ARCH","SCUOLA","PATENT OFF","CENT LIB","HEALTH","NATL",
          "LIBRAR","CLIN","FDN","OECD","FAC","WORLD BANK","POLITECN","INT MONETARY FUND","CLIMA","METEOR","OFFICE","ENVIR",
          "CONSORTIUM","OBSERVAT","AGRI", "MIT ", "INFN", "SUNY ")
```

改进后的机构信息提取机制更加准确。表 \@ref(tab:AU-UN-NA) 表示新旧方法得出不同机构信息字段的比较。`C1` 是作者信息字段，`AU_UN` 是 `bibliometrix` 软件提取的机构信息，`AU-UN2` 是改进提取方法后得出的机构信息。经过比较，可以发现新方法提取的信息更加准确和完整。

```{r AU-UN-NA}
set.seed(20200213)
a <- M %>% sample_n(100)
a <- metaTagExtraction(a,Field = "AU_UN")
a$AU_UN2 <- AU_UN_wos(a$C1)
a %>% filter(AU_UN != AU_UN2) %>%
  select(SR,AU_UN,AU_UN2,C1) %>%
  DT::datatable(caption = "不同方式获取机构信息的差异")
```

```{r}
M <- metaTagExtraction(M,Field = "AU_UN")
original <- tableTag(M, "AU_UN")
M$AU_UN2 <- AU_UN_wos(M$C1)
modified <- tableTag(M, "AU_UN2")
df <- data.frame(original) %>% left_join(data.frame(modified), by=c("Tab"="Tab"))
colnames(df) <- c("AFF","original","modified")
top_AFF <- df %>% group_by(AFF) %>% summarise(total=original+modified) %>% arrange(desc(total)) %>% head(1000) %>% pull(AFF)

df <- df %>% mutate(diff = modified - original, AFF = as_factor(AFF)) %>%
  pivot_longer(cols = c("original","modified"), names_to = "type", values_to = "count")
df_label <- df %>% filter(diff!=0) %>% filter(percent_rank(abs(diff))>0.99) %>% select(AFF, diff) %>%
  mutate(color=ifelse(diff >0, "red","blue")) %>% unique()
p <- df %>% filter(AFF %in% top_AFF) %>%
  ggplot(aes(AFF,count,fill=type,color=type,group=type)) + geom_point() + geom_line() + geom_area(alpha=1/3,position = "identity") +
  # geom_line(aes(AFF,diff,color=I(color)),inherit.aes = FALSE,data = df_label) +
  geom_point(aes(AFF,diff,color=I(color)),inherit.aes = FALSE,data = df_label) +
  ggrepel::geom_text_repel(aes(AFF, diff, label=AFF,color=I(color)),data = df_label %>% filter(AFF %in% top_AFF), inherit.aes = FALSE) +
  theme(axis.text.x = element_blank())
plotly::ggplotly(p)
```


# 使用影响因子评价 {#impact-factor}

在我们的分析中，并没有对影响因子这一指标进行过多涉及，而主要是基于引用次数来评价文章的重要性。相对于影响因子，引用次数更能反映文章的重要性。引用次数本身就是影响因子的基础，是期刊所有论文引用次数除以发文量之后得到的一个指标。然而，即便是同一本期刊上发表论文的引用次数事实上也存在巨大差异，用一个平均指标不利于发现那些最重要的文献。与此同时，同一篇文章的全局被引频次和本地被引频次也会存在明显差异，考虑到我们主要立足点是“肠道菌群”研究，那些本地被引频次更高的文献理论上更加重要。

而在高被引论文中，影响因子小于3的论文数量最少，同时3-5,5-10,10-20之间和20以上的论文数目大体相当。大体上有50%的高被引论文影响因子在10分以上，另外有50%的高被引论文在10分以下（图 \@ref(fig:HC-article-group-by-IF)）。

```{r HC-article-group-by-IF, fig.cap="WoS高被引论文影响因子的分布情况"}
df <- M %>% filter(HC==TRUE)  %>% group_by(PY,group) %>% 
  summarise(n=n()) %>%
  group_by(PY) %>%
  mutate(proportion=n/sum(n))


plot_by_IF(df)

```

基于 LCS 得到的核心文献中，与高被引论文的影响因子分组情况较为相似，同样可以发现影响因子小于3的论文同样最少，同时3-5,5-10,10-20之间和20以上的论文数目大体相当（图 \@ref(fig:LCS-core-article-group-by-IF)）。

```{r LCS-core-article-group-by-IF, fig.cap="LCS核心论文影响因子的分布情况"}
df <- M %>% filter(CORE==TRUE) %>% 
  group_by(PY,group) %>% 
  summarise(n=n()) %>%
  group_by(PY) %>%
  mutate(proportion=n/sum(n))


plot_by_IF(df)
```

在前面 \@ref(core-article) 我们曾经列出了 LCS 核心文献和 WoS 高被引论文的清单，现在我们再来看一下这些不同类型文献的影响因子分布情况。
如图 \@ref(fig:core-vs-HC-boxplot) 中，可以发现 WoS 高被引特有的文献较 LCS 核心特有的文献具有更小的四分位数和中位数影响因子。这说明如果从影响因子的角度考虑，总体上 LCS 核心文献的影响因子较 WoS 高被引论文还更大一些。

```{r core-vs-HC-boxplot, fig.cap="LCS核心文献和WoS高被引文献的影响因子分布差异",fig.align="center",fig.width=6}
list <- list("LCS-specific"=MC_specific_article,
          "HighlyCited-specific"=HC_specific_article,
          "LCS-and-HighlyCited" = MC_HC_article)
names <- names(list)
l <- lapply(seq_along(list),function(i){
  data.frame(group=names[[i]], IF=list[[i]]$impact_factor)
})
df <- do.call("rbind",l)

p <- ggplot(df,aes(group,IF)) + geom_boxplot() + coord_cartesian(ylim = c(0,50))
ggplotly(p)
```