data_explore_fillNA.Rmd

---
title: "data_explore and fill NA value"
author: "liuc"
date: '2022-05-05'
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Data Explore

探索性数据的分布和缺失值的分布是较为常规的操作，一个优秀的R包会加快这个过程。现就几个R包进行一些探索性操作的笔记。

有时，解释缺失值出现的原因可能很简单，比如，可能是由于记录不全，各种意外等，但实现这一解释的过程可能并不简单，而且可能需要比开发探索性数据分析和模型所需的更多时间。


```{r}
library(dlookr) # 数据探索的包
# library(SmartEDA)
library(DataExplorer) # long time no 更新
# library(explore)
library(visdat)
library(naniar) # NA值的统计、绘图等


```


### `naniar`可以可视化NA值，并提供其他的一系列函数统计NA值的分布数目等

不过也只是一种可视化的手段，其提供的按照NA值存在与否看待其他变量分布的手段是很实用的。以及利用NA值建模也可以提供一种分析NA值的手段，不过为什么那么多warning而不改正呢。。

```{r}
# 二者的结果一致, 同一个作者
naniar::vis_miss(airquality)

visdat::vis_miss(airquality)
```

`naniar`中较为好用的一个方法是，其在进行ggplot2的绘图时，不会删掉NA值：

```{r}
ggplot(airquality, 
       aes(x = Solar.R, 
           y = Ozone)) + 
  naniar::geom_miss_point()
```

展示NA值缺失的数量：

```{r}
gg_miss_var(airquality)
```

`naniar`所提供的NA shadow的操作，对于不大的数据集而言，是一个很好的可视化的数据来源，其还可以整合`simputation`包的impute结果：图中两个颜色的散点

```{r}
airquality %>% 
  naniar::nabular() %>% 
  as.data.frame() %>% 
  simputation::impute_lm(Ozone ~ Temp + Wind) %>% 
  ggplot(aes(x = Temp,
             y = Ozone,
             colour = Ozone_NA)) + 
  geom_point()
```

*对NA值建模：*

```{r}
library(rpart)
library(rpart.plot)

airquality %>%
  add_prop_miss() %>%
  rpart(prop_miss_all ~ ., data = .) %>%
  prp(type = 4, extra = 101, prefix = "Prop. Miss = ")
```


### `dlookr`是一个帮助进行数据探索的包。

对于数据诊断和数据探索都提供了方便的函数。

```{r}
library(nycflights13)


dlookr::diagnose(flights)

# 除了多种diagnose命令外，对outlier的还是好用的
dlookr::diagnose_outlier(flights)


dlookr::describe(flights)

```

```{r}
dlookr::normality(flights, dep_time, arr_time)

dlookr::plot_normality(flights, dep_time, arr_time)
```


利用dlookr进行NA值的分析:

- impute:
- predictor is numerical variable
- “mean” : arithmetic mean
- “median” : median
- “mode” : mode
- “knn” : K-nearest neighbors
- target variable must be specified
- “rpart” : Recursive Partitioning and Regression Trees
- target variable must be specified
- “mice” : Multivariate Imputation by Chained Equations
- target variable must be specified
- random seed must be set
- predictor is categorical variable
- “mode” : mode
- “rpart” : Recursive Partitioning and Regression Trees
- target variable must be specified
- “mice” : Multivariate Imputation by Chained Equations
- target variable must be specified
- random seed must be set

```{r}
dep_time <- dlookr::imputate_na(flights,
                                xvar = dep_time,
                                method = 'mean'
                                )

# 对outlier进行impute


find_skewness(flights)


```


```{r}
gg_miss_var(airquality)


#
airquality %>%
  bind_shadow() %>%
  group_by(Ozone_NA) %>%
  summarise_at(
    .vars = "Solar.R",
    .funs = c("mean", "sd", "var", "min", "max"),
    na.rm = TRUE
  )

airquality %>%
  naniar::nabular() %>%
  ggplot(
    .,
    aes(
      x = Temp,
      colour = Ozone_NA
    )
  ) +
  geom_density()
```


在探索数据的NA值时，主要的是理解NA值产生的原因和

```{r}

```


## fill NA

> https://data.library.virginia.edu/getting-started-with-multiple-imputation-in-r/

Missing Data Assumptions
Rubin (1976) classified types of missing data in three categories: MCAR, MAR, MNAR

MCAR: Missing Completely at Random – the reason for the missingness of data points are at random, meaning that the pattern of missing values is uncorrelated with the structure of the data. An example would be a random sample taken from the population: data on some people will be missing, but it will be at random since everyone had the same chance of being included in the sample.

MAR: Missing at Random – the missingness is not completely random, but the propensity of missingness depends on the observed data, not the missing data. An example would be a survey respondent choosing not to answer a question on income because they believe the privacy of personal information. As seen in this case, the missing value for income can be predicted by looking at the answers for the personal information question.

MNAR: Missing Not at Random – the missing is not random, it correlates with unobservable characteristics unknown to a researcher. An example would be social desirability bias in survey – where respondents with certain characteristics we can’t observe systematically shy away from answering questions on racial issues.

All multiple imputation techniques start with the MAR assumption. While MCAR is desirable, in general it is unrealistic for the data. Thus, researchers make the assumption that missing values can be replaced by predictions derived by the observable portion of the dataset. This is a fundamental assumption to make, otherwise we wouldn’t be able to predict plausible values of missing data points from the observed data.

impute NA value.
对不同的数据集选择合适的impute NA的方法：
1. 对于有逻辑在内的变量而言，可以依据具体情况进行填充；
2. 对于NA值缺失很少，且对数据集有一定了解的情况下，简单填充也是可以的，但一般对于数据集的填充不考虑简单填充，其会引入诸多问题；
3. 一般对于较小的数据集，比如观察数目小于500/1000的情况下，可以采用kNN impute方法，对于较大一些的数据集可以采用 随机森林填充的方法；
4. 对于问题中涉及到统计推论的情况，采用多重插补的方法，似乎适用于一切情况。


`Hmisc`有提供多个简单插补的方法；
下面记录一下`mice`包用于多重插补。当你认为数据是MCAR或MAR，并且缺失数据问题非常复杂时，多重插补将是一个非常实用方法。


```{r, include=FALSE}
library(mde)
library(mice)
library(ggmice)
# library(miceadds)
library(VIM) # visualization of missing and imputed values
library(simputation) # 此包在简化缺失值插补的流程，提供了统一的使用语法，提供多种常见的插补缺失值的方法，可以和管道符%>%连用

```


### VIM

use VIM package for plot.

```{r}
data(sleep, package="VIM")


mice::md.pattern(sleep)
```
解读：0表示变量的列中有缺失值，1则表示没有缺失值。第一行表述了“无缺失值”的模式（所有元素都为1）。第二行表述了“除了 Span 之外无缺失值”的模式。第一列表示各缺失值模式的实例个数，最后一列表示各模式中有缺失值的变量的个数。此处可以看到，有42个实例没有缺失值，仅2个实例缺失了 Span 。9个实例同时缺失了 NonD 和 Dream的值。数据集包含了总共(42×0)+(2×1)+…+(1×3)=38个缺失值。最后一行给出了每个变量中缺失值的数目。


```{r}
VIM::aggr(sleep, prop=TRUE, numbers=TRUE)
```
```{r}
VIM::barMiss(sleep)
```
```{r}
VIM::matrixplot(sleep)
```
```{r}
marginplot(sleep[c("Gest","Dream")], pch=c(20),
col=c("darkgray", "red", "blue"))
```


### kNN & missForest

此处记录一下这两种较为常用的impute方法

```{r}
x_imputed <- VIM::kNN(sleep, imp_var=FALSE)

marginplot(x_imputed, delimiter = "_imp")
```

```{r}
missForest::missForest(sleep)$ximp
```


### `mice`包提供多重插补方法

具体使用和对应的概念参考其文档。8个Vignettes的内容着实详实。

mice 包的插补分析过程：
library(mice)
imp <- mice(data, m)
fit <- with(imp, analysis)
pooled <- pool(fit)
summary(pooled) 
其中， 
- data 是一个包含缺失值的矩阵或数据框。 
- imp 是一个包含m个插补数据集的列表对象，同时还含有完成插补过程的信息。默认m为5。
- analysis 是一个表达式对象，用来设定应用于m个插补数据集的统计分析方法。方法包括做线性回归模型的 lm() 函数、做广义线性模型的 glm() 函数、做广义可加模型的gam() ，以及做负二项模型的 nbrm() 函数。表达式在函数的括号中， ~ 的左边是响应变量，右边是预测变量（用 + 符号分隔开）。 
- fit 是一个包含m个单独统计分析结果的列表对象。 
- pooled 是一个包含这m个统计分析平均结果的列表对象。


将多重插补法应用到 sleep 数据集，如下所示，在m=1时既是插补一次，这和下文的`simputation`所提供的应该是一只的。

```{r}
# 多重插补
# ?mice 中有详细描述其所支持的method方法
imp <- mice(sleep, seed=42, m = 5, method = 'pmm',
            print=F
            )
# attributes(imp)

imp_data <- mice::complete(imp, action = 1L) # ?complete

fit <- with(imp, lm(Dream ~ Span + Gest)) # 对impute后的数据集进行你所想要的分析
pooled <- pool(fit) # 将多次建模的结果整合在一起

summary(pooled)

```


可以看到默认是其他的所用变量用来impute具有NA值的变量。
怎么按照自身的先验知识选择列呢？
```{r}
imp$pred

ini <- mice(nhanes, maxit=0, print=F)
pred <- ini$pred
pred
pred[ ,"hyp"] <- 0 # 删掉不想要的列
imp <- mice(nhanes, pred=pred, print=F)
```

```{r}
plot(imp)
```

Change the imputation method

这种语法也太麻烦了点吧。。。

```{r}
ini <- mice(nhanes2, maxit = 0)
meth <- ini$meth
meth

meth["bmi"] <- "norm"
meth

imp <- mice(nhanes2, meth = meth, print=F)
```


```{r}
mice::D3()
```


```{r}
ggmice::ggmice(boys, aes(age, bmi)) +
  geom_point()

ggmice(imp, aes(age, bmi)) + geom_point()
```


### `simputation`提供的填充方法易于使用，不知是否有多重插补的方法。

Ad Hoc imputation methods

使用`simputation`简化填补的流程：

基于模型的方法:

线性回归
稳健线性回归
岭回归/弹性网络/lasso回归
CART模型（决策树）
随机森林
多元插补

基于最大期望值的方法:
missForest
Donor imputation (including various donor pool specifications)

K最近邻法:
sequential hotdeck (LOCF, NOCB)
random hotdeck
Predictive mean matching
其他

median imputation:
Proxy imputation: 使用其他列的值或使用简单的转换得到的值.
Apply trained models for imputation purposes.


*提供的函数：*
impute_rlm: robust linear model
impute_en: ridge/elasticnet/lasso
impute_cart: CART
impute_rf: random forest
impute_rhd: random hot deck
impute_shd: sequential hot deck
impute_knn: k nearest neighbours
impute_mf: missForest
impute_em: mv-normal
impute_const: 用一个固定值插补
impute_lm: linear regression
impute_pmm: Hot-deck imputation
impute_median: 均值插补
impute_proxy: 自定义公式插补，可以用均值等

```{r}
#

dat1 <- simputation::impute_lm(sleep, 
                               Dream ~ Span + Gest)

```

此处Dream的第4个值还是NA，这是因为Span这一列的第3个值是NA导致的，线性回归不能插补这样的缺失值。可以采用其他的方法进行impute。


使用一个固定值进行插补:
```{r}
impute_const(sleep, Dream ~ 7)
```