-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore missing values #488
Comments
In Julia julia> sum([1,2, missing])
missing As with |
This isn't quite right. Even though, in Julia Going further I conducted a survey of Julia, R, Python, and Octave (because I don't have a Matlab installation) to better understand how each library treats missing values. I created a dataset in Julia of ages and wages for individuals. The As you can see, in all languages and plotting libraries,
Note that this is true even in languages where Also note that in all example I show, I use the closest possible value to Turning back to the earlier fact that Below are my implementations: Julia data generationusing CSV, DataFrames
N = 1000
γ = .05
df = DataFrame(age = rand(20:65, N))
df.wage = map(df.age) do a
w = a * γ + rand() * 10
rand() < .1 ? missing : w
end
CSV.write("data/wages.csv", df) Julia plottingusing GLMakie, Makie, AlgebraOfGraphics
using CSV, DataFrames
using Plots: Plots
df = CSV.read("data/wages.csv", DataFrame)
# Makie.jl (GLMakie.jl) ##############################################
p = GLMakie.plot(df.age, df.wage)
save("out/julia/makie.png", p)
# Plots.jl ###########################################################
p = Plots.scatter(df.age, df.wage)
save("out/julia/plots.png", p)
# AlgebraOfGraphics.jl ###############################################
# This one is very messed up
p = data(df) * mapping(:age, :wage) |> draw
save("out/julia/algebraofgraphics.png", p)
# Mean behavior ######################################################
m = mean(df.wage) # missing R plottinglibrary(tidyverse)
df = read_csv("data/wages.csv")
# Base R #############################################################
png("out/R/baser.png")
plot(df$age, df$wage)
dev.off()
# ggplot #############################################################
p = df |>
ggplot(aes(x = age, y = wage)) +
geom_point()
ggsave("out/R/ggplot.png", p)
# Mean behavior ######################################################
mean(df$wage) # NA Octave plottingdf = csvread('data/wages.csv', "emptyvalue", NA)
# Plotting ###########################################################
p = scatter(df(:, 1), df(:, 2))
saveas(p, "out/octave/octave.png", "png")
# Mean ###############################################################
m = mean(df(:, 2)) # NA Python plottingimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("data/wages.csv")
df = df.fillna(pd.NA)
# Pandas graphing ####################################################
ax = df.plot.scatter(x='age', y='wage')
ax.figure.savefig('out/python/pandas.png')
# Matplotlib graphing ################################################
age = np.array(df.loc[:, "age"])
wage = np.array(df.loc[:, "wage"])
p = plt.scatter(age, wage)
p.figure.savefig('out/python/pyplot.png')
# Pandas mean ########################################################
m = df.loc[:, "wage"].mean() # A real
# Numpy mean #########################################################
x = np.array(df.loc[:, "wage"])
x.mean() # nan |
That doesn't address my point. Your example is about how to handle |
Matplotlib definitely doesn't just "ignore" nans, compare how the 2nd and 3rd lines here look in the plot: import matplotlib.pyplot as plt
import numpy as np
plt.plot([1, 2, 3, 4, 5], [2, 1, 3, 5, 4])
plt.plot([1, 2, 4, 5], np.array([2, 1, 5, 4]) + 1) # drop obs #3
plt.plot([1, 2, np.nan, 4, 5], np.array([2, 1, 3, 5, 4]) + 2) # replace obs #3 with nan Also, even if/when it does ignore nans, this behavior can be used as a consistency argument for handling TBH, I'm not that worried about this specific case of a plotting library – just concerned about the general direction towards silently ignoring a part of data without user explicitly asking or knowing about that. In plots, the best default would likely be to show missing values separately whenever possible. Eg, as a separate bin in the histogram, as a small inset in a scatterplot, etc. This would be a nice selling point of Julia and its plotting ecosystem, like
|
@aplavin I would still say "ignore" is the correct word. Whether there a gap in the line or it is filled in is not super material to me. The point is that the graph "works", the code runs, and the lines are still treated as continuous variables. Please look at how AlgebraOfGraphics.jl handles this scenario. The current behavior is very clearly a bug and could not reasonably be the intention of anyone making a plot.
This solution would make my life not better than current behavior, as I would have to turn off these extras any time I wanted to make a plot. I will work on a PR to change current behavior to match other plotting libraries. |
That attitude easily leads to silent correctness issues. Agree with @jariji, developing convenient but explicit and opt-in ways to handle missings is a much better solution. |
I want to re-emphasize that I explored the behavior of R, Python, Octave and other Julia libraries and all of them seamlessly create graphs with missing values. AlgebraOfGraphics.jl is the odd one out. @SimonDanisch hopefully you can help with a PR when I open one. @aplavin @jariji I will not be continuing this conversation, as it is not useful to go back and forth. |
Hey, I comment because I need a simple way to tell AlgebraOfGraphics that my data is indeed continuous and not categorical, just because it contains Maybe introducing something like As a workaround for people in my situation, |
Missing values are now passed to Makie and do not signal that data is categorical anymore. |
It would be really nice if AlgebraOfGraphics.jl ignored missing values better.
Current behavior is un-intuitive and likely not desired by anyone: It treats the entire column as categorical.
The desired behavior would be to drop missing pairs of
(x, y)
where eitherx
ory
aremissing
, similar to how AlgebraOfGraphics treatsNaN
.I'm happy to have a larger discussion about the semantics of
missing
for various edge cases that come up, but I deal with lots of missing data all the time and the current behavior makes it hard to usemissing
and iterate quickly to make plots.The text was updated successfully, but these errors were encountered: