Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore missing values #488

Closed
pdeffebach opened this issue Jan 23, 2024 · 10 comments
Closed

Ignore missing values #488

pdeffebach opened this issue Jan 23, 2024 · 10 comments

Comments

@pdeffebach
Copy link

It would be really nice if AlgebraOfGraphics.jl ignored missing values better.

Current behavior is un-intuitive and likely not desired by anyone: It treats the entire column as categorical.

The desired behavior would be to drop missing pairs of (x, y) where either x or y are missing, similar to how AlgebraOfGraphics treats NaN.

I'm happy to have a larger discussion about the semantics of missing for various edge cases that come up, but I deal with lots of missing data all the time and the current behavior makes it hard to use missing and iterate quickly to make plots.

@jariji
Copy link

jariji commented Feb 2, 2024

In Julia missing doesn't generally get dropped implicitly because the presence of a missing can be important information.

julia> sum([1,2, missing])
missing

As with sum, I prefer plots to reflect the existence of missings in the data. Otherwise I am at risk of massively misinterpreting my data. Having a simple way of skipping missing values could be nice, but I would like it to be opt-in.

@pdeffebach
Copy link
Author

@jariji

This isn't quite right. Even though, in Julia sum([1, 2, missing]) returns missing, existing libraries don't treat missing strictly. Both Plots.jl and Makie.jl omit missing values in arguments before plotting (similar to how GLM omits missing before running a regression).

Going further I conducted a survey of Julia, R, Python, and Octave (because I don't have a Matlab installation) to better understand how each library treats missing values. I created a dataset in Julia of ages and wages for individuals. The :wage variable is missing for 10% of the population. For each language I assessed (1) how missing is treated in mean(x) where x contains missing values, and more importantly (2) how missing values are handled during plotting.

As you can see, in all languages and plotting libraries, missing values are dropped before plotting. So if AlgebraOfGraphics.jl were to require separate handling of missing values, it would break with existing standards and expectations for plotting libraries.

Language Plotting package Missing treatment in mean Missing treatment in plotting Notes
Julia Plots.jl Returns missing Ignores missing  
Julia Makie.jl Returns missing Ignores missing  
Julia AlgebraOfGraphics.jl Returns missing Converts to categorical  
Julia Makie.jl Returns missing Ignores missing  
R ggplot Returns missing Ignores missing  
R Base R Returns missing Ignores missing  
Python Pandas Ignores missing Ignores missing Using pd.NA
Python Numpy + Matplotlib Returns missing Ignores missing Using nan
Octave Base Returns missing Ignores missing Using NA

Note that this is true even in languages where missing values are propagated in mean(df.age), as Julia does. (The only framework which ignores missing values in mean is pandas).

Also note that in all example I show, I use the closest possible value to missing. In R I use NA, in Pandas I use pd.NA, and in Octave I use NA. The framework I test which does not include NA is Numpy, which only supports NaN.

Turning back to the earlier fact that missing is omitted in Makie.jl, my guess is that current behavior could probably be considered a bug, and is the result of an unnecessary <:Real dispatch somewhere in AlgrebraOfGraphics.jl. However I can't find it at the moment. @SimonDanisch, do you know where AlgebraOfGraphics.jl might be treating a Union{Missing, <:Real} vector as categorical?

Below are my implementations:

Julia data generation
using CSV, DataFrames

N = 1000
γ = .05

df = DataFrame(age = rand(20:65, N))
df.wage = map(df.age) do a
    w = a * γ + rand() * 10
    rand() < .1 ? missing : w
end

CSV.write("data/wages.csv", df)
Julia plotting
using GLMakie, Makie, AlgebraOfGraphics
using CSV, DataFrames

using Plots: Plots

df = CSV.read("data/wages.csv", DataFrame)

# Makie.jl (GLMakie.jl) ##############################################
p = GLMakie.plot(df.age, df.wage)
save("out/julia/makie.png", p)

# Plots.jl ###########################################################
p = Plots.scatter(df.age, df.wage)
save("out/julia/plots.png", p)

# AlgebraOfGraphics.jl ###############################################
# This one is very messed up
p = data(df) * mapping(:age, :wage) |> draw
save("out/julia/algebraofgraphics.png", p)

# Mean behavior ######################################################
m = mean(df.wage) # missing
R plotting
library(tidyverse)

df = read_csv("data/wages.csv")

# Base R #############################################################
png("out/R/baser.png")
plot(df$age, df$wage)
dev.off()

# ggplot #############################################################
p = df |>
  ggplot(aes(x = age, y = wage)) +
  geom_point()

ggsave("out/R/ggplot.png", p)

# Mean behavior ######################################################
mean(df$wage) # NA
Octave plotting
df = csvread('data/wages.csv', "emptyvalue", NA)

# Plotting ###########################################################
p = scatter(df(:, 1), df(:, 2))
saveas(p, "out/octave/octave.png", "png")

# Mean ###############################################################
m = mean(df(:, 2)) # NA
Python plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("data/wages.csv")
df = df.fillna(pd.NA)

# Pandas graphing ####################################################
ax = df.plot.scatter(x='age', y='wage')
ax.figure.savefig('out/python/pandas.png')

# Matplotlib graphing ################################################
age = np.array(df.loc[:, "age"])
wage = np.array(df.loc[:, "wage"])
p = plt.scatter(age, wage)
p.figure.savefig('out/python/pyplot.png')

# Pandas mean ########################################################
m = df.loc[:, "wage"].mean() # A real

# Numpy mean #########################################################
x = np.array(df.loc[:, "wage"])
x.mean() # nan

@jariji
Copy link

jariji commented Feb 4, 2024

ggplot2 does make an NA bar

> data.frame(a = c('one', 'two', 'two', NA)) %>% 
  ggplot(aes(a)) +
  geom_bar()

But I don't think prior art is the way to go here. A plot is a data summary, just like mean is, and there isn't really a prinicpled justification for removing unknown values for plotting specifically. The consistency argument favors retaining them --- the argument for dropping data is about pragmatics. I think we should find a simple way to satisfy users who want missing data hidden, but that shouldn't compromise the expectations of users who don't.

@pdeffebach
Copy link
Author

That doesn't address my point. Your example is about how to handle missing when it is part of a categorical vector. I am talking about how to make a plot where I want a column to be treated as a continuous vector, but it contains missing.

@aplavin
Copy link
Contributor

aplavin commented Feb 4, 2024

Matplotlib definitely doesn't just "ignore" nans, compare how the 2nd and 3rd lines here look in the plot:

import matplotlib.pyplot as plt
import numpy as np

plt.plot([1, 2, 3, 4, 5], [2, 1, 3, 5, 4])
plt.plot([1, 2, 4, 5], np.array([2, 1, 5, 4]) + 1)  # drop obs #3
plt.plot([1, 2, np.nan, 4, 5], np.array([2, 1, 3, 5, 4]) + 2)  # replace obs #3 with nan

Also, even if/when it does ignore nans, this behavior can be used as a consistency argument for handling NaN in Julia, not missing. And indeed, for numeric values, NaN in Julia is the most well-specified, widely propagated, type-stable, performant, interoperable object to put when the actual value is not available.

TBH, I'm not that worried about this specific case of a plotting library – just concerned about the general direction towards silently ignoring a part of data without user explicitly asking or knowing about that.

In plots, the best default would likely be to show missing values separately whenever possible. Eg, as a separate bin in the histogram, as a small inset in a scatterplot, etc. This would be a nice selling point of Julia and its plotting ecosystem, like

Julia: language where data handling libraries have safe defaults 🌈

@pdeffebach
Copy link
Author

@aplavin I would still say "ignore" is the correct word. Whether there a gap in the line or it is filled in is not super material to me. The point is that the graph "works", the code runs, and the lines are still treated as continuous variables.

Please look at how AlgebraOfGraphics.jl handles this scenario. The current behavior is very clearly a bug and could not reasonably be the intention of anyone making a plot.

In plots, the best default would likely be to show missing values separately whenever possible. Eg, as a separate bin in the histogram, as a small inset in a scatterplot, etc. This would be a nice selling point of Julia and its plotting ecosystem, like

This solution would make my life not better than current behavior, as I would have to turn off these extras any time I wanted to make a plot.

I will work on a PR to change current behavior to match other plotting libraries.

@aplavin
Copy link
Contributor

aplavin commented Feb 4, 2024

The point is that the graph "works", the code runs

That attitude easily leads to silent correctness issues. Agree with @jariji, developing convenient but explicit and opt-in ways to handle missings is a much better solution.
Luckily, lots of Julia functions return nothing to indicate "no value", which is not susceptible to such issues. Still, would be nice to have an alternative (such as missing) that always propagates whenever possible – but still is never silently ignored.

@pdeffebach
Copy link
Author

I want to re-emphasize that I explored the behavior of R, Python, Octave and other Julia libraries and all of them seamlessly create graphs with missing values. AlgebraOfGraphics.jl is the odd one out.

@SimonDanisch hopefully you can help with a PR when I open one.

@aplavin @jariji I will not be continuing this conversation, as it is not useful to go back and forth.

@laikq
Copy link

laikq commented Feb 22, 2024

Hey, I comment because I need a simple way to tell AlgebraOfGraphics that my data is indeed continuous and not categorical, just because it contains missing values. I agree that silently ignoring missing value could lead to misleading graphics, but I cannot imagine a case where treating Union{Missing, Float64} as categorical is actually what someone needs.

Maybe introducing something like nonnumerical would help? mapping(:a, :b => continuous) could mark column :b as explicitly continuous.

As a workaround for people in my situation, mapping(:a, :b => (x -> coalesce(x, NaN))) works.

@jkrumbiegel
Copy link
Member

Missing values are now passed to Makie and do not signal that data is categorical anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants