Ignore missing values #488

pdeffebach · 2024-01-23T22:16:19Z

It would be really nice if AlgebraOfGraphics.jl ignored missing values better.

Current behavior is un-intuitive and likely not desired by anyone: It treats the entire column as categorical.

The desired behavior would be to drop missing pairs of (x, y) where either x or y are missing, similar to how AlgebraOfGraphics treats NaN.

I'm happy to have a larger discussion about the semantics of missing for various edge cases that come up, but I deal with lots of missing data all the time and the current behavior makes it hard to use missing and iterate quickly to make plots.

The text was updated successfully, but these errors were encountered:

jariji · 2024-02-02T22:43:25Z

In Julia missing doesn't generally get dropped implicitly because the presence of a missing can be important information.

julia> sum([1,2, missing])
missing

As with sum, I prefer plots to reflect the existence of missings in the data. Otherwise I am at risk of massively misinterpreting my data. Having a simple way of skipping missing values could be nice, but I would like it to be opt-in.

pdeffebach · 2024-02-04T19:55:29Z

@jariji

This isn't quite right. Even though, in Julia sum([1, 2, missing]) returns missing, existing libraries don't treat missing strictly. Both Plots.jl and Makie.jl omit missing values in arguments before plotting (similar to how GLM omits missing before running a regression).

Going further I conducted a survey of Julia, R, Python, and Octave (because I don't have a Matlab installation) to better understand how each library treats missing values. I created a dataset in Julia of ages and wages for individuals. The :wage variable is missing for 10% of the population. For each language I assessed (1) how missing is treated in mean(x) where x contains missing values, and more importantly (2) how missing values are handled during plotting.

As you can see, in all languages and plotting libraries, missing values are dropped before plotting. So if AlgebraOfGraphics.jl were to require separate handling of missing values, it would break with existing standards and expectations for plotting libraries.

Language	Plotting package	Missing treatment in mean	Missing treatment in plotting	Notes
Julia	Plots.jl	Returns missing	Ignores missing
Julia	Makie.jl	Returns missing	Ignores missing
Julia	AlgebraOfGraphics.jl	Returns missing	Converts to categorical
Julia	Makie.jl	Returns missing	Ignores missing
R	ggplot	Returns missing	Ignores missing
R	Base R	Returns missing	Ignores missing
Python	Pandas	Ignores missing	Ignores missing	Using pd.NA
Python	Numpy + Matplotlib	Returns missing	Ignores missing	Using nan
Octave	Base	Returns missing	Ignores missing	Using NA

Note that this is true even in languages where missing values are propagated in mean(df.age), as Julia does. (The only framework which ignores missing values in mean is pandas).

Also note that in all example I show, I use the closest possible value to missing. In R I use NA, in Pandas I use pd.NA, and in Octave I use NA. The framework I test which does not include NA is Numpy, which only supports NaN.

Turning back to the earlier fact that missing is omitted in Makie.jl, my guess is that current behavior could probably be considered a bug, and is the result of an unnecessary <:Real dispatch somewhere in AlgrebraOfGraphics.jl. However I can't find it at the moment. @SimonDanisch, do you know where AlgebraOfGraphics.jl might be treating a Union{Missing, <:Real} vector as categorical?

Below are my implementations:

Julia data generation

using CSV, DataFrames

N = 1000
γ = .05

df = DataFrame(age = rand(20:65, N))
df.wage = map(df.age) do a
    w = a * γ + rand() * 10
    rand() < .1 ? missing : w
end

CSV.write("data/wages.csv", df)

Julia plotting

using GLMakie, Makie, AlgebraOfGraphics
using CSV, DataFrames

using Plots: Plots

df = CSV.read("data/wages.csv", DataFrame)

# Makie.jl (GLMakie.jl) ##############################################
p = GLMakie.plot(df.age, df.wage)
save("out/julia/makie.png", p)

# Plots.jl ###########################################################
p = Plots.scatter(df.age, df.wage)
save("out/julia/plots.png", p)

# AlgebraOfGraphics.jl ###############################################
# This one is very messed up
p = data(df) * mapping(:age, :wage) |> draw
save("out/julia/algebraofgraphics.png", p)

# Mean behavior ######################################################
m = mean(df.wage) # missing

R plotting

library(tidyverse)

df = read_csv("data/wages.csv")

# Base R #############################################################
png("out/R/baser.png")
plot(df$age, df$wage)
dev.off()

# ggplot #############################################################
p = df |>
  ggplot(aes(x = age, y = wage)) +
  geom_point()

ggsave("out/R/ggplot.png", p)

# Mean behavior ######################################################
mean(df$wage) # NA

Octave plotting

df = csvread('data/wages.csv', "emptyvalue", NA)

# Plotting ###########################################################
p = scatter(df(:, 1), df(:, 2))
saveas(p, "out/octave/octave.png", "png")

# Mean ###############################################################
m = mean(df(:, 2)) # NA

Python plotting

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("data/wages.csv")
df = df.fillna(pd.NA)

# Pandas graphing ####################################################
ax = df.plot.scatter(x='age', y='wage')
ax.figure.savefig('out/python/pandas.png')

# Matplotlib graphing ################################################
age = np.array(df.loc[:, "age"])
wage = np.array(df.loc[:, "wage"])
p = plt.scatter(age, wage)
p.figure.savefig('out/python/pyplot.png')

# Pandas mean ########################################################
m = df.loc[:, "wage"].mean() # A real

# Numpy mean #########################################################
x = np.array(df.loc[:, "wage"])
x.mean() # nan

jariji · 2024-02-04T20:33:28Z

ggplot2 does make an NA bar

> data.frame(a = c('one', 'two', 'two', NA)) %>% 
  ggplot(aes(a)) +
  geom_bar()

But I don't think prior art is the way to go here. A plot is a data summary, just like mean is, and there isn't really a prinicpled justification for removing unknown values for plotting specifically. The consistency argument favors retaining them --- the argument for dropping data is about pragmatics. I think we should find a simple way to satisfy users who want missing data hidden, but that shouldn't compromise the expectations of users who don't.

pdeffebach · 2024-02-04T20:44:07Z

That doesn't address my point. Your example is about how to handle missing when it is part of a categorical vector. I am talking about how to make a plot where I want a column to be treated as a continuous vector, but it contains missing.

aplavin · 2024-02-04T21:40:08Z

Matplotlib definitely doesn't just "ignore" nans, compare how the 2nd and 3rd lines here look in the plot:

import matplotlib.pyplot as plt
import numpy as np

plt.plot([1, 2, 3, 4, 5], [2, 1, 3, 5, 4])
plt.plot([1, 2, 4, 5], np.array([2, 1, 5, 4]) + 1)  # drop obs #3
plt.plot([1, 2, np.nan, 4, 5], np.array([2, 1, 3, 5, 4]) + 2)  # replace obs #3 with nan

Also, even if/when it does ignore nans, this behavior can be used as a consistency argument for handling NaN in Julia, not missing. And indeed, for numeric values, NaN in Julia is the most well-specified, widely propagated, type-stable, performant, interoperable object to put when the actual value is not available.

TBH, I'm not that worried about this specific case of a plotting library – just concerned about the general direction towards silently ignoring a part of data without user explicitly asking or knowing about that.

In plots, the best default would likely be to show missing values separately whenever possible. Eg, as a separate bin in the histogram, as a small inset in a scatterplot, etc. This would be a nice selling point of Julia and its plotting ecosystem, like

Julia: language where data handling libraries have safe defaults 🌈

pdeffebach · 2024-02-04T21:59:06Z

@aplavin I would still say "ignore" is the correct word. Whether there a gap in the line or it is filled in is not super material to me. The point is that the graph "works", the code runs, and the lines are still treated as continuous variables.

Please look at how AlgebraOfGraphics.jl handles this scenario. The current behavior is very clearly a bug and could not reasonably be the intention of anyone making a plot.

In plots, the best default would likely be to show missing values separately whenever possible. Eg, as a separate bin in the histogram, as a small inset in a scatterplot, etc. This would be a nice selling point of Julia and its plotting ecosystem, like

This solution would make my life not better than current behavior, as I would have to turn off these extras any time I wanted to make a plot.

I will work on a PR to change current behavior to match other plotting libraries.

aplavin · 2024-02-04T22:19:11Z

The point is that the graph "works", the code runs

That attitude easily leads to silent correctness issues. Agree with @jariji, developing convenient but explicit and opt-in ways to handle missings is a much better solution.
Luckily, lots of Julia functions return nothing to indicate "no value", which is not susceptible to such issues. Still, would be nice to have an alternative (such as missing) that always propagates whenever possible – but still is never silently ignored.

pdeffebach · 2024-02-04T22:23:12Z

I want to re-emphasize that I explored the behavior of R, Python, Octave and other Julia libraries and all of them seamlessly create graphs with missing values. AlgebraOfGraphics.jl is the odd one out.

@SimonDanisch hopefully you can help with a PR when I open one.

@aplavin @jariji I will not be continuing this conversation, as it is not useful to go back and forth.

laikq · 2024-02-22T14:56:43Z

Hey, I comment because I need a simple way to tell AlgebraOfGraphics that my data is indeed continuous and not categorical, just because it contains missing values. I agree that silently ignoring missing value could lead to misleading graphics, but I cannot imagine a case where treating Union{Missing, Float64} as categorical is actually what someone needs.

Maybe introducing something like nonnumerical would help? mapping(:a, :b => continuous) could mark column :b as explicitly continuous.

As a workaround for people in my situation, mapping(:a, :b => (x -> coalesce(x, NaN))) works.

jkrumbiegel · 2024-08-22T08:34:42Z

Missing values are now passed to Makie and do not signal that data is categorical anymore.

jkrumbiegel mentioned this issue Jul 25, 2024

Don't treat continuous data with possible missings as categorical #514

Merged

jkrumbiegel closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore missing values #488

Ignore missing values #488

pdeffebach commented Jan 23, 2024

jariji commented Feb 2, 2024 •

edited

Loading

pdeffebach commented Feb 4, 2024

jariji commented Feb 4, 2024 •

edited

Loading

pdeffebach commented Feb 4, 2024

aplavin commented Feb 4, 2024 •

edited

Loading

pdeffebach commented Feb 4, 2024

aplavin commented Feb 4, 2024 •

edited

Loading

pdeffebach commented Feb 4, 2024

laikq commented Feb 22, 2024 •

edited

Loading

jkrumbiegel commented Aug 22, 2024

Ignore missing values #488

Ignore missing values #488

Comments

pdeffebach commented Jan 23, 2024

jariji commented Feb 2, 2024 • edited Loading

pdeffebach commented Feb 4, 2024

jariji commented Feb 4, 2024 • edited Loading

pdeffebach commented Feb 4, 2024

aplavin commented Feb 4, 2024 • edited Loading

pdeffebach commented Feb 4, 2024

aplavin commented Feb 4, 2024 • edited Loading

pdeffebach commented Feb 4, 2024

laikq commented Feb 22, 2024 • edited Loading

jkrumbiegel commented Aug 22, 2024

jariji commented Feb 2, 2024 •

edited

Loading

jariji commented Feb 4, 2024 •

edited

Loading

aplavin commented Feb 4, 2024 •

edited

Loading

aplavin commented Feb 4, 2024 •

edited

Loading

laikq commented Feb 22, 2024 •

edited

Loading