Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Box-Plot with outlier jitter #3148

Closed
buhtz opened this issue Nov 18, 2022 · 16 comments
Closed

Box-Plot with outlier jitter #3148

buhtz opened this issue Nov 18, 2022 · 16 comments

Comments

@buhtz
Copy link

buhtz commented Nov 18, 2022

image

What you see in that picture is a workaround for what I really would like to have. When searching the web you often got the combine-boxplot-with-swarmplot-solution. It would IMHO improve seaborn if this could be done via seaborn without a workaround.

The problems with that example are

  1. The outliers are drawn twice (green and red circles). Only draw the jittered outliers (the green ones).
  2. The none-outliers are also drawn. There is no need for them.

This is an MWE to reproduce that picture.

#!/usr/bin/env python3
import random
import pandas
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()

random.seed(0)

df = pandas.DataFrame({
    'Vals': random.choices(range(200), k=200)})
df_outliers = pandas.DataFrame({
    'Vals': random.choices(range(400, 700), k=20)})

df = pandas.concat([df, df_outliers], axis=0)

flierprops = {
    'marker': 'o',
    'markeredgecolor': 'red',
    'markerfacecolor': 'none'
}

# Usual boxplot
ax = sns.boxplot(y='Vals', data=df, flierprops=flierprops)

# Add jitter with the swarmplot function
ax = sns.swarmplot(y='Vals', data=df, linewidth=.75, color='none', edgecolor='green')
plt.show()

@mwaskom
Copy link
Owner

mwaskom commented Nov 18, 2022

This is something you can accomplish with a little pos-hoc artist manipulation:

ax = sns.boxplot(data=tips, y="day", x="total_bill", whis=.2)
for artist in ax.lines:
    if artist.get_linestyle() == "None":
        pos = artist.get_ydata()
        artist.set_ydata(pos + np.random.uniform(-.05, .05, len(pos)))

image

Personally, I think this looks a little messy, but YMMV.

@buhtz
Copy link
Author

buhtz commented Nov 18, 2022

Thanks a lot for that approach. I will use this as a workaround.

But please take this Issue as a feature request to Seaborn. Jittered outliers IMHO are a usual and often used case. e.g. ggplot can do this by parameter without workarounds.

@buhtz
Copy link
Author

buhtz commented Nov 18, 2022

Please let me explain a bit more the shortcoming of that (and each other) workaround.

image
I added some more outliers with nearly the same value. But they are all drawn nearly around the same place. This shouldn't happen.
If there are more values they should be drawn more wider.

@mwaskom
Copy link
Owner

mwaskom commented Nov 18, 2022

ggplot can do this by parameter without workarounds.

"ggplot parity" is an explicit non-goal of seaborn, but nevertheless this open feature request suggets that might not be the case? tidyverse/ggplot2#4480

I added some more outliers with nearly the same value. But they are all drawn nearly around the same place. This shouldn't happen.

I have no idea what this means. Those points look jittered to me. Please supply a reproducible example if you're going to claim that something doesn't work.

@buhtz
Copy link
Author

buhtz commented Nov 19, 2022

I have no idea what this means. Those points look jittered to me. Please supply a reproducible example

An MWE is not possible because it is not implemented yet. That is why the Issue exists.

Look in the figure at "Vals" round about 600. There are multiple outliers. There are much more outliers then between 400 and 500 or at 700. The outliers there should be jittered a bit more from left to right.

I don't mean to draw all data points side by side like it is done in a swarm plot. A bit "chaos" is totally OK. But for the viewer there should be an idea about how "big" the chaos is.

Technically explained:
In your example you calculate a jitter with np.random.uniform(-.05, .05, len(pos)). The "factor" here is 0.5. But for the Vals around 600 the factor should be "a bit more". The factor should depend on the frequency of data points on the (nearly) same position.

Sorry my English is not so good to explain it.

@mwaskom
Copy link
Owner

mwaskom commented Nov 19, 2022

An MWE is not possible because it is not implemented yet. That is why the Issue exists.

???

What is the code that you used to create the image that you claim has a problem?

@buhtz
Copy link
Author

buhtz commented Nov 19, 2022

What is the code that you used to create the image that you claim has a problem?

My initial MWE including your workaround

#!/usr/bin/env python3
import random
import pandas
import numpy
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()

random.seed(0)

df = pandas.DataFrame({
    'Vals': random.choices(range(200), k=200)})
df_outliers = pandas.DataFrame({
    'Vals': random.choices(range(400, 700), k=20)})

df = pandas.concat([df, df_outliers], axis=0)
for i in range(20):
    df = pandas.concat([df, pandas.DataFrame({'Vals': [600+i]})])

flierprops = {
    'marker': 'o',
    'markeredgecolor': 'red',
    'markerfacecolor': 'none'
}

# Usual boxplot
ax = sns.boxplot(y='Vals', data=df, flierprops=flierprops)

for line in ax.lines:
    if line.get_linestyle() == 'None':
        pos = line.get_xdata()
        pos = pos + numpy.random.uniform(-.05, .05, len(pos))
        line.set_xdata(pos)

plt.show()

Produce this

image

@mwaskom
Copy link
Owner

mwaskom commented Nov 19, 2022

The code I shared is a simple recipe. If you need something more complicated, feel free to expand it.

@jhncls
Copy link

jhncls commented Nov 23, 2022

Here is a hacky way to work with a swarmplot instead of a stripplot for the outliers:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme()

df = pd.DataFrame({'Vals': np.concatenate([np.random.randint(0, 200, size=1000),
                                           np.random.randint(400, 700, size=100),
                                           np.arange(600, 620)])})
df['x'] = np.random.randint(0, 3, len(df))

ax = sns.boxplot(x='x', y='Vals', data=df, orient='v')

xpos = np.array([])
ypos = np.array([])
for line in ax.lines:
     if line.get_linestyle() == 'None':
          xpos = np.append(xpos, line.get_xdata())
          ypos = np.append(ypos, line.get_ydata())
          line.remove()
sns.swarmplot(x=xpos, y=ypos, ax=ax, color='red', orient='v')

plt.tight_layout()
plt.show()

image

@mwaskom
Copy link
Owner

mwaskom commented Nov 23, 2022

Good thinking — you can actually make this simpler even by just plotting multiple swarms on the fly

ax = sns.boxplot(x='x', y='Vals', data=df, orient='v', fliersize=0)
for line in ax.lines:
    if line.get_linestyle() == 'None':
        sns.swarmplot(x=line.get_xdata(), y=line.get_ydata())

@buhtz
Copy link
Author

buhtz commented Nov 24, 2022

That was exactly what I was looking for. Thanks for the workaround.

The question now is if this is a candidate for a new feature? No matter when this will be implemented.
If not than we can close that issue.

@mwaskom
Copy link
Owner

mwaskom commented Nov 24, 2022

-1 on this, there's already way too many options stuffed into the boxplot API

@buhtz
Copy link
Author

buhtz commented Nov 24, 2022

Maybe this shouldn't be on Seaborn but on one of the underlying packages? plotly matplotlib or whatever does the magic?

@jhncls
Copy link

jhncls commented Nov 24, 2022

For magic, you're in Seaborn territory. Seaborn extends matplotlib, Plotly is a very different beast.
I agree with Michael that the boxplot API is already quite complex. Adding too many options might confuse casual users even more.
A swarmplot works nice in some situations, and not so well in others. For the moment, the code can exist as the hacky example in this thread. If a lot of people insist in it being added as a desired feature in the official code, it can be considered in some future.

@jcmkk3
Copy link

jcmkk3 commented Nov 24, 2022

It hasn't really been mentioned, but I think that this will probably be relatively easy to do in the future with the seaborn objects interface. Maybe I'm missing something that would make that a challenge, though. The big missing pieces at the moment are happening in these PRs: #3127 and #3120.

@mwaskom
Copy link
Owner

mwaskom commented Nov 26, 2022

Yes, you'd also need a stat transform that filters to/out outliers. (And a swarm mark since that's apparently what's actually desired here, not jitter).

Boxplots are annoying in that they're a "standard" plot type but they're actually quite complicated to make and open the door to all sorts of API complexity.

Repository owner locked and limited conversation to collaborators Nov 27, 2022
@mwaskom mwaskom converted this issue into discussion #3161 Nov 27, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants