Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect transform pipeline generated when encoding references calculated column with binning inside layout #9354

Open
1 task done
jonmmease opened this issue May 20, 2024 · 0 comments
Labels

Comments

@jonmmease
Copy link
Contributor

Bug Description

Root cause of what was originally reported in Vega-Altair in vega/altair#3423 (comment)

There seems to be a Vega-Lite issue in the scenario where a new column is generated by a calculate transform, and then referenced in an encoding channel with bin: true, when that mark is inside a layout.

Here is a working example spec that calculates a new column using joinaggregate and calculate and then uses this column as the x encoding with binning enabled:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.17.0.json",
  "config": {"view": {"continuousWidth": 300, "continuousHeight": 300}},
  "data": {
    "url": "https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json"
  },
  "mark": {"type": "bar"},
  "encoding": {
    "x": {
      "bin": true,
      "field": "Rating_std",
      "type": "quantitative"
    },
    "y": {"aggregate": "count", "type": "quantitative"}
  },
  "transform": [
    {
      "joinaggregate": [
        {"op": "mean", "field": "IMDB_Rating", "as": "mean_val"},
        {"op": "stdev", "field": "IMDB_Rating", "as": "std_val"}
      ]
    },
    {
      "calculate": "(datum.IMDB_Rating - datum.mean_val) / datum.std_val",
      "as": "Rating_std"
    }
  ],
  "width": 400
}

visualization (1)

This generates the following correct Vega transform pipeline:

      "transform": [
        {
          "type": "joinaggregate",
          "as": ["mean_val", "std_val"],
          "ops": ["mean", "stdev"],
          "fields": ["IMDB_Rating", "IMDB_Rating"]
        },
        {
          "type": "formula",
          "expr": "(datum.IMDB_Rating - datum.mean_val) / datum.std_val",
          "as": "Rating_std"
        },
        {
          "type": "extent",
          "field": "Rating_std",
          "signal": "bin_maxbins_10_Rating_std_extent"
        },
        {
          "type": "bin",
          "field": "Rating_std",
          "as": ["bin_maxbins_10_Rating_std", "bin_maxbins_10_Rating_std_end"],
          "signal": "bin_maxbins_10_Rating_std_bins",
          "extent": {"signal": "bin_maxbins_10_Rating_std_extent"},
          "maxbins": 10
        },
        ...
      ]

Note that the joinaggregate and formula transforms that create the Rating_std column come before the extent and bin transforms that reference this column.

Now make the modification of nesting this same chart, by itself, inside a layer.

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.17.0.json",
  "config": {"view": {"continuousWidth": 300, "continuousHeight": 300}},
  "data": {
    "url": "https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json"
  },
  "layer": [
    {
      "mark": {"type": "bar"},
      "encoding": {
        "x": {
          "bin": true,
          "field": "Rating_std",
          "type": "quantitative"
        },
        "y": {"aggregate": "count", "type": "quantitative"}
      },
      "transform": [
        {
          "joinaggregate": [
            {"op": "mean", "field": "IMDB_Rating", "as": "mean_val"},
            {"op": "stdev", "field": "IMDB_Rating", "as": "std_val"}
          ]
        },
        {
          "calculate": "(datum.IMDB_Rating - datum.mean_val) / datum.std_val",
          "as": "Rating_std"
        }
      ]
    }
  ],
  "width": 400
}

Now the rect marks fail to display because this is the transform pipeline generated:

      "transform": [
        {
          "type": "extent",
          "field": "Rating_std",
          "signal": "layer_0_bin_maxbins_10_Rating_std_extent"
        },
        {
          "type": "bin",
          "field": "Rating_std",
          "as": ["bin_maxbins_10_Rating_std", "bin_maxbins_10_Rating_std_end"],
          "signal": "layer_0_bin_maxbins_10_Rating_std_bins",
          "extent": {"signal": "layer_0_bin_maxbins_10_Rating_std_extent"},
          "maxbins": 10
        },
        {
          "type": "joinaggregate",
          "as": ["mean_val", "std_val"],
          "ops": ["mean", "stdev"],
          "fields": ["IMDB_Rating", "IMDB_Rating"]
        },
        {
          "type": "formula",
          "expr": "(datum.IMDB_Rating - datum.mean_val) / datum.std_val",
          "as": "Rating_std"
        },
        ...
      ]

Note that the extent and bin transforms are positioned above the creation of the Rating_std column by the joinaggregate and formula transforms.

Does anyone have any initial thoughts on what might trigger the difference in behavior when the chart is nested in a layout? I think this is probably a pretty common scenario (e.g. the original author wanted to overlay a rule on top of a histogram of a calculated column), so I'd like to make some progress on getting to the bottom of it.

Checklist

  • I checked for duplicate issues (though this is a little hard to search for)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant