Incorrect transform pipeline generated when encoding references calculated column with binning inside layout #9354

jonmmease · 2024-05-20T11:20:53Z

Bug Description

Root cause of what was originally reported in Vega-Altair in vega/altair#3423 (comment)

There seems to be a Vega-Lite issue in the scenario where a new column is generated by a calculate transform, and then referenced in an encoding channel with bin: true, when that mark is inside a layout.

Here is a working example spec that calculates a new column using joinaggregate and calculate and then uses this column as the x encoding with binning enabled:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.17.0.json",
  "config": {"view": {"continuousWidth": 300, "continuousHeight": 300}},
  "data": {
    "url": "https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json"
  },
  "mark": {"type": "bar"},
  "encoding": {
    "x": {
      "bin": true,
      "field": "Rating_std",
      "type": "quantitative"
    },
    "y": {"aggregate": "count", "type": "quantitative"}
  },
  "transform": [
    {
      "joinaggregate": [
        {"op": "mean", "field": "IMDB_Rating", "as": "mean_val"},
        {"op": "stdev", "field": "IMDB_Rating", "as": "std_val"}
      ]
    },
    {
      "calculate": "(datum.IMDB_Rating - datum.mean_val) / datum.std_val",
      "as": "Rating_std"
    }
  ],
  "width": 400
}

This generates the following correct Vega transform pipeline:

      "transform": [
        {
          "type": "joinaggregate",
          "as": ["mean_val", "std_val"],
          "ops": ["mean", "stdev"],
          "fields": ["IMDB_Rating", "IMDB_Rating"]
        },
        {
          "type": "formula",
          "expr": "(datum.IMDB_Rating - datum.mean_val) / datum.std_val",
          "as": "Rating_std"
        },
        {
          "type": "extent",
          "field": "Rating_std",
          "signal": "bin_maxbins_10_Rating_std_extent"
        },
        {
          "type": "bin",
          "field": "Rating_std",
          "as": ["bin_maxbins_10_Rating_std", "bin_maxbins_10_Rating_std_end"],
          "signal": "bin_maxbins_10_Rating_std_bins",
          "extent": {"signal": "bin_maxbins_10_Rating_std_extent"},
          "maxbins": 10
        },
        ...
      ]

Note that the joinaggregate and formula transforms that create the Rating_std column come before the extent and bin transforms that reference this column.

Now make the modification of nesting this same chart, by itself, inside a layer.

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.17.0.json",
  "config": {"view": {"continuousWidth": 300, "continuousHeight": 300}},
  "data": {
    "url": "https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json"
  },
  "layer": [
    {
      "mark": {"type": "bar"},
      "encoding": {
        "x": {
          "bin": true,
          "field": "Rating_std",
          "type": "quantitative"
        },
        "y": {"aggregate": "count", "type": "quantitative"}
      },
      "transform": [
        {
          "joinaggregate": [
            {"op": "mean", "field": "IMDB_Rating", "as": "mean_val"},
            {"op": "stdev", "field": "IMDB_Rating", "as": "std_val"}
          ]
        },
        {
          "calculate": "(datum.IMDB_Rating - datum.mean_val) / datum.std_val",
          "as": "Rating_std"
        }
      ]
    }
  ],
  "width": 400
}

Now the rect marks fail to display because this is the transform pipeline generated:

      "transform": [
        {
          "type": "extent",
          "field": "Rating_std",
          "signal": "layer_0_bin_maxbins_10_Rating_std_extent"
        },
        {
          "type": "bin",
          "field": "Rating_std",
          "as": ["bin_maxbins_10_Rating_std", "bin_maxbins_10_Rating_std_end"],
          "signal": "layer_0_bin_maxbins_10_Rating_std_bins",
          "extent": {"signal": "layer_0_bin_maxbins_10_Rating_std_extent"},
          "maxbins": 10
        },
        {
          "type": "joinaggregate",
          "as": ["mean_val", "std_val"],
          "ops": ["mean", "stdev"],
          "fields": ["IMDB_Rating", "IMDB_Rating"]
        },
        {
          "type": "formula",
          "expr": "(datum.IMDB_Rating - datum.mean_val) / datum.std_val",
          "as": "Rating_std"
        },
        ...
      ]

Note that the extent and bin transforms are positioned above the creation of the Rating_std column by the joinaggregate and formula transforms.

Does anyone have any initial thoughts on what might trigger the difference in behavior when the chart is nested in a layout? I think this is probably a pretty common scenario (e.g. the original author wanted to overlay a rule on top of a histogram of a calculated column), so I'd like to make some progress on getting to the bottom of it.

Checklist

I checked for duplicate issues (though this is a little hard to search for)

The text was updated successfully, but these errors were encountered:

jonmmease added the Bug 🐛 label May 20, 2024

jonmmease mentioned this issue May 20, 2024

Can't overlay a calculated chart and mark_rule vega/altair#3423

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect transform pipeline generated when encoding references calculated column with binning inside layout #9354

Incorrect transform pipeline generated when encoding references calculated column with binning inside layout #9354

jonmmease commented May 20, 2024

Incorrect transform pipeline generated when encoding references calculated column with binning inside layout #9354

Incorrect transform pipeline generated when encoding references calculated column with binning inside layout #9354

Comments

jonmmease commented May 20, 2024

Bug Description

Checklist