Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML table for the view of Grouped Dataset/GatherBy Dataset in Jupyter Notebook #83

Open
Kexi002 opened this issue Sep 7, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@Kexi002
Copy link
Contributor

Kexi002 commented Sep 7, 2022

Hi,

I found there is no HTML table display for the view Grouped Dataset/GatherBy Dataset, which is generated by groupby() or gatherby(). These views are still displayed in plain text, like:

image

So I think it will be a good idea to add an HTML show function for GroupBy and GatherBy. Since GroupBy and GatherBy do not belong to AbstractDataset, we cannot simply modify the show function of AbstractDataset, but its code could be reused.

It will be easy to fix this issue by passing the view(gds.parent, gds.perm, :) into an AbstractDataset show function's copy, whose summary message is modified. This is my improvement:

image

And it also works well with large Dataset:

image
(omitted rows)
image

@Kexi002
Copy link
Contributor Author

Kexi002 commented Sep 7, 2022

I'm checking this solution for some details. In order to modify the displayed summary message, I need to add some new show functions. Should I place them in io.jl or groupby.jl/gatherby.jl?

@sl-solution
Copy link
Owner

sl-solution commented Sep 7, 2022

It should be good to fix this issue.

I think, one way to fix this issue is to modify _show(io::IO, ::MIME"text/html", ds::AbstractDataset;...) in such away that it can accept a keyword argument which is the permutation of observations. By default, it can be 1:n (n is the number of rows), however, for GroupBy and GatherBy you can pass the observations permutation.

@sl-solution sl-solution added the enhancement New feature or request label Sep 7, 2022
@Kexi002
Copy link
Contributor Author

Kexi002 commented Sep 7, 2022

Just note that for very long ds view(ds, perm, :) is very slow (it is PrettyTables.jl issue).

You mean Dataset with many rows? I made some test for large Dataset:

10,000,000 rows * 10 columns:
image
(omitted rows)
image

1,000,000 rows * 100 columns:
image
(omitted rows)
image

100,000 rows * 1,000 columns:
image
(omitted rows)
image

It seems like these large Dataset are displayed normally (By default displaing 30 rows). But once the entries of the Dataset exceeds 100,000,000, my program will crash. I'm not sure if my computer is running out of memory or it's a display issue.

I think, to fix this issue, you may modify _show(io::IO, ::MIME"text/html", ds::AbstractDataset;...) in such away that it can accept a keyword argument which is the permutation of observations. By default, it can be 1:n (n is the number of rows), however, for GroupBy and GatherBy you can pass the observations permutation.

Add a permutation argument might be a good idea, so we don't need to add additional show functions. I will try that.

My question is that I am worried that modifying the existing _show() function will affect many places. But if just copy a _show() function specifically for GroupBy, it would be less risky, but might add more code. How to make a trade-off?

@Kexi002
Copy link
Contributor Author

Kexi002 commented Sep 7, 2022

I just thought of another problem. If we pass a permutation argument, we can judge that this is a GroupBy/GatherBy, not an AbstractDataset, and print the correct summary message.

However, how can we distinguish GroupBy and GatherBy if we only pass a permutation argument? This would be a problem unless they all print the same summary message.

@sl-solution
Copy link
Owner

adding a keyword argument is one way, another one way can be modifying _show to work on permutations of observations. We have an internal function _get_perms that you can use for this purpose.

However, I think changing _show is better compared to adding a new similar function.

@Kexi002
Copy link
Contributor Author

Kexi002 commented Sep 13, 2022

Hi, I'm trying to add some functions for GroupBy/GatherBy, so the _show() function could be reused to output the HTML table for these two types.

I'm not sure how to implement eachcol() for GroupBy/GatherBy will be reasonable. The implementation of AbstractDataset eachcol() create a DatasetColumns for that data set, which just a structure contains the data set.

Since the permutation will affect the max widths of the first N rows of a column, I can't just use something like eachcol(ds.parent) to solve this problem. Could you give me any suggestions? @sl-solution

@Kexi002
Copy link
Contributor Author

Kexi002 commented Sep 14, 2022

Another question is, should we be able to access a GroupBy entry by gds[!, :x1] (exclamation mark)? Because actually GroupBy doesn't have a real original Dataset.

Should we be able to modify its parent Dataset by itself?

@sl-solution
Copy link
Owner

We should not define eachcol or getindex or setindex! for GroupBy/GatherBy. GroupBy and GatherBy are supposed to be consumed immediately and never being saved or modified, because, small changes on them invalidate their properties. So we like to define as less functions as possible for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants