Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about original and new SanitizationFilter #406

Open
digitalmoksha opened this issue Jun 4, 2024 · 2 comments
Open

Question about original and new SanitizationFilter #406

digitalmoksha opened this issue Jun 4, 2024 · 2 comments

Comments

@digitalmoksha
Copy link
Contributor

In the original SanitizationFilter, which uses Sanitize, you had two transformers

        transformers: [
          # Top-level <li> elements are removed because they can break out of
          # containing markup.
          lambda { |env|
            name = env[:node_name]
            node = env[:node]
            if name == LIST_ITEM && node.ancestors.none? { |n| LISTS.include?(n.name) }
              node.replace(node.children)
            end
          },

          # Table child elements that are not contained by a <table> are removed.
          lambda { |env|
            name = env[:node_name]
            node = env[:node]
            if (TABLE_SECTIONS.include?(name) || TABLE_ITEMS.include?(name)) && node.ancestors.none? { |n| n.name == TABLE }
              node.replace(node.children)
            end
          }
        ].freeze

I don't see these in the new version based on Selma. Are these transformations no longer needed?

I assume that any transformations would need to be written as Selma handlers. I don't see a way to add those to the sanitization config, so I'm guessing these would need to be node filters.

@gjtorikian
Copy link
Owner

If I remember correctly, at the time, I thought that this seemed unnecessary, since as you said the same action can be accomplished with node filters. As I look at this again, though, I could see a use case for some kind of config to simplify node addition/removal. Something like (spitballing):

remove_unsemantic: ["li", Config::TABLE_CHILD_ELEMENTS]

Since I can imagine some people would use this gem to create well-formed markup, perhaps the gem should just make it much easier to indicate whether you want parents or children of a certain node moved/replaced/deleted/etc.

Does that answer your question, or did I misunderstand?

@digitalmoksha
Copy link
Contributor Author

I guess initially I was curious if you thought those conditions still needed to be sanitized. If so, it might make sense to include those in the filter, as handlers passed into Selma::Rewriter.new(sanitizer: sanitization_config).

Since I can imagine some people would use this gem to create well-formed markup, perhaps the gem should just make it much easier to indicate whether you want parents or children of a certain node moved/replaced/deleted/etc.

I can see the usefulness of that. Though currently my transformers are more along the lines of remove the class unless the class == 'anchor', or remove the id unless it's a footnote or other parser generated id. I think I will need handlers for those.

For now I plan on using Selma directly for sanitization until I can do the much heavier lift of upgrading html-pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants