Extracting embedded JSON objects with rogue elements #2805
-
What problem are you trying to solve? Does -- or could -- Nokogiri have a built-in way of extracting inline JSON objects in HTML, including handling of rogue elements that appear in the wild? ( Fixing Please show your code!
Would be nice to be able to do someting like this do get the JSON object:
Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hi, @forthrin, thanks for asking this question! Nokogiri specifically wraps XML and HTML parsers, and so unfortunately has no capability to parse Javascript. That said, you may want to look at something like https://github.com/nene/rkelly-remix which is a pure-Ruby javascript parser! You should be able to use it to examine the contents of a #! /usr/bin/env ruby
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "rkelly-remix"
gem "nokogiri"
end
require "rkelly"
markup = '<script>foo={"a": 1, "b": undefined, "c": function(){}};</script><script>bar={"baz": 2};</script>'
html_doc = Nokogiri.HTML5(markup)
script = html_doc.at_css("script")
script.content # => "foo={\"a\": 1, \"b\": undefined, \"c\": function(){}};"
RKelly::Parser.new.parse(script.content).to_sexp.map do |k,v|
pp [k,v]
end
# >> [:expression,
# >> [:op_equal,
# >> [:resolve, "foo"],
# >> [:object,
# >> [[:property, :"\"a\"", [:lit, 1]],
# >> [:property, :"\"b\"", [:resolve, "undefined"]],
# >> [:property, :"\"c\"", [:func_expr, "function", [], [:func_body, []]]]]]]] Hope this helps? |
Beta Was this translation helpful? Give feedback.
-
Thanks for your prompt and helpful reply. I agree that identifying inline (and semi-invalid) JSON however common, is ostensibly out of scope for an HTML parser. It looks like your suggested library does the job. However, two questions:
|
Beta Was this translation helpful? Give feedback.
Hi, @forthrin, thanks for asking this question!
Nokogiri specifically wraps XML and HTML parsers, and so unfortunately has no capability to parse Javascript.
That said, you may want to look at something like https://github.com/nene/rkelly-remix which is a pure-Ruby javascript parser! You should be able to use it to examine the contents of a
<script>
tag: