Memory management question #2936
-
TL;DR: is there a pragmatic way to select some elements from a Document, keep the Elements in memory and drop the larger Document to free its memory? Problem:I have a program which is parsing HTML in a somewhat memory-constrained environment. One issue I've run into is if the HTML is sufficiently large, the parsed Nokogiri Document can take up a heavy memory footprint. Typically I need to query a small number of elements out of the Document, but depending on the use-case I may not be able to quickly extract what I need from the elements and return the values, dropping the document from scope and freeing up the memory. Attempted Solution:My initial attempt to do this was to add a helper function which parses the html, queries the document with a given query, and returns the resulting element. In my mind this means only the elements remain in scope/memory while the document itself is freed. However I realized that def query_html(query)
n = Nokogiri::HTML(html)
n.css(query)
end QuestionMy question then is: am I correct that this wrapper doesn't have the memory benefits I initially expected? And if it does not, is there a way to accomplish what I'm attempting? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
You're correct that should keep the whole document in memory. You could try creating a new DocumentFragment and insert the results of the query into the fragment. I think that should move all of the nodes from the original Document into the fragment. Then you should be able to drop the original document. I'm not at a computer to try right now, however. |
Beta Was this translation helpful? Give feedback.
-
An alternative might be to parse the document as HTML4 using one of libxml2's streaming APIs: the SAX parser or the Reader, for example. This will increase code complexity significantly, but may be worth it for the memory tradeoff if you're in a highly constrained environment. For a sense of how complex the code ends up being, and for hints on how to structure your code, check out https://github.com/flavorjones/fairy-wing-throwdown/blob/master/lib/flavorjones.rb |
Beta Was this translation helpful? Give feedback.
You're correct that should keep the whole document in memory.
You could try creating a new DocumentFragment and insert the results of the query into the fragment. I think that should move all of the nodes from the original Document into the fragment. Then you should be able to drop the original document.
I'm not at a computer to try right now, however.