Memory management question #2936

aThorp96 · 2023-07-20T15:19:08Z

aThorp96
Jul 20, 2023

TL;DR: is there a pragmatic way to select some elements from a Document, keep the Elements in memory and drop the larger Document to free its memory?

Problem:

I have a program which is parsing HTML in a somewhat memory-constrained environment. One issue I've run into is if the HTML is sufficiently large, the parsed Nokogiri Document can take up a heavy memory footprint. Typically I need to query a small number of elements out of the Document, but depending on the use-case I may not be able to quickly extract what I need from the elements and return the values, dropping the document from scope and freeing up the memory.

Attempted Solution:

My initial attempt to do this was to add a helper function which parses the html, queries the document with a given query, and returns the resulting element. In my mind this means only the elements remain in scope/memory while the document itself is freed. However I realized that Element#parent implies that the rest of the document may remain in memory, even if I only have a direct reference to the elements I care about.

def query_html(query)
  n  = Nokogiri::HTML(html)
  n.css(query)
end

Question

My question then is: am I correct that this wrapper doesn't have the memory benefits I initially expected? And if it does not, is there a way to accomplish what I'm attempting?

Answered by stevecheckoway

Jul 20, 2023

You're correct that should keep the whole document in memory.

You could try creating a new DocumentFragment and insert the results of the query into the fragment. I think that should move all of the nodes from the original Document into the fragment. Then you should be able to drop the original document.

I'm not at a computer to try right now, however.

View full answer

stevecheckoway · 2023-07-20T17:43:50Z

stevecheckoway
Jul 20, 2023
Maintainer

You're correct that should keep the whole document in memory.

You could try creating a new DocumentFragment and insert the results of the query into the fragment. I think that should move all of the nodes from the original Document into the fragment. Then you should be able to drop the original document.

I'm not at a computer to try right now, however.

1 reply

flavorjones Jul 22, 2023
Maintainer

Yes, this is the approach I would recommend. Copying nodes to a new document or document fragment will allow the original document to be garbage collected.

flavorjones · 2023-07-22T15:18:48Z

flavorjones
Jul 22, 2023
Maintainer

An alternative might be to parse the document as HTML4 using one of libxml2's streaming APIs: the SAX parser or the Reader, for example. This will increase code complexity significantly, but may be worth it for the memory tradeoff if you're in a highly constrained environment.

For a sense of how complex the code ends up being, and for hints on how to structure your code, check out https://github.com/flavorjones/fairy-wing-throwdown/blob/master/lib/flavorjones.rb

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory management question #2936

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Memory management question #2936

aThorp96 Jul 20, 2023

Problem:

Attempted Solution:

Question

Replies: 2 comments · 1 reply

stevecheckoway Jul 20, 2023 Maintainer

flavorjones Jul 22, 2023 Maintainer

flavorjones Jul 22, 2023 Maintainer

aThorp96
Jul 20, 2023

Replies: 2 comments 1 reply

stevecheckoway
Jul 20, 2023
Maintainer

flavorjones Jul 22, 2023
Maintainer

flavorjones
Jul 22, 2023
Maintainer