🐛 Line endings for read with limit are missing #4

kylewelsby · 2024-04-18T15:24:53Z

When decrypting streaming and splitting files per line appears to loose \r\n or \n line endings when using Reader#read(1024).

Using without the limit doesn't lose the line endings.

Affected code

Bzip2::FFI::Reader.open(io_or_path) do |reader|
  while buffer = reader.read(1024) do
    line = buffer.split("\n")
    # process uncompressed line from the buffer
  end
end

Work around

Bzip2::FFI::Reader.open(io_or_path) do |reader|
  while buffer = reader.read do
    line = buffer.split("\n")
    # process uncompressed line from the buffer
  end
end

The issue might be somewhere around the code paths for default 4kb vs limit branch of read.
https://github.com/philr/bzip2-ffi/blob/7202391117e6709bc67404c1f55d4a3f5d3bc791/lib/bzip2/ffi/reader.rb#L311C1-L321C57

The text was updated successfully, but these errors were encountered:

philr · 2024-04-18T20:08:17Z

I'm not sure exactly what you mean by losing the line endings? Perhaps an example file with the output you're seeing would help.

Calling #read without a length will decompress the entire stream in one go. The loop is not necessary in the second 'Work around' example.

Note that your two examples are not doing the same thing. In the first 'Affected code' example, if a line spans two 1,024 byte buffers, it will be split into two lines. I see identical output from both approaches up to the line spans the first buffer:

Bzip2::FFI::Reader.open('test/fixtures/lorem-4096-bytes-compressed.txt.bz2') do |reader|
  buffer = reader.read(1024)
  lines = buffer.split("\n")
  lines.map { |l| l.force_encoding('UTF-8') }
end
=>
["Lorém ipsúm dòlòr sìt amét, vix cu alìa póstulant, pri ea odio falli ",
 "mnesárchum. Regiónè eripuit maluisset has ut, at luptatum accusamus qúi, né ",
 "hinc salutàtus éum. Eos no solum còtìdiéque. Àt atqùì latìnè ",
 "tincidùnt pri, pro àn nihil légère. Errem adipiscíng èx èòs, íllud ",
 "saperet nusqúam mea ét. ",
 "",
 "Diċo graecis ei meǣ, fugit liber tincidūnÞ est ex, ðuo āltēra volūmus ",
 "dēfiniÞionēs ut! Sed ut hærum moveÞ assueverīt, no modo ƿonumes deleniÞi ",
 "pri. Seǣ ex sāle noster complectitur, ei qui porro aliqūam taċīmætes. ",
 "Cæse falli denique āt sed? MutæÞ mǽzim tāntæs ēos eu, cum cu ælteræ ",
 "ċeteros expētēƿdæ! ",
 "",
 "Ašsum labores érroribus quo ňe. Án ůnům pařtem při, súmmo ščripta ",
 "singůlís sed éi, eů nůllam aliquip nec. Faceťě ádmodum scřipserit íd ",
 "ňéc, ei pri conguě nůllam ačcůsam. Ěst ťe áliqúam atomórúm, áň ",
 "natum movet détraxiť mel! Eos úť erřor rěcťeque, át usu legére ",
 "ádversáríum ďefinitiónes, ad f"]

Bzip2::FFI::Reader.open('test/fixtures/lorem-4096-bytes-compressed.txt.bz2') do |reader|
  buffer = reader.read
  lines = buffer.split("\n")
  lines.map { |l| l.force_encoding('UTF-8') }
end
=>
["Lorém ipsúm dòlòr sìt amét, vix cu alìa póstulant, pri ea odio falli ",
 "mnesárchum. Regiónè eripuit maluisset has ut, at luptatum accusamus qúi, né ",
 "hinc salutàtus éum. Eos no solum còtìdiéque. Àt atqùì latìnè ",
 "tincidùnt pri, pro àn nihil légère. Errem adipiscíng èx èòs, íllud ",
 "saperet nusqúam mea ét. ",
 "",
 "Diċo graecis ei meǣ, fugit liber tincidūnÞ est ex, ðuo āltēra volūmus ",
 "dēfiniÞionēs ut! Sed ut hærum moveÞ assueverīt, no modo ƿonumes deleniÞi ",
 "pri. Seǣ ex sāle noster complectitur, ei qui porro aliqūam taċīmætes. ",
 "Cæse falli denique āt sed? MutæÞ mǽzim tāntæs ēos eu, cum cu ælteræ ",
 "ċeteros expētēƿdæ! ",
 "",
 "Ašsum labores érroribus quo ňe. Án ůnům pařtem při, súmmo ščripta ",
 "singůlís sed éi, eů nůllam aliquip nec. Faceťě ádmodum scřipserit íd ",
 "ňéc, ei pri conguě nůllam ačcůsam. Ěst ťe áliqúam atomórúm, áň ",
 "natum movet détraxiť mel! Eos úť erřor rěcťeque, át usu legére ",
 "ádversáríum ďefinitiónes, ad fačer mažim němóřé sea. ",
...
]

The force_encoding call is just there to make things easier to see what's going on (#read returns a binary string). I've truncated the output from the second example.

If I modify the first example to take account of the line splitting, I get the same result from both approaches:

all_lines = Bzip2::FFI::Reader.open('test/fixtures/lorem-4096-bytes-compressed.txt.bz2') do |reader|
  [].tap do |lines|
    last = nil
    while buffer = reader.read(1024) do
      split = buffer.split("\n")
      split[0] = last + split[0] if last
      lines.push(*split[0..-2])
      last = split[-1]
    end
    lines << last if last
  end
end

all_lines2 = Bzip2::FFI::Reader.open('test/fixtures/lorem-4096-bytes-compressed.txt.bz2') do |reader|
  buffer = reader.read
  buffer.split("\n")
end

all_lines == all_lines2
=> true

kylewelsby · 2024-04-19T11:28:59Z

I found a specific example, but I haven't been able to replicate in isolation. The data we're using contains PII, so I can't share the exact examples.

Changing from the limited read to the default 4096 read, resolved the issue. Even using the limited with 4096 defined triggered the same error.

Once I've found an isolated way to replicate the issue I'll update the findings here, and even better, apply a patch to the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Line endings for read with limit are missing #4

🐛 Line endings for read with limit are missing #4

kylewelsby commented Apr 18, 2024 •

edited

Loading

philr commented Apr 18, 2024

kylewelsby commented Apr 19, 2024

🐛 Line endings for read with limit are missing #4

🐛 Line endings for read with limit are missing #4

Comments

kylewelsby commented Apr 18, 2024 • edited Loading

philr commented Apr 18, 2024

kylewelsby commented Apr 19, 2024

kylewelsby commented Apr 18, 2024 •

edited

Loading