Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Line endings for read with limit are missing #4

Open
kylewelsby opened this issue Apr 18, 2024 · 2 comments
Open

🐛 Line endings for read with limit are missing #4

kylewelsby opened this issue Apr 18, 2024 · 2 comments

Comments

@kylewelsby
Copy link

kylewelsby commented Apr 18, 2024

When decrypting streaming and splitting files per line appears to loose \r\n or \n line endings when using Reader#read(1024).

Using without the limit doesn't lose the line endings.

Affected code

Bzip2::FFI::Reader.open(io_or_path) do |reader|
  while buffer = reader.read(1024) do
    line = buffer.split("\n")
    # process uncompressed line from the buffer
  end
end

Work around

Bzip2::FFI::Reader.open(io_or_path) do |reader|
  while buffer = reader.read do
    line = buffer.split("\n")
    # process uncompressed line from the buffer
  end
end

The issue might be somewhere around the code paths for default 4kb vs limit branch of read.
https://github.com/philr/bzip2-ffi/blob/7202391117e6709bc67404c1f55d4a3f5d3bc791/lib/bzip2/ffi/reader.rb#L311C1-L321C57

@philr
Copy link
Owner

philr commented Apr 18, 2024

I'm not sure exactly what you mean by losing the line endings? Perhaps an example file with the output you're seeing would help.

Calling #read without a length will decompress the entire stream in one go. The loop is not necessary in the second 'Work around' example.

Note that your two examples are not doing the same thing. In the first 'Affected code' example, if a line spans two 1,024 byte buffers, it will be split into two lines. I see identical output from both approaches up to the line spans the first buffer:

Bzip2::FFI::Reader.open('test/fixtures/lorem-4096-bytes-compressed.txt.bz2') do |reader|
  buffer = reader.read(1024)
  lines = buffer.split("\n")
  lines.map { |l| l.force_encoding('UTF-8') }
end
=>
["Lorém ipsúm dòlòr sìt amét, vix cu alìa póstulant, pri ea odio falli ",
 "mnesárchum. Regiónè eripuit maluisset has ut, at luptatum accusamus qúi, né ",
 "hinc salutàtus éum. Eos no solum còtìdiéque. Àt atqùì latìnè ",
 "tincidùnt pri, pro àn nihil légère. Errem adipiscíng èx èòs, íllud ",
 "saperet nusqúam mea ét. ",
 "",
 "Diċo graecis ei meǣ, fugit liber tincidūnÞ est ex, ðuo āltēra volūmus ",
 "dēfiniÞionēs ut! Sed ut hærum moveÞ assueverīt, no modo ƿonumes deleniÞi ",
 "pri. Seǣ ex sāle noster complectitur, ei qui porro aliqūam taċīmætes. ",
 "Cæse falli denique āt sed? MutæÞ mǽzim tāntæs ēos eu, cum cu ælteræ ",
 "ċeteros expētēƿdæ! ",
 "",
 "Ašsum labores érroribus quo ňe. Án ůnům pařtem při, súmmo ščripta ",
 "singůlís sed éi, eů nůllam aliquip nec. Faceťě ádmodum scřipserit íd ",
 "ňéc, ei pri conguě nůllam ačcůsam. Ěst ťe áliqúam atomórúm, áň ",
 "natum movet détraxiť mel! Eos úť erřor rěcťeque, át usu legére ",
 "ádversáríum ďefinitiónes, ad f"]
Bzip2::FFI::Reader.open('test/fixtures/lorem-4096-bytes-compressed.txt.bz2') do |reader|
  buffer = reader.read
  lines = buffer.split("\n")
  lines.map { |l| l.force_encoding('UTF-8') }
end
=>
["Lorém ipsúm dòlòr sìt amét, vix cu alìa póstulant, pri ea odio falli ",
 "mnesárchum. Regiónè eripuit maluisset has ut, at luptatum accusamus qúi, né ",
 "hinc salutàtus éum. Eos no solum còtìdiéque. Àt atqùì latìnè ",
 "tincidùnt pri, pro àn nihil légère. Errem adipiscíng èx èòs, íllud ",
 "saperet nusqúam mea ét. ",
 "",
 "Diċo graecis ei meǣ, fugit liber tincidūnÞ est ex, ðuo āltēra volūmus ",
 "dēfiniÞionēs ut! Sed ut hærum moveÞ assueverīt, no modo ƿonumes deleniÞi ",
 "pri. Seǣ ex sāle noster complectitur, ei qui porro aliqūam taċīmætes. ",
 "Cæse falli denique āt sed? MutæÞ mǽzim tāntæs ēos eu, cum cu ælteræ ",
 "ċeteros expētēƿdæ! ",
 "",
 "Ašsum labores érroribus quo ňe. Án ůnům pařtem při, súmmo ščripta ",
 "singůlís sed éi, eů nůllam aliquip nec. Faceťě ádmodum scřipserit íd ",
 "ňéc, ei pri conguě nůllam ačcůsam. Ěst ťe áliqúam atomórúm, áň ",
 "natum movet détraxiť mel! Eos úť erřor rěcťeque, át usu legére ",
 "ádversáríum ďefinitiónes, ad fačer mažim němóřé sea. ",
...
]

The force_encoding call is just there to make things easier to see what's going on (#read returns a binary string). I've truncated the output from the second example.

If I modify the first example to take account of the line splitting, I get the same result from both approaches:

all_lines = Bzip2::FFI::Reader.open('test/fixtures/lorem-4096-bytes-compressed.txt.bz2') do |reader|
  [].tap do |lines|
    last = nil
    while buffer = reader.read(1024) do
      split = buffer.split("\n")
      split[0] = last + split[0] if last
      lines.push(*split[0..-2])
      last = split[-1]
    end
    lines << last if last
  end
end

all_lines2 = Bzip2::FFI::Reader.open('test/fixtures/lorem-4096-bytes-compressed.txt.bz2') do |reader|
  buffer = reader.read
  buffer.split("\n")
end

all_lines == all_lines2
=> true

@kylewelsby
Copy link
Author

I found a specific example, but I haven't been able to replicate in isolation. The data we're using contains PII, so I can't share the exact examples.

Changing from the limited read to the default 4096 read, resolved the issue. Even using the limited with 4096 defined triggered the same error.

Once I've found an isolated way to replicate the issue I'll update the findings here, and even better, apply a patch to the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants