-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Line endings for read with limit are missing #4
Comments
I'm not sure exactly what you mean by losing the line endings? Perhaps an example file with the output you're seeing would help. Calling Note that your two examples are not doing the same thing. In the first 'Affected code' example, if a line spans two 1,024 byte buffers, it will be split into two lines. I see identical output from both approaches up to the line spans the first buffer: Bzip2::FFI::Reader.open('test/fixtures/lorem-4096-bytes-compressed.txt.bz2') do |reader|
buffer = reader.read(1024)
lines = buffer.split("\n")
lines.map { |l| l.force_encoding('UTF-8') }
end
=>
["Lorém ipsúm dòlòr sìt amét, vix cu alìa póstulant, pri ea odio falli ",
"mnesárchum. Regiónè eripuit maluisset has ut, at luptatum accusamus qúi, né ",
"hinc salutàtus éum. Eos no solum còtìdiéque. Àt atqùì latìnè ",
"tincidùnt pri, pro àn nihil légère. Errem adipiscíng èx èòs, íllud ",
"saperet nusqúam mea ét. ",
"",
"Diċo graecis ei meǣ, fugit liber tincidūnÞ est ex, ðuo āltēra volūmus ",
"dēfiniÞionēs ut! Sed ut hærum moveÞ assueverīt, no modo ƿonumes deleniÞi ",
"pri. Seǣ ex sāle noster complectitur, ei qui porro aliqūam taċīmætes. ",
"Cæse falli denique āt sed? MutæÞ mǽzim tāntæs ēos eu, cum cu ælteræ ",
"ċeteros expētēƿdæ! ",
"",
"Ašsum labores érroribus quo ňe. Án ůnům pařtem při, súmmo ščripta ",
"singůlís sed éi, eů nůllam aliquip nec. Faceťě ádmodum scřipserit íd ",
"ňéc, ei pri conguě nůllam ačcůsam. Ěst ťe áliqúam atomórúm, áň ",
"natum movet détraxiť mel! Eos úť erřor rěcťeque, át usu legére ",
"ádversáríum ďefinitiónes, ad f"] Bzip2::FFI::Reader.open('test/fixtures/lorem-4096-bytes-compressed.txt.bz2') do |reader|
buffer = reader.read
lines = buffer.split("\n")
lines.map { |l| l.force_encoding('UTF-8') }
end
=>
["Lorém ipsúm dòlòr sìt amét, vix cu alìa póstulant, pri ea odio falli ",
"mnesárchum. Regiónè eripuit maluisset has ut, at luptatum accusamus qúi, né ",
"hinc salutàtus éum. Eos no solum còtìdiéque. Àt atqùì latìnè ",
"tincidùnt pri, pro àn nihil légère. Errem adipiscíng èx èòs, íllud ",
"saperet nusqúam mea ét. ",
"",
"Diċo graecis ei meǣ, fugit liber tincidūnÞ est ex, ðuo āltēra volūmus ",
"dēfiniÞionēs ut! Sed ut hærum moveÞ assueverīt, no modo ƿonumes deleniÞi ",
"pri. Seǣ ex sāle noster complectitur, ei qui porro aliqūam taċīmætes. ",
"Cæse falli denique āt sed? MutæÞ mǽzim tāntæs ēos eu, cum cu ælteræ ",
"ċeteros expētēƿdæ! ",
"",
"Ašsum labores érroribus quo ňe. Án ůnům pařtem při, súmmo ščripta ",
"singůlís sed éi, eů nůllam aliquip nec. Faceťě ádmodum scřipserit íd ",
"ňéc, ei pri conguě nůllam ačcůsam. Ěst ťe áliqúam atomórúm, áň ",
"natum movet détraxiť mel! Eos úť erřor rěcťeque, át usu legére ",
"ádversáríum ďefinitiónes, ad fačer mažim němóřé sea. ",
...
] The If I modify the first example to take account of the line splitting, I get the same result from both approaches: all_lines = Bzip2::FFI::Reader.open('test/fixtures/lorem-4096-bytes-compressed.txt.bz2') do |reader|
[].tap do |lines|
last = nil
while buffer = reader.read(1024) do
split = buffer.split("\n")
split[0] = last + split[0] if last
lines.push(*split[0..-2])
last = split[-1]
end
lines << last if last
end
end
all_lines2 = Bzip2::FFI::Reader.open('test/fixtures/lorem-4096-bytes-compressed.txt.bz2') do |reader|
buffer = reader.read
buffer.split("\n")
end
all_lines == all_lines2
=> true |
I found a specific example, but I haven't been able to replicate in isolation. The data we're using contains PII, so I can't share the exact examples. Changing from the limited read to the default 4096 read, resolved the issue. Even using the limited with 4096 defined triggered the same error. Once I've found an isolated way to replicate the issue I'll update the findings here, and even better, apply a patch to the code. |
When decrypting streaming and splitting files per line appears to loose
\r\n
or\n
line endings when usingReader#read(1024)
.Using without the limit doesn't lose the line endings.
Affected code
Work around
The issue might be somewhere around the code paths for default 4kb vs limit branch of read.
https://github.com/philr/bzip2-ffi/blob/7202391117e6709bc67404c1f55d4a3f5d3bc791/lib/bzip2/ffi/reader.rb#L311C1-L321C57
The text was updated successfully, but these errors were encountered: