-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in recovering invalid lines in JSONL inputs #17098
Fix bug in recovering invalid lines in JSONL inputs #17098
Conversation
cpp/src/io/json/read_json.cu
Outdated
if (last_char != '\n') { | ||
last_char = '\n'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hardcoded to \n
, should it be delimiter mentioned in json reader options instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Qn: Does this force reader to do extra copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I've changed the hardcoded \n
to the reader options delimiter. Also fixed it in a few other places.
Yes, we now have two extra copies between host and device each of size 1 byte. We also perform a stream sync between the copies. I'll run the JSON benchmarks to see what the impact of this change is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we merge #17028, we need to check the last character in the buffer only when both nullify_empty_rows
and recover_with_null
are enabled. Otherwise, I think we can always add a delimiter since empty rows are anyway ignored.
/ok to test |
|
||
auto const shift_for_nonzero_offset = std::min<std::int64_t>(chunk_offset, 1); | ||
auto const first_delim_pos = | ||
chunk_offset == 0 ? 0 : find_first_delimiter(readbufspan, '\n', stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean it was a long-standing bug until now? Since we already supported customized delimiter for a long time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this has been bug until now. I suspect that when we enable recover_with_null
, the FST removing excess characters after the delimiter in each line fixes the error in partial lines read due to the hard-coded \n
delimiter, preventing us from encountering an error. But I think this bug would have caused lines in the input spanning byte ranges to be skipped.
Also, if the size of the input file is less than 2GB and we always read the whole file i.e. not in byte ranges, then again we would not encounter this bug.
…n/cudf into enh-json_nullify_empty_lines
Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>
/ok to test |
…n/cudf into enh-json_nullify_empty_lines
…into json-quote-char-parsing-fix
/ok to test |
…into json-quote-char-parsing-fix
/ok to test |
/merge |
Description
Addresses #16999
Checklist