Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add binrw support for pdb files #64

Merged
merged 19 commits into from
Apr 14, 2022
Merged

Add binrw support for pdb files #64

merged 19 commits into from
Apr 14, 2022

Conversation

Holzhaus
Copy link
Owner

@Holzhaus Holzhaus commented Apr 3, 2022

Fixes #45. Based on #66 and #67.

@Holzhaus Holzhaus force-pushed the pdb-binrw branch 4 times, most recently from b628ef4 to 5800304 Compare April 8, 2022 08:26
@Holzhaus Holzhaus marked this pull request as ready for review April 11, 2022 18:55
@Holzhaus
Copy link
Owner Author

Holzhaus commented Apr 11, 2022

I think this is ready to review now. @Swiftb0y Do you want to take a look?

Copy link
Contributor

@Swiftb0y Swiftb0y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pinging me that this is ready.
General issue still: We use the binrw attribute everywhere, but writing obviously produces garbage databases (even though thats not obvious from the code). So I think we should either make that clear at compile time (by using #[binread] on the read-only datastructures) or if that requires too many changes, document that writing won't produce valid databases!

src/util.rs Outdated Show resolved Hide resolved
src/pdb/mod.rs Outdated Show resolved Hide resolved
src/pdb/mod.rs Show resolved Hide resolved
src/pdb/mod.rs Outdated Show resolved Hide resolved
src/pdb/mod.rs Show resolved Hide resolved
Comment on lines +606 to 607
#[br(offset = base_offset, parse_with = FilePtr16::parse)]
isrc: DeviceSQLString,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you opt for the parse_with approach instead?

Suggested change
#[br(offset = base_offset, parse_with = FilePtr16::parse)]
isrc: DeviceSQLString,
#[br(offset = base_offset)]
isrc: FilePtr16<DeviceSQLString>,

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the actual offset is an implementation detail IMHO. When we add support for writing, I don't think we want to save the offsets in the struct, otherwise all later offsets of a row become wrong when an earlier string changes length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhmm, that makes sense, but since FilePtr does have BinWrite yet, it doesn't make much sense to discuss the ergonomics of a foreign, still unimplemented API.

src/pdb/mod.rs Outdated
Comment on lines 297 to 311
// Calculate number of rows in last row group
let mut num_rows_in_last_row_group = num_rows % RowGroup::MAX_ROW_COUNT;
if num_rows_in_last_row_group == 0 {
num_rows_in_last_row_group = RowGroup::MAX_ROW_COUNT;
}

// Read last row group
let row_group = RowGroup::read_options(reader, ro, (num_rows_in_last_row_group,))?;
row_groups.push(row_group);

// Read remaining row groups
for _ in 1..num_row_groups {
let row_group = RowGroup::read_options(reader, ro, (RowGroup::MAX_ROW_COUNT,))?;
row_groups.insert(0, row_group);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike the fact that this is kinda brittle in regards to ordering and so forth. Also the repeated insertion at the front is not ideal but I don't see a good alternative either.

Copy link
Owner Author

@Holzhaus Holzhaus Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I contemplated whether I should use a VecDeque instead, but then decided against it. Didn't want to spend too much time on premature optimization, we can always fine-tune the code later.

Comment on lines +36 to +48
let abs_offset: u64 = page_offset
+ u64::try_from(Page::HEADER_SIZE).unwrap()
+ u64::from(row_offset);
reader
.seek(SeekFrom::Start(abs_offset))
.expect("failed to seek to row offset");
let row = Row::read_options(
&mut reader,
&ReadOptions::default(),
(page.page_type.clone(),),
)
.expect("failed to parse row");
println!(" {:?}", row);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having to manually mess with the offset is highly unsafe and not something I would make part of the API in any shape. First of all because a lot could go wrong when the consumer is manually in charge of juggling all these types, and the resulting boilerplate required to make this work is also less than ideal. Basically, its easy to misuse and hard to use correctly. https://www.oreilly.com/library/view/97-things-every/9780596809515/ch55.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, making binrw part of the public API is not good either.

Copy link
Owner Author

@Holzhaus Holzhaus Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having to manually mess with the offset is highly unsafe and not something I would make part of the API in any shape. First of all because a lot could go wrong when the consumer is manually in charge of juggling all these types, and the resulting boilerplate required to make this work is also less than ideal. Basically, its easy to misuse and hard to use correctly. https://www.oreilly.com/library/view/97-things-every/9780596809515/ch55.html

I agree and already have some code that gets rid of this. The downside is that it just parse all rows in a page at once, not lazily. Didn't want to put it into this PR because it's already a pretty large diff.

Also, making binrw part of the public API is not good either.

There is only alternative I can think of (other than to not use binrw at all) would be to copy all parsed data into another struct that does not implement BinRead/BinWrite. If the trait and the struct is pub, so is the impl. We can document that using binrw is an implementation detail and depending on it is not recommended though.

Anyway, I wouldn't worry too much about the public API yet. We are still in the experimentation phase, and hiding binrw stuff just makes the code complicated without much gain. Later on, we need to think how to make serialization possible without in-depth knowledge of the format without creating struct instances by hand anyway. Whoever uses this library probably doesn't care about the various string types, page size and how many rows there are in a row group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't get me wrong, I agree that completely eliminating binrw from our public API is not worth the effort, but the amount of boilerplate and types needed for the basic act of just reading out all rows is not good. If we only require the API consumer to make use of BinRead/BinWrite, thats fine IMO.

I agree and already have some code that gets rid of this. The downside is that it just parse all rows in a page at once, not lazily. Didn't want to put it into this PR because it's already a pretty large diff.

Ok, then I won't dwell on this issue in this PR any longer.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the commit btw: 88c87c8

This allows to make the `u32` member private and treat it as an
implementation detail.
The `BinWrite` trait for `Page`/`Row` does not make sense because it
looks like it's possible to serialize them, but the output is actually
invalid because proper serialization is not implemented yet.
@Holzhaus
Copy link
Owner Author

Thanks for reviewing!

General issue still: We use the binrw attribute everywhere, but writing obviously produces garbage databases (even though thats not obvious from the code). So I think we should either make that clear at compile time (by using #[binread] on the read-only datastructures) or if that requires too many changes, document that writing won't produce valid databases!

Agreed. Fixed in 076efb0.

@Holzhaus Holzhaus mentioned this pull request Apr 12, 2022
Comment on lines +377 to +378
/// Apparently this is not always zero, so it might also be something different.
unknown: u16,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we still assert that its zero so we can more easily catch occurrences of this?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that makes tests fail:

thread 'pdb_demo_tracks_PIONEER_rekordbox_export_pdb' panicked at 'called `Result::unwrap()` on an `Err` value:
 ╺━━━━━━━━━━━━━━━━━━━━┅ Backtrace ┅━━━━━━━━━━━━━━━━━━━━╸

 0: Error: unknown == 0 at 0x2fee
           While parsing field 'row_groups' in Page
     at src/pdb/mod.rs:282

 ╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸

', /home/jan/Projects/rekordcrate/target/debug/build/rekordcrate-2debded95e6c6d26/out/tests_pdb.rs:23:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhmm, I thought this might not be that common. Then I guess we leave the assert out and try to decipher the meaning of these fields some other time.

src/pdb/mod.rs Show resolved Hide resolved
src/pdb/mod.rs Show resolved Hide resolved
@Holzhaus Holzhaus merged commit 6375d10 into main Apr 14, 2022
@Holzhaus Holzhaus deleted the pdb-binrw branch October 6, 2022 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port pdb.rs from nom to binrw
2 participants