Add binrw support for pdb files #64

Holzhaus · 2022-04-03T22:19:15Z

Fixes #45. Based on #66 and #67.

This is the first step to solve issue #45.

The binrw-based parser is now feature complete and can replace the nom-based parser. Fixes #45.

We completely transitioned to binrw! 🎉

Holzhaus · 2022-04-11T19:08:25Z

I think this is ready to review now. @Swiftb0y Do you want to take a look?

Swiftb0y

Thanks for pinging me that this is ready.
General issue still: We use the binrw attribute everywhere, but writing obviously produces garbage databases (even though thats not obvious from the code). So I think we should either make that clear at compile time (by using #[binread] on the read-only datastructures) or if that requires too many changes, document that writing won't produce valid databases!

src/util.rs

src/pdb/mod.rs

Swiftb0y · 2022-04-11T21:34:28Z

src/pdb/mod.rs

+        #[br(offset = base_offset, parse_with = FilePtr16::parse)]
        isrc: DeviceSQLString,


why did you opt for the parse_with approach instead?

Suggested change

#[br(offset = base_offset, parse_with = FilePtr16::parse)]

isrc: DeviceSQLString,

#[br(offset = base_offset)]

isrc: FilePtr16<DeviceSQLString>,

Because the actual offset is an implementation detail IMHO. When we add support for writing, I don't think we want to save the offsets in the struct, otherwise all later offsets of a row become wrong when an earlier string changes length.

Mhmm, that makes sense, but since FilePtr does have BinWrite yet, it doesn't make much sense to discuss the ergonomics of a foreign, still unimplemented API.

Swiftb0y · 2022-04-11T21:41:24Z

src/pdb/mod.rs

+        // Calculate number of rows in last row group
+        let mut num_rows_in_last_row_group = num_rows % RowGroup::MAX_ROW_COUNT;
+        if num_rows_in_last_row_group == 0 {
+            num_rows_in_last_row_group = RowGroup::MAX_ROW_COUNT;
+        }
+
+        // Read last row group
+        let row_group = RowGroup::read_options(reader, ro, (num_rows_in_last_row_group,))?;
+        row_groups.push(row_group);
+
+        // Read remaining row groups
+        for _ in 1..num_row_groups {
+            let row_group = RowGroup::read_options(reader, ro, (RowGroup::MAX_ROW_COUNT,))?;
+            row_groups.insert(0, row_group);
+        }


I dislike the fact that this is kinda brittle in regards to ordering and so forth. Also the repeated insertion at the front is not ideal but I don't see a good alternative either.

I contemplated whether I should use a VecDeque instead, but then decided against it. Didn't want to spend too much time on premature optimization, we can always fine-tune the code later.

Swiftb0y · 2022-04-11T21:52:47Z

src/bin/rekordcrate-pdb.rs

+                    let abs_offset: u64 = page_offset
+                        + u64::try_from(Page::HEADER_SIZE).unwrap()
+                        + u64::from(row_offset);
+                    reader
+                        .seek(SeekFrom::Start(abs_offset))
+                        .expect("failed to seek to row offset");
+                    let row = Row::read_options(
+                        &mut reader,
+                        &ReadOptions::default(),
+                        (page.page_type.clone(),),
+                    )
+                    .expect("failed to parse row");
+                    println!("      {:?}", row);


having to manually mess with the offset is highly unsafe and not something I would make part of the API in any shape. First of all because a lot could go wrong when the consumer is manually in charge of juggling all these types, and the resulting boilerplate required to make this work is also less than ideal. Basically, its easy to misuse and hard to use correctly. https://www.oreilly.com/library/view/97-things-every/9780596809515/ch55.html

Also, making binrw part of the public API is not good either.

having to manually mess with the offset is highly unsafe and not something I would make part of the API in any shape. First of all because a lot could go wrong when the consumer is manually in charge of juggling all these types, and the resulting boilerplate required to make this work is also less than ideal. Basically, its easy to misuse and hard to use correctly. https://www.oreilly.com/library/view/97-things-every/9780596809515/ch55.html

I agree and already have some code that gets rid of this. The downside is that it just parse all rows in a page at once, not lazily. Didn't want to put it into this PR because it's already a pretty large diff.

Also, making binrw part of the public API is not good either.

There is only alternative I can think of (other than to not use binrw at all) would be to copy all parsed data into another struct that does not implement BinRead/BinWrite. If the trait and the struct is pub, so is the impl. We can document that using binrw is an implementation detail and depending on it is not recommended though.

Anyway, I wouldn't worry too much about the public API yet. We are still in the experimentation phase, and hiding binrw stuff just makes the code complicated without much gain. Later on, we need to think how to make serialization possible without in-depth knowledge of the format without creating struct instances by hand anyway. Whoever uses this library probably doesn't care about the various string types, page size and how many rows there are in a row group.

Don't get me wrong, I agree that completely eliminating binrw from our public API is not worth the effort, but the amount of boilerplate and types needed for the basic act of just reading out all rows is not good. If we only require the API consumer to make use of BinRead/BinWrite, thats fine IMO.

I agree and already have some code that gets rid of this. The downside is that it just parse all rows in a page at once, not lazily. Didn't want to put it into this PR because it's already a pretty large diff.

Ok, then I won't dwell on this issue in this PR any longer.

Here's the commit btw: 88c87c8

This allows to make the `u32` member private and treat it as an implementation detail.

The `BinWrite` trait for `Page`/`Row` does not make sense because it looks like it's possible to serialize them, but the output is actually invalid because proper serialization is not implemented yet.

This slipped through in f052a09 (#66).

Holzhaus · 2022-04-12T07:19:45Z

Thanks for reviewing!

General issue still: We use the binrw attribute everywhere, but writing obviously produces garbage databases (even though thats not obvious from the code). So I think we should either make that clear at compile time (by using #[binread] on the read-only datastructures) or if that requires too many changes, document that writing won't produce valid databases!

Agreed. Fixed in 076efb0.

Swiftb0y · 2022-04-12T11:20:09Z

src/pdb/mod.rs

+    /// Apparently this is not always zero, so it might also be something different.
+    unknown: u16,


can we still assert that its zero so we can more easily catch occurrences of this?

No, that makes tests fail:

thread 'pdb_demo_tracks_PIONEER_rekordbox_export_pdb' panicked at 'called `Result::unwrap()` on an `Err` value: ╺━━━━━━━━━━━━━━━━━━━━┅ Backtrace ┅━━━━━━━━━━━━━━━━━━━━╸ 0: Error: unknown == 0 at 0x2fee While parsing field 'row_groups' in Page at src/pdb/mod.rs:282 ╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ ', /home/jan/Projects/rekordcrate/target/debug/build/rekordcrate-2debded95e6c6d26/out/tests_pdb.rs:23:14 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Mhmm, I thought this might not be that common. Then I guess we leave the assert out and try to decipher the meaning of these fields some other time.

src/pdb/mod.rs

Holzhaus force-pushed the pdb-binrw branch 4 times, most recently from b628ef4 to 5800304 Compare April 8, 2022 08:26

This was referenced Apr 8, 2022

Serialization Support for PDB files #68

Open

DeviceSQLString Unification #67

Merged

Holzhaus added 9 commits April 11, 2022 20:49

feat(pdb): Add preliminary support for parsing *.pdb files with binrw

589602d

This is the first step to solve issue #45.

refactor(pdb): Add support for parsing Row enum using binrw

3877b4f

refactor(pdb): Add support for parsing Page struct using binrw

1d7d32b

refactor(pdb): Add support for parsing RowGroup struct with binrw

87f1d5a

refactor(pdb): Replace nom with binrw in most of rekordcrate-pdb

93e1302

refactor(pdb): Add support for parsing page indices with binrw

2996a3b

test: Update PDB tests to use binrw implementation instead of nom

dbdcd38

refactor(pdb): Remove nom parser implementation in favor of binrw

7061e7b

The binrw-based parser is now feature complete and can replace the nom-based parser. Fixes #45.

build: Remove nom dependency from package

5e7cb91

We completely transitioned to binrw! 🎉

Holzhaus force-pushed the pdb-binrw branch from 5800304 to 5e7cb91 Compare April 11, 2022 18:55

Holzhaus marked this pull request as ready for review April 11, 2022 18:55

Swiftb0y suggested changes Apr 11, 2022

View reviewed changes

Holzhaus added 5 commits April 12, 2022 08:04

refactor(pdb): Add PageIndex::offset() method

3f2a6d1

This allows to make the `u32` member private and treat it as an implementation detail.

refactor(pdb): Mark Page struct and Row enum as not serializable

076efb0

The `BinWrite` trait for `Page`/`Row` does not make sense because it looks like it's possible to serialize them, but the output is actually invalid because proper serialization is not implemented yet.

refactor(util): Remove remaining ColorIndex code that uses u16

96eaee9

This slipped through in f052a09 (#66).

chore(pdb): Remove possibly panicking expect from current_offset

0092c61

chore(pdb): Remove possibly panicking expect from Header::read_pages

8a739ec

Holzhaus force-pushed the pdb-binrw branch from c4f79bb to 8a739ec Compare April 12, 2022 06:40

docs(pdb): Add TODO comment regarding usage of u16::div_ceil

359db9d

Holzhaus added 3 commits April 12, 2022 12:08

fix(pdb): Fix offset read logic for name in Artist rows

ebdb726

chore(pdb): Remove bw attribute for binread-only enum struct

58f916b

chore(pdb): Don't mark the row group's padding field as temp

a14dc76

docs(pdb): Add documentation comments for temporary fields, too

480bf8c

Holzhaus mentioned this pull request Apr 12, 2022

Pdb Improvements #69

Merged

Swiftb0y suggested changes Apr 12, 2022

View reviewed changes

Swiftb0y approved these changes Apr 14, 2022

View reviewed changes

Holzhaus merged commit 6375d10 into main Apr 14, 2022

Holzhaus deleted the pdb-binrw branch October 6, 2022 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add binrw support for pdb files #64

Add binrw support for pdb files #64

Holzhaus commented Apr 3, 2022 •

edited

Loading

Holzhaus commented Apr 11, 2022 •

edited

Loading

Swiftb0y left a comment

Swiftb0y Apr 11, 2022

Holzhaus Apr 12, 2022

Swiftb0y Apr 12, 2022

Swiftb0y Apr 11, 2022

Holzhaus Apr 12, 2022 •

edited

Loading

Swiftb0y Apr 11, 2022

Swiftb0y Apr 11, 2022

Holzhaus Apr 12, 2022 •

edited

Loading

Swiftb0y Apr 12, 2022

Holzhaus Apr 13, 2022

Holzhaus commented Apr 12, 2022

Swiftb0y Apr 12, 2022

Holzhaus Apr 13, 2022

Swiftb0y Apr 14, 2022

		#[br(offset = base_offset, parse_with = FilePtr16::parse)]
		isrc: DeviceSQLString,

		/// Apparently this is not always zero, so it might also be something different.
		unknown: u16,

Add binrw support for pdb files #64

Add binrw support for pdb files #64

Conversation

Holzhaus commented Apr 3, 2022 • edited Loading

Holzhaus commented Apr 11, 2022 • edited Loading

Swiftb0y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Holzhaus Apr 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Holzhaus Apr 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Holzhaus commented Apr 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Holzhaus commented Apr 3, 2022 •

edited

Loading

Holzhaus commented Apr 11, 2022 •

edited

Loading

Holzhaus Apr 12, 2022 •

edited

Loading

Holzhaus Apr 12, 2022 •

edited

Loading