-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align filter field expression with PICA Path #579
Comments
I guess the terms path and filter (matcher) are used in a different way. In pica-rs a path expression points always to subfield values (eg. $ pica filter "044.?" DUMP.dat | pica count
$ pica filter "044[HK]?" DUMP.dat | pica count
$ pica filter "1.../*?" DUMP.dat | pica count
$ pica filter "1.../*.[3-8]?" DUMP.dat | pica count In order to get your first examples to work, just a the
I think the second part is not a problem for pica-rs. First, in pica-rs there is no (NB: There is only one little exception, that an occurrence of |
I re-evaluated the current implementation when writing #588: use of patterns in tags is already supported as specified as PICA Path (by the way I might rename it to "PICA Path Expression" if this helps). The following remains:
|
I think all missing points are possible to implement and I suggest to
start a "mini-project" to organize the steps needed. If this is okay
for you, I would like to implement the "hard" part by myself. But
before this, I would like to write a set unit-test of the PICA Path
specification in order to identify the missing parts.
One question regarding occurrences as single digits: Does `012A/1`
should match `012A/001`?
…On Tue Jan 31, 2023 at 6:41 PM CET, Jakob Voß wrote:
I re-evaluated the current implementation when writing #588: use of patterns in tags is already supported as specified as PICA Path (by the way I might rename it to "PICA Path Expression" if this helps). The following remains:
- Occurrences as single digit (e.g. `045R/1` should match `045R/01` and `045R/0-2` should match `045R`, `045R/01` and `045R/02`)
- Default occurrence `/*` when field tag starts with `2` (e.g. `209C` is equivalent to `209C/*`)
- xtags (see [examples below](https://format.k10plus.de/k10plushelp.pl?cmd=pplist&katalog=Standard#exemplar)) e.g. `209Ax01` is `209A{x=='1'}` is `209A/*{x=='1'}` and `209Ax0-1` is `209A{x in ['0','1']}` is `209A/*{x in ['0','1']}` . The sequence-syntax is not included in current PICA Path specification but used in K10plus.
--
Reply to this email directly or view it on GitHub:
#579 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
|
On level 2 the occurrence can be two or three digits so See https://unapi.k10plus.de/?&format=pp&id=opac-de-627:ppn:129435686 for an example of a record with occurrence on level 2 exceeding 99: it includes the field This should be matched by any of:
but not by By the way the K10plus format documentation uses the xtag-syntax Sorry, I did not invent PICA+ format! |
I took some time and take a look at the current PICA Path specification and I think this is not the way pica-rs should follow. Despite, that the specification has some errors, it has too much valid expression where the semantic is not clear. Just a few example: Is there a difference between But most important, implementing the missing feature and extensions would make the parser code more complicated and less maintainable. And even more important I want to provide my colleagues an clear syntax with a clear semantic, which is oriented on use-cases and real data. It simply does not make sense to allow numbers with more than three digits in and range expression, when occurrences can have only two or three digits. I see two options to proceed: pica-rs uses an other term than "path", or the PICA Path specification gets revised and a new, independent version 2.0 is released. This new version must specify a minimal core set with multiple extensions and a clear description of the syntax and semantic. |
Please don't follow xkcd 927. The current specification is based on use-cases working with real PICA+ data since more than a decade. It's complexity and apparent errors arise from the need to support multiple viewpoints and applications. pica-rs is not the first of these applications and it will not be the last because people continously create ad-hoc implementation based on what they already know and what is easy to implement for them.
So let's go for option 2. I agree that the current document at https://format.gbv.de/query/picapath is far from clear, finished and easy to understand. First of all please ignore all extensions. pica-rs will support its own extension of PICA Path for sure and these don't need to be named extensions or "path" at all. Just make sure that pica-rs accepts interprets every core PICA Path Expression as a full subset to basic interoperability between applications. The absolute minimum baseline is the subset described here (including this restriction) to reference fields with common definition in cataloging rules:
With the additional rule that occurrence zero ( This can be simplified and extended, e.g.
With the additional rule that occurrence is set to Finally add minimum support of subfields:
That's all. Everything beyond, e.g. to include the subfield indicator
Good spot. I'm not happy with xtags at all, but that's how PICA is used. My colleauges introduced
This is just a (possibly buggy) picadata extension of PICA Path, please ignore.
This is another PICA Path extension I don't like neither, please ignore. Sorry for not being more clear about what actually makes the core of PICA Patch and what is an optional extension to satisfy independent applications. Let's keep it simple but assure basic interoperability. |
As mentioned in the inital issue description, support of xtags could be postponed beyond version 1.0 and discussed in another issue. For release 1.0 of pica-rs the only change needed for compatibility with other PICA tools, is the default occurrence of level 2 tags if no occurrence is specified: e.g. |
As part of the stabilization I simplified and cleaned up the syntax of pica-rs path expression (ex. removed lazy- and How to proceed with this issue?I'll discuss this issue with our DNB user group (next meeting 28.07.) and let you know what they think and how we proceed. |
Apart from the goal of compatibility across institutions and tools for PICA data, the following arguments may help colleauges from DNB to decide:
|
#858 reintroduces the minimum support of subfields according to the PICA Path specification. |
Working on #458 I realized that command
filter
does not fully conform to PICA Path yet. What's missing are:To support patterns in tags, a tag should be allowed to be given as any of
[012.] [0-9.] [0-9.] [A-Z@.]
, for instance:Default occurrence for field expressions of level 2 should be
/*
because on level 2 the occurrence has a different role (counter). So these two should be identical:Note that these are different on purpose:
The text was updated successfully, but these errors were encountered: