Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Should REUSE.toml support more complex globbing? #98

Open
silverhook opened this issue Nov 20, 2024 · 2 comments
Open

Discussion: Should REUSE.toml support more complex globbing? #98

silverhook opened this issue Nov 20, 2024 · 2 comments

Comments

@silverhook
Copy link
Contributor

Intro / example use case

I recently got into the situation where I needed to glob two types of files with the same extension, but treat them differently license-wise.

Let me show you this with an example (only the relevant parts):

# config files are not copyrightable
[[annotations]]
path = ["**/.*config", "**/*.properties"]
SPDX-License-Identifier = "CC0-1.0"

# i18n and l10n
[[annotations]]
path = ["**/bundle*.properties","**/Language.properties" , "**/Language_??.properties"]
SPDX-License-Identifier = "GPL-3.0-only"

When I check what license reuse tool finds a file matching **/Language.properties to be under (with this workaround I use until fsfe/reuse-tool#1106 gets done):

reuse spdx > reuse.spdx
grep "Language.properties" reuse.spdx --after-context=6 --before-context=1 | grep LicenseInfo

I (rightly) expect it to say GPL-3.0-only, which is also what happens:

LicenseInfoInFile: GPL-3.0-only

But when I check what license reuse tool finds a file matching **/Language_??.properties with:

reuse spdx > reuse.spdx
grep "Language_en.properties" reuse.spdx --after-context=6 --before-context=1 | grep LicenseInfo

I (wrongly) expect it to override again, but instead I get the following result:

LicenseInfoInFile: CC0-1.0

Discussion / actual question

So, the question is, are * and ** enough for globbing, or do we need something more flexible?

If we need something more flexible, are ? enough, or do we need to go further (e.g. [a-z], [0-9], [a,f,v])? Maybe a full globbing system even?

Personally, I’m undecided right now. The above issue I can resolve with *, but I will see if I run into an unsolvable situation while I REUSE-ify the behemoth that is Liferay Portal code base.

My example I present here more as an anecdotal potential symptom to start the discussion.

@carmenbianca
Copy link
Member

carmenbianca commented Nov 21, 2024

Hi @silverhook. I'm going to give a structured response to this.

A little history

Skipping over some steps, the initial implementation of REUSE.toml for Specification 3.2 used Python's standard library fnmatch, then later the glob module of the third-party library wcmatch. Both of these modules implement bash-like globbing, but fnmatch was insufficient for technical reasons (only works on filenames, not full paths). However, adopting wcmatch was tricky for two main reasons:

  • I want to keep the amount of third-party dependencies low. I do not want to put too much weight on this, though, so this soft requirement can be discarded.
  • I struggled to document wcmatch's behaviour in the specification.

For these reasons, I adopted a much simpler implementation. The discussion can be found here. I will talk more about the simple implementation later.

REUSE is unambiguous

One of the goals of REUSE is to be unambiguous. For that reason, I needed to accurately describe the glob behaviour in the specification. The problem is that the documentation of the fnmatch and wcmatch libraries are simply not adequate, and at the time, I could not find other good documentation to point to. I would need to dedicate a lot more words than I want to defining how globbing works in the spec.

However, revisiting this now, I have found Pattern Matching in the bash documentation. This documentation is more than adequate. We could, if we wanted to, write something akin to 'globbing works like defined in Pattern Matching, using the globstar and dotglob options, and with the C locale' (or something to that effect). I would be content that this sufficiently documents how globbing works in REUSE; the only subsequent challenge is making sure that our code actually precisely adheres to what Pattern Matching describes. I think wcmatch can do that, but I'd need to double-check.

(I also considered writing 'globbing works like the behaviour of wcmatch.glob with such-and-so options', but I discarded this idea because it would marry our specification to an implementation detail---a small third-party Python-only library.)

REUSE is machine-readable

But even if we solve the problem of ambiguity in the specification, there is another challenge. Another goal is to be machine-readable, and I'm anxious that having an advanced globbing algorithm prevents third-party software from making inferences about REUSE.toml files. Specifically, third-party software would need to either:

  • be a Python program that depends on wcmatch, and use wcmatch.glob with the same options as we do;
  • or be any program that does not depend on wcmatch that can produce the exact same bash-like (with dotglob and globstar) globbing results.

If the program is unable to replicate the exact behaviour of wcmatch.glob, then that program produces erroneous inferences about REUSE.toml, which breaks unambiguity and machine-readability.

REUSE is easy

This is a minor concern, but another goal of REUSE is to be easy. It was my fear that advanced globbing would increase the difficulty for humans to parse REUSE.toml files. Especially legal experts, who may not be familiar with POSIX-y stuff. And to be frank, some bash-like globbing patterns are incredibly arcane.

What I settled on

Wanting to reduce complexity, I made an issue against wcmatch to disable some features, specifically sequence matching, here. The maintainer closed my issue, which was perfectly fair and reasonably. So I had my own go at it, and was able to implement the feature relatively easily. Here is the current state of the tool's globbing code:

def __attrs_post_init__(self) -> None:
    def translate(path: str) -> str:
        # pylint: disable=too-many-branches
        blocks = []
        escaping = False
        globstar = False
        prev_char = ""
        for char in path:
            if char == "\\":
                if prev_char == "\\" and escaping:
                    escaping = False
                    blocks.append("\\\\")
                else:
                    escaping = True
            elif char == "*":
                if escaping:
                    blocks.append(re.escape("*"))
                    escaping = False
                elif prev_char == "*" and not globstar:
                    globstar = True
                    blocks.append(r".*")
            elif char == "/":
                if not globstar:
                    if prev_char == "*":
                        blocks.append("[^/]*")
                    blocks.append("/")
                escaping = False
            else:
                if prev_char == "*" and not globstar:
                    blocks.append(r"[^/]*")
                blocks.append(re.escape(char))
                globstar = False
                escaping = False
            prev_char = char
        if prev_char == "*" and not globstar:
            blocks.append(r"[^/]*")
        result = "".join(blocks)
        return f"^({result})$"

    self._paths_regex = re.compile(
        "|".join(translate(path) for path in self.paths)
    )

The thought process here is simple: by thoroughly reducing the complexity of the globbing algorithm, third-party software can easily (a.) copy the above code, (b.) re-implement the above code, or (c.) write some code that does what the specification defines, using the REUSE tool's test suite as reference. (Important note: the REUSE Specification documents the globbing behaviour in full.)

Of course, this choice depends on the assumption that no advanced features are needed---that we can get away exclusively with *s and **s. Which is exactly what this issue is about. If we decide that, no, this feature set is not enough, then we will need to reconsider our approach.

Standardisation

There is one last point in favour of implementing full-featured bash-like globbing, in spite of whether we actually need it or not.

Our implementation of globbing is incredibly custom, exclusive to us. Furthermore, although I wrote a heap of tests, the code is brittle, and there could be unknown broken corner cases. One such bug shipped in v4.0.0. By sticking closer to bash globbing, we avoid all the pitfalls of having a custom solution. I won't name the advantages of standardisation here.

However, we then loop back to the problem outlined earlier: can third-party software easily re-implement our exact bash-like globbing feature set? I had a search for JavaScript, which has node-glob. But for Ruby and Rust, I was unable to find anything.

Alternatively, we could implement fnmatch-like globbing instead, which is in most standard libraries, is incredibly close to bash-like globbing, but which does not support globstar (**). (Although, frustratingly, Ruby's fnmatch implementation does implement globstar without an option to disable it, which is immensely unhelpful.)

New tech to the rescue

One final note. Since REUSE Specification 3.2 was released, Python 3.13 has also released. And it comes with features that make wcmatch superfluous! glob.translate and pathlib.PurePath.full_match come with fnmatch-like globbing that also (optionally) implement globstar (and dotglob). This is further documented here in a section called Pattern language.

But this, too, has challenges as described in the above section, and as described when discussing wcmatch. And furthermore, it'd be quite a break from our current (unwritten) policy to jump to Python 3.13. We could depend on wcmatch for Python ≤3.12 and use the Python standard library for Python ≥3.13, but I do not know if there are any obvious discrepancies. I also do not know if Python 3.13's implementation matches the Pattern Matching bash documentation, which reintroduces the problem of documenting this properly in the specification, unless we link directly to the Pattern language Python documentation section (and decide that that section is sufficient).


Anyroad, that's all. A lot of problems and caveats and 'uuuuuugh I wish this were simpler'.

Terrible summary:

  • It's hard to unambiguously document our globbing behaviour.
  • It's hard for third parties to re-implement the exact globbing behaviour we settle on.
  • Implementing fnmatch-like or bash-like globbing with dotglob and globstar options requires a third-party Python library (wcmatch) or Python ≥3.13, and we would need to do more research to ascertain whether either of these actually fully conform to the documented behaviour that we are targetting.
  • Our current globbing behaviour is much simpler than POSIX-y globbing, but also fairly custom to us.

@silverhook
Copy link
Contributor Author

Thank you, @carmenbianca, for both this wonderful recap/explanation and your hard work in REUSE.

I think you pointed out really well why the situation is the way it is and why we should make any changes to it only if it turns out it is really needed.

Of course, this choice depends on the assumption that no advanced features are needed---that we can get away exclusively with *s and **s. Which is exactly what this issue is about. If we decide that, no, this feature set is not enough, then we will need to reconsider our approach.

This is the crux of it, it seems, yes.

So, let us keep this “issue” open for discussion for whenever such an occasion arises. But at least on my side, that day has not yet come.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants