Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement builtin glob function #488

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

hdwalters
Copy link
Contributor

@hdwalters hdwalters commented Sep 22, 2024

Fixes #463, by creating a builtin glob function.

build.rs Outdated
@@ -1,3 +1,4 @@
fn main() {
build_helper::rerun_if_changed("src/tests/**");
// See [https://doc.rust-lang.org/cargo/reference/build-scripts.html].
build_helper::rerun_if_changed("src/tests");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "rerun-if-changed" directive takes a directory name, not a "**" extended glob; this was rebuilding Amber even if nothing had changed in "src/tests".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know this was commented. Good job!

src/main.rs Outdated
@@ -175,15 +175,13 @@ fn handle_docs(cli: Cli) -> Result<(), Box<dyn Error>> {
}
}

/*
#[cfg(target_os = "windows")]
#[cfg(windows)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a conditional build for Windows (even if not officially supported), surely it shouldn't be commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #490.

src/modules/expression/expr.rs Outdated Show resolved Hide resolved
format!("{quote}{}{quote}", translate_interpolated_region(self.strings.clone(), interps, true))
let strings = translate_interpolated_region(self.strings.clone(), interps, true);
if self.unquoted {
strings
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary so we generate Bash for item in file* not for item in "file*".

token(meta, "glob")?;
token(meta, "(")?;
let mut new_is_unquoted_ctx = true;
swap(&mut new_is_unquoted_ctx, &mut meta.context.is_unquoted_ctx);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is dangerous, because the context value is not reverted to its previous value if the function does an early return, as it will if syntax(meta, &mut arg)? fails below. However, the source code is full of instances of this, so we need a comprehensive fix. I will create a separate issue for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #489.

@@ -58,8 +58,7 @@ impl SyntaxModule<ParserMetadata> for FunctionInvocation {
// Get the function name
let tok = meta.get_current_token();
if let Some(ref tok) = tok {
self.line = tok.pos.0;
self.col = tok.pos.1;
(self.line, self.col) = tok.pos;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally more readable to use tuple destructuring than explicit indices.

import * from "std/text"

// Output
// ===
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These separators are necessary because if the script creates no output (as in this case) the unit test actually behaves as if we had written:

// Output
// Succeded

format!("{quote}{}{quote}", translate_interpolated_region(self.strings.clone(), interps, true))
let strings = translate_interpolated_region(self.strings.clone(), interps, true);
if self.unquoted {
strings
Copy link
Contributor Author

@hdwalters hdwalters Sep 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still allows for injection attacks, but I have pushed another commit to prevent this, by escaping spaces and semicolons:

if self.escaped {
    strings.replace(" ", "\\ ").replace(";", "\\;")
} else {
    let quote = meta.gen_quote();
    format!("{quote}{strings}{quote}")
}

[
format!("for {name} in {expr}"),
"do".to_string(),
format!("if {conditional}"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note, this will only work with string literals:

#!/usr/bin/env amber

echo "[works]"
loop index, item in glob("*.md") {
    echo "{index}: {item}"
}

echo "[fails]"
let wildcard = "*.md"
loop index, item in glob(wildcard) {
    echo "{index}: {item}"
}

I do not see any way around this (short of calling find, which has its own problems) because Bash glob expansion does not work with quoted globs.

@@ -13,5 +12,5 @@ main {
} else {
echo "File {tmpdir}/amber-symbolic not exists"
}
unsafe $rm {tmpdir}/amber-symbolic$
unsafe $rm -fr {tmpdir}$
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were several places in the standard library unit tests where we failed to clean up after ourselves, so /tmp/ filled up with temporary directories. Fixed this while I was here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

}

echo "==="
loop file in glob("{tmpdir}/*with spaces*") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The glob builtin does not expect the calling code to escape spaces in its own string literals; it has to do this itself, to prevent injection attacks.

@@ -114,6 +114,8 @@ pub struct Context {
pub is_main_ctx: bool,
/// Determines if the context is in an unsafe block
pub is_unsafe_ctx: bool,
/// Determines if text should be escaped e.g. for file globs
pub is_escaped_ctx: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mte90
Copy link
Member

Mte90 commented Sep 23, 2024

A Bash generated:

for file in missing* file* *.csv; do
    if [ -e "${file}" ]; then
        echo "${file}"
    fi
done

I like this approach using the bash for.

@hdwalters
Copy link
Contributor Author

A Bash generated:

for file in missing* file* *.csv; do
    if [ -e "${file}" ]; then
        echo "${file}"
    fi
done

I like this approach using the bash for.

Please bear in mind that the glob builtin as currently written can only support string literals:

for file in glob("*.md") {
    echo file
}

Not:

let files = "*.md"
for file in glob(files) {
    echo file
}

Or:

let files = [ "*.md" ]
for file in glob(files) {
    echo file
}

Copy link
Member

@Ph0enixKM Ph0enixKM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the syntax perspective, since glob is a built-in and echo is a built-in and glob requires parentheses but echo does not - does it mean we introduce a new builtin type?

I'd keep the new built-in just like the others: glob "value"

Copy link
Member

@b1ek b1ek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there so many changes, unrelated to this PR? mainly the refactoring ones.

can you cherry pick the commits with the refactoring changes to a separate branch and make another PR for it? and leave only the relevant ones in this one

@hdwalters
Copy link
Contributor Author

why is there so many changes, unrelated to this PR? mainly the refactoring ones.

can you cherry pick the commits with the refactoring changes to a separate branch and make another PR for it? and leave only the relevant ones in this one

Yeah, fair enough. I tend to make changes as I go (though I do try to limit them to the vicinity of the functional changes). However, I would like to get to the bottom of the "are we ok with builtin functions that behave like standard library functions" question, before I do any more work in this branch.

@hdwalters
Copy link
Contributor Author

hdwalters commented Sep 30, 2024

I've created a new pull request #495 for this purpose, including all the refactoring from this PR, and some other general tidying I've spotted while working on other branches. If we can get that merged, I'll rebase this one, and hopefully it will become a lot easier to review.

@hdwalters
Copy link
Contributor Author

Yes, I see there are merge conflicts (which I expected) following the merge of #495. I will rebase this branch when we've resolved the questions raised in #277.

@hdwalters hdwalters force-pushed the implement-builtin-glob-function branch from f8e8de8 to d23ca60 Compare October 2, 2024 21:26
@hdwalters
Copy link
Contributor Author

Now that I've rebased the branch onto the latest "master", can I get another review please?

@hdwalters hdwalters changed the title Implement builtin glob function Implement builtin glob expression Oct 2, 2024
@hdwalters hdwalters changed the title Implement builtin glob expression Implement builtin glob function Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Loop files/directories (glob)
4 participants