-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JS/TS ESM version (modern JS module system vs global scope polution) #183
Comments
There was also #123 which I closed as abandoned (it broke CI so it needed more work before it could be merged). The current structure is probably partly due to my inexperience with modern javascript and partly due to the code's history. Snowball's javascript backend was originally a JSX backend (not the JSX you're likely thinking of, but https://jsx.github.io/ which was "a faster, safer, easier JavaScript" that transpiled to javascript). The JSX project is inactive, so I morphed it into producing javascript code directly, with the primary personal motivation being to have a demo on the website which could work without any server-side support. That was also more than 5 years ago. If we could provide a single output that without real drawbacks that would be simpler. Does ESM work with all browsers? What older node versions don't support it?
I'm guessing from the name that selects a stemmer based on the current locale setting? That doesn't seem like a useful thing to do as you need to use the same stemming algorithm when searching that you did at index time and that's typically a per-database thing, whereas the locale is per user (and can typically be changed by the user too). |
ESM is supported by all modern browsers (no IE or Opera Mini, of which IE is officially dead and Opera Mini has patchy support for JavaScript in general). The upgrade for in-browser usage relying on the old global behavior would typically look something like this: - <script src="./path/to/base-stemmer.js"></script>
- <script src="./path/to/english-stemmer.js"></script>
+ <script type="module">
+ import StemmerEnglish from "./path/to/stemmer-english.mjs"
+ globalThis.StemmerEnglish = StemmerEnglish
+ </script> Or for NodeJS (ESM) usage relying on the global behavior: - import "./path/to/base-stemmer.js"
- import "./path/to/stemmer-english.mjs"
+ import StemmerEnglish from "./path/to/stemmer-english.mjs"
+ globalThis.StemmerEnglish = StemmerEnglish
In NodeJS, experimental support for ESM was added in 8.5.0 (Sep 2017), with unflagged support added in 13.2.0 (Nov 2019). Current latest is 20.5.1, LTS is 18.17.1. End-of-life for 12.x (most recent LTS without unflagged support) was Apr 2022. However, newer versions of NodeJS continue to support the old style of "CJS" modules ( Interestingly, NodeJS ESM imports based on the "implicit global polution" pattern don't throw an error (as would be expected in strict-mode) — my guess is that this is a deviation from the spec to support with porting codebases from CJS.
Not exactly — it takes a locale as a parameter, rather than using the current locale setting (though you could also import one based on current locale setting by passing it as a param). My use case is that I want to use any relevant stemmer based on multilingual, user-supplied files, of which the locale is known but varies per file. |
As a starter, can we prefix the declaration with |
If you make that change and nothing else, it makes the variable local to the module, so nothing is exported (importing the module becomes a no-op). I think the most minimal working ESM/strict-mode-compatible version that doesn't pollute the global scope could be generated something like this: mkdir -p out
cat <(echo "let BaseStemmer\nexport default ") ./javascript/base-stemmer.js > ./out/base-stemmer.mjs
./snowball algorithms/english.sbl -js -o out/stemmer-english \
-n "import BaseStemmer from './base-stemmer.mjs'; let StemmerEnglish; export default StemmerEnglish"
mv ./out/stemmer-english.js ./out/stemmer-english.mjs Works in both NodeJS and Deno: cat <(echo "import StemmerEnglish from './stemmer-english.mjs'\nconsole.log(new StemmerEnglish().stemWord('location'))") > ./out/main.mjs
echo "result from node: $(node ./out/main.mjs)"
# result from node: locat
echo "result from deno: $(deno run ./out/main.mjs)"
# result from deno: locat |
It seems almost all users should be happy with ESM output (and many probably happier), but it's conceivable a few might still want the current output. Maybe nobody would, but unfortunately it's hard to survey more than a very limited number of users. For anyone still relying on pre-ESM support, it's probably really unhelpful to have a dependency force the timing of when they drop support for older javascript implementations. People in this situation can choose to use with an older Snowball version, but then they don't get other changes from newer versions which they may want. So I think I'm leaning towards trying to keep the existing output as an option, and if that proves complicated we can reassess. Thinking about command line option naming we don't have many target-language specific options. There are a few like With the motivation of trying to keep it clearer what options are relevant to a particular target language, I wonder if we should follow Go's lead of prefixing with the language name, e.g. |
I've pushed changes to add support for I haven't tried to merge in the changes between I didn't try to generate code with the various style changes you'd made - they're likely possible but I'd like evidence that there's firm consensus in the JS world about such points or at least a strong argument for why we should produce code that looks a particular way, rather than these just being matters of personal preference. A data point is you've reformatted to use tabs for indenting, but we changed to using spaces for indentation here due to #123. I really don't want to get sucked into acting as a mediator between warring Javascript style factions! Re: locales, I see from the code you really just mean a way to create a stemmer object given an ISO-639 language code string - that seems fine (and e.g. the C libstemmer can create a stemmer for a given language name or ISO-639 language code string). |
Nice! I'll give it a try with Deno when I have time.
I'm not sure if this will be possible without global scope pollution, though there might be a way to check and only pollute global scope if the context isn't ESM.
Nor me — I just formatted that way on my fork as that's what I use as config for my auto-formatter. I wouldn't presume to advocate for that change in the official repo. |
One version which dynamically works for both is not the only approach - we could generate one from the other, or generate both variants from some sort of template. That would probably happen at the same stage where we run snowball to generate the JS sources.
OK, cool. To be clear I'm OK with people proposing such changes where there's a good reason. If we're expecting people to use the generated code as-is in the browser, formatting which makes it smaller is arguably helpful as it's currently 952K of JS for all of them. Simply However at least with the current size I'd expect people to be running it through a minifier tool - for our own website I'm using closure-compiler which reduces that to 272K. (Actually all the size numbers above include Perhaps we should include a minification step. I don't really do enough JS to know if that's actually helpful though. |
IMO it's not really necessary, as the library doesn't provide any pre-built version. It makes sense to leave minification up to consumers (or downstream providers of pre-built versions), as they already need to produce their own build output. Perhaps stick a note in the README about a suggested way to minify - e.g. something like this would work in Deno (with output being consumable by various environments): import { transform } from "https://deno.land/x/swc@0.2.1/mod.ts";
const dirName = "javascript";
for await (const f of Deno.readDir(dirName)) {
const segs = f.name.split(".");
if (!f.isFile || segs.includes("min") || !segs.at(-1).endsWith("js")) continue;
await Deno.writeTextFile(
`${dirName}/${[...segs.slice(0, -1), "min", segs.at(-1)].join(".")}`,
transform(await Deno.readTextFile(`${dirName}/${f.name}`), { minify: true }).code,
);
}
From looking into it more, it seems there's no way of properly feature-checking for ESM contexts. I noticed in your recent changes you added Having thought on it a bit more, here's an outline for build logic I think should work for all 3 of ESM, CJS, and legacy/compatibility global mode:
* "Bare version": just the ** "Relevant boilerplate":
|
I pretty much just resolved the issues with the parts of #123 which didn't get applied before because they broke CI, then merged them. If you're only suggesting we support CJS because I merged that, that's not a compelling reason to (I really have little idea what's current in JS, and that PR was originally from 2019). If it's actually still useful that seems reasonable though. I should be able to easily sort out the codegen tweaks, but it'd be helpful if someone with more idea than me could set up suitable additional CI builds to actually test the three variants. I don't know how feasible it is, but it'd also be useful if we could have a CI job testing the JS code running inside a browser for the variants where that makes sense. |
Results from GitHub code search:
Not completely scientific as
In-browser CI can be done with tools like Selenium, Cypress, or Puppeteer, but IMO it's massive overkill for this use case (usually those tools are reserved for things like end-to-end testing of UI interactions). ESM and global versions should work identically in browser, and CJS isn't relevant to browsers. |
Also repos for projects that are no longer actively maintained generally persist rather than get removed which will tend to skew towards older ways of doing things, but it does seem like that couldn't reasonably explain all of it.
OK, that seems reasonable then - we can always revisit if we hit a situation which wasn't caught by CI but would have been with in-browser testing. If I have things straight, CI testing currently essentially tests a kind of hybrid which dynamically supports CJS or global, so I can probably manage to split that into a job for each. If someone can set up CI jobs which test ESM with javascript and typescript that would be helpful, probably using deno since that appears to have a stricter implementation of ESM than node (from comments above). |
The current JS build script outputs files with the following shape (example for English):
In non-strict-mode (e.g. root-level non-modularized JavaScript), this automatically adds
StemmerEnglish
to the global scope, which is less-than-ideal but generally doesn't cause problems. However, in strict-mode (e.g. a JavaScript module), this throwsUncaught ReferenceError: StemmerEnglish is not defined
. This makes it impossible to run on Deno, which always uses strict mode.In addition, the JSDoc types added aren't picked up by TypeScript.
Is there interest in adding an ESM build, either as a replacement for or in addition to the current JS one? I've managed to hack one together in my fork (README, compilation logic, dynamic imports by locale code, tests), and it works nicely in Deno, newer versions of Node, and in-browser HTTP imports, with TypeScript-available types out of the box. I could probably work it into a PR with some effort (I don't know much C), but I'd need a steer on a couple of things:
-esm
as an additional flag along with-js
, or-esm
being a completely separate flag from-js
?getStemmerByLocale.mjs
be considered valuable to include in the build, or should that be supplied by userland code? The advantage of including it in the build is that hard-coded dynamic imports are visible to static analysis, giving the performance advantage of lazily-loaded stemmers without causing the imports to be removed from bundles by overly-zealous downstream build tools. This is primarily useful for multilingual use cases. Requiring that logic to be supplied by userland code would require the consumer to add their own build step, based on the files in./algorithms
and the locale name mapping.The text was updated successfully, but these errors were encountered: