-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jdk19 regexp fix #11016
base: master
Are you sure you want to change the base?
Jdk19 regexp fix #11016
Conversation
WalkthroughThe pull request includes significant changes across multiple files in the LanguageTool project. Key modifications involve the deletion of the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Outside diff range and nitpick comments (18)
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (1)
79458-79458
: Consider optimizing similar abbreviation patternsThe patterns for basic abbreviations (s.o., d.h., m.E., z.B.) follow a consistent structure. Consider combining them into a more efficient pattern using alternation.
-<regexp case_sensitive='yes'>(s|S)\.[ \t]?(o|u)\.</regexp> -<regexp case_sensitive='yes'>(d|D)\.[ \t]?h\.</regexp> -<regexp case_sensitive='yes'>(m|M)\.[ \t]?E\.</regexp> -<regexp case_sensitive='yes'>(z|Z)\.[ \t]?(B|K|T|Zt|Hd?)\.</regexp> +<regexp case_sensitive='yes'>(?:([sdmz]|[SDMZ])\.[ \t]?(?:o|u|h|E|(?:B|K|T|Zt|Hd?))\.</regexp>Also applies to: 79467-79467, 79475-79475, 79491-79491
languagetool-core/src/main/java/org/languagetool/rules/patterns/RegexAntiPatternFilter.java (1)
Line range hint
44-53
: Consider caching compiled patterns for better performance.The current implementation compiles patterns inside the loop for each iteration. Since these patterns are constant, they could be pre-compiled and cached.
Here's a suggested optimization:
public class RegexAntiPatternFilter extends RegexRuleFilter { + private final Map<String, Pattern> patternCache = new HashMap<>(); + private Pattern getOrCompilePattern(String antiPattern) { + return patternCache.computeIfAbsent(antiPattern, + pattern -> Pattern.compile(pattern, Pattern.UNICODE_CHARACTER_CLASS)); + } @Nullable @Override public RuleMatch acceptRuleMatch(RuleMatch match, Map<String, String> arguments, AnalyzedSentence sentenceObj, Matcher patternMatcher) { // ... existing code ... for (String antiPattern : antiPatterns) { - Pattern p = Pattern.compile(antiPattern, Pattern.UNICODE_CHARACTER_CLASS); + Pattern p = getOrCompilePattern(antiPattern); Matcher matcher = p.matcher(sentenceObj.getText()); // ... rest of the code ... } } }languagetool-language-modules/es/src/main/java/org/languagetool/language/Spanish.java (4)
Line range hint
391-391
: Consider using a more specific exception typeWrapping IOException in RuntimeException loses the original exception context. Consider creating a custom checked exception or using a more specific runtime exception.
-throw new RuntimeException(e); +throw new LanguageToolException("Error while tagging suggestion: " + suggestion, e);
Line range hint
379-408
: Consider breaking down the complex filtering logicThe method handles multiple filtering scenarios which makes it hard to maintain and test. Consider extracting each filtering logic into separate private methods.
Here's a suggested refactor:
private List<RuleMatch> filterRuleMatches(List<RuleMatch> ruleMatches, AnnotatedText text, Set<String> enabledRules) { List<RuleMatch> results = new ArrayList<>(); for (RuleMatch ruleMatch : ruleMatches) { List<String> suggestions = ruleMatch.getSuggestedReplacements(); if (suggestions.size()==1 && ruleMatch.getRule().getId().startsWith("AI_ES_GGEC")) { String suggestion = suggestions.get(0); - // ignore adding punctuation at the sentence end - if (ruleMatch.getRule().getId().equals("AI_ES_GGEC_MISSING_PUNCTUATION") && suggestion.endsWith(".")) { - if (ruleMatch.getSentence().getText().replaceAll("\\s+$", "").endsWith(suggestion.substring(0,suggestion.length()-1))) { - continue; - } - } - // avoid obsolete diacritics - if (suggestionsToAvoid.contains(suggestion.toLowerCase())) { - continue; - } - // avoid lowercase at the sentence start - if (ruleMatch.getSentence().getText().trim().startsWith(StringTools.uppercaseFirstChar(suggestion))) { - continue; - } - // avoid voseo forms in suggestions - List<AnalyzedTokenReadings> atr; - try { - atr = this.getTagger().tag(Arrays.asList(suggestion)); - } catch (IOException e) { - throw new RuntimeException(e); - } - if (atr != null && atr.size()>0) { - if (atr.get(0).matchesPosTagRegex(voseoPostagPatern)) { - continue; - } - } + if (shouldSkipPunctuationSuggestion(ruleMatch, suggestion) || + isObsoleteDiacritic(suggestion) || + isLowercaseAtSentenceStart(ruleMatch, suggestion) || + isVoseoForm(suggestion)) { + continue; + } ruleMatch.setOriginalErrorStr(); if (suggestion.equalsIgnoreCase(ruleMatch.getOriginalErrorStr())) { updateRuleMatchForCasingChange(ruleMatch); } } results.add(ruleMatch); } return results; } +private boolean shouldSkipPunctuationSuggestion(RuleMatch ruleMatch, String suggestion) { + return ruleMatch.getRule().getId().equals("AI_ES_GGEC_MISSING_PUNCTUATION") && + suggestion.endsWith(".") && + ruleMatch.getSentence().getText().replaceAll("\\s+$", "") + .endsWith(suggestion.substring(0, suggestion.length()-1)); +} +private boolean isObsoleteDiacritic(String suggestion) { + return suggestionsToAvoid.contains(suggestion.toLowerCase()); +} +private boolean isLowercaseAtSentenceStart(RuleMatch ruleMatch, String suggestion) { + return ruleMatch.getSentence().getText().trim() + .startsWith(StringTools.uppercaseFirstChar(suggestion)); +} +private boolean isVoseoForm(String suggestion) { + try { + List<AnalyzedTokenReadings> atr = this.getTagger().tag(Arrays.asList(suggestion)); + return atr != null && !atr.isEmpty() && + atr.get(0).matchesPosTagRegex(voseoPostagPatern); + } catch (IOException e) { + throw new LanguageToolException("Error checking voseo form: " + suggestion, e); + } +} +private void updateRuleMatchForCasingChange(RuleMatch ruleMatch) { + ruleMatch.setMessage("Mayúsculas y minúsculas recomendadas."); + ruleMatch.setShortMessage("Mayúsculas y minúsculas"); + ruleMatch.getRule().setLocQualityIssueType(ITSIssueType.Typographical); + ruleMatch.getRule().setCategory( + Categories.CASING.getCategory(ResourceBundleTools.getMessageBundle(this)) + ); + ruleMatch.setSpecificRuleId( + ruleMatch.getRule().getId().replace("ORTHOGRAPHY", "CASING") + ); +}
Line range hint
377-377
: Add null check for ruleMatches parameterThe method should validate the input parameter to prevent NullPointerException.
public List<RuleMatch> filterRuleMatches(List<RuleMatch> ruleMatches, AnnotatedText text, Set<String> enabledRules) { + if (ruleMatches == null) { + return Collections.emptyList(); + } List<RuleMatch> results = new ArrayList<>();
Line range hint
378-378
: Consider using ArrayList with initial capacityInitialize ArrayList with expected size to avoid resizing operations.
-List<RuleMatch> results = new ArrayList<>(); +List<RuleMatch> results = new ArrayList<>(ruleMatches.size());languagetool-language-modules/ca/src/main/java/org/languagetool/language/Catalan.java (3)
348-348
: Consider using Pattern.compile with flags parameterThe pattern combines flags using the bitwise OR operator. For better readability and maintainability, consider using
Pattern.compile(String regex, int flags)
.-private static final Pattern CA_OLD_DIACRITICS = compile(".*\\b(sóc|dóna|dónes|vénen|véns|fóra)\\b.*",Pattern.CASE_INSENSITIVE|Pattern.UNICODE_CHARACTER_CLASS); +private static final Pattern CA_OLD_DIACRITICS = compile( + ".*\\b(sóc|dóna|dónes|vénen|véns|fóra)\\b.*", + Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS +);
425-425
: Consider extracting pattern strings as constantsThe regex patterns are complex and used in multiple places. Consider extracting them as named constants for better maintainability and documentation.
+private static final String APOSTROPHES2_REGEX = "\\b([mtlsn])['']([^1haeiouáàèéíòóúA-ZÀÈÉÍÒÓÚ"«\"])"; +private static final String APOSTROPHES3_REGEX = "\\be?([mtsldn])e? (h?[aeiouàèéíòóú])"; // ... more constants ... -private static final Pattern CA_APOSTROPHES2 = compile("\\b([mtlsn])['']([^1haeiouáàèéíòóúA-ZÀÈÉÍÒÓÚ"«\"])", Pattern.UNICODE_CHARACTER_CLASS); +private static final Pattern CA_APOSTROPHES2 = compile(APOSTROPHES2_REGEX, Pattern.UNICODE_CHARACTER_CLASS); -private static final Pattern CA_APOSTROPHES3 = compile("\\be?([mtsldn])e? (h?[aeiouàèéíòóú])", - Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS); +private static final Pattern CA_APOSTROPHES3 = compile(APOSTROPHES3_REGEX, + Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS);Also applies to: 428-428, 430-430, 432-432, 434-434, 436-436, 438-438, 440-440, 442-442
422-442
: Document the regex patternsThese patterns handle complex linguistic rules for Catalan. Consider adding documentation to explain what each pattern matches and why certain character classes or groups are included/excluded.
+/** + * Pattern for handling Catalan contractions. + * Matches: "a els" -> "als", "de els" -> "dels" + */ private static final Pattern CA_CONTRACTIONS = compile("\\b([Aa]|[Dd]e) e(ls?)\\b", Pattern.UNICODE_CHARACTER_CLASS); +/** + * Pattern for handling Catalan apostrophes before words. + * Matches: "l' home" -> "l'home" + */ private static final Pattern CA_APOSTROPHES1 = compile("\\b([LDNSTMldnstm]['']) ", Pattern.UNICODE_CHARACTER_CLASS);languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/entities/hyphenised.ent (2)
2-2
: Consider improving maintainability of the regex pattern.The current pattern is extremely long and complex, making it difficult to maintain and debug. Consider:
- Breaking down the pattern into smaller, named sub-patterns
- Adding comments to document the word categories
- Using XML entity references for common sub-patterns
Example structure:
- <!ENTITY hyphenised_expressions "(?U)\b(?!feijão(?:ões)-frade)((?:c(?:a... + <!-- Common word endings --> + <!ENTITY common_endings "(?:eiro|eira|adores|ador)"> + <!-- Animal related terms --> + <!ENTITY animal_terms "(?:c(?:achorro|avalo|obra)|p(?:assaro|eixe))"> + <!-- Main pattern --> + <!ENTITY hyphenised_expressions "(?U)\b(?!feijão(?:ões)-frade) + (?: + <!-- Animal compounds --> + (?:&animal_terms;)-(?:&common_endings;) | + <!-- Plant compounds --> + ... + ) + ">
2-2
: Consider performance implications of the complex regex.Such a large regex pattern might impact performance. Consider:
- Using a regex profiler to identify potential bottlenecks
- Breaking down the pattern into multiple smaller patterns that can be matched separately
- Using a trie-based approach for fixed strings
languagetool-core/src/main/java/org/languagetool/rules/patterns/XMLRuleHandler.java (2)
Line range hint
464-482
: Clarify the exception message insetRuleFilter
In the
setRuleFilter
method, the exception message uses concatenation for variables:throw new RuntimeException("Rule " + rule.getFullId() + " of type " + rule.getClass() + " cannot have a filter (" + filterClassName + ")");Consider using
String.format
for better readability:- throw new RuntimeException("Rule " + rule.getFullId() + " of type " + rule.getClass() + " cannot have a filter (" + filterClassName + ")"); + throw new RuntimeException(String.format("Rule %s of type %s cannot have a filter (%s)", rule.getFullId(), rule.getClass(), filterClassName));
Line range hint
385-392
: Avoid potential confusion with variable namingIn
checkNumber
method, the variablepos
is reassigned within the loop which may lead to confusion. Consider using a different variable for the updated position to enhance readability.languagetool-commandline/src/test/java/org/languagetool/commandline/MainTest.java (3)
125-131
: Handle possible issues with command arguments inProcessBuilder
Using
System.getProperty("java.class.path")
directly inProcessBuilder
arguments may cause issues if the classpath contains spaces or special characters. It's recommended to construct the command using aList<String>
to ensure proper handling of arguments.Apply the following diff to use a
List<String>
for command arguments:Process process = new ProcessBuilder( - "java", "-cp", System.getProperty("java.class.path"), "org.languagetool.commandline.Main", "-h" + Arrays.asList( + "java", + "-cp", + System.getProperty("java.class.path"), + "org.languagetool.commandline.Main", + "-h" + ) ).start();This approach helps avoid issues with spaces in file paths and ensures the command arguments are properly handled.
125-144
: Consider capturing and asserting on error output in test methodsIn the test methods where
ProcessBuilder
is used, only the standard output is captured and asserted upon. If the process writes important information to the error stream, it may be missed. Consider capturing the error output and including assertions to verify that no unexpected errors occur.Modify the test methods to read the error output:
Process process = new ProcessBuilder( // existing command ).start(); int exitCode = process.waitFor(); String output = readProcessOutput(process); +String errorOutput = readProcessError(process); assertEquals("Exit status", 1, exitCode); +assertTrue("Error output should be empty", errorOutput.isEmpty());And add a new method to read the error stream:
private String readProcessError(Process process) throws IOException { StringBuilder errorOutput = new StringBuilder(); try (BufferedReader stdErrReader = new BufferedReader(new InputStreamReader(process.getErrorStream()))) { String line; while ((line = stdErrReader.readLine()) != null) { errorOutput.append(line).append(System.lineSeparator()); } } return errorOutput.toString(); }
125-131
: UseProcessBuilder.inheritIO()
for simpler output handlingAlternatively, if capturing the output is not needed for assertions, you can simplify the process execution by using
ProcessBuilder.inheritIO()
, which redirects the subprocess's output and error streams to the current process. This approach reduces complexity when output capturing isn't necessary.Example:
Process process = new ProcessBuilder( "java", "-cp", System.getProperty("java.class.path"), "org.languagetool.commandline.Main", "-h" +).inheritIO().start(); -int exitCode = process.waitFor(); -String output = readProcessOutput(process); -assertTrue(output.contains("Usage: java -jar languagetool-commandline.jar")); +int exitCode = process.waitFor(); +assertEquals("Exit status", 1, exitCode);languagetool-core/src/main/resources/org/languagetool/resource/segment.srx (2)
5774-5774
: Potential Over-Matching with Abbreviation PatternsThe regex on line 5774:
<beforebreak>\b\p{L}{1,2}\.</beforebreak> <afterbreak>\p{L}{1,2}\.</afterbreak>This pattern matches one or two letters followed by a period, which might be too broad and match unintended abbreviations or initials, leading to incorrect sentence boundaries.
Consider adding context or constraints to the pattern to prevent over-matching. For example, specify common abbreviations or ensure it's not followed by a lowercase letter.
5868-5868
: Clarify the Patterns for 'п.' and 'д.' AbbreviationsIn the regex on line 5868:
<beforebreak>(?<!т\.[\h\v]?)\b[пд]\.[\h\v]*</beforebreak> <afterbreak></afterbreak>The abbreviations
п.
andд.
can have multiple meanings ('пан', 'день', etc.). Ensure that the negative lookbehind forт.
correctly handles contexts to prevent misinterpretation.Consider providing additional context in the regex to avoid false positives.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (20)
languagetool-commandline/src/test/java/org/languagetool/commandline/AbstractSecurityTestCase.java
(0 hunks)languagetool-commandline/src/test/java/org/languagetool/commandline/MainTest.java
(5 hunks)languagetool-core/src/main/java/org/languagetool/rules/AbstractUnitConversionRule.java
(2 hunks)languagetool-core/src/main/java/org/languagetool/rules/patterns/PatternRuleHandler.java
(1 hunks)languagetool-core/src/main/java/org/languagetool/rules/patterns/RegexAntiPatternFilter.java
(1 hunks)languagetool-core/src/main/java/org/languagetool/rules/patterns/XMLRuleHandler.java
(1 hunks)languagetool-core/src/main/java/org/languagetool/tokenizers/SrxTools.java
(2 hunks)languagetool-core/src/main/resources/org/languagetool/resource/segment.srx
(10 hunks)languagetool-language-modules/ca/src/main/java/org/languagetool/language/Catalan.java
(3 hunks)languagetool-language-modules/ca/src/main/java/org/languagetool/rules/ca/PronomsFeblesHelper.java
(1 hunks)languagetool-language-modules/de/src/main/java/org/languagetool/language/German.java
(1 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml
(9 hunks)languagetool-language-modules/es/src/main/java/org/languagetool/language/Spanish.java
(1 hunks)languagetool-language-modules/es/src/main/java/org/languagetool/tokenizers/es/SpanishWordTokenizer.java
(1 hunks)languagetool-language-modules/fr/src/main/java/org/languagetool/language/French.java
(1 hunks)languagetool-language-modules/fr/src/main/java/org/languagetool/rules/fr/MakeContractionsFilter.java
(1 hunks)languagetool-language-modules/fr/src/main/java/org/languagetool/tokenizers/fr/FrenchWordTokenizer.java
(1 hunks)languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/entities/hyphenised.ent
(1 hunks)languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/pt.sor
(1 hunks)pom.xml
(1 hunks)
💤 Files with no reviewable changes (1)
- languagetool-commandline/src/test/java/org/languagetool/commandline/AbstractSecurityTestCase.java
✅ Files skipped from review due to trivial changes (1)
- languagetool-language-modules/de/src/main/java/org/languagetool/language/German.java
🔇 Additional comments (31)
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (1)
Line range hint 79458-79590
: Verify Unicode character handling in patterns
The AI summary mentions these changes are part of broader Unicode improvements. However, the patterns don't explicitly use Unicode character classes. Please verify if these patterns need to be updated to support the Pattern.UNICODE_CHARACTER_CLASS
flag mentioned in the summary.
languagetool-language-modules/fr/src/main/java/org/languagetool/rules/fr/MakeContractionsFilter.java (1)
29-32
: Improved Unicode support for word boundaries
The change from Pattern.UNICODE_CASE
to Pattern.UNICODE_CHARACTER_CLASS
is a good improvement. While UNICODE_CASE
only handles case-insensitive matching for Unicode characters, UNICODE_CHARACTER_CLASS
provides comprehensive Unicode support for word boundaries (\b
), which is crucial for correctly matching French contractions at word boundaries when the text contains Unicode characters.
Let's verify the handling of Unicode edge cases:
✅ Verification successful
Let me analyze the changes to verify if updating the regex pattern flags from Pattern.UNICODE_CASE
to Pattern.UNICODE_CHARACTER_CLASS
in MakeContractionsFilter.java
has any impact.
Let me check the implementation details of the MakeContractionsFilter to better understand the pattern usage.
Let me check the test case to understand how the contractions are being verified.
Based on the analysis of the code and its usage, I can now provide a final assessment of the change from Pattern.UNICODE_CASE
to Pattern.UNICODE_CHARACTER_CLASS
.
The change is beneficial and safe
The change from Pattern.UNICODE_CASE
to Pattern.UNICODE_CHARACTER_CLASS
in MakeContractionsFilter.java
is an improvement because:
- The patterns use word boundaries (
\b
) with French contractions like "de le" → "du", "à le" → "au", etc. Pattern.UNICODE_CHARACTER_CLASS
provides proper Unicode support for word boundaries and character classes, whileUNICODE_CASE
only affects case-insensitive matching- The patterns already use
CASE_INSENSITIVE
flag for case handling, so no functionality is lost - The change ensures word boundaries work correctly with Unicode characters in French text
The change maintains backward compatibility since:
- The core functionality of contraction replacement remains the same
- The patterns still match the same basic text patterns
- Case-insensitive matching is preserved through the separate
CASE_INSENSITIVE
flag
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for tests that verify Unicode edge cases in French contractions
# Expected: Find test cases that cover Unicode scenarios
# Look for test files related to French contractions
fd -e java -e xml "MakeContractionsFilter.*Test"
# Search for Unicode test patterns in test files
rg -l "(?i)unicode|[\u0080-\uffff]" $(fd -e java -e xml "MakeContractionsFilter.*Test")
Length of output: 302401
Script:
#!/bin/bash
# Search for other usages of UNICODE_CASE and UNICODE_CHARACTER_CLASS flags in the codebase
rg -l "Pattern\.(UNICODE_CASE|UNICODE_CHARACTER_CLASS)"
# Search for tests related to MakeContractionsFilter
fd -e java "MakeContractionsFilter.*Test"
Length of output: 3950
Script:
#!/bin/bash
# Get the implementation of MakeContractionsFilter
rg -A 10 -B 10 "class MakeContractionsFilter" languagetool-language-modules/fr/src/main/java/org/languagetool/rules/fr/MakeContractionsFilter.java
# Look for any tests that verify French contractions
rg -l "de le|à le|de les|à les" languagetool-language-modules/fr/src/test/
Length of output: 1283
Script:
#!/bin/bash
# Get the relevant test content
rg -A 10 -B 10 "de le|à le|de les|à les" languagetool-language-modules/fr/src/test/java/org/languagetool/language/FrenchTest.java
# Also check the parent class to understand the contract
rg -A 10 -B 10 "class AbstractMakeContractionsFilter" languagetool-core/src/main/java/org/languagetool/rules/AbstractMakeContractionsFilter.java
Length of output: 2281
languagetool-core/src/main/java/org/languagetool/rules/patterns/RegexAntiPatternFilter.java (1)
45-45
: LGTM! The addition of UNICODE_CHARACTER_CLASS flag improves Unicode support.
The change enhances regex matching by properly handling Unicode character classes, which is crucial for natural language processing.
Let's verify JDK compatibility and similar patterns in the codebase:
✅ Verification successful
The addition of Pattern.UNICODE_CHARACTER_CLASS flag is appropriate and compatible with JDK 17
The codebase is configured to use Java 17 as shown in the pom.xml:
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
The Pattern.UNICODE_CHARACTER_CLASS flag was introduced in Java 7 and is fully supported in Java 17. The flag improves Unicode support by making character classes like \w, \d, \s Unicode-aware, which is particularly important for natural language processing tasks.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for JDK version requirements and similar pattern usage
# Check if other regex patterns might need the same flag
rg "Pattern\.compile\([^,]+" --type java
# Look for JDK version specification
fd "pom.xml" --exec grep -A 5 -B 5 "<java.version>"
Length of output: 239437
Script:
#!/bin/bash
# Check for JDK version requirements in pom.xml files
fd "pom.xml" -x grep -A 5 -B 5 "java.version|source.version|target.version|maven.compiler.source|maven.compiler.release"
# Check for JDK version in GitHub workflows
fd "\.ya?ml$" -g '.github/workflows/**' -x grep -A 5 -B 5 "java-version"
Length of output: 301
Script:
#!/bin/bash
# Check for JDK version requirements in pom.xml files with simpler grep
find . -name "pom.xml" -type f -exec grep -A 2 -B 2 "<java\.version>\|<source>\|<target>\|<release>" {} \;
# Check for Java version in any workflow files
find . -name "*.yml" -o -name "*.yaml" -type f -exec grep -A 2 -B 2 "java-version\|jdk" {} \;
# Look for Java version related compiler arguments
find . -name "pom.xml" -type f -exec grep -A 2 -B 2 "maven.compiler.\|<source>\|<target>" {} \;
Length of output: 10827
languagetool-core/src/main/java/org/languagetool/tokenizers/SrxTools.java (3)
34-34
: LGTM: Required import for Pattern class
The addition of Pattern import is necessary for using Pattern.UNICODE_CHARACTER_CLASS flag.
63-64
: LGTM: Enhanced Unicode support in tokenization
Good improvement to add Unicode character class support via Pattern.UNICODE_CHARACTER_CLASS flag. The implementation is clean and uses modern Java features (Map.of).
63-64
: Verify impact on existing tokenization behavior
The addition of UNICODE_CHARACTER_CLASS flag might affect how certain characters are tokenized. Please ensure this change:
- Maintains backward compatibility with existing tokenization behavior
- Has been tested with various Unicode text samples
Let's check for test coverage and similar pattern flag usage:
✅ Verification successful
Let's generate additional verification scripts to check for test coverage and behavior:
This will help us verify if there are adequate tests covering Unicode character handling in the tokenization process, particularly with the SRX implementation.
Let's do one final check to understand the SRX tokenization behavior:
This will help us understand the existing test coverage and usage patterns of SrxTextIterator to make a final determination about the impact of the UNICODE_CHARACTER_CLASS flag.
Based on the gathered evidence, I can now provide a final response:
Pattern.UNICODE_CHARACTER_CLASS flag addition is safe and consistent with existing usage
The addition of UNICODE_CHARACTER_CLASS flag to SrxTools is safe because:
- The flag is already widely used across multiple language-specific tokenizers (Ukrainian, French, Spanish, etc.) for Unicode text handling
- The codebase has extensive tokenization test coverage with language-specific SRX tokenizer tests
- The change aligns with existing pattern matching behavior in other parts of the codebase
The modification improves Unicode support while maintaining backward compatibility since:
- It only affects how Unicode properties are interpreted in regular expressions
- Other tokenizers in the codebase already use this flag successfully
- The extensive test suite across multiple languages will catch any regressions
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify test coverage and consistent pattern flag usage
# Check for test files related to SrxTools
echo "Looking for related test files..."
fd -e java -e xml "Srx.*Test|test.*Srx"
# Check for other usages of DEFAULT_PATTERN_FLAGS_PARAMETER
echo "Checking for other DEFAULT_PATTERN_FLAGS_PARAMETER usage..."
rg "DEFAULT_PATTERN_FLAGS_PARAMETER" -A 2
# Check for other UNICODE_CHARACTER_CLASS usage
echo "Checking for consistent UNICODE_CHARACTER_CLASS usage..."
rg "UNICODE_CHARACTER_CLASS" -A 2
Length of output: 15828
Script:
#!/bin/bash
# Check for SRX test files and configurations
echo "Looking for SRX test files and configurations..."
fd -e xml ".*srx$"
# Check for test classes that might test SrxTools
echo "Looking for test classes..."
rg -l "class.*Test.*extends" | rg -i "srx|tokeniz"
# Check for any existing test methods related to Unicode handling
echo "Checking for Unicode-related tests..."
rg -B 2 -A 2 "(?i)test.*unicode|unicode.*test"
Length of output: 1117
Script:
#!/bin/bash
# Check for SrxTools usage and its tests
echo "Checking for SrxTools usage patterns..."
rg -A 5 "new SrxTextIterator"
# Look for existing tokenization tests
echo "Looking for tokenization test files..."
fd -e java "TokenizerTest|SrxTest"
# Check the actual content of any found test files
echo "Checking test file contents..."
rg -B 2 -A 5 "void test.*Tokenizer|void test.*Srx"
Length of output: 8017
languagetool-language-modules/es/src/main/java/org/languagetool/tokenizers/es/SpanishWordTokenizer.java (2)
46-46
: LGTM! Proper Unicode support for regex pattern.
The change from Pattern.UNICODE_CASE
to Pattern.UNICODE_CHARACTER_CLASS
is correct. This flag is more appropriate for proper Unicode handling in JDK 19, especially for word boundaries (\b
) in non-ASCII text.
46-46
: Verify pattern behavior with Spanish-specific texts.
Please ensure the pattern works correctly with various Spanish texts, especially those containing:
- Ordinal numbers with accented characters
- Numbers at word boundaries with surrounding Spanish-specific characters
✅ Verification successful
Let me search for specific test cases in Spanish tokenizer tests.
Let me try a different approach to find test cases and pattern usage.
Pattern behavior verified with comprehensive test coverage and grammar rules
The pattern is well-verified through:
-
Test cases in
SpanishWordTokenizerTest.java
covering:- Basic ordinals: "1.º", "21.ª", "45.º"
- Alternative forms: "21.a", "45.o"
- Complex cases: "11.as Jornadas"
-
Grammar rules in
grammar.xml
handling:- Gender agreement (º/ª)
- Spacing rules with ordinals
- Various ordinal number formats
-
Disambiguation rules in
disambiguation.xml
covering:- Ordinal patterns with numbers
- Special cases and exceptions
The pattern \\b([\\d]+)\\.(º|ª|o|a|er|os|as)\\b
correctly handles all Spanish ordinal number variations and is supported by extensive test cases and grammar rules.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for test cases in the Spanish test files
# Expected: Find existing test cases that verify ordinal number handling
# Look for test files
fd -e java -e txt "Test.*Spanish" -t f
# Search for ordinal number patterns in test files
rg -l "([0-9]+)(º|ª|o|a|er|os|as)" $(fd -e java -e txt "Test.*Spanish" -t f)
Length of output: 17070
Script:
#!/bin/bash
# Search for test cases in Spanish tokenizer tests
rg -A 5 -B 5 "([0-9]+)(º|ª|o|a|er|os|as)" languagetool-language-modules/es/src/test/java/org/languagetool/tokenizers/es/SpanishWordTokenizerTest.java
# Also check for any other Spanish test files that might contain ordinal number tests
fd -e java "Test.*Spanish" languagetool-language-modules/es/ -x rg -l "([0-9]+)(º|ª|o|a|er|os|as)" {}
Length of output: 257
Script:
#!/bin/bash
# Look for the test class and its content
fd -e java "SpanishWordTokenizerTest" -x cat {}
# Also check for any test resources that might contain ordinal numbers
fd -t f -e txt "test.*es" languagetool-language-modules/es/src/test/resources/ -x cat {}
# Look for similar patterns in other Spanish-related files
rg -A 3 "Pattern\\.compile.*º|ª" languagetool-language-modules/es/
Length of output: 27141
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/pt.sor (1)
181-181
: LGTM! Important fix for ordinal number conversion.
The addition of the (?U)
flag (UNICODE_CHARACTER_CLASS with reluctant quantifiers) is a crucial fix that prevents over-capturing in the ordinal number feminine conversion pattern. This change ensures proper handling of sentence ranges by making the .*
quantifiers non-greedy, which is especially important when converting masculine ordinal numbers (e.g., "primeiro" → "primeira") within larger text contexts.
Let's verify the impact of this change on ordinal number conversions:
✅ Verification successful
Let me generate new shell scripts to better analyze the ordinal number patterns across language modules:
The first command will help us understand how other language modules handle ordinal feminine conversions, while the second will show if the (?U) flag is used in similar contexts elsewhere. This will give us better context about whether this pattern is consistent across the codebase.
Let me fix the path pattern and try again:
This should show us the ordinal feminine patterns in Portuguese, Spanish and French language modules, and any usage of the (?U) flag in these files.
Based on the search results, I can now provide a definitive response:
The (?U) flag addition is a valid and important fix for Portuguese ordinal feminine conversion.
The search results show that:
- The (?U) flag is only used in Portuguese (pt.sor) for ordinal feminine conversion
- Spanish (es.sor) and French (fr.sor) use different patterns for ordinal feminine conversion
- The pattern is specifically used to handle word boundary cases with 'o' to 'a' conversion in Portuguese ordinals
The non-greedy quantifier is essential here because:
- It prevents over-capturing in cases where multiple ordinal numbers appear in sequence
- It ensures precise word boundary matching when converting masculine ordinals ending in 'o' to feminine forms ending in 'a'
- Other Romance languages (Spanish, French) use different patterns that don't require this flag
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any other ordinal number conversion patterns that might need similar fixes
# Search for similar patterns in other language modules
rg -g '*.sor' '(?U)|[.]\\*.*ordinal'
# Look for any related test cases
rg -g '*Test.java' 'ordinal.*feminine|feminine.*ordinal'
Length of output: 156748
Script:
#!/bin/bash
# Search for similar ordinal feminine conversion patterns in other language modules
rg -A 2 -B 2 '== ordinal-feminine ==' 'languagetool-language-modules/*/src/main/resources/org/languagetool/resource/*/**.sor'
# Look for other patterns using (?U) flag in .sor files
rg '(?U)' 'languagetool-language-modules/*/src/main/resources/org/languagetool/resource/*/**.sor'
Length of output: 479
Script:
#!/bin/bash
# Search for similar ordinal feminine conversion patterns in other language modules
rg -A 2 -B 2 '== ordinal-feminine ==' 'languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/pt.sor' 'languagetool-language-modules/es/src/main/resources/org/languagetool/resource/es/es.sor' 'languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/fr.sor'
# Look for other patterns using (?U) flag in .sor files
rg '(?U)' 'languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/pt.sor' 'languagetool-language-modules/es/src/main/resources/org/languagetool/resource/es/es.sor' 'languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/fr.sor'
Length of output: 67989
languagetool-language-modules/fr/src/main/java/org/languagetool/tokenizers/fr/FrenchWordTokenizer.java (1)
67-67
: LGTM! Improved Unicode support for number formatting patterns
The change from UNICODE_CASE to UNICODE_CHARACTER_CLASS is appropriate for these digit patterns as it provides better Unicode support for handling international number formats, particularly French thousand separators.
Also applies to: 69-69
languagetool-language-modules/es/src/main/java/org/languagetool/language/Spanish.java (1)
318-318
: LGTM: Pattern.UNICODE_CHARACTER_CLASS flag addition improves Unicode support
The addition of Pattern.UNICODE_CHARACTER_CLASS
flag is appropriate for JDK 19 compatibility and ensures proper handling of Unicode characters in Spanish contractions.
Let's verify the pattern matches expected Spanish contractions:
✅ Verification successful
Pattern correctly matches Spanish contractions requiring transformation
The verification confirms that the regex pattern \b([Aa]|[Dd]e) e(l)\b
with Pattern.UNICODE_CHARACTER_CLASS
flag correctly matches the expected Spanish contractions that should be transformed:
- Matches "a el" → should become "al"
- Matches "de el" → should become "del"
- Case-insensitive for the first word ("A el" also matches)
- Correctly excludes "él" (accented) cases which should not be contracted
The pattern implementation is correct for its intended purpose of identifying Spanish contractions that need to be transformed.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Create a test file with various Spanish contractions
echo 'a el
de el
A el
De EL
a él
de él' > test_contractions.txt
# Test the pattern
echo "Testing contraction matches:"
rg -p '\b([Aa]|[Dd]e) e(l)\b' test_contractions.txt
Length of output: 184
languagetool-language-modules/ca/src/main/java/org/languagetool/language/Catalan.java (2)
48-50
: LGTM: Improved Unicode support in apostrophe patterns
The addition of Pattern.UNICODE_CHARACTER_CLASS
flag enhances the robustness of these patterns by properly handling Unicode characters in Catalan text.
422-442
: Verify the impact of Unicode character class changes
The switch from Pattern.UNICODE_CASE
to Pattern.UNICODE_CHARACTER_CLASS
for these patterns is significant. While it improves Unicode support, we should verify that it doesn't affect existing functionality.
✅ Verification successful
Based on the extensive test coverage found, I can now provide a definitive response about the Unicode character class changes.
The switch to Pattern.UNICODE_CHARACTER_CLASS is safe and verified
The test results show comprehensive testing of Unicode character handling in Catalan text, including:
- Extensive tests for diacritical marks (à, è, é, í, ò, ó, ú)
- Proper handling of apostrophes and contractions
- Correct tokenization of words with Unicode characters
- Accurate case transformations with accented characters
- Proper suggestions with accented characters
- Handling of special punctuation marks (·, —)
The test coverage demonstrates that the patterns work correctly with:
- Word boundaries (\b)
- Case-insensitive matching
- Character classes
- Unicode properties
The change from UNICODE_CASE to UNICODE_CHARACTER_CLASS is actually an improvement as it provides better Unicode support while maintaining all existing functionality, as evidenced by the passing tests.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for test cases that might be affected by the pattern changes
# Look for test cases involving these patterns
rg -l "CA_CONTRACTIONS|CA_APOSTROPHES|POSSESSIUS_[vV]|CA_REMOVE_SPACES" \
--type java \
--glob "*Test.java"
# Search for Unicode-specific test cases
ast-grep --pattern 'class $_ {
$$$
@Test
$$$
void $_($$$) {
$$$
Pattern.$$$
$$$
}
}'
Length of output: 214
Script:
#!/bin/bash
# Let's try a different approach to find relevant test cases
# Search for test files in the Catalan language module
fd "Test.*\.java$" languagetool-language-modules/ca
# Look for pattern usage in test files with context
rg -C 3 "Pattern\.(UNICODE_CHARACTER_CLASS|UNICODE_CASE)" \
languagetool-language-modules/ca \
--type java \
--glob "*Test.java"
# Search for test methods that verify text transformations
rg -C 3 "assertEquals.*[\"'][^\"']*[àèéíòóúÀÈÉÍÒÓÚ][^\"']*[\"']" \
languagetool-language-modules/ca \
--type java \
--glob "*Test.java"
# Look for specific pattern usage in the Catalan class
rg -C 3 "(CA_CONTRACTIONS|CA_APOSTROPHES|POSSESSIUS_[vV]|CA_REMOVE_SPACES)" \
languagetool-language-modules/ca/src/main/java/org/languagetool/language/Catalan.java
Length of output: 106700
Script:
#!/bin/bash
# Let's search for test cases that specifically test these patterns and their behavior
# Look for test cases that verify contractions and apostrophes
rg -B 2 -A 2 "CA_(CONTRACTIONS|APOSTROPHES|REMOVE_SPACES)" \
languagetool-language-modules/ca/src/test/java/org/languagetool/language/CatalanTest.java
# Search for test methods that verify text transformations with Unicode characters
rg -B 2 -A 2 "adaptSuggestion|toAdvancedTypography" \
languagetool-language-modules/ca/src/test/java/org/languagetool/language/CatalanTest.java
# Look for any Unicode-related test cases
rg -B 2 -A 2 "Pattern\.UNICODE_" \
languagetool-language-modules/ca/src/test/java/org/languagetool/language/CatalanTest.java
# Check if there are any test cases for possessive patterns
rg -B 2 -A 2 "POSSESSIUS_[vV]" \
languagetool-language-modules/ca/src/test/java/org/languagetool/language/CatalanTest.java
Length of output: 1062
Script:
#!/bin/bash
# Let's try searching in the test directory with a broader scope
# Find all test files that might contain relevant test cases
fd "Test\.java$" languagetool-language-modules/ca/src/test
# Look for test cases that verify text transformations with Unicode characters
rg -B 2 -A 2 "adaptSuggestion|toAdvancedTypography" \
languagetool-language-modules/ca/src/test
# Look for test cases involving pattern matching with Unicode
rg -B 2 -A 2 "Pattern\.UNICODE_" \
languagetool-language-modules/ca/src/test
# Search for test cases that use these specific patterns
rg -B 2 -A 2 "(CA_CONTRACTIONS|CA_APOSTROPHES|POSSESSIUS_[vV]|CA_REMOVE_SPACES)" \
languagetool-language-modules/ca/src/test
# Look for test methods that verify Catalan-specific text transformations
rg -B 2 -A 2 "assertEquals.*[\"'][^\"']*[àèéíòóúÀÈÉÍÒÓÚ][^\"']*[\"']" \
languagetool-language-modules/ca/src/test
Length of output: 89174
languagetool-language-modules/ca/src/main/java/org/languagetool/rules/ca/PronomsFeblesHelper.java (2)
556-556
: LGTM! Enhanced Unicode support for Catalan text processing.
The addition of Pattern.UNICODE_CHARACTER_CLASS
flag improves the pattern's ability to handle Catalan text with accented characters, making the pronoun transformation more robust.
556-556
: Consider adding test cases for accented characters.
To ensure the enhanced Unicode support works correctly, consider adding test cases that verify the pattern matches Catalan text with accented vowels (à, è, é, í, ò, ó, ú).
languagetool-language-modules/fr/src/main/java/org/languagetool/language/French.java (1)
53-55
: LGTM! Addition of UNICODE_CHARACTER_CLASS flag improves Unicode support.
The addition of Pattern.UNICODE_CHARACTER_CLASS
flag to the apostrophe patterns enhances their ability to handle Unicode characters in French text, making the patterns more robust.
Let's verify the test coverage for these patterns:
✅ Verification successful
Let me analyze the test coverage for the apostrophe patterns. I can see that FrenchTest.java
contains several test cases for the toAdvancedTypography
method, but I need to verify if these tests specifically cover the patterns affected by the Unicode flag change. Let me check the implementation:
Let me check one final thing to understand the BEFORE_APOS
constant and its usage:
Addition of UNICODE_CHARACTER_CLASS flag is verified and well-tested
The changes to the apostrophe patterns are well-covered by existing tests in FrenchTest.java
. The patterns handle French elisions (l', c', d', etc.) and specific words (qu', jusqu', etc.) with proper Unicode support. The test cases verify various combinations:
- Basic apostrophe cases: "C'est"
- Mixed quotes: "C'est l'"homme"" → "C'est l'« homme »"
- Nested quotes: "'C'est'"
The addition of Pattern.UNICODE_CHARACTER_CLASS
flag enhances the robustness of these patterns for Unicode text processing without changing their core functionality, which is confirmed by the existing test coverage.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for test cases that verify the apostrophe patterns
# Expected: Find test files containing assertions for these patterns
# Search for test files
rg -l "BEFORE_APOS_PATTERN" --type java
# Search for specific test cases
rg "toAdvancedTypography.*['''\"]" --type java -A 3
Length of output: 9750
Script:
#!/bin/bash
# Check the implementation of French.java to understand how these patterns are used
rg -A 10 "BEFORE_APOS_PATTERN" languagetool-language-modules/fr/src/main/java/org/languagetool/language/French.java
Length of output: 1952
Script:
#!/bin/bash
# Check the BEFORE_APOS constant definition
rg "BEFORE_APOS\s*=" languagetool-language-modules/fr/src/main/java/org/languagetool/language/French.java
# Check test cases that might contain words matching BEFORE_APOS patterns
rg "l'[\"']|[cdjlmnst]'[\"']" languagetool-language-modules/fr/src/test/java/org/languagetool/language/FrenchTest.java
Length of output: 463
languagetool-core/src/main/java/org/languagetool/rules/AbstractUnitConversionRule.java (2)
199-199
: LGTM! Enhanced Unicode support for unit pattern matching.
The addition of Pattern.UNICODE_CHARACTER_CLASS flag improves the handling of Unicode spaces and word boundaries in unit measurements.
Let's verify the impact on unit pattern matching with various Unicode characters:
83-83
: LGTM! Adding Unicode support to number range detection.
The addition of Pattern.UNICODE_CHARACTER_CLASS flag improves the handling of Unicode digits and word boundaries in number range detection.
Let's verify the impact on number range detection with various Unicode digits:
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/entities/hyphenised.ent (1)
2-2
: Addition of (?U) flag improves pattern matching behavior.
The addition of the ungreedy flag (?U)
is a good improvement as it prevents over-matching in complex patterns by making quantifiers match as few characters as possible.
pom.xml (3)
194-194
:
Verify dependency updates and address TODO comments.
Several critical dependency updates have been made with TODO comments indicating potential issues:
org.sweble.wikitext
: Update from 1.1.1 might cause compilation errors in wikipedia, dev and rpm modules- Missing dependencies after updates:
- jopt-simple.jar
- commons-jxpath.jar
- log4j.jar
Run the following script to check for missing dependencies and potential compilation issues:
Please ensure all TODO items are addressed and verify the compatibility of the updated dependencies before merging.
194-194
: Verify changelog for net.loomchild.segment 2.0.4.
The version update from 2.0.3 to 2.0.4 might include changes relevant to JDK 19 compatibility and regex handling.
Run the following script to check the changelog and usage:
Line range hint 116-118
: Clarify Java version requirements.
The PR title mentions "Jdk19" but the project is configured to use Java 17:
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
Run the following script to check Java version requirements across the codebase:
Please clarify:
- Should the Java version be updated to 19?
- If not, are the regex fixes compatible with Java 17?
✅ Verification successful
Java version mismatch resolved - no update needed
The codebase is correctly configured to use Java 17:
- The main pom.xml explicitly sets both source and target to Java 17
- No Java 19-specific features are being used in the codebase
- The compiler plugin is properly configured to enforce Java 17 compatibility
The PR title mentioning "Jdk19" appears to be referring to the development environment rather than a requirement to update the project's Java version. The regex improvements in this PR are compatible with Java 17.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check Java version requirements
# Look for Java version-specific features or annotations
echo "Checking for Java version-specific features..."
rg "requires java\."
# Check for module-info.java files
echo "Checking for Java modules..."
fd -g "module-info.java"
Length of output: 335
Script:
#!/bin/bash
# Let's try a different approach to check Java version requirements
# Check for any Java version-specific API usage or imports
echo "Checking for Java version-specific APIs..."
rg -l "java\.lang\.invoke|java\.util\.concurrent\.Flow|sealed class|permits|record\s+\w+|--enable-preview"
# Check pom.xml files for Java version configurations
echo "Checking all pom.xml files for Java version settings..."
fd -g "pom.xml" -x grep -l "maven.compiler\|java.version"
# Look for any GitHub workflow files that might specify Java version
echo "Checking GitHub workflows for Java version..."
fd -g "*.yml" -g "*.yaml" --search-path .github/workflows -x grep -l "java-version"
Length of output: 4716
Script:
#!/bin/bash
# Let's check the actual content of pom.xml files and GitHub workflows
# Check main pom.xml for all Java version related configurations
echo "Checking main pom.xml for Java version settings..."
cat pom.xml | grep -A 2 -B 2 "maven.compiler\|java.version"
# Check GitHub workflows for Java version
echo "Checking GitHub workflows..."
find .github/workflows -type f -name "*.yml" -o -name "*.yaml" -exec cat {} \; | grep -A 2 -B 2 "java-version"
# Check if there are any Java 19 specific features in the changed files
echo "Checking for Java 19 specific features..."
rg "switch\s*\([^)]+\)\s*{|sealed\s+class|permits|record\s+class|pattern\s+matching|virtual\s+threads"
Length of output: 2608
languagetool-core/src/main/java/org/languagetool/rules/patterns/XMLRuleHandler.java (4)
21-28
: Approved: Added necessary imports for new functionality
The added imports are appropriate and necessary for the new classes and interfaces used in the updated code.
Line range hint 313-318
: Handle contradictory regex flags in hasPosixCharacterClass
The method hasPosixCharacterClass
throws a RuntimeException
when both ?i
(case-insensitive) and \\p{Lu}/\\p{Ll}
(case-sensitive) are present:
if (res && s.contains("(?i")) {
throw new RuntimeException("Contradicting regex contains both '?i' (case-insensitive) and \\p{Lu}/\\p{Ll} (case-sensitive): " + s + " in rule " + id);
}
Ensure that this exception handling aligns with the overall error management strategy of the application.
Check if similar patterns exist elsewhere and whether they are handled consistently.
✅ Verification successful
Let me search for more specific information about the regex pattern handling in XMLRuleHandler.
Based on the code analysis, I can now provide a final response:
The RuntimeException for contradictory regex flags is a valid error handling approach
The exception is thrown when a regex pattern contains both case-insensitive (?i
) and case-sensitive (\p{Lu}/\p{Ll}
) POSIX character classes. This is a deliberate design choice to:
- Prevent ambiguous pattern matching behavior
- Maintain compatibility with Java 15+ regex handling (as noted in the code comment)
- Provide clear feedback to rule authors about contradictory pattern specifications
The error handling aligns with the application's overall strategy as:
- It fails fast with a clear error message that helps identify the problematic rule
- It prevents silent failures or unexpected pattern matching behavior
- The message includes both the conflicting pattern and rule ID for easy debugging
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find other instances where regex flags may conflict.
# Search for regex patterns containing both (?i) and \\p{Lu} or \\p{Ll}
rg '\\\\p\{[LlUu]\}' -tjava | rg '\(\?i'
# Verify the exception handling consistency across the codebase
rg 'throw new RuntimeException' -tjava -A 2
Length of output: 214459
Script:
#!/bin/bash
# Search for regex pattern handling in XMLRuleHandler
rg -A 5 -B 5 'hasPosixCharacterClass' -tjava
# Look for other regex pattern validations
rg -A 5 'pattern|regex' -tjava languagetool-core/src/main/java/org/languagetool/rules/patterns/XMLRuleHandler.java
Length of output: 15020
Line range hint 130-172
: Verify the initialization and usage of phraseMap
and phrasePatternTokens
The methods preparePhrase(Attributes attrs)
and finalizePhrase()
introduce logic that manipulates phraseMap
and phrasePatternTokens
. While phraseMap
is lazily initialized in finalizePhrase()
, ensure that preparePhrase()
does not access phraseMap
before it is initialized to avoid potential NullPointerException
. Additionally, verify that the logic correctly handles cases when patternTokens
is empty or not.
Run the following script to examine the calls and usage of preparePhrase()
and finalizePhrase()
:
✅ Verification successful
No NullPointerException risk in phraseMap usage
The code is safe from NullPointerException for the following reasons:
phraseMap
is lazily initialized infinalizePhrase()
before any usage:
if (phraseMap == null) {
phraseMap = new HashMap<>();
}
-
preparePhrase()
is only called afterfinalizePhrase()
in the XML parsing flow, as evidenced by the handler pattern, ensuringphraseMap
is already initialized when accessed. -
phrasePatternTokens
is eagerly initialized at declaration:
protected List<ArrayList<PatternToken>> phrasePatternTokens = new ArrayList<>();
The code follows a proper initialization pattern and the review comment's concern about potential NPE can be dismissed.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the order and usage of `preparePhrase()` and `finalizePhrase()` methods.
# Find all calls to `preparePhrase` and `finalizePhrase`
rg 'preparePhrase\(' -tjava -A 3
rg 'finalizePhrase\(' -tjava -A 3
# Check for any usage of `phraseMap` in `preparePhrase` before initialization
rg -A 5 'void preparePhrase' -tjava | rg 'phraseMap'
# Verify if `patternTokens` is appropriately handled
rg -A 10 'void preparePhrase' -tjava | rg 'patternTokens\.isEmpty\(\)'
Length of output: 214459
Script:
#!/bin/bash
# Search for phraseMap and phrasePatternTokens usage in preparePhrase and finalizePhrase methods
rg -A 5 'void preparePhrase|void finalizePhrase' -tjava languagetool-core/src/main/java/org/languagetool/rules/patterns/XMLRuleHandler.java
# Search for phraseMap initialization
rg 'phraseMap\s*=\s*new' -tjava languagetool-core/src/main/java/org/languagetool/rules/patterns/XMLRuleHandler.java
# Search for phrasePatternTokens usage
rg 'phrasePatternTokens' -tjava languagetool-core/src/main/java/org/languagetool/rules/patterns/XMLRuleHandler.java
# Search for class fields/variables
rg '^(\s+)(protected|private).*phraseMap|phrasePatternTokens' -tjava languagetool-core/src/main/java/org/languagetool/rules/patterns/XMLRuleHandler.java
Length of output: 2047
Line range hint 700-715
: Potential thread-safety issues with shared mutable state
The use of non-thread-safe HashMap
instances internedStrings
, internedMatchers
, and internedPos
without synchronization may lead to race conditions if instances of XMLRuleHandler
are accessed concurrently by multiple threads. To ensure thread safety, consider using ConcurrentHashMap
or synchronizing access to these maps.
Run the following script to check if XMLRuleHandler
instances are used in a multithreaded context:
languagetool-commandline/src/test/java/org/languagetool/commandline/MainTest.java (2)
37-37
: Approval of class inheritance change
The MainTest
class no longer extends AbstractSecurityTestCase
. This change simplifies the class hierarchy and is appropriate given the new testing approach using ProcessBuilder
. The removal appears to have no adverse effects on the test setup.
Line range hint 108-114
: Redirection of System.out
and System.err
in setUp
and tearDown
The redirection of System.out
and System.err
streams to capture output for assertions is correctly implemented. Restoring the original streams in tearDown
ensures no side effects on other tests.
languagetool-core/src/main/resources/org/languagetool/resource/segment.srx (3)
5760-5760
: Verify Numbered List Handling in Regex
The regular expression on line 5760:
<beforebreak>\b\d{1,3}\.[\h]+</beforebreak>
This pattern matches one to three digits followed by a period and horizontal whitespace. Ensure that this does not inadvertently match decimal numbers or other numeric formats within sentences, which could cause improper sentence splitting.
To confirm the correctness, you might run tests to ensure that decimal numbers (e.g., "3.14") are not mistakenly split.
5833-5833
: Ensure Correct Handling of Currency and Quantity Abbreviations
In the regex on line 5833:
<beforebreak>\b(тис|млн|млрд|грн)\.[\h\v]*</beforebreak>
<afterbreak>[\h\v]*(\d|[КМ]Вт)</afterbreak>
This pattern deals with abbreviations for quantities and currency (thousand, million, billion, hryvnia). Verify that the pattern correctly prevents sentence breaks after these abbreviations, especially when followed by numbers or units.
Ensure that sentences like "Вартість становить 5 млн. грн. Це найнижча ціна." are not incorrectly split.
5859-5859
: Check the Context of 'нар.' Abbreviation
The regex on line 5859:
<beforebreak>\bнар\.[\h\v]*</beforebreak>
<afterbreak>([0-9]|бл\.|арт\.)</afterbreak>
Ensure that the abbreviation нар.
(народження) is correctly handled in contexts like dates of birth and does not interfere with other usages that might not signify sentence boundaries.
Verify that sentences such as "Він нар. у 1980 р." are processed correctly without unintended splits.
<message>&glalong;</message> | ||
<suggestion>p. a.</suggestion> | ||
<url>https://de.wiktionary.org/wiki/p._a.</url> | ||
<short>⪥</short> | ||
<example correction="p. a.">Hier gibt es eine Rendite von 2 % <marker>p.a.</marker></example> | ||
</rule> | ||
<rule><!--17--> | ||
<regexp case_sensitive='yes'>Dr\.\s?med\.\s?(dent|vet)\.</regexp> | ||
<regexp case_sensitive='yes'>Dr\.\s?med\.[ \t]?(dent|vet)\.</regexp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix inconsistent whitespace handling in academic title patterns
The patterns mix \s?
and [ \t]?
for whitespace handling. This inconsistency could lead to different matching behavior.
-<regexp case_sensitive='yes'>Dr\.\s?med\.[ \t]?(dent|vet)\.</regexp>
-<regexp case_sensitive='yes'>Dr\.\s?rer\.[ \t]?(nat|pol)\.</regexp>
+<regexp case_sensitive='yes'>Dr\.[ \t]?med\.[ \t]?(dent|vet)\.</regexp>
+<regexp case_sensitive='yes'>Dr\.[ \t]?rer\.[ \t]?(nat|pol)\.</regexp>
Also applies to: 79590-79590
@@ -79534,7 +79534,7 @@ | |||
<example correction="n. V.">Sprechstunde <marker>n.V.</marker></example> | |||
</rule> | |||
<rule><!--12--> | |||
<regexp>(u)\.\s?(v)\.\s?(m|a)\.</regexp> | |||
<regexp>(u)\.\s?(v)\.[ \t]?(m|a)\.</regexp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove redundant whitespace check in u.v.m. pattern
The pattern contains both \s?
and [ \t]?
, which is redundant and could match more whitespace than intended.
-<regexp>(u)\.\s?(v)\.[ \t]?(m|a)\.</regexp>
+<regexp>(u)\.[ \t]?(v)\.[ \t]?(m|a)\.</regexp>
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
<regexp>(u)\.\s?(v)\.[ \t]?(m|a)\.</regexp> | |
<regexp>(u)\.[ \t]?(v)\.[ \t]?(m|a)\.</regexp> |
private static final Pattern SPACE_DIGITS0 = Pattern.compile("([\\d]{4}) ", | ||
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE); | ||
private static final Pattern SPACE_DIGITS = Pattern.compile("([\\d]) ([\\d][\\d][\\d])\\b", | ||
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE); | ||
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS); | ||
private static final Pattern SPACE_DIGITS2 = Pattern.compile("([\\d]) ([\\d][\\d][\\d]) ([\\d][\\d][\\d])\\b", | ||
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE); | ||
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Maintain consistency in Unicode pattern flags
The SPACE_DIGITS0 pattern still uses UNICODE_CASE while SPACE_DIGITS and SPACE_DIGITS2 have been updated to UNICODE_CHARACTER_CLASS. This inconsistency could lead to unexpected behavior when handling Unicode digits and spaces.
Apply this change to maintain consistency:
private static final Pattern SPACE_DIGITS0 = Pattern.compile("([\\d]{4}) ",
- Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
+ Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
private static final Pattern SPACE_DIGITS0 = Pattern.compile("([\\d]{4}) ", | |
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE); | |
private static final Pattern SPACE_DIGITS = Pattern.compile("([\\d]) ([\\d][\\d][\\d])\\b", | |
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE); | |
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS); | |
private static final Pattern SPACE_DIGITS2 = Pattern.compile("([\\d]) ([\\d][\\d][\\d]) ([\\d][\\d][\\d])\\b", | |
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE); | |
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS); | |
private static final Pattern SPACE_DIGITS0 = Pattern.compile("([\\d]{4}) ", | |
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS); | |
private static final Pattern SPACE_DIGITS = Pattern.compile("([\\d]) ([\\d][\\d][\\d])\\b", | |
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS); | |
private static final Pattern SPACE_DIGITS2 = Pattern.compile("([\\d]) ([\\d][\\d][\\d]) ([\\d][\\d][\\d])\\b", | |
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS); |
private String readProcessOutput(Process process) throws IOException { | ||
try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()))) { | ||
StringBuilder output = new StringBuilder(); | ||
String line; | ||
while ((line = reader.readLine()) != null) { | ||
output.append(line).append(System.lineSeparator()); | ||
} | ||
return output.toString(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Prevent potential deadlocks by handling both standard and error streams
In the readProcessOutput
method, only the process's standard output stream is read. If the process writes to its error stream and the buffer fills up, it could cause the process to hang due to unhandled data. It's important to read both the standard output and error streams to prevent this issue.
Apply this diff to read both the standard output and error streams:
private String readProcessOutput(Process process) throws IOException {
- try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()))) {
- StringBuilder output = new StringBuilder();
- String line;
- while ((line = reader.readLine()) != null) {
- output.append(line).append(System.lineSeparator());
- }
- return output.toString();
- }
+ StringBuilder output = new StringBuilder();
+ StringBuilder errorOutput = new StringBuilder();
+ try (
+ BufferedReader stdOutReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
+ BufferedReader stdErrReader = new BufferedReader(new InputStreamReader(process.getErrorStream()))
+ ) {
+ String line;
+ while ((line = stdOutReader.readLine()) != null) {
+ output.append(line).append(System.lineSeparator());
+ }
+ while ((line = stdErrReader.readLine()) != null) {
+ errorOutput.append(line).append(System.lineSeparator());
+ }
+ }
+ // Optionally, you can assert on errorOutput if needed
+ return output.toString();
}
This ensures that both output streams are consumed, preventing the process from hanging due to full buffers.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
private String readProcessOutput(Process process) throws IOException { | |
try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()))) { | |
StringBuilder output = new StringBuilder(); | |
String line; | |
while ((line = reader.readLine()) != null) { | |
output.append(line).append(System.lineSeparator()); | |
} | |
return output.toString(); | |
} | |
} | |
private String readProcessOutput(Process process) throws IOException { | |
StringBuilder output = new StringBuilder(); | |
StringBuilder errorOutput = new StringBuilder(); | |
try ( | |
BufferedReader stdOutReader = new BufferedReader(new InputStreamReader(process.getInputStream())); | |
BufferedReader stdErrReader = new BufferedReader(new InputStreamReader(process.getErrorStream())) | |
) { | |
String line; | |
while ((line = stdOutReader.readLine()) != null) { | |
output.append(line).append(System.lineSeparator()); | |
} | |
while ((line = stdErrReader.readLine()) != null) { | |
errorOutput.append(line).append(System.lineSeparator()); | |
} | |
} | |
// Optionally, you can assert on errorOutput if needed | |
return output.toString(); | |
} |
// int flags = regexCaseSensitive ? 0 : Pattern.CASE_INSENSITIVE|Pattern.UNICODE_CASE; | ||
int flags = regexCaseSensitive ? Pattern.UNICODE_CHARACTER_CLASS : Pattern.CASE_INSENSITIVE|Pattern.UNICODE_CHARACTER_CLASS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirm the replacement of Pattern.UNICODE_CASE
with Pattern.UNICODE_CHARACTER_CLASS
In the previous implementation (now commented out), when regexCaseSensitive
was false
, the flags included Pattern.UNICODE_CASE
to enable case-insensitive matching of Unicode characters. The new code replaces Pattern.UNICODE_CASE
with Pattern.UNICODE_CHARACTER_CLASS
, which extends character classes like \w
, \d
, and \s
to match Unicode characters but does not affect case-insensitive matching.
If the intent is to support both Unicode character classes and case-insensitive matching for Unicode characters when regexCaseSensitive
is false
, you should include both Pattern.UNICODE_CASE
and Pattern.UNICODE_CHARACTER_CLASS
in the flags. Otherwise, case-insensitive matching may only affect ASCII characters.
Consider applying the following change to include both flags:
-int flags = regexCaseSensitive ? Pattern.UNICODE_CHARACTER_CLASS : Pattern.CASE_INSENSITIVE|Pattern.UNICODE_CHARACTER_CLASS;
+int flags = regexCaseSensitive ? Pattern.UNICODE_CHARACTER_CLASS : Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.UNICODE_CHARACTER_CLASS;
This ensures that both case-insensitive matching and Unicode character classes are enabled when regexCaseSensitive
is false
.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// int flags = regexCaseSensitive ? 0 : Pattern.CASE_INSENSITIVE|Pattern.UNICODE_CASE; | |
int flags = regexCaseSensitive ? Pattern.UNICODE_CHARACTER_CLASS : Pattern.CASE_INSENSITIVE|Pattern.UNICODE_CHARACTER_CLASS; | |
// int flags = regexCaseSensitive ? 0 : Pattern.CASE_INSENSITIVE|Pattern.UNICODE_CASE; | |
int flags = regexCaseSensitive ? Pattern.UNICODE_CHARACTER_CLASS : Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.UNICODE_CHARACTER_CLASS; |
@@ -5744,7 +5744,7 @@ | |||
</rule> | |||
<!-- Наші в... Лос-Анджелесі --> | |||
<rule break="no"> | |||
<beforebreak>(?U)\b(в|у|на|за|з|із|зі|зо)(\.\.\.|…)[\h\v]*</beforebreak> | |||
<beforebreak>\b(в|у|на|за|з|із|зі|зо)(\.\.\.|…)[\h\v]*</beforebreak> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Review the Use of Word Boundary \b
with Unicode Text
In the regular expression on line 5747:
<beforebreak>\b(в|у|на|за|з|із|зі|зо)(\.\.\.|…)[\h\v]*</beforebreak>
The \b
word boundary may not correctly handle word boundaries in Unicode text for the Ukrainian language. This could lead to incorrect sentence segmentation.
Consider replacing \b
with a more Unicode-aware boundary matcher, such as (?<!\S)
to match the start of a word, or ensure that the regex engine is configured for Unicode word boundaries.
@@ -5915,7 +5915,7 @@ | |||
</rule> | |||
<!-- але розбиваємо «всього 20 м. Почалося» --> | |||
<rule break="no"> | |||
<beforebreak>(?U)(?<!\d[\h\v]*)\bм\.[\h\v]*</beforebreak> | |||
<beforebreak>(?<!\d[\h\v]*)\bм\.[\h\v]*</beforebreak> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Disambiguate the Abbreviation 'м.' Between 'місто' and 'метр'
The regex on line 5918:
<beforebreak>(?<!\d[\h\v]*)\bм\.[\h\v]*</beforebreak>
<afterbreak>[А-ЯІЇЄҐ][а-яіїєґ']</afterbreak>
The abbreviation м.
can stand for both 'місто' (city) and 'метр' (meter). This pattern may incorrectly split sentences when м.
is used as a unit of measurement.
To resolve this, refine the regex to distinguish between the two contexts.
Apply this diff to improve the regex:
-<beforebreak>(?<!\d[\h\v]*)\bм\.[\h\v]*</beforebreak>
+<beforebreak>(?<!\d[\h\v]*)\bм\.[\h\v]+(?=[А-ЯІЇЄҐ][а-яіїєґ'])</beforebreak>
This adjustment ensures that м.
is treated as 'місто' only when followed by a capital letter starting a city name.
@arysin I will update the tests for the sentence ranges to cover the issue that occurred. |
Reopen PR we saw some problems with the sentence ranges.
@arysin
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
pom.xml
file to reflect dependency changes and compatibility checks.Chores