This is a library with classes for any wildcard implementation that finds pattern with * (asterisk) and ? (query) token.
Without regular expression extension (named ext-pcre
- before PHP 5.3) you have no wildcard support within PHP.
See https://www.php.net/manual/en/pcre.installation.php.
The known wildcard behavior (see Wildcard character (@Wikipedia)) is limited for a good implementation.
A small remedy?
We need a kind of "compiled" and "cached" pattern. The implementation of regular expression like preg_match
has a
huge performance. We lost these advantage as of PHP 7.0, because of just-in-time compiled pcre pattern!
- Benchmark
- Wildcard variants
- Possible valid pattern
- Invalid pattern
- Escaping
- Repeating phrases
- Caching
- Wish list
- Appendix
API:
- Matcher (for use of single calls)
- Performer (for use of multiple calls)
- Converter (for use of regular expression)
When we benchmark several methods to match a fitting phrase (1000 random strings), we have a good results without "regular expressions":
Single Byte:
Benchmark | Time | Reference | Difference |
---|---|---|---|
WildcardPerformer |
0.00353789 | 100 % | - |
preg_match |
0.01223898 | 71 % | 71 % |
WildcardMatcher |
0.01635289 | 25 % | 78 % |
Multi Byte (like Unicode):
Benchmark | Time | Reference | Difference |
---|---|---|---|
WildcardPerformer |
0.00660086 | 100 % | - |
mb_ereg_match |
0.01290488 | 48 % | 48 % |
WildcardMatcher |
0.03252506 | 60 % | 79 % |
Function | WildcardMatcher | WildcardPerformer |
---|---|---|
strlen | 9 | 5 (-4) |
substr | 5 | 4 (-1) |
strpos | 5 | 1 (-4) |
chr | 2 | 2 (0) |
Conditions | 57 | 13 (-44) |
!= character | == 1 character | >= 2 characters | Token |
---|---|---|---|
0 | 0 | 0 | (invalid) |
0 | 0 | 1 | ??... / ??...* |
0 | 1 | 0 | ? |
0 | 1 | 1 | ** |
1 | 0 | 0 | (empty string) |
1 | 0 | 1 | ??** (draft) |
1 | 1 | 0 | ?* |
1 | 1 | 1 | * |
?? (2 characters)
???* (2-3 characters)
***
?**
?*?
*?
\\
\?
\*
Please note of this escaping scenarios:
Subject | Pattern | Explanation |
---|---|---|
search\\* |
*\\* |
matches search\ with wildcard, then matches escaped * |
search\\* |
*\\\\* |
matches search with wildcard, then matches escaped \ , then matches * with wildcard |
search\\* |
*\\\\\\* |
matches search with wildcard, then matches escaped \ , then matches escaped * |
search\\* |
*\\\\\\\\* |
matches search with wildcard, then matches escaped \ , then not matches escaped \ - ignore rest of * |
Further explanation:
programmatic | internal | after escaping |
---|---|---|
\\ |
\ |
\ |
\\\\ |
\\ |
\ |
The asterisk (*
) has a problem finding the right position. If several identical phrases can be found in a string,
the search must take place in all positions to find the pattern.
Search: *is?ue (where is "*is")
Subject: this is an asterisk issue
Founds: ^ ^ ^ ^
The simplest solution is to break down the pattern recursively, but it should be not recursively.
The regular expression extension has a caching and compiling strategy to improve performance. The second time the same pattern is called, a tremendous increase in performance is achieved.
To help us to improve also performance, we use a simple key-value caching.
protected array $cachedResults = [];
public function match(string $subject): bool
{
return $this->cachedResults[$subject] ?? $this->cachedResults[$subject] = $this->computePhrases($subject, 0);
}
This is not a preferred solution.
- Remove recursive call of
WildcardPerformer::computePhrases
(see Wikipedia - Matching wildcards) - Remove of
StringFunctionMapper
(too slow) - Combine Matcher and Performer (two advantages in one)
- Caching interface
- New pattern
??**
(0 or 2 characters) or?????**
(0 or 5 characters) - Glob support
- Example:
[abcdef0123456789]
,[0-9a-f]
,[!a-z]
- Use in YAWL:
[0-9a-f]?
(one times)[0-9a-f]?*
(zero or one times)[0-9a-f]*
(zero or N times)[0-9a-f]**
(one or N times)[0-9a-f]x
(one times then followsx
)
- Example:
Common Algorithms:
Non-recursive Algorithms:
Recursive Algorithms:
- Compares a filename or pathname to a pattern by "Apple"
- Stackoverflow - Recursive solutions for glob pattern matching by "Siler"
- NOP - Number Of Packages
- NOC - Number Of Classes
- NOM - Number Of Methods
- LOC – Lines of Code
- CYCLO - Cyclomatic complexity
- CALLS - number of distinct function- and method-calls
- ANDC - Average Number of Derived Classes
- AHH - Average Hierarchy Height