Native String for JS backend #767

Ihromant · 2023-09-29T21:07:37Z

Ihromant
Sep 29, 2023

Motivation.
TeaVM considered performant library/compiler. There is even benchmark that compares speed on TeaVM vs GWT.
Still, there is one performance bottleneck in TeaVM. Unfortunately it's in one of the most used classes: String.
I have pet project which does lots of DOM generation and manipulation using Javascript API. I noticed that when I load same page statically generated (page has looooots of DOM elements) and dynamically (get data from websocket, parse JSON, generate DOM, append it), it loads 3-4 times slower. I understand that there are other places where it could be the bottleneck. But today when I improved String.toLowerCase I understood one of bottleneck parts well.
Simplified example:

public static void main(String[] args) {
        HTMLElement div = HTMLDocument.current().createElement("div");
        div.getStyle().setProperty("position", "relative");
        for (int i = 0; i < 100_500; i++) {
            HTMLElement child = HTMLDocument.current().createElement(generateTag());
            child.getStyle().setProperty("position", "absolute");
            Point p = generatePoint();
            child.getStyle().setProperty("left", p.x() + "px");
            child.getStyle().setProperty("top", p.y() + "px");
        }
    }

    private static String generateTag() {
        return ThreadLocalRandom.current().nextInt(2) == 0 ? "div" : "span";
    }

    private static Point generatePoint() {
        return new Point(ThreadLocalRandom.current().nextInt(0, 800), ThreadLocalRandom.current().nextInt(0, 600));
    }

    private record Point(int x, int y) {}

What do we have here? Literals were successfully inlined, but

$child = ju_Random_nextInt(juc_ThreadLocalRandom_current, 2) ? $rt_s(1) : $rt_s(2); // corresponds to generateTag, two conversions
$child = $div.createElement($rt_ustr($child));
...
var$8 = jl_StringBuilder__init_(); // three lines correspond to p.x() + "px", lots of conversions and transformations
jl_StringBuilder_append(jl_StringBuilder_append0(var$8, var$5), $rt_s(3));
var$9 = jl_StringBuilder_toString(var$8);
var$7.setProperty("left", $rt_ustr(var$9));

Even first case was not inlined. In second case we have lots of transformations. Firstly we generate constant pool consists of artificial strings. Then we append integer and "px" in StringBuilder, in the end we convert builder to String and reconverting it back to Javascript string. It remains the same after the minification. It's all instead of just having (yes, it's correct everytime assuming that $var8 is an int).

var$7.setProperty("left", $var8 + "px");

If we have big loop, this obviously goes to performance degradation where not needed.
It would be stupidly criticize without solution, but today when I was in swimming pool I came with idea which I'm sharing.

Assumptions.
Here are assumptions which are confirmed by documentation.

Javascript Strings are char arrays itself. It's because it's UTF-16 string. So encoding already is set and we can assume that it's char[] array.
Most used intersecting methods run exactly same as in Java. String in Java and JS is immutable, so no side-effects or differences.

JS charCodeAt <-> Java charAt
Java toCharArray can be easily generated
JS codePointAt <-> Java codePointAt
and so on...

There should be no bugs in intersecting methods between Java and JS for simplest cases.

Solution.
Let's consider following interface:

public interface BaseString {
    int length();

    char charCodeAt(int index);

    int codePointAt(int index);

    default char[] toCharArray() {
        char[] result = new char[length()];
        for (int i = 0; i < result.length; i++) {
            result[i] = charCodeAt(i);
        }
        return result;
    }

    BaseString subString(int startInc);

    BaseString subString(int startInc, int endExc);

    BaseString toLowerCase(); // so on, some additional methods can be also added after analisys
}

JSString immediately extends this interface.
Second class added is

/* package-private */ class TEmulString implements BaseString {
private final char[] characters;
}

TString becomes

public final class TString {
private final BaseString base; // will be JSString in JS, TEmulString in other planforms (or other String analogues)
private int hashCode;
// delegate already implemented to baseString, wrap result to TString, implement other
}

Constructors when we don't play with charset, then everything is easy, when we play with different than UTF-16 charset - use emulated code.
Using this approach we can use most efficient methods of JS-platform string, and other will remain the same. Methods that most Java developers use for String 90% intersecting methods that are common between Java and JS strings. So performance will raise significantly.

Bonuses.

It allows to do optimizations. For instance how to resolve concat:

class Point {
int x;
int y;
@Override
public void toString() {
return "Point(" + x + "," + y + ")";
}
}

becomes

function cie_Point_toString($var1) { return "Point(" + $var1.x + "," + $var2.y + ")";} // Instead of large code earlier

If we have double/float/object - then we can evaluate their String representations and then join natively.
I reviewed few articles about optimization with array.join(''), but some say that join is faster, some say that common + is faster.
Here is example. Anyway, for large concats or for concats with unknown number of arguments, StringBuilder can be used.
2. In compile time it's possible to resolve some calls to simple implementation. Example:

String tokens = someString.split(";");

it's real case. I was very surprised when I found that in this codesize increased by 100kb and full Regexp implementation was included.
If resolve it in compile-time, it's possible to do next trick: if argument is constant and it's length and it's not special regexp character - just use native split or emulate it with something like (example from my codebase which I'm using to avoid regexps)

static Stream<String> split(String s, char ch) {
        Iterable<String> it = () -> new Iterator<>() {
            private int idx = s.isEmpty() || s.charAt(0) != ch ? 0 : 1;

            @Override
            public boolean hasNext() {
                return idx < s.length();
            }

            @Override
            public String next() {
                int pos = s.indexOf(ch, idx);
                if (pos < 0) {
                    pos = s.length();
                }
                String res = s.substring(idx, pos);
                idx = pos + 1;
                return res;
            }
        };
        return StreamSupport.stream(it.spliterator(), false);
    }

implementation which return String[] can be easily done and substituted to this call and save developer from including regexps (70-80% cases will be covered by this optimization).
3. JS interaction becomes very easy. In our case we just need to invoke$var1.base to pass emulated string to JS context and new TString(jsString) to pass JS strings back instead of converting it forth and back between contexts.
4. We can get rid of string pool or just replace it with native JS string array. $rt_s implementation becomes just get required native string by index.
5. Integer.parseInt and Integer.toString (Short, Byte) can be simplified in JS case to fast and efficient conversions. Float and double still can continue using emulated methods.
6. Codesize will shrink significantly for applications using DOM manipulations or working with text somehow else.

Future.
String templates JEP-430 in the next stable JDK (or earlier) will become part of standard. In case of native Strings, we will be able to reuse power of JS string interpolation.

Implementation.
I understand that TeaVM is mature project which is used in productions of different companies. I'm suggesting the following way:

Implement benchmarks for different String usage scenarios.
Implement suggested approach in separate branch.
Compare benchmarked values.
Merge if there will be performance improvement. (I assume that there will be and significant).

@konsoletyper What do you think?

konsoletyper · 2023-09-30T03:53:43Z

konsoletyper
Sep 30, 2023
Maintainer

Sure, this String part can be optimized, but not that easy you think and not using these BaseString interfaces.

As for this StringBuilder thing: it's now how things generated by TeaVM, it's how they generated by javac. Sure, when you compile newer versions of Java language, javac would produce invokedynamic, and in theory this can make a hint. On the other hand, this will make whole invokedynamic infrastructure more complicated. I suggest analysing control flow instead to find such trivial StringBuilder case, which does not escape, where all arguments are either string literals or numerics and turn them into JavaScript concatenation. But this is not as easy as you think.

There's also a thing: currently toString and parse(Type) are also implemented in Java, so randomly switching between Java implementation and JS implementation can be inconsistent. Please, note this very carefully: when something behaves not according spec, it's not inconsistent! Inconsistent is when in one occasions number printed one way, and in other occasions the same number printed the other way.

Even String.split optimization can't be done that easy. Even figuring out that you pass only literal to a method is not that trivial. And there are many "trivial" cases from user's standpoint (say, someone wrote a method that only calls String.split) won't work as expected.

Anyway, there are tons of places where TeaVM can be optimized, and I see no reason why this string stuff is the most important of them.

1 reply

Ihromant Sep 30, 2023
Author

I understand that it's not so easy I'm thinking.
Also by working on recent PR I understood the complication of guessing by bytecode what the actual implementation was.
About importance: this is my pain point. If you are working with Strings alot in your code - 70-80% that you are working with JS API and suffering from this additional unneeded conversions/invocations. If you have small DOM - yes, you almost not feeling this, but I'm feeling this every time and forced to use temporary (?) hacks to deal with performance issues. Some examples:

So, in the beginning I dropped double usage due to slow toString conversion (it was felt even with small DOM).
Then I wrote my own split-by-char to avoid using regexps for 1-letter case. You can say that it's how it intended to work. But inside JDK there is optimization that leads to not-regexp implementation in 1-letter cases.
Then I noticed slow work of toLowerCase/toUpperCase and dropped it everywhere possible to reduce number of conversions.

Still, the main problem remains: I have large DOM and need to work with it.
I can assume why it's not a top priority: most TeaVM users paint almost everything to canvas and yes, nobody can complain that it works slower than native code (I assume that today it's how most cross-platform applications are developed). But in my case I would like to utilize power of CSS and DOM manipulations, but I can't do it in full power.
You are not supposed to work on this changes, I can do that, but I need confirmation from you that it will be accepted and your help/hints about how to do it and what should be considered.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native String for JS backend #767

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Native String for JS backend #767

Ihromant Sep 29, 2023

Replies: 1 comment · 1 reply

konsoletyper Sep 30, 2023 Maintainer

Ihromant Sep 30, 2023 Author

Ihromant
Sep 29, 2023

Replies: 1 comment 1 reply

konsoletyper
Sep 30, 2023
Maintainer

Ihromant Sep 30, 2023
Author