Add source position information to span nodes #202

leamingrad · 2024-05-31T16:23:11Z

(Note: The original context for this PR is covered in django-ftl/fluent-compiler#32, but the key parts are copied in here)

Hi 👋. Firstly, thanks for this library, and for the project in general!

I'm currently working with python-fluent (via fluent-compiler and django-ftl) as part of a large Django project. Its generally been great, but unfortunately parsing and compiling our fluent files contributes quite a lot (~10s) to our apps startup time.

With that in mind, I was hoping to contribute a couple of optimisations to fluent-compiler and project-fluent to speed up the handling of large fluent files.

One of the biggest contributor to compile times in fluent-compiler is the span_to_position function. This function takes the start of a Span node, and coverts it into (row, column) tuple. However in order to do this, it needs to scan through the source for all newlines to work out how many rows their have been.

This PR is an attempt to solve that problem by returning more information from the parsing code in fluent.syntax. Specifically, we add a new type of AST node to represent a row/column position in the source. We then track the current row/column while scanning through the source stream, and store that in Span nodes as they are created.

I've tried to make this PR as easy-to-review as possible, but unfortunately there is still quite a large diff, so if there is anything which would be helpful in terms of breaking commits up/squashing them together, please let me know.

I'm not entirely clear on whether the structure of the AST nodes is part of the fluent spec, or if it is something which python-fluent can decide unilaterally. If this turns out to be an insurmountable problem, I can switch to trying to solve this inside fluent-compiler instead.

(Note: This PR will result in a one-line merge conflict with #201, so I will rebase as needed if one is merged first)

Prior to this change, variables were using _index and _pos suffixes (or no suffix) to all refer to stream indexes. This change switches all of them to use _index, which will make future refactors clearer.

leamingrad · 2024-05-31T18:03:16Z

~~I've just noticed an issue with the position information around line boundaries - I'll fixup and then mark as ready for review again~~

Edit: On further checking there doesn't seem to be an issue, but I have updated the tests to make things clearer

Prior to this change, the only positional variables which were tracked when moving through the stream was the current index in the stream. We would like to start reporting richer positional information from the parser, specifically the position (in terms of row/column) in the source object. This change allows this by updating the Stream class to additionally track its current (zero-indexed) row and column while moving through the stream. We will persist this information into the syntax tree in the following commits.

Prior to this change, the location in the parser stream was represented by the current index. We would like to update the parser stream to also return the source position (row/column) as part of the location information. This change adds a type alias to represent a location in the parser stream, and a property to return it. Currently, this is just the index, but we will update it in future commits. This allows us to break up the change to make it easier to review, as we can switch code paths which create span information to use locations before making any functional changes.

Prior to this change, we were passing around integers when creating span information. This change uses locations instead. This has no functional change, but will make it easier to start passing through source position information in the following commit.

Prior to this change, we were tracking source position information in the stream, but were not storing it in the resulting spans. This change adds start_position and end_position SourcePosition attributes to Span nodes. To make this work we end up complicating the constructor for Span nodes, so that it can either take a location or an index/position combo (for JSON deserialization).

eemeli · 2024-07-25T18:51:49Z

The size of the diff here held me off from looking too closely for a while...

Have you considered other approaches to speeding up span_to_position? The additional data you're looking to include in the AST is pretty sizable as it touches all nodes, and it's duplicating the position info that's in the existing start and end indices.

For one possible approach, for the JS yaml library I ended up defining a lineCounter option and a LineCounter class that automatically tracks newlines, and using them, makes the index -> row/col position mapping much faster.

Could that sort of an approach work for you as well?

leamingrad · 2024-08-07T15:55:03Z

No problem about the delay, thanks for getting back to me.

I think there are definitely other approaches which could speed up span_to_position inside fluent-compiler itself, either something that looks a bit like this PR or something like a LineCounter class.

My original thinking was that by providing tools to enable that inside python-fluent would be generally useful - do you think there is anything it would be useful for python-fluent to provide?

eemeli · 2024-08-08T04:25:20Z

I would prefer an approach that added the least cost to users who do not need the line/column position for all nodes. Always calculating the positions during the parse seems like mostly unnecessary work, while providing some kind of side channel for getting the newline indices during the parse seems like a much lighter addition to fluent.syntax that would also avoid duplicating information in the parsed results.

Use consistent variable names in get_entry_or_junk

308e0fc

Prior to this change, variables were using _index and _pos suffixes (or no suffix) to all refer to stream indexes. This change switches all of them to use _index, which will make future refactors clearer.

leamingrad mentioned this pull request May 31, 2024

Optimisations for compiling large fluent files django-ftl/fluent-compiler#32

Open

leamingrad marked this pull request as draft May 31, 2024 18:02

leamingrad added 6 commits June 2, 2024 12:43

Use Locations when adding spans

c263139

Prior to this change, we were passing around integers when creating span information. This change uses locations instead. This has no functional change, but will make it easier to start passing through source position information in the following commit.

Add an ast node to represent a position in the source stream

b94039f

Add position information to all syntax fixtures

314ca90

leamingrad force-pushed the add-source-position-information branch from 00ec1ca to 314ca90 Compare June 2, 2024 11:43

leamingrad marked this pull request as ready for review June 2, 2024 11:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add source position information to span nodes #202

Add source position information to span nodes #202

leamingrad commented May 31, 2024 •

edited

Loading

leamingrad commented May 31, 2024 •

edited

Loading

eemeli commented Jul 25, 2024

leamingrad commented Aug 7, 2024

eemeli commented Aug 8, 2024

Add source position information to span nodes #202

Are you sure you want to change the base?

Add source position information to span nodes #202

Conversation

leamingrad commented May 31, 2024 • edited Loading

leamingrad commented May 31, 2024 • edited Loading

eemeli commented Jul 25, 2024

leamingrad commented Aug 7, 2024

eemeli commented Aug 8, 2024

leamingrad commented May 31, 2024 •

edited

Loading

leamingrad commented May 31, 2024 •

edited

Loading