Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add source position information to span nodes #202

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

leamingrad
Copy link
Contributor

@leamingrad leamingrad commented May 31, 2024

(Note: The original context for this PR is covered in django-ftl/fluent-compiler#32, but the key parts are copied in here)

Hi 👋. Firstly, thanks for this library, and for the project in general!

I'm currently working with python-fluent (via fluent-compiler and django-ftl) as part of a large Django project. Its generally been great, but unfortunately parsing and compiling our fluent files contributes quite a lot (~10s) to our apps startup time.

With that in mind, I was hoping to contribute a couple of optimisations to fluent-compiler and project-fluent to speed up the handling of large fluent files.

One of the biggest contributor to compile times in fluent-compiler is the span_to_position function. This function takes the start of a Span node, and coverts it into (row, column) tuple. However in order to do this, it needs to scan through the source for all newlines to work out how many rows their have been.

This PR is an attempt to solve that problem by returning more information from the parsing code in fluent.syntax. Specifically, we add a new type of AST node to represent a row/column position in the source. We then track the current row/column while scanning through the source stream, and store that in Span nodes as they are created.

I've tried to make this PR as easy-to-review as possible, but unfortunately there is still quite a large diff, so if there is anything which would be helpful in terms of breaking commits up/squashing them together, please let me know.

I'm not entirely clear on whether the structure of the AST nodes is part of the fluent spec, or if it is something which python-fluent can decide unilaterally. If this turns out to be an insurmountable problem, I can switch to trying to solve this inside fluent-compiler instead.

(Note: This PR will result in a one-line merge conflict with #201, so I will rebase as needed if one is merged first)

Prior to this change, variables were using _index and _pos suffixes (or
no suffix) to all refer to stream indexes.

This change switches all of them to use _index, which will make future
refactors clearer.
@leamingrad
Copy link
Contributor Author

leamingrad commented May 31, 2024

I've just noticed an issue with the position information around line boundaries - I'll fixup and then mark as ready for review again

Edit: On further checking there doesn't seem to be an issue, but I have updated the tests to make things clearer

Prior to this change, the only positional variables which were tracked
when moving through the stream was the current index in the stream.

We would like to start reporting richer positional information from the
parser, specifically the position (in terms of row/column) in the source
object.

This change allows this by updating the Stream class to additionally
track its current (zero-indexed) row and column while moving through the
stream. We will persist this information into the syntax tree in the
following commits.
Prior to this change, the location in the parser stream was represented
by the current index. We would like to update the parser stream to also
return the source position (row/column) as part of the location
information.

This change adds a type alias to represent a location in the parser
stream, and a property to return it. Currently, this is just the index,
but we will update it in future commits.

This allows us to break up the change to make it easier to review, as we
can switch code paths which create span information to use locations
before making any functional changes.
Prior to this change, we were passing around integers when creating
span information.

This change uses locations instead. This has no functional change, but
will make it easier to start passing through source position
information in the following commit.
Prior to this change, we were tracking source position information in
the stream, but were not storing it in the resulting spans.

This change adds start_position and end_position SourcePosition
attributes to Span nodes. To make this work we end up complicating the
constructor for Span nodes, so that it can either take a location or an
index/position combo (for JSON deserialization).
@leamingrad leamingrad force-pushed the add-source-position-information branch from 00ec1ca to 314ca90 Compare June 2, 2024 11:43
@leamingrad leamingrad marked this pull request as ready for review June 2, 2024 11:44
@eemeli
Copy link
Member

eemeli commented Jul 25, 2024

The size of the diff here held me off from looking too closely for a while...

Have you considered other approaches to speeding up span_to_position? The additional data you're looking to include in the AST is pretty sizable as it touches all nodes, and it's duplicating the position info that's in the existing start and end indices.

For one possible approach, for the JS yaml library I ended up defining a lineCounter option and a LineCounter class that automatically tracks newlines, and using them, makes the index -> row/col position mapping much faster.

Could that sort of an approach work for you as well?

@leamingrad
Copy link
Contributor Author

No problem about the delay, thanks for getting back to me.

I think there are definitely other approaches which could speed up span_to_position inside fluent-compiler itself, either something that looks a bit like this PR or something like a LineCounter class.

My original thinking was that by providing tools to enable that inside python-fluent would be generally useful - do you think there is anything it would be useful for python-fluent to provide?

@eemeli
Copy link
Member

eemeli commented Aug 8, 2024

I would prefer an approach that added the least cost to users who do not need the line/column position for all nodes. Always calculating the positions during the parse seems like mostly unnecessary work, while providing some kind of side channel for getting the newline indices during the parse seems like a much lighter addition to fluent.syntax that would also avoid duplicating information in the parsed results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants