Track line and column data incrementally when parsing strings #71

mmcqd · 2023-02-04T19:29:25Z

This PR replaces the integer index into a stream with a triple of integers representing offset, line and column information.

offset is what index originally was: how many individual elements of the stream have been consumed.

When parser input is a string line and column are updated incrementally with each character of the input string.

When parser input is not a string line and column are both set to -1 and ignored.

This adds a huge speedup on large inputs to parsers which make frequent use of the line_info parser. For reference, using the example JSON parser updated to track line/col data on each node, a 100k line JSON file goes from taking ~210 seconds to ~7 seconds on my machine. We've gone from quadratic time position tracking to linear time position tracking.

I added no new tests, since this PR does not add a new feature, only changes the implementation of an existing one. I'm happy to add more tests if needed though.

Let me know if there's anything I can do to improve this!

mmcqd · 2023-02-04T19:30:42Z

src/parsy/__init__.py

+ column: int
+
+
+@dataclass(frozen=True)


I made Result and Position frozen dataclasses, because it seems there's no reason to treat them mutably, and they ought to be hashable if they can be.

mmcqd · 2023-02-04T19:31:53Z

src/parsy/__init__.py

@@ -516,6 +513,17 @@ def fail(expected: str) -> Parser:
 return Parser(lambda _, index: Result.failure(index, expected))


+def make_index_update(consumed: str) -> Callable[[Position], Position]:


This function is curried to avoid recomputing the count and rfind methods every single time someone uses the string parser

codecov-commenter · 2023-02-05T20:14:16Z

Codecov Report

Base: 94.44% // Head: 94.43% // Decreases project coverage by -0.02% ⚠️

Coverage data is based on head (6f66f49) compared to base (da4593e).
Patch coverage: 100.00% of modified lines in pull request are covered.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #71      +/-   ##
==========================================
- Coverage   94.44%   94.43%   -0.02%     
==========================================
  Files           9        9              
  Lines        1027     1025       -2     
==========================================
- Hits          970      968       -2     
  Misses         57       57

Impacted Files	Coverage Δ
src/parsy/__init__.py	`100.00% <100.00%> (ø)`
tests/test_parsy.py	`99.37% <100.00%> (-0.02%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

spookylukey · 2023-02-05T20:31:57Z

Thanks for the PR, this looks like it could be a very valuable contribution for some users.

I haven't had time to look at it in depth, but it looks like it is significantly backwards incompatible. The Result class is publicly documented, as are a number of the methods whose signatures have changed. For example, see https://parsy.readthedocs.io/en/latest/ref/parser_instances.html . At the very minimum the example on that page would need to work unchanged.

Would there be any way to get this PR to work without these backwards incompatibilities? For example, could you make your Position object subclass from int for backwards compat, or some other solution that checks the type of values and converts as necessary?

If not, that would probably be a stow stopper. A breaking change of this size would require a new major version at least, or a fork.

mmcqd · 2023-02-06T01:48:25Z

Ah, yeah I see the problem. I think it should be possible to keep this backwards compatible, let me see what I can do

underyx · 2023-02-06T01:51:45Z

One idea for better backwards compat: Store the three-tuple on a different attribute, e.g. Result.index_position, and make Result.index a @property that returns an index based on the data in index_position.

mmcqd added 2 commits February 4, 2023 11:01

Track line/col info incrementally on string inputs

e6c040d

Update tests

6f66f49

mmcqd commented Feb 4, 2023

View reviewed changes

underyx approved these changes Feb 4, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track line and column data incrementally when parsing strings #71

Track line and column data incrementally when parsing strings #71

mmcqd commented Feb 4, 2023 •

edited

Loading

mmcqd Feb 4, 2023

mmcqd Feb 4, 2023

codecov-commenter commented Feb 5, 2023

spookylukey commented Feb 5, 2023

mmcqd commented Feb 6, 2023

underyx commented Feb 6, 2023

		@@ -516,6 +513,17 @@ def fail(expected: str) -> Parser:
		return Parser(lambda _, index: Result.failure(index, expected))


		def make_index_update(consumed: str) -> Callable[[Position], Position]:

Track line and column data incrementally when parsing strings #71

Are you sure you want to change the base?

Track line and column data incrementally when parsing strings #71

Conversation

mmcqd commented Feb 4, 2023 • edited Loading

mmcqd Feb 4, 2023

Choose a reason for hiding this comment

mmcqd Feb 4, 2023

Choose a reason for hiding this comment

codecov-commenter commented Feb 5, 2023

Codecov Report

spookylukey commented Feb 5, 2023

mmcqd commented Feb 6, 2023

underyx commented Feb 6, 2023

mmcqd commented Feb 4, 2023 •

edited

Loading