-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed parsing by running concurrent processes #267
Comments
I think speed is more important than memory. Is it possible to try to slice the text for multi-process parsing? |
The old single-process version has been running for more than three hours. Maybe the longer the string, the slower the parsing. A temporary version written for, hope useful. Thanks ~ |
Thanks for sharing! I'll prioritize this ticket. Just recording this thought for myself when I revisit this ticket (hopefully soon): The challenge with implementing this will be to find a good place to split the text. Some of the grammars may include spaces, so it is not sufficient to split on a space. We could split on a "\n" as FANGOD did, but not all large bodies of text will necessarily have newlines in them. The solution may be to just split in "\n" for now and document that large texts without newlines won't benefit from the parallel processing. |
It was wrong when splitting the json file, so adding splitting by length, and adding a window with a length of 256. |
Thanks! I've started working on this in #276. |
Update on this issue: After giving this issue some thought, I've reduce the priority on this issue because I don't think this issue is particularly urgent. Chunking text for concurrent processing would be really useful and we may still implement it in this library, but someone using this library can reasonably chunk text outside of this library and pass the chunks of text into the library. |
Heyo! So I want to bump this issue, as we've been running the library for some time now, and find it to get way too slow around 0.0.5 - 0.1Mb of data with all parsers. This has been partially worked around by giving our users options to choose which they want, but that's not really enough either. Sample:
Using more memory and CPU to do concurrent processing would be beautiful, as I found that e.g. URL's take about 40% of the time by itself. We have tons of usecases where data passes more than 0.2 as well :) My thoughts:
|
Thanks for the input, I'll bump the priority up a bit on this ticket. I won't be able to spend a lot of time on this for the time being, unfortunately, so it may be some time before I get to work on this - but I'll try my best. |
Would you be able to give me a quick look into it over a 15 min call to show how the parsing works, so we can try to fork the threading for it ourselves? |
Sorry for the delayed response. Unfortunately, life circumstances have significantly reduced the amount of time I can spend on this project for the foreseeable future. I've reached out to you via the email you use to make commits to github. |
No description provided.
The text was updated successfully, but these errors were encountered: