-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of space-delimited data #30
base: master
Are you sure you want to change the base?
Conversation
* Add rule to rewrite to specialized function * Inline decoding functions
I had a very brief look at the patch on my phone. Do we really need new |
I didn't give it much thought. But it look like a good idea. The only
Other option is to use record and ignore custom delimiter
|
Maybe flag for skipping header should go to the DecodeOptions too |
Could you describe the grammar and escaping rules for space delimited data? Can there be multiple spaces between columns? Can there be other space characters than ASCII spaces? I think we should be able to support this by adding fields to |
AFAIK there is no standard for space-delimited data (or however it called). And if such standard does exist no one cares and just do whatever he like. So it's important to be permissive. Fields are separated by one or more space or tab or any mixture of them. Also leading and trailing spaces should be dropped. Following data have both leading spaces, multiple spaces as separator and trailing spaces
Escaping rules are complicated matter. I work mostly with numbers so I don't know waht escaping schemes are used. I assumed CSV escaping. Here is grammar written as haskell-like pseudocode. Hope it's undestandable
|
Would it be enough to change We would also need to add a Check out the Python (http://docs.python.org/3/library/csv.html) and Go (http://golang.org/pkg/encoding/csv/) CSV modules for ideas. They already support customizations like these. |
Maybe. There could be some subtle moments. I need to think about it |
Here is grammar for both CSV and space-delimited data. CSV grammar
Space delimited grammar
There are three differences: different separators, diffirent field parsers and space-delimited parser drops leading/trailing spaces. Changes in the field parser are nessesary. First it need to stop on both space and tab. And unescaped field must be at least one character. It's important otherwise we have ambigoius grammar. For example line "a " could be parsed as:
I hope it's clear. I've looked at both python and go libraries. It look like none could be used to parse data using grammar above. It look like only way to push grammar selection into option is to either enumerate grammars or by setting field and header parsers. |
I thought about this a bit today. I didn't get much further than breaking down the changes into a diff:
I made some initial attempts in adding more field to |
I think that grammars are different enough that it's difficult to unify them
But then we lose record update syntax. It is possible to put everything to
In this case fields have different meaning depending on values of other field. There is also type class approach. It could be extended to handle any CSV
|
Let me start with the constraints I'm working with:
At the extreme, the user could just provide a I'd like to avoid this if possible, especially if it's not needed. I went and looked at the output formats of Matlab, Octave, and Excel and they all use a single separator (space or tab), with the exception of the Excel Formatted Text output, which uses several spaces. I went ahead and checked the current field parser and it actually uses this grammar:
so it almost does what your grammer specifies. If we made At first I thought we could change the parsing code for separators from
Hence we'd have to change |
Well I dn't like type-classes idea either. There isn't enough CSV-like formats to justify such generality. I don't think that switching decDelimiter to
Then default options are:
|
I almost forgot about this So does following design for decode/encode options seems reasonable to you? It does manage to unify CSV and space delimited data .
|
Have you tried to implement it to make sure it actually works? :) |
Not yet. Being naturally lazy I like to avoid obviosly wrong ideas as early as possible. But it's simple enumeration of differences in grammar so there shouldn't be any problems. Most of the pieces are already implemented and need only reshuffling. |
I'll look into it. |
I'm working on this on the https://github.com/tibbe/cassava/tree/space-delim branch. |
I've implemented correct trimming of spaces. It turned out to be tricky and I had to rewrite
I'm working on |
Implementation is mostly complete. There are no tests for encoding yet and few performance regressions too. |
I tried as well and came to the same conclusion. |
Encoding and decoding for all APIs together with tests and documentations. Previously dicussed at #14
Benchmarks don't indicate any performance regression for CSV.