Implementation of space-delimited data #30

Shimuuar · 2013-01-02T14:05:22Z

Encoding and decoding for all APIs together with tests and documentations. Previously dicussed at #14

Benchmarks don't indicate any performance regression for CSV.

* Add rule to rewrite to specialized function * Inline decoding functions

tibbe · 2013-01-02T15:37:50Z

I had a very brief look at the patch on my phone. Do we really need new
top-level functions? Can't we just extend DecodeOptions?

Shimuuar · 2013-01-03T11:10:57Z

I didn't give it much thought. But it look like a good idea. The only
question is how to extend DecodeOptions. `decDelimiter' only make sense
for CSV data.

data DecodeOptions = SpaceDelim | CSV Word8

Other option is to use record and ignore custom delimiter

data DecodeOptions = DO { decDelim :: Word8, spaceDelim :: Bool }

Shimuuar · 2013-01-03T11:55:48Z

Maybe flag for skipping header should go to the DecodeOptions too

tibbe · 2013-01-04T16:18:55Z

Could you describe the grammar and escaping rules for space delimited data? Can there be multiple spaces between columns? Can there be other space characters than ASCII spaces?

I think we should be able to support this by adding fields to DecodeOptions for recognizing the separators, escaping the right characters, etc. Python does it this way. I don't think we need new top-level functions, except a convenient spaceDelimDecodeOptions :: DecodeOptions constant.

Shimuuar · 2013-01-04T17:29:16Z

AFAIK there is no standard for space-delimited data (or however it called). And if such standard does exist no one cares and just do whatever he like. So it's important to be permissive. Fields are separated by one or more space or tab or any mixture of them. Also leading and trailing spaces should be dropped. Following data have both leading spaces, multiple spaces as separator and trailing spaces

   1 a   1.22
 123 bcd 3.3
  88 c   0.4   ← invisible trailng space
1078 d   0.4

Escaping rules are complicated matter. I work mostly with numbers so I don't know waht escaping schemes are used. I assumed CSV escaping.

Here is grammar written as haskell-like pseudocode. Hope it's undestandable

row   = many ws *> field `sepBy` many1 ws <* many ws
field = csvEscapedField <|> many1 notWS
ws    = ' ' <|> '\t'

tibbe · 2013-01-05T20:44:29Z

Would it be enough to change decDelimiter to a Parser ByteString? If we did that, you could parse space-delimited data using the current code by setting decDelimiter = many (' ' <|> '\t'). We just need to make sure this doesn't kill performance.

We would also need to add a decStripWhitespace option.

Check out the Python (http://docs.python.org/3/library/csv.html) and Go (http://golang.org/pkg/encoding/csv/) CSV modules for ideas. They already support customizations like these.

Shimuuar · 2013-01-07T10:36:27Z

Maybe. There could be some subtle moments. I need to think about it

Shimuuar · 2013-01-10T11:00:21Z

Here is grammar for both CSV and space-delimited data.

CSV grammar

   file        = [header CRLF] record *(CRLF record) [CRLF]
   header      = name  *(COMMA name)
   record      = field *(COMMA field)
   name        = field
   field       = escaped | non-escaped
   escaped     = DQUOTE *(TEXTDATA | COMMA | CR | LF | 2DQUOTE) DQUOTE
   non-escaped = *TEXTDATA

   COMMA    = ',' or 1-byte delimiter
   DQUOTE   = "
   CRLF     = CR LF | LF
   TEXTDATA = ^[ DQUOTE COMMA CR LF ]
   LF       = %x0A
   CR       = %x0D

Space delimited grammar

   file        = [header CRLF] record *(CRLF record) [CRLF]
   header      = *WS name  *(+WS name)  *WS
   record      = *WS field *(+WS field) *WS
   name        = field
   field       = escaped | non-escaped
   escaped     = DQUOTE *(TEXTDATA | COMMA | CR | LF | 2DQUOTE) DQUOTE
   non-escaped = +TEXTDATA

   DQUOTE   = "
   CRLF     = CR LF | LF
   TEXTDATA = ^[ DQUOTE WS CR LF ]
   LF       = %x0A
   CR       = %x0D
   WS       = %x20 | %x09

There are three differences: different separators, diffirent field parsers and space-delimited parser drops leading/trailing spaces. Changes in the field parser are nessesary. First it need to stop on both space and tab. And unescaped field must be at least one character. It's important otherwise we have ambigoius grammar. For example line "a " could be parsed as:

record 0WS (non-escaped "a") [] 1WS
record 0WS (non-escaped "a") [1WS (non-escaped "")] 0WS

I hope it's clear.

I've looked at both python and go libraries. It look like none could be used to parse data using grammar above. It look like only way to push grammar selection into option is to either enumerate grammars or by setting field and header parsers.

tibbe · 2013-01-15T03:44:10Z

I thought about this a bit today. I didn't get much further than breaking down the changes into a diff:

-header      = name  *(COMMA name)
-record      = field *(COMMA field)
+header      = *WS name  *(+WS name)  *WS
+record      = *WS field *(+WS field) *WS
-non-escaped = *TEXTDATA
+non-escaped = +TEXTDATA
-TEXTDATA = ^[ DQUOTE COMMA CR LF ]
+TEXTDATA = ^[ DQUOTE WS CR LF ]

I made some initial attempts in adding more field to DecodeOptions, but it didn't work out.

Shimuuar · 2013-01-16T12:15:21Z

I think that grammars are different enough that it's difficult to unify them
using options. Best I can think up:

data DecOpt = CSV Word8 | SpaceDelim

But then we lose record update syntax. It is possible to put everything to
record:

data DecOpt = DecOpt
  { isSpaceDelim :: Bool
  , csvSeparator :: Word8
  }

In this case fields have different meaning depending on values of other field.
csvSeparator doesn't mean anything if isSpaceDelim is True.

There is also type class approach. It could be extended to handle any CSV
like format. But is there any other?

class DecOpt a where
  -- Return parsers for header and ordinary record
  toDecOpt :: a -> (Parser,Parser)

data CsvOpt = CsvOpt Word8
instance DecOpt where
  toDecOpt (Csv d) = (header d, record d)


data SpaceDelim = SpaceDelim
instance DecOpt SpaceDelim where
  toDecOpt _ = (headerTable, recordTable)

tibbe · 2013-01-16T17:50:55Z

Let me start with the constraints I'm working with:

The more general we make the library (e.g. provide more top-level combinators) the more difficult it becomes for users to grasp). For example, if we double the number of top-level decode and encode functions, there's more stuff there for the users to understand.
Certain kinds of additions lead to a doubling of the number of functions in the API. For example, supporting files with and without headers (decode and decodeByName) led to such a doubling. I decided to try to only add new top-level functions if the return value differs (as in the case of decode and decodeByName) and handle any format differences using the options records.
At some point we'll just be writing another generic parser library and we'll end up in the Turing tarpit, where everything can be done using cassava, but nothing can be done well and/or easily. I'm trying to to constrain the problem domain we're working in to prevent that from happening. We don't want to end up with decodeOptions :: CsvOpts ... | SpaceDelimOpts ... | XmlOpts .... ;)

At the extreme, the user could just provide a Parser (Vector (Vector ByteString)) and we could offer a decodeUsing :: FromRecord a => Parser (Vector (Vector ByteString)) -> ByteString -> Vector a. That would be the most general (although not the most efficient, as we always need to construct the vector of vector of byte strings in memory).

I'd like to avoid this if possible, especially if it's not needed. I went and looked at the output formats of Matlab, Octave, and Excel and they all use a single separator (space or tab), with the exception of the Excel Formatted Text output, which uses several spaces.

I went ahead and checked the current field parser and it actually uses this grammar:

TEXTDATA = ^[ DQUOTE <escape-char> CR LF ]

so it almost does what your grammer specifies. If we made decDelimiter a predicate function it should do exactly what your grammar does (i.e. disallow any kind of whitespace).

At first I thought we could change the parsing code for separators from word8 (decDelimiter opts) to many (word8 (decDelimiter opts)), but that doesn't work as this CSV data would parse correctly:

a,b,,c

Hence we'd have to change decDelimiter to be of type Parser () as I mentioned before.

Shimuuar · 2013-01-16T18:54:30Z

Well I dn't like type-classes idea either. There isn't enough CSV-like formats to justify such generality.

I don't think that switching decDelimiter to Parser () is good idea. Parser for unescaped fields require predicate on delimiter character. Since we need predicate anyway we can just add flag to choose between one character and one or more characters as delimiter. In the same way it's possible tyo add flag for dropping initial/trailing whitespaces.

data DecOpt = DecOpt
  { decDelim         :: Word8 → Bool
  , oneOrMoreDelim   :: Bool
  , stripDelimOnEdge :: Bool
  }

Then default options are:

csvOpts = DecOpt (== ',') False False
spaceDelimOpts = DecOpt (\c -> == ' ' || == '\t') True True

Shimuuar · 2013-02-21T18:16:15Z

I almost forgot about this

So does following design for decode/encode options seems reasonable to you? It does manage to unify CSV and space delimited data .

data DecOpt = DecOpt
  { decDelim         :: Word8 → Bool 
    -- Predicate for the separator character
  , oneOrMoreDelim   :: Bool
    -- Whether consecutive separator characters should be treated as single separator
  , stripDelimOnEdge :: Bool
    -- Thether separators on start/end of line should be dropped
  }

data EncOpt = EncOpt
  { encDelim :: Word8
  , extraCharsToEscape :: Word8 -> Bool
    -- Any other characters which sould be escaped.
  }

tibbe · 2013-02-22T19:55:47Z

Have you tried to implement it to make sure it actually works? :)

Shimuuar · 2013-02-22T20:30:04Z

Not yet. Being naturally lazy I like to avoid obviosly wrong ideas as early as possible. But it's simple enumeration of differences in grammar so there shouldn't be any problems. Most of the pieces are already implemented and need only reshuffling.

tibbe · 2013-02-22T22:55:31Z

I'll look into it.

tibbe · 2013-02-23T00:56:08Z

I'm working on this on the https://github.com/tibbe/cassava/tree/space-delim branch.

Shimuuar · 2013-02-25T17:18:59Z

I've implemented correct trimming of spaces. It turned out to be tricky and I had to rewrite record parser.

Correctly leading/trailing spaces

It's hard to strip whitespaces correctly because
 a) It's valid part of field for CSV so
    "a,b,c " -> ["a","b","c "]
 b) If we're using spaces as delimiter we get spurious empty field
    at the end fo the line
    "a b c " -> ["a","b","c",""]

Only reliable way to strip them is to read whole line, strip spaces
and parse stripped line.

I'm working on space2 branch in my repo.

Shimuuar · 2013-02-28T16:53:03Z

Implementation is mostly complete. There are no tests for encoding yet and few performance regressions too.

tibbe · 2013-03-07T17:38:58Z

Only reliable way to strip them is to read whole line, strip spaces and parse stripped line.

I tried as well and came to the same conclusion.

Shimuuar added 22 commits October 20, 2012 20:05

Add parsers for space-delimited data files

359b691

Add function to decode space-delimited files

6800c8b

Handle leading and trailing spaces

05398ba

Add ghc-prof-options to gather sensible profiling information

805bb75

Use named constants

b506196

Use takeWhile1 it slightly improves performance

9185c6f

Add named decoders for tables

5910b82

Merge branch 'master' into space-delim

35cde7e

Fix build

cbae628

Add decoding of space-delimited data with header

09b73f7

Escape tab too

953b09d

Add encoding of space-delimited data

7927d9c

Use more consistent naming

1bf95bd

Add space-delimited data decoding for incremental API

5dec679

Add decoding of space-delimited data for streaming API

2ace414

Add documentation

bd7b4ba

Add tests

7f6a544

Fix escaping bug for empty strings

371bdd2

Test for streaming

2ea024a

Copy optimization from CSV functions

6e4ecc5

* Add rule to rewrite to specialized function * Inline decoding functions

Move common functionality and helpers to the end of module

53a1d23

Missing type signature

4f503c1

Shimuuar mentioned this pull request Mar 8, 2013

Second implementation of parsing for space-delimited data #36

Open

devonhollowood mentioned this pull request Feb 27, 2017

Current status of implementing space-delimited data #132

Open

hvr added this to the ⊥ milestone Jun 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of space-delimited data #30

Implementation of space-delimited data #30

Shimuuar commented Jan 2, 2013

tibbe commented Jan 2, 2013

Shimuuar commented Jan 3, 2013

Shimuuar commented Jan 3, 2013

tibbe commented Jan 4, 2013

Shimuuar commented Jan 4, 2013

tibbe commented Jan 5, 2013

Shimuuar commented Jan 7, 2013

Shimuuar commented Jan 10, 2013

tibbe commented Jan 15, 2013

Shimuuar commented Jan 16, 2013

tibbe commented Jan 16, 2013

Shimuuar commented Jan 16, 2013

Shimuuar commented Feb 21, 2013

tibbe commented Feb 22, 2013

Shimuuar commented Feb 22, 2013

tibbe commented Feb 22, 2013

tibbe commented Feb 23, 2013

Shimuuar commented Feb 25, 2013

Shimuuar commented Feb 28, 2013

tibbe commented Mar 7, 2013

Implementation of space-delimited data #30

Are you sure you want to change the base?

Implementation of space-delimited data #30

Conversation

Shimuuar commented Jan 2, 2013

tibbe commented Jan 2, 2013

Shimuuar commented Jan 3, 2013

Shimuuar commented Jan 3, 2013

tibbe commented Jan 4, 2013

Shimuuar commented Jan 4, 2013

tibbe commented Jan 5, 2013

Shimuuar commented Jan 7, 2013

Shimuuar commented Jan 10, 2013

tibbe commented Jan 15, 2013

Shimuuar commented Jan 16, 2013

tibbe commented Jan 16, 2013

Shimuuar commented Jan 16, 2013

Shimuuar commented Feb 21, 2013

tibbe commented Feb 22, 2013

Shimuuar commented Feb 22, 2013

tibbe commented Feb 22, 2013

tibbe commented Feb 23, 2013

Shimuuar commented Feb 25, 2013

Shimuuar commented Feb 28, 2013

tibbe commented Mar 7, 2013