This a fparsec subset clone that works on generalized stream classes. This means you can use combinators on binary streams, strings, or any other custom stream classes you want. Included in the project is a sample CSV parser and a sample MP4 header binary parser.
- Installation
- Target audience for documentation
- When to use and known limitations
- Types and notation
- Generic operators
- String operators
- Binary operators
- Bit parsers
- Bit parsing order
- Computation Expression Syntax
- A note on FParsec vs ParsecClone regarding strings
- Instantiating user states
- Debugging
- Dealing with value restrictions
- A CSV parser example (string)
- An MP4 parser example (binary)
- Bit parsing example
- Improving binary performance
Install ParsecClone via NuGet
Install-Package ParsecClone
This will install the ParsecClone F# library.
Included in the main ParsecClone.Combinator
dll are:
- The general operators are in
ParsecClone.Operators
- The string handling in
ParsecClone.StringCombinator
- And the binary operators in
ParsecClone.BinaryCombinator
The documentation below is intended for people who are familiar with combinator libraries. If you are not familiar with FParsec style combinators and notation, you may want to run through their tutorials and explanations first.
While the following documentation is not as robust as theirs, ParsecClone operators are very similar. Once you are familiar with FParsec operator and operator styles the following documentation should be enough to get you on your way.
ParsecClone is well suited for binary parsing which works on stream sources (memory streams, file streams, etc). Not only can you do byte level parsing, but also bit level parsing. Performance of parsecClone is close to native. In my tests it was only 2x times slower than hand written c++. Performance even exceeded C++ if you ran the parser multiple times (since the JIT had already run)!
ParseClone can also parse strings, but doesn't work on string streams. One of the reasons is that to use regular expressions you need to have unlimited lookahead to your stream. With a stream you'd end up having to read the whole stream in anyways! Since FParsec works on streams, I chose to not duplicate that functionality.
If you have strings you can buffer into memory, ParsecClone will work great (so smaller files that you can read all in one go).
More importantly, ParsecClone is great for adding new stream sources to extend its capabilities. To do so just implement the IStreamP
interface and hook into the matcher
function in the base combinator library.
A few other caveats. Currently ParsecClone's string parsing doesn't do any memoization, so you are stuck reparsing data. However, by default the binary parser does memoize using a custom cache. You can disable this by passing in a None
to the cache instantiator.
ParsecClone uses 3 main types for all of its combinators.
type State<'StateType, 'ConsumeType, 'UserState> = IStreamP<'StateType, 'ConsumeType, 'UserState>
type Reply<'Return, 'StateType, 'ConsumeType, 'UserState> = 'Return option * State<'StateType, 'ConsumeType, 'UserState>
type Parser<'Return, 'StateType, 'ConsumeType, 'UserState> = State<'StateType, 'ConsumeType, 'UserState> -> Reply<'Return, 'StateType, 'ConsumeType, 'UserState>
Since the types are kind of nasty, in the following operator examples I will use a shorthand notation of
Parser<'Return> implies Parser<'Return,_,_,_>
If other type information is needed in the signature I'll use the full parser type signature.
Included operators are
val (>>=) : Parser<'a> -> ('a -> Parser<'b>) -> Parser<'b>
Combiner with function callback
val (>>=?) : Parser<'a> -> ('a -> Parser<'b>) -> Parser<'b>
Combiner with function callback and backtracking
val (>>.) : Parser<'a> -> Parser<'b> -> Parser<'b>
Use result of second combinator
val (.>>) : Parser<'a> -> Parser<'b> -> Parser<'a>
Use result of first combinator
val preturn: 'a -> Parser<'a>
Return a value as a combinator
val pzero: unit -> Parser<'a>
Defines a zero (for use with folds and other starting states). Result is (None, state
)
val (|>>) : 'a -> ('a -> 'b) -> Parser<'b>
Pipe value into union or constructor
val |>>%) : 'a -> Parser<'a>
Pipe to zero argument discriminated union
val <|>) : Parser<'a> -> Parser<'b> -> Parser<'c>
Takes two parsers, and returns a new parser. The result is either the result of the first parser (if it succeeds) or the result of the second parser, as long as the or'd parsers don't modify the underlying state.
val .<?>>.) : Parser<'a> -> Parser<'a list> -> Parser<'a list>
Takes a single parser, and a list parser and applies both parers as options.
If the first parser succeeds and the second parser fails, returns a list of the result of the first parser (Some('a)::[]
).
If the first parser succeeds and the second parser succeeds returns a cons list of both results (Some('a)::Some('a) list
). This operator does not backtrack but will not fail if the first parser doesn't succeed (since its wrapped as an opt
).
If the first parser fails, this parser fails.
val .<<?>.) : Parser<'a list> -> Parser<'a> -> Parser<'a list>
The same as .<?>>.
except with the arguments inverted. The list parser is first and the single parser is second.
If the first parser fails, this parser fails.
val (>>--): Parser<'a> -> (unit -> 'a) -> Parser<'a>
This operator lets you capture the actual invocation result of a parser. For example, say you want to time how long a parser takes. You can create a time function like this:
let time identifier func =
let start = System.DateTime.Now
let value = func()
printfn "%s Took %s" s ((System.DateTime.Now - start).ToString())
value
And time an operator like
let newParser = parserImpl >>-- time "parserImpl"
Internally the right hand function is delayed and not executed till we actually call the parser:
let (>>--) parser wrapper =
fun state ->
wrapper (fun () -> parser state)
val (>>|.): Parser<'a> -> ('a -> 'b) -> Parser<'b>
Takes a parser and a transformer, applies the result of the parser to the transformer and returns a new parser that returns the transformed result.
val many: Parser<'a> -> Parser<'a list>
Repeats a parser zero or more times, until the parser fails to match or the end of stream is encountered.
val matcher: (State<_, 'ConsumeType, _> -> 'a -> int option) -> 'a -> Parser<'ConsumeType>
Generic match on predicate and executes state modifier to return result
val anyOf: ('a -> Parser<'a>) -> 'a list -> Parser<'a>
Takes a function that maps the list into a bunch of parsers and or's each result together with the <|>
combinator. For example: anyOf matchStr ["f";"o";"i";"g";"h";"t";"e";"r";"s";" "]
val choice: Parser<'a> list -> Parser<'a>
Takes a list of parsers and or's them together with <|>
val attempt: Parser<'a> -> Parser<'a>
If no match occurs or an exception happens, backtracks to the beginning of the state of the parser
val takeTill: ('a -> bool) -> Parser<'a> -> Parser<'a list>
Takes a predicate and a parser and consumes until the predicate is true. Then backtracks one element
val takeWhile: ('a -> bool) -> Parser<'a> -> Parser<'a list>
Takes a predicate and a parser and consumes until the predicate is false. Then backtracks one element
val manyN: int -> Parser<'a> -> Parser<'a list>
Takes a number and tries to consume N parsers. If it doesn't consume exactly N it will fail. Aliased by exactly
.
val many1: Parser<'a> -> Parser<'a list>
Repeats a parser one or more times (fails if no match found)
val lookahead: Parser<'a> -> Parser<'a>
Returns a result and a new parser, but backtracks the state
val manyTill: Parser<'a> -> Parser<'b> -> Parser<'a list>
Takes a parser and an end parser, and repeats the first parser zero or more times until the second parser succeeds
val manyTill1: Parser<'a> -> Parser<'b> -> Parser<'a list>
Same as manyTill
except fails on zero matches, so expects at least one or more
val between: Parser<'a> -> Parser<'b> -> Parser<'c> -> Parser<'b>
Takes a bookend parser, the parser, and another bookened parse and returns the value of the middle parser
val between2: Parser<'a> -> Parser<'b> -> Parser<'b>
Takes a bookend parser and the target parser, and applies the bookend parser twice to between
. Usage could be for parser |> between2 quote
val eof: Parser<unit>
Parser succeeds if the stream has nothing left to consume. Fails otherwise.
val manySatisfy: ('a -> bool) -> Parser<'a> -> Parser<'a list>
Alias for takeWhile
val satisfy : ('a -> bool) -> Parser<'a> -> Paser<'a>
Takes a predicate and a parser, applies the parser once and if the return result passes the predicate returns the result, otherwise backtracks.
val satisfyUserState : ('UserState -> bool) -> Parser<'a,_,_'UserState> -> Paser<'a>
Takes a predicate and a parser. It applies the parser and then calls the predicate with the new userstate. If the predicate succeeds, returns the result of the parser, otherwise it backtracks.
val opt : Parser<'a> -> Parser<'a option>
Takes a parser, applies the the state, and returns a result option. Careful using this in the context of a 'many' since it you can get into infinite loops since you always "succeed"
val createParserForwardedToRef: unit -> (Parser<'a>, ref Parser<'a>)
Returns a tuple of (parser, ref parser) to use for recursive calling parsers
val reproc elevator seed parser : ('result -> 'userState -> State<'newStateType,'newConsumeType,'userState>) ->
Parser<'result, 'originalStateType, 'consumeType, 'userState> ->
Parser<'b, 'newStateType, 'newConsumeType, 'userState> ->
Parser<'b, 'originalStateType, 'consumeType, 'userSTate>
This functions lets you apply a parser to a buffered set of data. The buffered set of data acts as its own parser state. The seed is a parser on the original state and is used to create a new parse state (by the elevator).
The second parser argument is the parser that will be applied to the new state. The original state is advanced by the amount that the seed consumed.
The return result is the return from the elevated parser, but the returned parser type continues to work on the underlying state.
val getUserState : unit -> Parser<'UserState, 'StateType, 'ConsumeType, 'UserState>
Returns a parser whose result is the currently stored user state
val setUserState : 'UserState -> Parser<unit>
Takes a value and sets the userstate with that value
val statePosition: unit -> Parser<int64>
Returns the position of the state. Does not modify the stream.
val parse: ParserCombinator
An instance of the ParserCombinator
workflow builder, which allows you to do workflow based combinators (basically a direct port of FParsecs combinator).
One major difference between this and fparsec is that my string parsing is based on regex, not single character parsing. To me, this makes parsing a little easier since I struggled with the string parsing in fparsec. Also it's kind of nice to not be an exact clone, because that's no fun.
String operators in the StringP
module are:
val matchStr: string -> Parser<string>
Matches a string if it starts with some text uses match
val regexStr: string -> Parser<string>
Takes a regular expression and tests to see if the current state begins with a match uses match
val any: Parser<string>
Parses the regex .
from the stream.
val anyBut: string -> Parser<string>
Takes a regular expression and returns a character that does not match the regex
val char: Parser<string>
Parses the regex [a-z]
from the stream.
val chars: Parser<string>
Parses the regex [a-z]+
from the stream.
val digit: Parser<string>
Parses the regex [0-9]
from the stream.
val digits: Parser<string>
Parses the regex [0-9]+
from the stream.
val newline: Parser<string>
Matches \r\n
or \n:
or \r
val whitespace: Parser<string>
Parses the regex \s
from the stream.
val whitespaces: Parser<string>
Parses the regex \s+
from the stream.
val space: Parser<string>
Parses the regex " "
from the stream.
val spaces: Parser<string>
Parses the regex " "+
from the stream.
val tab: Parser<string>
Parses the regex \t
from the stream.
val tabs: Parser<string>
Parses the regex \t+
from the stream.
val ws: Parser<string>
optional whitespace parser. Always succeeds, if it consumes actual whitespace the resulting string will not be an empty string. If it fails, it will return an empty string. Does not include newlines.
val foldStrings: string list -> Parser<string>
takes a string list and preturns a concatenated version of those strings string list -> parser<string>
val makeStringStream : String -> StringStreamP<unit>
Utility function to create a stream from a string. Use this if you don't need to create any user state.
val isNewLine : String -> bool
Returns true if the string is \r\n
, \n
, or \r
.
val allWhiteSpace : String -> bool
The same as ws
except includes newlines as well
val stringLiteral : (delim : String) -> (escapeString : String) -> Parser<string>
Parses a literal escaped using passed in parameters. The delim field is the field indicating how to segment an escaped string. For example, maybe you only want to escape a string between ,
or (as with JSON) you want to delimit inside of "
.
The escape string lets you specify what is the prefix escape string. Commonly this is \\
(which is the escaped version of \
).
For example:
let source = "a\,b\\n\r\\t,notmatched" |> makeStringStream
let delim = ","
let p = stringLiteral delim "\\"
let result = test source (many (p |> sepBy <| (matchStr delim)))
result |> should equal ["a\,b\\n\r\\t"; "notmatched"]
The first element in the result set is:
a\,b\n
\t
Notice how the non-escaped "r" was evaluated as a literal but the other characters maintained their delimiters
val quotedStringLiteral : String -> Parser<String>
Parses a quoted string literal leveraging the \
character to quote values. Stops when a non-escaped quote ("
) is encountered.
To use a binary parser, you need to instantiate the BinParser
class, which is the container for these operators. They are not imported into the global space. The reason being that you can pass an endianess converter to it. The endianess converter is run against all number converters, but not anything else.
let bp = new BinParser<_>(Array.rev)
The BinParser takes a generic argument representing the userstate of the stream. In general, just declare it as an unknown parameter and the type inference system will figure it out for you.
Binary operators of the BinParser
class in the BinaryParser
module are:
val makeBinStream: Stream -> BinStream<unit>
Helper function to take a stream and create a BinStream
instance for use with the binary combinators. Use this if you don't need any user state.
val byteN: int -> : Parser<byte[]>
Takes an integer N and consumes a byte array of that size
val byte1: Parser<byte>
Returns one byte
val byte2: Parser<byte[]>
Returns two bytes
val byte3: Parser<byte[]>
Returns three bytes
val byte4: Parser<byte[]>
Returns four bytes
val intB: Parser<int>
Returns the byte value as a signed integer
val int16: Parser<int16>
Parses 2 bytes and returns a signed 16 bit integer
val int32: Parser<int32>
Parses 4 bytes and returns a signed 32 bit integer
val int64: Parser<int64>
Parses 8 bytes and returns a signed 64 bit integer
val uintB: Parser<uint>
Returns the byte value as an unsigned integer
val uint16: Parser<uint16>
Parses 2 bytes and returns an unsigned 16 bit integer
val uint32: Parser<uint32>
Parses 4 bytes and returns an unsigned 32 bit integer
val uint64: Parser<uint64>
Parses 8 bytes and returns an unsigned 64 bit integer
val skip: int -> Parser<bool>
Skips N bytes in the stream by seeking. Returns true if succeeded.
val seekTo: int -> Parser<bool>
Seeks to the position in the stream starting from origin. Returns true if succeeded.
val skiptoEnd: Parser<unit>
Skips to the end of the stream
val shiftL: uint32 -> (uint32 -> Parser<uint32>)
Shifts left N bits
val shiftR: uint32 -> (uint32 -> Parser<uint32>)
Shifts right N bits
val floatP: Parser<float>
Parses a 4 byte float
val matchBytes: byte[] -> Parser<byte[]>
Parses the exact byte sequence (as byte array). Result is the byte sequence you expected. Fails if the byte sequence is not found at the start of the stream.
val byteToUInt : byte -> uint
Takes one byte, and converts to uint32
val toUInt16 : byte[] -> uint16
Takes a 2 byte array, applies endianess converter, and converts to uint 16
val toUInt24 : byte[] -> uint32
Takes a 3 byte array, applies endianess converter, and converts to uint 32
val toUInt32 : byte[] -> uint32
Takes a 4 byte array, applies endianess converter, and converts to uint 32
val toUInt64 : byte[] -> uint64
Takes a 8 byte array, applies endianess converter, and converts to uint 64
val byteToInt : byte -> int
Takes one byte and converts to int32
val toInt16 : byte[] -> int16
Takes a 2 byte array, applies endianess converter, and converts to int 16
val toInt24 : byte[] -> int32
Takes a 3 byte array, applies endianess converter, and converts to int 32
val toInt32 : byte[] -> int32
Takes a 4 byte array, applies endianess converter, and converts to int 32
val toInt64 : byte[] -> int64
Takes a 8 byte array, applies endianess converter, and converts to int 64
val parseStruct<'T, 'UserState>: bool -> int -> BinParser<'UserState> -> Parser<'T list>
This method is not defined on the BinParser
object, but is auto included with the BinaryCombinator
namespace. 'T
should be the struct type you want to parse. Internally this will read sizeof 'T * numEntries
bytes and marshal the bytes directly into the struct. This can be significantly faster than parsing each entry independently. The bool, if true, says to use network order (big endian). In this case it will read the bytes from end to begin. Your struct should be ordered backwards for this to work properly (so first field last, last field first). If the bool is false, it will be little endian. The int parameter says the number of structs to parse.
val defineStructParserLE<'T>: int -> BinParser<unit> -> Parser<'T list>
Defines a little endian struct parser. This method is not defined on the BinParser
object, but is auto included with the BinaryCombinator
namespace. Helper function to define a struct parser with no required user state.
val defineStructParserBE<'T>: int -> BinParser<unit> -> Parser<'T list>
Defines a big endian struct parser. This method is not defined on the BinParser
object, but is auto included with the BinaryCombinator
namespace. Helper function to define a struct parser with no required user state.
Also included in the binary parser are bit level parsers. These parsers need to work on a "seeded" byte stream. For example, you need to read in a 2 byte block, and then do bit level parsing on the 2 byte block. The byte stream will be advanced by 2 bytes, but you can work on the "seeded" (or cached) version of the stream with new parser types, by lifting the parser stream to a new stream type.
The bit type that is returned looks like this
type Bit =
| One
| Zero
Operators that make this possible include:
val makeBitP: Parser<byte[]> -> Parser<'ReturnType>
takes a seed parser (to provide the underlying byte array to use as the parser set) and a bit level parser and applies the bit level parser to the seed. Bit parsers are complicated because the smallest thing you can read off the stream is a byte, so you have to work in memory on your byte stream.
val bitsN: int -> Parser<Bit[]>
Takes an integer N and returns an array of Bit
union types (Zero
or One
)
val bitsToInt: Bit [] -> int
Takes a bit array and converts it to an int
val bitN : int -> Parser<Bit>
Takes an integer N and returns back the bit value at position N
val bit1 : Parser<Bit>
Returns the value of the first bit (zero or one)
val bit2 : Parser<Bit>
Returns the value of the second bit (zero or one)
val bit3 : Parser<Bit>
Returns the value of the third bit (zero or one)
val bit4 : Parser<Bit>
Returns the value of the fourth bit (zero or one)
val bit5 : Parser<Bit>
Returns the value of the fifth bit (zero or one)
val bit6 : Parser<Bit>
Returns the value of the sixth bit (zero or one)
val bit7 : Parser<Bit>
Returns the value of the seventh bit (zero or one)
val bit8 : Parser<Bit>
Returns the value of the eight bit (zero or one)
Bit parsing works left to right and doesn't get run through the endianness converter. Here is the layout of what is meant by bit 1 through bit 8,
0xF = 0b 1 1 1 1 1 1 1 1
bit# 1 2 3 4 5 6 7 8
If you need to extend the bit parsing, there is a BitStream
class that handles the bit handling from a byte array
Just like in FParsec, ParsecClone supports workflow syntax:
[<Test>]
let testExpression() =
let state = makeStringStream "this is a test"
let parser = parse {
let! _ = matchStr "this"
let! _ = ws
let! _ = matchStr "is a"
let! _ = ws
return! matchStr "test"
}
let result = test state parser
result |> should equal "test"
One thing I really wanted to implement that Fparsec didn't have was regular expression support for strings. Just to demonstrate what you need to do to parse a string given by the grammar
<f> := "f"
<oos> := "o" | "o"<oos>
<fighter> := "fighter"
<foofighter> := <f><oos><fighter>
Basically the word foofighter
with at least one or more o
's. Here is an example in fparsec
let fooString = pstring "f" >>= fun f ->
many1 (pstring "o") >>= fun os ->
pstring "fighter" >>= fun fighter ->
preturn (f + (List.reduce (+) os) + fighter)
test fooString "foofighter" |> should equal "foofighter"
Here is an example in ParsecClone
let foo = regexStr "fo+fighter"
let state = makeStringStream "foofighter"
test state foo |> should equal "foofighter"
Just different flavors. You can do the fparsec way in ParsecClone as well. Arguably, I'd say that FParsec is more correct here since you are forced to implement the grammar without cheating with regex, but regex does make the problem succinct.
One thing that can happen is you need to track context sensitive information during your parsing. This is where the user state comes into play. For the simple cases, makeStringStream
and makeBinStream
create state sources that have no user state (unit
). To create a stream source with a custom user state type do the following:
type VideoState = { IsAudio: bool; StateStart: int64 }
let makeBinStreamState (stream:Stream) =
new BinStream<VideoState>(stream, { IsAudio = false; StateStart = (int64)0 }, BinStreams.createCache())
In this scenario I am creating a user state of the record VideoState
and seeding the BinStream
with a default value. The user state is mutable, so you can pass it whatever you want. I'm also creating the default binary cache to use for memoization. If you don't want to use memoization (maybe your data source is huge and you would rather seek around in the stream), then pass in None
. The cache is an option type.
Just like with FParsec, you can run into an F# value restriction. This is due to un-inferrable generics that are used by the parser types. There are a lot of generic types (more than FParsec, since this is more customizable). The same rules apply here as with FParsec. If the parser gets used in some context where the final user state gets evaluated (for example by actually using the parser in a test
function), OR, by directly qualifying the parser.
The mp4 sample does this, since it defines parsers in one assembly but uses them from another. Notice the trick here:
type VideoParser<'Return> = Parser<'Return, System.IO.Stream, byte[], VideoState>
This creates a generic type of VideoParser
who's 'Return
type is unknown, but we know that it takes a Stream
, it will consume byte[]
, and the user state should be VideoState
. Then declare parsers return type as:
let video : VideoParser<_> = many (choice[ attempt ftyp;
moov;
mdat;
free;]) .>> eof
The other parsers don't need to be marked as VideoParser
since they all get used from video
. If you have errors, pin the types.
Debugging a combinator is hard. Right now ParsecClone supports only the most minimal debugging by printing when values are consumed and backtracked, and the current state at these steps. Debugging this way can be overwhelming, and it's recommended to leverage this only for inspection while building strong (tested) combinbators in small batches.
To enable debugging use set the debugging flag.
Combinator.enableDebug <- true
Lets actually use my parsec clone. Below is a grammar that will parse csv files
namespace StringMatchers
open Combinator
open StringCombinator
module CsvSample =
let delimType = ","
let(|DelimMatch|EscapedType|Other|) i =
if i = "\\" || i ="\"" then EscapedType
else if i = delimType then DelimMatch
else Other
let delim<'a> = matchStr delimType
let quote = matchStr "\""
let validNormalChars = function
| EscapedType
| DelimMatch -> false
| rest -> not (isNewLine rest)
let inQuotesChars = function
| "\"" -> false
| _ -> true
let unescape = function
| "n" -> "\n"
| "r" -> "\r"
| "t" -> "\t"
| c -> c
let quoteStrings = (many (satisfy (inQuotesChars) any)) >>= foldStrings
let escapedChar<'a> = matchStr "\\" >>. (anyOf matchStr [delimType; "\"";"n";"r";"t"] |>> unescape)
let normal<'a> = satisfy validNormalChars any
let normalAndEscaped = many (normal <|> escapedChar) >>= foldStrings
let literal<'a> = quoteStrings |> between2 quote
let csvElement = many (literal <|> normalAndEscaped) >>= foldStrings
let listItem<'a> = delim >>. ws >>. opt csvElement
let elements<'a> = csvElement .<?>>. many listItem
let lines<'a> = many (elements |> sepBy <| newline) .>> eof
If you've done the fparsec JSON tutorial this should look pretty similar. The basic gist is that you want to allow both string literals within quotes, and regular escaped characters. So:
foo, bar, baz
Is a valid csv, but so is
foo\,,,bar,baz\"
And so is
"foo,",bar,baz
All these cases work with the sample CSV parser
Here is an example of how to use the csv parser
[<Test>]
let testAll() =
let csv = makeStringStream @"This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words""
This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words""
This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words""
This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words""
This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words""
This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words""
This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words""
This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words""
This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words""
This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words""
This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"", This is some text! whoo ha, ""words"""
let result = test csv lines
List.length result |> should equal 11
As another example, this time of the binary parser, I wrote a small mp4 video file header parser. MP4 can be pretty complicated, so I didn't do the entire full spec, but you should be able to get the idea of how to use the binary parser.
For a detailed explanation and walkthrough check the sample readme
Parsing bits requires you to work on a pre-seeded byte stream. This is achieved by calling the makeBitP
parser which reads a certain number of bytes from the byte stream, and then elevates the stream into a bit stream. The returned result from the makeBitP
parser is the return result from the bit parsers. Only bit parsers can be used in the bit parsing stream, byte parsers won't work.
Here is a simple example:
[<Test>]
let bitParserTest() =
let bytes = [|0xF0;0x01|] |> Array.map byte
let parserStream = makeBinStream <| new MemoryStream(bytes)
let bitToBool = bp.bitsN 4
let bitP = bp.makeBitP (byteN 1) bitToBool
let result = test parserStream (bitP .>> bp.byte1 .>> eof)
result |> should equal 15
We create a single byte seed to use for the bit parsing, and then read 4 bits from the read byte (the other 4 bits are ignored). The underlying source stream was advanced by 1 byte, so I read the next byte to discard it, and then check for eof for completeness. The final tests makes sure that the 4 bits we read were all ones, validating that the value is 15 (0b1111).
Here is another example that applies the combinator many
to the bitParser. This example parses each bit and returns the bit value for a byte, and is applied to an array of 10 bytes
[<Test>]
let testApplyManyBits() =
let bytes = Array.init 10 (fun i -> byte(0x01))
let parserStream = makeBinStream <| new MemoryStream(bytes)
let selectAllBits = bp.bit1 >>= fun one ->
bp.bit1 >>= fun two ->
bp.bit1 >>= fun three ->
bp.bit1 >>= fun four ->
bp.bit1 >>= fun five ->
bp.bit1 >>= fun six ->
bp.bit1 >>= fun seven ->
bp.bit1 >>= fun eight ->
preturn [|one;two;three;four;five;six;seven;eight|]
let bitP = bp.makeBitP (byteN 1) selectAllBits
let result = test parserStream (many bitP .>> eof)
let target = [0..Array.length bytes - 1] |> List.map (fun _ -> bytesToBits <| bytes.[0..0])
result |> should equal target
One easy win is to wrap your input stream with a BufferedStream
. Check the included unit tests for an example.
The other big bottleneck in combinator parsing is when you need to parse the same item many times (maybe thousands, or hundreds of thousands of times). If you have to parse the same type many times in a row you should consider using a struct to hold your data type and leveraging the struct parsing functionality of ParsecClone.
Using the struct parsing requires a little bit of setup. Let me demonstrate using the sample mp4 parser.
Due to use of generics, we need to pin the UserState
for the struct parser.
/// <summary>
/// Creates a network order binary parser
/// </summary>
let bp = new BinParser<_>(Array.rev)
let pStructs<'T> bytes : VideoParser<_> = parseStruct<'T, VideoState> true bytes bp
The parseStruct
function comes defined in the BinParsers
module which is auto included with ParsecClone.BinaryCombinator
. The function takes a generic type 'T
which should be the struct type you want to parse, a boolean representing whether the bytes it reads are in network order (big endian) or not. This is important because for some binary structures, the files are written big endian vs the .NET normal of little endian. For big endian files, the bulk array is reversed, and read backwards during struct parsing. This means you should order your structures backwards as well, and then everything will map.
The other arguments to the struct parser include a byte array representing the entire byte chunk to read, and a reference to the binary parser instance.
As an example, let's say we have the following struct:
[<Struct>]
type TimeToSampleEntry =
struct
val SampleCount: uint32;
val SampleDuration: uint32
end
Which is contained in the following record:
type Stts = {
Atom: AtomBase
VersionAndFlags: VersionAndFlags
NumberOfEntries: uint32
SampleTimes: TimeToSampleEntry list
}
There could potentially be hundreds of thousands of the TimeToSampleEntry
elements, so it's good for us to batch this processing instead of reading 8 bytes at a time.
To parse the struct you can do something like the following:
let stts : VideoParser<_> =
atom "stts" >>= fun id ->
versionAndFlags >>= fun vFlags ->
bp.uint32 >>= fun numEntries ->
pStructs<TimeToSampleEntry> (int numEntries) >>= fun samples ->
preturn {
Atom = id
VersionAndFlags = vFlags
NumberOfEntries = numEntries
SampleTimes = samples
} |>> STTS
Notice how pStructs
takes the type of the struct as a generic as well as the number of structs to create. Internally the struct parser will read sizeof typeof('T) * count
bytes and marshal the raw byte array into the structure.
In general, don't optimize prematurely. The nice thing about records and structs is that it's trivial to change a record to a struct. Updating your parser requires only a line change (instead of a manyN (int numEntries) parser
you replace it with the pStructs
parser).