Skip to content
Eugene Lazutkin edited this page Jul 10, 2020 · 5 revisions

Dashboard

Node.js CI NPM version TypeScript definitions on DefinitelyTyped

Known Vulnerabilities Dependencies devDependencies

About

stream-csv-as-json is a minimal micro-library, which provides a set of light-weight stream components to process huge CSV files with a minimal memory footprint. It is a companion library for stream-json, fully compatible with it, and can use advanced features provided by that library.

It can:

  • Parse CSV files compliant with RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files.
    • Properly handles values with newlines in them.
    • Supports a relaxed definition of "newline".
    • Supports a customizable separator.
  • Parse CSV files far exceeding available memory.
    • Even individual primitive data items (strings) can be streamed piece-wise.
    • Processing humongous files can take minutes and even hours. Shaving even a microsecond from each operation can save a lot of time waiting for results. That's why all stream-csv-as-json and stream-json components were meticulously optimized.
  • Stream using a SAX-inspired event-based API.
  • Provide utilities to handle huge database dumps.
  • Follows conventions of a no-dependency micro-library stream-chain.

It was meant to be a set of building blocks for data processing pipelines organized around CSV, JSON and JavaScript objects. Users can easily create their own "blocks" using provided facilities.

Documentation

This is an overview, which can be used as a cheat sheet. Click on individual components to see a detailed API documentation with examples.

The main module

The main module returns a factory function, which produces instances of Parser decorated with emit().

Parser

The heart of the package is Parser — a streaming CSV parser, which consumes text and produces a stream of tokens.

const {parser} = require('stream-csv-as-json');
const pipeline = fs.createReadStream('data.csv').pipe(parser());

Each row is logically represented by JSON tokens as an array of string values.

A stream produced by the CSV parser is compliant with the JSON token stream. All data processing facilities of stream-json can be used on it: filters, streamers, and so on.

Essentials

Classes and functions to make streaming data processing enjoyable:

  • Stringer is a Transform stream. It receives a token stream and converts it to the CSV format. It is very useful when you want to edit a stream with filters and a custom code, and save it back to a file.
    const {stringer} = require('stream-csv-as-json/Stringer');
    const {pick}     = require('stream-json/filters/Pick');
    
    chain([
      fs.createReadStream('data.csv.gz'),
      zlib.createGunzip(),
      parser(),
      pick({filter: 'data'}),
      stringer(),
      zlib.Gzip(),
      fs.createWriteStream('edited.csv.gz')
    ]);
  • AsObjects is a Transform stream. It consumes a stream produced by Parser (a row as an array of string values), uses the first row as a header, and reformats array as objects using header values as keys for corresponding fields.
    const {asObjects} = require('stream-csv-as-json/AsObjects');
    
    chain([
      fs.createReadStream('data.csv.gz'),
      zlib.createGunzip(),
    
      // data:
      // a,b,c
      // 1,2,3
    
      parser(),
    
      // ['a', 'b', 'c']
      // ['1', '2', '3']
    
      asObjects()
    
      // {a: '1', b: '2', c: '3'}
    ]);

Advanced use

Performance tuning

Performance considerations are discussed in a separate document dedicated to Performance.

Credits

The test file tests/sample.csv.gz is Master.csv from Lahman’s Baseball Database 2012. The file is copyrighted by Sean Lahman. It is used here under a Creative Commons Attribution-ShareAlike 3.0 Unported License. In order to test all features of the CSV parser, the file was minimally modified: row #1000 has a CRLF inserted in a value, row #1001 has a double quote inserted in a value, then the file was compressed by gzip.

Clone this wiki locally