Skip to content

A fast implementation of Golang epub to txt conversion.

License

Notifications You must be signed in to change notification settings

Rexwang8/fast-epubtotxt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fast-epubtotxt

A fast epub to txt converter implemented in Golang. This converter acheives speeds of 4-6 million characters/second in testing. For example, it converts the novel 1984 in ~100-150ms.

Usage

./convert \
    -inputDir [INPUT_DIRECTORY] \
    -outputDir [OUTPUT_DIRECTORY] \
    -writeHeader=[true|false] \
    -writeMetadata=[true|false] \
    -cleanOutput=[true|false] \
    -seperateFolders=[true|false] \
    -stopEarly=[INT_NUMBER_OF_BOOKS] \
    -silent=[true|false] \
    -skipCopyRight=[true|false] \
    -gutenbergCleaning=[true|false]

Example:

./convert \
    -inputDir ../data/test-lib \
    -outputDir ./output \
    -writeHeader=true \
    -writeMetadata=true \
    -cleanOutput=true \
    -seperateFolders=true \
    -stopEarly=100 \
    -skipCopyRight=false \
    -gutenbergCleaning=true

Arguments

Argument Type Description Default Value
inputDir string Input folder path ./input
outputDir string Output folder path ./output
writeHeader bool Write a metadata header to the *.txt file. true
writeMetadata bool Write metadata to a seperate file. false
cleanOutput bool Remove strange characters and spacing from the output. true
gutenbergCleaning bool Perform additional output cleaning for Gutenberg format books. false
seperateFolders bool Write epub and metadata to a seperate folder per book. false
stopEarly int The number of books to process before stopping. 0 (unlimited)
silent bool Suppress console output. false
skipCopyRight bool Skip all books marked as copyrighted in the metadata. false

Build instructions

Build the converter with golang.

go build convert.go

Official icon

Icon

Parsing benchmark

This converter processed 55,756 books from the Project Gutenberg library in less than 45 minutes.

Parsing took 44m56.324860172s, parsed 16465734085 characters at a rate of 6106732 characters per second.
Parsed 55756 books, 55340 finished and 416 skipped due to copy right.

Notes

The converter is not exhaustively tested. Please contact me or raise an issue if errors are discovered.

Significant references

About

A fast implementation of Golang epub to txt conversion.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages