- html: fix intermitent whitespace normalization issues
- pdf: fix a crash when aMetadata is null (e.g. metadata without the --all argument)
- fix running
make check
on clean machines - fix dictionary (output order independent) tests [Debian Wheezy]
- fix mimetype alias resolution [shared-mime-info >= 1.1]
- clean up the Vim syntax support files
- update the man documentation for the cainteoir and dictionary programs
- add a man file for the phoneme-converter program
- support using Pico TTS
- espeak: support pronouncing text via the Cainteoir TTS API
- audio: support ALSA output
- support private use language tags
- improve mapping eSpeak's language codes to RFC5656 language codes
- various API changes and enhancements
- improved and extended the API documentation
- improved CSS Counter Styles Level 3 support
- make use of the ucd-tools project for handling Unicode codepoint classification
- dictionary: manage pronunciation dictionaries
- phoneme-converter: convert between different phonetic transcriptions
- support tokenizing text into groups (words, numbers, punctuation, etc.)
- support basic context analysis (cardinal/ordinal number detection)
- support converting cardinal and ordinal numbers to words
- support a feature-based phonetic model and different phonetic transcription schemes
- support using a pronunciation dictionary to look up word pronunciations
- epub: support ePub3 Media Overlays [1]
- epub: support ePub3 navigation documents [1]
- html: improved the whitespace normalization logic [1]
- html: improved parsing HTML documents [1]
- html: fixed handling of unquoted percentage characters in attributes
- opf: correctly handle @id attributes on dublin-core elements (dc:*).
- zip/epub: support ZIP archives containing data descriptors
- email: support archived email files
- email: support Debian bugmail mbox files
- This feature was funded by the CSIR (Council for Scientific and Industrial Research).
- preserve the event information from the readers into the document model
- rework the document event structure to use a CSS-like subset
- model the CSS3 Counter Styles (previously Lists) spec for defining the disc and decimal list types
- epub3: support for onix and marc metadata
- mht: fix reading MHT documents
- fix compilation on clang 3.2
- add support for the mb-cz2 voice
- add a configure option to disable MBROLA support
- use C++11 range-based for loops instead of
foreach_iter
- complete API documentation with doxygen (except the deprecated cainteoir::document API)
- rework and simplify the public buffer API
- improved space normalization logic with tests
- html: initial support for MS Word generated HTML (http://www.w3.org/TR/REC-html40 XML namespace)
- html: fix anchor generation for the root item (fixes navigation in ZIP files containing HTML documents)
- mime: support story-based Title, Story, Author and Keywords mime-like markup
- mime: support 'Article ... of ...' and 'From ...' first lines
- mime: improve email and username identification in From sections
- opf: dc:subject statements with comma-separated keywords are now supported
- add support for RDF Query Language expression templates
- optimize
rdf.select(s, _, _)
queries - remove
rql::either
support
- update the IANA Language Sub-Tag Repository to 2012/09/04
- use RDF/RQL in the language tests
- fetch the extlang data from the IANA Language Sub-Tag Repository
- improve running the language tests -- set the LANGUAGE and LANG environment variables when setting the language
- improve the chart generation and feature support
- fixes for ascii-ipa; make each entry unique
- create a vim syntax file for the phoneme transcription scheme format (*.phon)
- add support for arpabet, Unicode IPA, x-sampa and sampa-de transcription schemes
- add support for en1, us1, us2, us3, de1, de2, de4 and de5 MBROLA voice phoneme transcription schemes
- doc2doc: improve text formatting
- improve argument and error handling for the command-line tools
- switch NEWS and README to markdown format
- xml: detault to windows-1252 encoding
- xml: switch encoding via
<?xml encoding="..."?>
- mime/email: switch encoding via
Content-Type: ...; charset="..."
- html: switch encoding via
<meta charset="...">
- html: switch encoding via
<meta http-equiv="Content-Type" content="...; charset=...">
- mime: support mimetype/alternative
- email: support email embedded within html
<pre>
tags - html: fix parsing of html containing implicit (missing) tags
- html: support table markup
- ePub: be more relaxed about checking the epub mimetype file
- define an archive interface with a zip file implementation
- report the filename as a toc entry if no toc is present
- move the UTF-8 helpers to a common location
- optimize decoding of utf-8 and us-ascii encodings
- do not load a new decoder when switching to the same encoding
- switch document parsing from an event model to a reader model
- doc2doc: a simple command-line tool for converting documents from one format to another
- pdf: support PDF documents using the Poppler library
- (x)html: support extracting metadata from several meta tag names
- zip: support reading generic zipped document collections
- ePub: make the ePub parser use the archive interface
- use BCP 47 for interpreting and comparing language tags
- translate the language and region names using the iso-codes package
- support for espeak using installed mbrola voices
- (x)html: report h1..h6 sections as table of content items
- email: improve email content detection and support newsgroup information
- rdf/xml: support property attributes on an empty property element
- xml: support entities in DTD blocks
- fix detection of html that is valid xml, but is not xhtml
- improve detection of control characters when identifying binary data
- simplified and cleaned up the RDF model and query APIs
- use
std::shared_ptr
consistently overstd::tr1::shared_ptr
- use
std::make_shared
instead of allocating the object explicitly - remove the libxml2 dependency and make all xml-based parsers use xmlreader
- languages.rdf -- represent ISO-639-2/B, ISO-639-2/T, ISO-639-3, ISO-639-5, espeak and wikimedia language codes;
- add document format implementation status documentation.
- parse entities in attribues;
- support single-quote and html-style attributes;
- parse xml namespaces;
- unify xml and rdf namespace/uri support;
- support element/attribute mapping to context entry objects.
- epub -- support epub 3 @refines and @datatype metadata;
- html -- support all known implicit close tags and recognise more elements;
- ocf -- use the xmlreader api and expose a reader api;
- opf -- use the xmlreader api.
- release the file handle when finished recording ogg files.
- use the shared-mime-info database for MIME type detection.
- use iconv for character encoding conversions.
- support all Content-Transfer-Encoding types for MIME headers, including base64.
- improve audio error handling during reading/recording.
- support more language code mappings for new espeak voices.
- support UND as a language identifier (for Calibri eBooks).
- support XML encoded HTML without an associated xmlns.
- support ncx-based table of content in epub documents;
- support epub 3.0 metadata;
- basic support for SSML.
- support listing a document's table of contents and read/record a specified content range;
- support setting the voice reading speed, pitch and volume.
- support for single-file HTML pages (MHTML, MHT);
- support for RTF documents;
- enhanced HTML support.
- don't segfault if the OPF document is missing from an epub file.
- improved total reading time estimating heuristics.
- support email (RFC822) mbox format;
- basic support for RTF documents.
- add rdfs:domain, rdfs:range and OWL markup to the text-to-speech schema;
- install the text-to-speech RDF schema used by the Cainteoir Engine;
- extract language and title metadata from HTML documents;
- extract metadata from email From and Subject headers.
- list supported document and audio formats in the --help output;
- don't output progress when writing audio data to stdout;
- don't crash when using the default voice (no dc:language metadata).
- command-line text-to-speech program;
- support for using espeak text-to-speech voices;
- listen to text, html and epub documents on the command line;
- record text, html and epub documents to ogg/vorbis and wave audio;
- see the reading/recording progress and (estimated) total time;
- preserve metadata from document to audio (currently epub to ogg/vorbis).
- extract metadata from epub, smil, rdf/xml and other documents;
- generate RDF N-Triples
- generate RDF Turtle statements
- generate VorbisComment metadata