Skip to content

Latest commit

 

History

History
130 lines (95 loc) · 4.19 KB

README.md

File metadata and controls

130 lines (95 loc) · 4.19 KB

NOAT: Non-Overlapping Annotation Tagging

NOAT ("note") is a helper class for inserting reference-based annotations as HTML tags at arbitary points in text, based on their start and end positions, while avoiding overlapping open and close tags of different type (invalid HTML). This ensures creating a well-formed HTML document that will yield a properly structured DOM.

The text is broken into segments, bounded by the start and end points of all of the annotations. It is then reassembled, with opening and closing tags for annotations inserted between the segments. Tags are closed and reopened as needed to prevent overlap.

For example, given:

text = "Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit."

annotations = [{
    'type': 'emphasis',
    'start': 5,
    'end': 30,
},{
    'type': 'strong',
    'start': 20,
    'end': 50,
}]

Simply inserting the tags at the given start and end positions would result in invalid HTML:

Duis <em>mollis, est non<strong> commodo l</em>uctus, nisi erat por</strong>
ttitor ligula, eget lacinia odio sem nec elit.

The correct output is:

Duis <em>mollis, est non<strong> commodo l</strong></em><strong> uctus, nisi
erat por</strong>ttitor ligula, eget lacinia odio sem nec elit.

Note that </strong> tag before the strong's end, to allow the emphasis annotation to be closed without overlapping the <strong>. The strong annotation is then reopened with a <strong> and then closed at its actual end.

Usage

NOAT is available in three flavors: Python (2&3), CoffeeScript, and Ruby. The API is basically the same, with some slight differences for language variations. In every case, the adding of annotations is lazy, so the actual markup is not generated until the __str__, toString, or to_s method is called.

To install, simply include the noat.<ext> file where necessary. (Python users can pip install noat as well.)

There are no dependencies, and even the tests can just be run directly, eg python tests.py.

.add

Python       : .add(tag, start, [end,] [attributes,] [**attributes])
CoffeeScript : .add(tag, start, [end,] [attributes={}])
Ruby         : .add(tag, start, [end,] [attributes={}])

tag can be any string. start and end are integers describing the start and end positions of the annotations (inclusive). end is optional, allowing for 'collapsed' tags, (eg abcd<span></span>efgh). Attributes are an object/dict/hash (CoffeeScript/Python/Ruby) of key-value attributes to be added to the tag, eg <a href="http://example.com">link</a>. Python can also accept keyword arguments (which supersede any dict attributes).

For convenience, since class is a reserved word but a common annotation attribute, the attribute key '_class' will be converted to 'class', allowing for Python keyword arguments to be written as _class="marker". The CoffeeScript and Ruby versions will do the same, even though it's less necessary.

Python

>>> from noat import NOAT
>>> some_text = 'Lorem ipsum dolor sit amet.'
>>> markup = NOAT(some_text)
>>> markup.add('em', 5, 15)
>>> markup.add('a', 4, 10, href='http://example.com')
>>> str(markup)
'Lore<a href="http://example.com">m<em> ipsu</em></a><em>m dol</em>or sit amet.'

CoffeeScript

coffee> NOAT = require './noat'
coffee> some_text = 'Lorem ipsum dolor sit amet.'
coffee> markup = new NOAT(some_text)
coffee> markup.add('em', 5, 15)
coffee> markup.add('a', 4, 10, {href:'http://example.com'})
coffee> markup.toString()
'Lore<a href="http://example.com">m<em> ipsu</em></a><em>m dol</em>or sit amet.'

Ruby

irb(main):001:0> NOAT = require './noat.rb'
irb(main):002:0> some_text = 'Lorem ipsum dolor sit amet.'
irb(main):003:0> markup = NOAT.new(some_text)
irb(main):004:0> markup.add('em', 5, 15)
irb(main):005:0> markup.add('a', 4, 10, {:href => 'http://example.com'})
irb(main):006:0> markup.to_s()
=> "Lore<a href=\"http://example.com\">m<em> ipsu</em></a><em>m dol</em>or sit amet."

Authors

License

Unlicensed aka Public Domain. See /UNLICENSE for more information.