class somajo.somajo.SoMaJo(language, *, split_camel_case=False, split_sentences=True, xml_sentences=None, character_offsets=False)
Bases: object
Tokenization and sentence splitting.
- Parameters
- language ({'de_CMC', 'en_PTB'}) – Language-specific tokenization rules.
- split_camel_case (bool, (default=False)) – Split words written in camelCase (excluding established names and terms).
- split_sentences (bool, (default=True)) – Perform sentence splitting in addition to tokenization.
- xml_sentences (str, (default=None)) – Delimit sentences by XML tags of this name
(
xml_sentences='s'
→ <s>…</s>). When used with XML input, this might lead to minor changes to the original tags to guarantee well-formed output (tags might need to be closed and re-opened at sentence boundaries). - character_offsets (bool, (default=False)) – Compute the character offsets in the input for each token. This allows for stand-off tokenization.
Split paragraphs of text into sequences of tokens.
- Parameters
- paragraphs (iterable) – An iterable of single paragraphs of text.
- parallel (int, (default=1)) – Number of processes to use.
- Yields
list – The
Token
objects in a single sentence or paragraph (depending on the value ofsplit_sentences
).
Tokenization and sentence splitting; print one sentence per line:
>>> paragraphs = ["Heyi:)", "Was machst du morgen Abend?! Lust auf Film?;-)"]
>>> tokenizer = SoMaJo("de_CMC")
>>> sentences = tokenizer.tokenize_text(paragraphs)
>>> for sentence in sentences:
... print(" ".join([token.text for token in sentence]))
...
Heyi :)
Was machst du morgen Abend ?!
Lust auf Film ? ;-)
Only tokenization; print one paragraph per line:
>>> tokenizer = SoMaJo("de_CMC", split_sentences=False)
>>> tokenized_paragraphs = tokenizer.tokenize_text(paragraphs)
>>> for paragraph in tokenized_paragraphs:
... print(" ".join([token.text for token in paragraph]))
...
Heyi :)
Was machst du morgen Abend ?! Lust auf Film ? ;-)
Tokenization and sentence splitting; print one token per line with token classes and extra information; print an empty line after each sentence:
>>> sentences = tokenizer.tokenize_text(paragraphs)
>>> for sentence in sentences:
... for token in sentence:
... print("{token.text}\t{token.token_class}\t{token.extra_info}")
... print()
...
Heyi regular SpaceAfter=No
:) emoticon
Was regular
machst regular
du regular
morgen regular
Abend regular SpaceAfter=No
?! symbol
Lust regular
auf regular
Film regular SpaceAfter=No
? symbol SpaceAfter=No
;-) emoticon
Tokenization and sentence splitting; print one token per line and delimit sentences with XML tags:
>>> tokenizer = SoMaJo("de_CMC", xml_sentences="s")
>>> sentences = tokenizer.tokenize_text(paragraphs)
>>> for sentence in sentences:
... for token in sentence:
... print(token.text)
...
<s>
Heyi
:)
</s>
<s>
Was
machst
du
morgen
Abend
?!
</s>
<s>
Lust
auf
Film
?
;-)
</s>
Split the contents of a text file into sequences of tokens.
- Parameters
- text_file (str or file-like object) – Either a filename or a file-like object containing text.
- paragraph_separator ({'single_newlines', 'empty_lines'}) – How are paragraphs separated in the input? Is there one paragraph per line (‘single_newlines’) or do paragraphs span several lines and are separated by ‘empty_lines’?
- parallel (int, (default=1)) – Number of processes to use.
- Yields
list – The
Token
objects in a single sentence or paragraph (depending on the value ofsplit_sentences
).
Tokenization and sentence splitting; input file with paragraphs separated by empty lines; print one token per line with token classes and extra information; print an empty line after each sentence:
>>> with open("example_empty_lines.txt") as f:
... print(f.read())
...
Heyi:)
Was machst du morgen Abend?! Lust auf Film?;-)
>>> sentences = tokenizer.tokenize_text_file("example_empty_lines.txt", paragraph_separator="single_newlines")
>>> for sentence in sentences:
... for token in sentence:
... print("{token.text}\t{token.token_class}\t{token.extra_info}")
... print()
...
Heyi regular SpaceAfter=No
:) emoticon
Was regular
machst regular
du regular
morgen regular
Abend regular SpaceAfter=No
?! symbol
Lust regular
auf regular
Film regular SpaceAfter=No
? symbol SpaceAfter=No
;-) emoticon
Tokenization and sentence splitting; input file with paragraphs separated by single newlines; print one sentence per line:
>>> with open("example_single_newlines.txt", encoding="utf-8") as f:
... print(f.read())
...
Heyi:)
Was machst du morgen Abend?! Lust auf Film?;-)
>>> tokenizer = SoMaJo("de_CMC")
>>> with open("example_empty_lines.txt", encoding="utf-8") as f:
... sentences = tokenizer.tokenize_text_file(f, paragraph_separator="empty_lines")
... for sentence in sentences:
... print(" ".join([token.text for token in sentence]))
...
Heyi :)
Was machst du morgen Abend ?!
Lust auf Film ? ;-)
Split a string of XML data into sequences of tokens.
- Parameters
- xml_data (str) – A string containing XML data.
- eos_tags (iterable) – XML tags that constitute sentence breaks, i.e. tags that
do not occur in the middle of a sentence. For HTML input,
you might use the following list of tags:
['title', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p', 'br', 'hr', 'div', 'ol', 'ul', 'dl', 'table']
- strip_tags (bool, (default=False)) – Remove the XML tags from the output.
- parallel (int, (default=1)) – Number of processes to use.
- prune_tags (iterable) – These XML tags and their contents will be removed from the
input before tokenization. For HTML input, you might use
['script', 'style']
or, depending on your use case,['head']
.
- Yields
list – The
Token
objects in a single sentence or stretch of XML delimited byeos_tags
(depending on the value ofsplit_sentences
).
Tokenization and sentence splitting; print one token per line and an empty line after each sentence:
>>> xml = "<html><body><p>Heyi:)</p><p>Was machst du morgen Abend?! Lust auf Film?;-)</p></body></html>"
>>> eos_tags = "title h1 h2 h3 h4 h5 h6 p br hr div ol ul dl table".split()
>>> tokenizer = SoMaJo("de_CMC")
>>> sentences = tokenizer.tokenize_xml(xml, eos_tags)
>>> for sentence in sentences:
... for token in sentence:
... print(token.text)
... print()
...
<html>
<body>
<p>
Heyi
:)
</p>
<p>
Was
machst
du
morgen
Abend
?!
Lust
auf
Film
?
;-)
</p>
</body>
</html>
Tokenization and sentence splitting; strip XML tags from the output and print one sentence per line
>>> sentences = tokenizer.tokenize_xml(xml, eos_tags, strip_tags=True)
>>> for sentence in sentences:
... print(" ".join([token.text for token in sentence]))
...
Heyi :)
Was machst du morgen Abend ?!
Lust auf Film ? ;-)
Only tokenization; print one chunk of XML (delimited by
eos_tags
) per line:
>>> tokenizer = SoMaJo("de_CMC", split_sentences=False)
>>> chunks = tokenizer.tokenize_xml(xml, eos_tags)
>>> for chunk in chunks:
... print(" ".join([token.text for token in chunk]))
...
<html> <body> <p> Heyi :) </p>
<p> Was machst du morgen Abend ?! Lust auf Film ? ;-) </p> </body> </html>
Tokenization and sentence splitting; print one token per line and delimit sentences with XML tags:
>>> xml = "<html><body><p>Heyi:)</p><p>Was machst du morgen Abend?! Lust auf Film?;-)</p></body></html>"
>>> eos_tags = "title h1 h2 h3 h4 h5 h6 p br hr div ol ul dl table".split()
>>> tokenizer = SoMaJo("de_CMC", xml_sentences="s")
>>> sentences = tokenizer.tokenize_xml(xml, eos_tags)
>>> for sentence in sentences:
... for token in sentence:
... print(token.text)
... print()
...
<html>
<body>
<p>
<s>
Heyi
:)
</s>
</p>
<p>
<s>
Was
machst
du
morgen
Abend
?!
</s>
<s>
Lust
auf
Film
?
;-)
</s>
</p>
</body>
</html>
Split the contents of an xml file into sequences of tokens.
- Parameters
- xml_file (str or file-like object) – A file containing XML data. Either a filename or a file-like object.
- eos_tags (iterable) – XML tags that constitute sentence breaks, i.e. tags that
do not occur in the middle of a sentence. For HTML input,
you might use the following list of tags:
['title', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p', 'br', 'hr', 'div', 'ol', 'ul', 'dl', 'table']
- strip_tags (bool, (default=False)) – Remove all XML tags from the output.
- parallel (int, (default=1)) – Number of processes to use.
- prune_tags (iterable) – These XML tags and their contents will be removed from the
input before tokenization. For HTML input, you might use
['script', 'style']
or, depending on your use case,['head']
.
- Yields
list – The
Token
objects in a single sentence or stretch of XML delimited byeos_tags
(depending on the value ofsplit_sentences
).
Tokenization and sentence splitting; print one token per line and an empty line after each sentence:
>>> with open("example.xml") as f:
... print(f.read())
...
<html>
<body>
<p>Heyi:)</p>
<p>Was machst du morgen Abend?! Lust auf Film?;-)</p>
</body>
</html>
>>> eos_tags = "title h1 h2 h3 h4 h5 h6 p br hr div ol ul dl table".split()
>>> tokenizer = SoMaJo("de_CMC")
>>> sentences = tokenizer.tokenize_xml_file("example.xml", eos_tags)
>>> for sentence in sentences:
... for token in sentence:
... print(token)
... print()
...
<html>
<body>
<p>
Heyi
:)
</p>
<p>
Was
machst
du
morgen
Abend
?!
Lust
auf
Film
?
;-)
</p>
</body>
</html>
Tokenization and sentence splitting; strip XML tags from the output and print one sentence per line:
>>> with open("example.xml") as f:
... sentences = tokenizer.tokenize_xml_file(f, eos_tags, strip_tags=True)
... for sentence in sentences:
... print(" ".join(token.text for token in sentence))
...
Heyi :)
Was machst du morgen Abend ?!
Lust auf Film ? ;-)
Only tokenization; print one token per line
>>> tokenizer = SoMaJo("de_CMC", split_sentences=False)
>>> chunks = tokenizer.tokenize_xml_file("example.xml", eos_tags)
>>> for chunk in chunks:
... for token in chunk:
... print(token.text)
...
<html>
<body>
<p>
Heyi
:)
</p>
<p>
Was
machst
du
morgen
Abend
?!
Lust
auf
Film
?
;-)
</p>
</body>
</html>
class somajo.token.Token(text, *, markup=False, markup_class=None, markup_eos=None, locked=False, token_class=None, space_after=True, original_spelling=None, first_in_sentence=False, last_in_sentence=False, character_offset=None)
Bases: object
Token objects store a piece of text (in the end a single token) with additional information.
- Parameters
- text (str) – The text that makes up the token object
- markup (bool, (default=False)) – Is the token a markup token?
- markup_class ({'start', 'end'}, optional (default=None)) – If markup=True, then markup_class must be either “start” or “end”.
- markup_eos (bool, optional (default=None)) – Is the markup token a sentence boundary?
- locked (bool, (default=False)) – Mark the token as locked.
- token_class ({'URL', 'XML_entity', 'XML_tag', 'abbreviation', 'action_word', 'amount', 'date', 'email_address', 'emoticon', 'hashtag', 'measurement', 'mention', 'number', 'ordinal', 'regular', 'semester', 'symbol', 'time'}, optional (default=None)) – The class of the token, e.g. “regular”, “emoticon”, “URL”, etc.
- space_after (bool, (default=True)) – Was there a space after the token in the original data?
- original_spelling (str, optional (default=None)) – The original spelling of the token, if it is different from the one in text.
- first_in_sentence (bool, (default=False)) – Is it the first token of a sentence?
- last_in_sentence (bool, (default=False)) – Is it the last token of a sentence?
- character_offset (tuple, (default=None)) – Character offset of the token in the input as tuple (start, end) such that input[start:end] == text (if there are no changes to the token text during tokenization)
String representation of extra information.
- Returns A string representation of the space_after and original_spelling attributes.
- Return type str
>>> tok = Token(":)", token_class="regular", space_after=False, original_spelling=": )")
>>> print(tok.text)
:)
>>> print(tok.extra_info)
SpaceAfter=No, OriginalSpelling=": )"