Plain Text Extractor

Plain Text Extractor is a Golang library that helps you extract plain text from HTML and Markdown.

It provides a flexible and extensible interface for extracting the plain text content using both the predefined extraction methods and your own custom extraction requirements.

Features

Parse HTML and Markdown documents into plain text.
Support for custom extraction functions.
Easy-to-use API to convert complex documents to simple plain text.

Installation

go get github.com/huantt/plaintext-extractror

Usage

Markdown extractor

markdownContent := "# H1 \n*italic* **bold** `code` `not code [link](https://example.com) ![image](https://image.com/image.png) ~~strikethrough~~"
extractor := NewMarkdownExtractor()
output, err := extractor.PlainText(markdownContent)
if err != nil {
    panic(err)
}
fmt.Println(output)
// Output: H1 \nitalic bold code `not code link image strikethrough

Benchmark

goos: windows
goarch: amd64
pkg: github.com/huantt/plaintext-extractor/markdown
cpu: 11th Gen Intel(R) Core(TM) i5-1155G7 @ 2.50GHz
BenchmarkMarkdownExtractorMediumSize
BenchmarkMarkdownExtractorMediumSize-8   	12194006	        89.09 ns/op	      16 B/op	       1 allocs/op
BenchmarkMarkdownExtractorLargeSize
BenchmarkMarkdownExtractorLargeSize-8    	12645927	        88.25 ns/op	      16 B/op	       1 allocs/op
PASS

Custom Markdown Tag

markdownContent := "This is {color:#0A84FF}red{color}"

customTag := markdown.Tag{
    Name:       "color-custom-tag",
    FullRegex:  regexp.MustCompile("{color:[a-zA-Z0-9#]+}(.*?){color}"),
    StartRegex: regexp.MustCompile("{color:[a-zA-Z0-9#]+}"),
    EndRegex:   regexp.MustCompile("{color}"),
}

markdownExtractor := NewMarkdownExtractor(customTag)
plaintextExtractor := plaintext.NewExtractor(markdownExtractor.PlainText)
plaintext, err := plaintextExtractor.PlainText(markdownContent)
if err != nil{
    panic(nil)
}
fmt.Println(plaintext)
// Output: This is red

HTML Extractor

html := `<div>This is a <a href="https://example.com">link</a></div>`
extractor := NewHtmlExtractor()
output, err := extractor.PlainText(html)
if err != nil {
    panic(err)
}
fmt.Println(output)
// Output: This is a link

Multiple extractors

input := `<div> html </div> *markdown*`
markdownExtractor := markdown.NewExtractor()
htmlExtractor := html.NewExtractor()
extractor := NewExtractor(markdownExtractor.PlainText, htmlExtractor.PlainText)
output, err := extractor.PlainText(input)
if err != nil {
    panic(err)
}
fmt.Println(output)
// Output: html markdown

Contribution

Contributions to the Plain Text Parser project are welcome! If you find any issues or want to add new features, please feel free to open an issue or submit a pull request. Please see the CONTRIBUTING.md for more information.

License

This project released under the MIT License, refer LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
html		html
markdown		markdown
testdata/markdown		testdata/markdown
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
go.mod		go.mod
go.sum		go.sum
parser.go		parser.go
parser_test.go		parser_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plain Text Extractor

Features

Installation

Usage

Markdown extractor

Benchmark

Custom Markdown Tag

HTML Extractor

Multiple extractors

Contribution

License

About

Releases 3

Packages

Contributors 2

Languages

License

huantt/plaintext-extractor

Folders and files

Latest commit

History

Repository files navigation

Plain Text Extractor

Features

Installation

Usage

Markdown extractor

Benchmark

Custom Markdown Tag

HTML Extractor

Multiple extractors

Contribution

License

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Languages

Packages