-
Notifications
You must be signed in to change notification settings - Fork 0
Script #0
The first script commit is a command-line tool, where we pass the name of the HTML file (generated by Screenotate the underlying OCR program). The Python script uses Beautiful Soup to parse the tags, a simple regex to remove the HTML tags, and then copies the relevant text to the clipboard.
#!/usr/bin/env python3
import sys
import re
import pyperclip
from bs4 import BeautifulSoup
The script is executed from the command-line, so the first line is a shebang.
The script has 4 import statements:
-
sys
allows us to take arguments from the command line -
re
provides regular expression matching operations -
pyperclip
provides copy and paste clipboard functions -
Beautiful Soup
allows us to parse HTML files
-
The user passes the name of the HTML file generated by Screenotate, which is then stored in the
user_file
variable. -
The file is then run through Beautiful Soup, creating a
BeautifulSoup object
(where the document is represented as a nested data structure.) Thefind_all("pre")
method finds the relevant text for the user. (Screenotate uses<pre></pre>
tags to enclose the OCR'd text). -
We use a simple regex to find the
pre
tags, so they can be removed. We don't want them in our output text. -
Per the exhaustive docstring, the
remove_tags()
function removes the tags from the file generated by the OCR app, strips line breaks and copies the relevant text to the clipboard
def remove_tags(text):
notags = tags.sub("", text) # Strip tags
without_line_breaks = notags.replace("\n", " ") # Strip line breaks
pyperclip.copy(without_line_breaks) # Copy text to clipboard
print("Parsed text copied to clipboard!")
return without_line_breaks
- We call the function on the stringified text parsed by Beautiful Soup