Script #0

First step

The first script commit is a command-line tool, where we pass the name of the HTML file (generated by Screenotate the underlying OCR program). The Python script uses Beautiful Soup to parse the tags, a simple regex to remove the HTML tags, and then copies the relevant text to the clipboard.

#!/usr/bin/env python3
import sys
import re
import pyperclip
from bs4 import BeautifulSoup

Prerequisites: Shebangs & Imports

The script is executed from the command-line, so the first line is a shebang.

The script has 4 import statements:

sys allows us to take arguments from the command line
re provides regular expression matching operations
pyperclip provides copy and paste clipboard functions
Beautiful Soup allows us to parse HTML files

Program Flow

The user passes the name of the HTML file generated by Screenotate, which is then stored in the user_file variable.
The file is then run through Beautiful Soup, creating a BeautifulSoup object (where the document is represented as a nested data structure.) The find_all("pre") method finds the relevant text for the user. (Screenotate uses <pre></pre> tags to enclose the OCR'd text).
We use a simple regex to find the pre tags, so they can be removed. We don't want them in our output text.
Per the exhaustive docstring, the remove_tags() function removes the tags from the file generated by the OCR app, strips line breaks and copies the relevant text to the clipboard

def remove_tags(text):
    notags = tags.sub("", text)  # Strip tags
    without_line_breaks = notags.replace("\n", " ")  # Strip line breaks
    pyperclip.copy(without_line_breaks)  # Copy text to clipboard
    print("Parsed text copied to clipboard!")
    return without_line_breaks

We call the function on the stringified text parsed by Beautiful Soup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script #0

First step

Prerequisites: Shebangs & Imports

Program Flow

Clone this wiki locally