Skip to content

Script #0

Liam Thompson edited this page Feb 5, 2021 · 2 revisions

First step

The first script commit is a command-line tool, where we pass the name of the HTML file (generated by Screenotate the underlying OCR program). The Python script uses Beautiful Soup to parse the tags, a simple regex to remove the HTML tags, and then copies the relevant text to the clipboard.

#!/usr/bin/env python3
import sys
import re
import pyperclip
from bs4 import BeautifulSoup

Prerequisites: Shebangs & Imports

The script is executed from the command-line, so the first line is a shebang.

The script has 4 import statements:

  • sys allows us to take arguments from the command line
  • re provides regular expression matching operations
  • pyperclip provides copy and paste clipboard functions
  • Beautiful Soup allows us to parse HTML files

Program Flow

  1. The user passes the name of the HTML file generated by Screenotate, which is then stored in the user_file variable.

  2. The file is then run through Beautiful Soup, creating a BeautifulSoup object (where the document is represented as a nested data structure.) The find_all("pre") method finds the relevant text for the user. (Screenotate uses <pre></pre> tags to enclose the OCR'd text).

  3. We use a simple regex to find the pre tags, so they can be removed. We don't want them in our output text.

  4. Per the exhaustive docstring, the remove_tags() function removes the tags from the file generated by the OCR app, strips line breaks and copies the relevant text to the clipboard

def remove_tags(text):
    notags = tags.sub("", text)  # Strip tags
    without_line_breaks = notags.replace("\n", " ")  # Strip line breaks
    pyperclip.copy(without_line_breaks)  # Copy text to clipboard
    print("Parsed text copied to clipboard!")
    return without_line_breaks
  1. We call the function on the stringified text parsed by Beautiful Soup
Clone this wiki locally