A simple extractor based on BeautifulSoup, you can use it to iterate through all the HTML files in the website root directory and get the text, placeholders and other text.
-Python 3 (version 3.8.0) -BeautifulSoup 4 (version 4.4.0) -CSV tool
-path
: the url of the website root directory. (eg. 'CurrentRoot/src' ,'C:/User/Wbesite/src' )
Please modify this var before running this script.
The function findHTML
will be called to iterate the whole folder and store a list of urls(htmls
) of all .html files in this root directory.
-The function extract
needs one reference named path, which should be the url of targer .html which you want to deal with(generally one of the url from htmls list)
-This script is going to aim 3 types of text: Text
, Placeholder
,Mattooltip
. (Detailed script description on BeautifulSoup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
text = soup.find_all(text=True)
placeholders = soup.find_all(placeholder=True)
mattooltips = soup.find_all(mattooltip=True)
-Filters: there is a list called blacklist
which contains all the tag names we want to ignore(e.g header, meta,style). For any specific tags, you can just add the tag name into this list to block them.
if s.name not in blacklist
-Rules: Generally, we need to block some meaningless texts like digits and interpolation expressions(Angular):
digit
:
if not t.strip().encode('UTF-8').isdigit():
Interpolation expressions
:
if not isInterpolationExpressions(t):
### following
def isInterpolationExpressions(t):
return '{{' in t.strip() and '}}' in t.strip()
In this script, this expected output is .csv spreadsheet file. Python 3.8.0 combined with the csv tool:
with open("names.csv", 'a+',newline="",encoding='utf-8') as csvfile:
writer = csv.writer(csvfile,dialect='excel')
writer.writerow(['Term','Url'])
for p in htmls:
print('Current File' + p)
extract(p)
extract(path):
###
code
###
def addTextToOutput():
###
code
###
writer.writerow([t.strip(),path])
def PlaceholderTextToOutput():
###
code
###
writer.writerow([t.strip(),path])
def addMatTooltipTextToOutput():
###
code
###
writer.writerow([t.strip(),path])
CSV tool is openning file name.csv
with encoding in utf-8
, then extract
function will add new rows including required Texts
, Placeholders
and Mattooltips
.
The ideal output includes two Column Term
and Url
Term | Url |
---|---|
Forgot password | src/app/view/account/forgot-password/forgot-password.component.html |
close | src/app/view/account/forgot-password/forgot-password.component.html |
Cancel | src/app/view/account/forgot-password/forgot-password.component.html |
Send | src/app/view/account/forgot-password/forgot-password.component.html |
Client Code | src/app/view/account/forgot-password/forgot-password.component.html |
User Name | src/app/view/account/forgot-password/forgot-password.component.html |
... | ... |