The code and logic in this repo has done it's job, and successfully scraped the legacy HTML content, resulting in structured DITA.
After approaching 18 months of development, the code was run once "for real", and will now be archived.
Legacy content for Field Service Manual, a digital twin that we are going to develop against.
See the original mock data here.
- python >= 3.9
- beautifulsoup4
Python dependencies are specified in requirements.txt
in the project root directory.
Install these automatically using the below pip
command from the same directory.
pip install -r requirements.txt
Execute the following command from the project root directory to run the tests.
python -m unittest discover
Execute the following command from the project root directory to run program. The test_payload_for_parsed_json is optional parameter.
-
Check if
DITA-OT
is already installed by running the command belowdita --version
-
If the
dita
command is not found, then you need to download and install the latest version ofDITA-OT
from the official website. -
Once you have downloaded and extracted the package to a directory on your system, you can add the
DITA-OT/bin
directory to yourPATH
variable by adding the following line to your.bash_profile
file:export PATH=$PATH:/path/to/DITA-OT/bin
Replace
/path/to/DITA-OT
with the actual path to your DITA-OT installation directory. After saving the file, you'll need to reload your.bash_profile
file by running the following command:source ~/.bash_profile
-
Test that
DITA-OT
is installed correctly by running thedita --version
command in your terminal. If the command returns a version number, thenDITA-OT
is installed and configured correctly. You can now use thedita
command in your terminal to perform various DITA-related tasks.
python parser/lman_parser.py data
There are a number of arguments that can be passed to the parser script. Run with --help
to see them:
python parser/lman_parser.py --help
usage: lman_parser.py [OPTION] DATA_PATH [LOGGING_LEVEL]
positional arguments:
DATA_PATH Path to the source data
LOGGING_LEVEL Debug level - must be one of debug, warning or info
options:
-h, --help show this help message and exit
--skip-first-run, --no-skip-first-run
Skip Run 1, loading the shopping list from the link_tracker.json file
--warn-on-blank-runs, --no-warn-on-blank-runs
Print warning messages whenever runs of blank paragraphs are found
--run-validation, --no-run-validation
Run the DITA validation step
These allow you to skip the first run, warn about sequences of blank paragraphs or run the DITA validation.
There is a separate script to run the parser for a single file. Run:
python parser/parse_single_file.py INPUT_FOLDER SINGLE_FILE LOGGING_LEVEL
The only extra parameter is the SINGLE_FILE
which should be the path to the individual file you want to process. For example:
python parser/parse_single_file.py data /home/LegacyMan/data/Britain_Cmplx/Phase_F.html debug
will process Phase_F.html
, with debug logging
Apply the version controlled (checked into Git) HTML discrepancy corrections available at client site. Checking out the applicable branch and replacing the files in the project location should do.
Execute the following command from the project root directory to run program. The test_payload_for_parsed_json is optional parameter.
The < data directory > placeholder is to be replaced by a valid path containing the following directories:
- PlatformData (as in < data directory >/PlatformData/PD_1.html)
- QuickLinksData (as in < data directory >/QuickLinksData/Abbreviations.html)
From tests, an absolute path c:\real_data
works more reliably than using a relative path ..\..\real_data
.
python parser\lman_parser.py < data directory > < target directory >
We have also developed a checking script that will search for content in files in the source directory and check that it appears in content in the target directory. Each 'chunk' of content is a bit of plain text (ie. no tags in the middle of it) that is at least 40 characters long.
To run the file checker run:
python parser/check_files.py --help
This will show this help text:
usage: check_files.py [OPTION] SOURCE_PATH TARGET_PATH
positional arguments:
SOURCE_PATH Path to the source data
TARGET_PATH Path to the target (converted) data
options:
-h, --help show this help message and exit
--files FILES Either an integer, to process that number of random files, or 'all' to process all files (default: 5)
--text TEXT Either an integer, to read that number of random text strings from each file or 'all' to read all valid text strings from the file (default: 10)
Running with just the path to the source and target files will check 10 chunks of text each from 5 files. These numbers can be adjusted by setting the --files
and --text
parameters, and the checking can be run on all chunks of text in all files by passing --files all --text all
.