This document describes the steps to convert an HTML file to Markdown for MDN content.
To perform Markdown conversion, you must:
- Have Git, NodeJS >=v12 <v17 and Yarn installed
- Have a GitHub account (it's free!)
- Have a local copy of
mdn/markdown
: the conversion tool is located in this repo- See README for setup instructions
- This script was originally in
mdn/yari
, but it has been forked into this repository for easier maintenance and sunsetting when it is no longer needed
- Have a local copy of
mdn/content
and/ormdn/translated-content
mdn/content
is needed for en-US, whereasmdn/translated-content
is for all other locales
Basically, we will perform the following:
- Perform a dry run of Markdown conversion and:
- Assess the report generated by the script
- Update the HTML document to remove problematic elements
- Repeat the above steps until satisfactory
- Run the Markdown conversion script for real
- Submit pull requests with the changes
Perform a dry run of the conversion by running yarn run h2m <target> --locale <locale> --mode dry
in the folder of your local checkout of mdn/markdown
, where <folder>
is the specific folder relative to the language root (AKA mdn-translated-content/
). This will perform a test run of the conversion and generate a report, but will not modify any files.
Once the script is completed, it will display a message containing the count of HTML elements that could not be converted, as well as the name of the report file (md-conversion-problems-report-[TIMECODE].md
) that was created for more details, which can be found in the script repo's folder. This report describes all of the elements that the script could not handle, and thus has left as HTML. Most of these will need to be removed from the HTML first (see Common unhandled elements), but some can be ignored.
You can see examples of such reports:
If the message did not appear and there was no new report file, great! That means that the conversion was 100% successful and you can now perform the real conversion.
There are a number of elements you will often see in the "unhandled elements list". This section will list the most common ones you will see, and how to fix them for conversion.
dl
>dt
/dd
- The conversion script has strict expectations for the contents of a
dl
element. The first child element should be adt
element, and for everydt
element, there should be one, and only one, correspondingdd
element. - If the number of
dt
anddd
elements are not equal, the script cannot convert them. - Remove any stray
dt
elements, and combine siblingdd
elements together using<br />
tags, then try again.
- The conversion script has strict expectations for the contents of a
*.hidden
- The
.hidden
class was used for content that would show when editing the content in the old wiki engine, and would not show to a typical reader. More than likely, these should all simply be removed as they are no longer helpful. - Either remove the class or remove the entire element and its contents at your own discretion.
- The
th
/td
- Table cells may somtimes include lists, codeblocks, and other multiline content. Since Markdown does not allow this, tables with these cell contents cannot be converted.
- Separate the table contents into a multi-paragraph strcutre if possible.
- Translators: compare the document to the current English locale for an example of how to handle that specific element.
Once you have taken care of a good chunk of the elements the script reported as unhandled, re-run the script with the dry
operating mode again. A new report file will be generated to describe what remains, if there is anything left. Repeat the above steps to reduce the list as much as possible. Once you are done to a satisfactory point, it is now time to run the conversion script for real.
To make review easier when the document is converted to Markdown, you may want to submit a pull request containing only the cleanup. In the event that you have removed some portion from the file (ex. a .hidden
block), this will help convey that it was intentionally removed.
You may skip this step and head straight to conversion, but we recommend at least creating a separate commit to track the changes.
Once the preparations have been made, you are now ready to perform the conversion. You now can run yarn md h2m <folder> --locale <locale> --mode replace
and open a PR with the changes. The replace
mode will first rename the HTML files from .html
to .md
without performing any conversion, then it will stage those changes, and finally convert the file contents (without staging them). To better retain git history, we recommend committing the staged changes (the files being moved), and then creating another commit with the conversion.
Sometimes, characters within macros will be unintentionally escaped as a part of the conversion. Make sure to check macro issues by running yarn build <files...>
and checking for any errors.
To speed up review time and reduce the chance of merge conflicts while your PR is in review, it is highly recommended to keep the number of files touched to a minimum. Although the changes are created using this script, every PR still needs to be carefully reviewed for accuracy and malicious changes.
- Typography
- You may decide to keep some unconverted elements for consistency and/or typography reason. For instance, in the French docs, we were already using
<sup>
consistently for ordinal numbers and<i lang="en">
for English terms which are not code and sometimes kept untranslated for clarity (ex. "viewport")
- You may decide to keep some unconverted elements for consistency and/or typography reason. For instance, in the French docs, we were already using
- Yari translation
- As stated earlier, if you use the converter, please add your localizations for keywords to https://github.com/mdn/yari/tree/main/markdown/localizations. Yari will use those when translating back content from MD to HTML (if I understood correctly)
- Using commits from the last HTML state of
mdn/content
- When tackling issues in existing content and since the English content is the reference for the localized content, don't hesitate to browse the mdn/content repo/files at the last commit before the markdown conversion. You may then be able to "update" your localized content's structure with the most "correct/recent" English one.
- List of those commits per section (poking @wbamberg if it may help tagging :)) ⏳
- The discussions about Markdown on mdn/content
- The initial discussion about Markdown conversion for mdn/translated-content
- Contains a list of PRs from @wbamberg
This guide was originally written by @SphinxKnight for the localization team. It has been updated by @queengooborg for the new script updates.