DOCX reader: Nested lists (numbers & literals) with new start values and new OL-blocks #10096

citizen422 · 2024-08-17T16:23:01Z

citizen422
Aug 17, 2024

Hello!

First of all, Pandoc has becoming very powerful for converting docx files to epub files!

Because of this I have tried to convert more complex Word DOCX files to EPUB2 files. Some of my documents have nested lists, with numbers and literals. Here, I can reproduce an error concerning the XHTML-output:

While EPUB2 does not allow the START-value for an OL-block, some OL-blocks have specific values.
Even though the lists are nested, the second part of the main level starts with a new OL-block. Apart from the left indent, this interrupts the counter of the first level numbering.

I have enclosed a DOCX file that shows this specific problem

Nested List.docx

If there is a way how I can modify the way Pandoc reads the DOCX file I am interested to learn this. Besides this, I hope that this will be tested as reproducible bug and therefore be solved.

Regards!

jgm · 2024-08-17T17:43:48Z

jgm
Aug 17, 2024
Maintainer

I took a look at the XML in this document.

Pandoc usually expects the paragraphs in a list to have embedded numPr that indicates list level etc. For example,

<w:numPr><w:ilvl w:val="0" /><w:numId w:val="1001" /></w:numPr>

Here the w:ilvl indicates the list level, and the numId specifies the style of list.

In your document, the paragraphs don't have this:

    <w:p w:rsidR="00AB53E2" w:rsidRPr="00AB53E2" w:rsidRDefault="00AB53E2" w:rsidP="00AB53E2">
      <w:pPr>
        <w:pStyle w:val="Listennummer2"/>
      </w:pPr>
      <w:r w:rsidRPr="00AB53E2">
        <w:t>Literal-a
        </w:t>
      </w:r>
    </w:p>

The styles.xml contains:

<w:style w:type="paragraph" w:styleId="Listennummer">
<w:name w:val="List Number"/>
<w:basedOn w:val="Standard"/>
<w:uiPriority w:val="99"/>
<w:unhideWhenUsed/>
<w:rsid w:val="00AB53E2"/>
<w:pPr>
<w:numPr>
<w:numId w:val="3"/>
</w:numPr>
<w:spacing w:before="120" w:after="0"/>
<w:jc w:val="lowKashida"/>
</w:pPr>
<w:rPr>
<w:rFonts w:ascii="Calibri" w:hAnsi="Calibri"/>
<w:sz w:val="21"/>
<w:szCs w:val="21"/>
<w:lang w:bidi="ar-SA"/>
</w:rPr>
</w:style>

<w:style w:type="paragraph" w:styleId="Listennummer2">
<w:name w:val="List Number 2"/>
<w:basedOn w:val="Standard"/>
<w:uiPriority w:val="99"/>
<w:unhideWhenUsed/>
<w:rsid w:val="00AB53E2"/>
<w:pPr>
<w:numPr>
<w:numId w:val="4"/>
</w:numPr>
<w:spacing w:before="160" w:after="0"/>
<w:jc w:val="lowKashida"/>
</w:pPr>
<w:rPr>
<w:rFonts w:ascii="Calibri" w:hAnsi="Calibri"/>
<w:sz w:val="21"/>
<w:szCs w:val="21"/>
<w:lang w:bidi="ar-SA"/>
</w:rPr>
</w:style>

which does specify the numId (so we get the right style of list), but nothing here specifies the list level. That is why they're all coming out as first-level, I think.

But obviously it works okay in Word. I'm not sure where Word is getting the list level information?

Anything special about how this was created?

0 replies

citizen422 · 2024-08-17T19:46:43Z

citizen422
Aug 17, 2024
Author

Hello John,I have to thank you for adressing my question.It is true that indent levels in this DOCX formatting are not provided using level-lists in this kind of Word document. Instead, there are at least four contextual indicators that might support getting the right corresponding indent level:1. All subsequent paragraphs, numbered or bulleted, belong to one level or another level.2. Each numbering style belongs to one specific indent level.3. The content level is technically equal to its first appearance, which also corresponds with the left margin of the paragraph used on each different level. Thus, the first numbering style used in this range defines the first indent level, the second numbering style the second indent level…4. If one former numbering style & paragraph margin is repeated, it also means that the former numbering list is continued.Especially the third indicator might help to identify the different levels used upon their introduction.I hope this helps understanding how this type of DOCX might be parsed into XHTML. If you need further examples, please let me know.Kind Regards,Tobias

0 replies

jgm · 2024-08-17T23:04:41Z

jgm
Aug 17, 2024
Maintainer

Instead, there are at least four contextual indicators that might support getting the right corresponding indent level:1. All subsequent paragraphs, numbered or bulleted, belong to one level or another level. 2. Each numbering style belongs to one specific indent level.

I don't see where or how either of these things is specified in the XML, though.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOCX reader: Nested lists (numbers & literals) with new start values and new OL-blocks #10096

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

DOCX reader: Nested lists (numbers & literals) with new start values and new OL-blocks #10096

citizen422 Aug 17, 2024

Replies: 3 comments

jgm Aug 17, 2024 Maintainer

citizen422 Aug 17, 2024 Author

jgm Aug 17, 2024 Maintainer

citizen422
Aug 17, 2024

jgm
Aug 17, 2024
Maintainer

citizen422
Aug 17, 2024
Author

jgm
Aug 17, 2024
Maintainer