I have a bunch of docx
files with deep outlines (up to 6 levels deep) in them. The author tried to apply Word's heading
styles to each level but for some reason it didn't work as expected.
My goal is to develop a pandoc.org filter which would help me automatically transform from what you can see in original-from-docx.txt
to what can be seen in expected-filtered-output.txt
. Needless to say this is just a sample of the 268 pages docx file I'm processing.
This is the command I'm running to generate original-from-docx.txt
:
pandoc -f docx -t markdown -S -o original-from-docx.txt outline.docx
I'm needing to develop a filter that would:
- convert hard line breaks into proper paragraph breaks.
- for each new list item, take it's first paragraph, remove any formatting and convert it into a heading according to its nesting level.
But I'm open to other ideas :)