Writing an Online Documentation System using Omnis Studio
© Omnis Software, and its licensors 2020. All rights reserved. 5
Converting the Source documents
The internal format of Word documents is proprietary, so for the actual conversion of Word
documents to HTML, we tried several third-party utilities, eventually deciding on a program called
Pandoc; a universal document converter.
Know your Limitations
Whilst immensely powerful and providing conversion to and from many different document formats,
we found that conversion from Microsoft Word documents came with caveats. The ability of Pandoc
to extract embedded images as separate PNG files was immediately attractive, as was the ability to
generate an embedded HTML table-of-contents based on the document headings. Simple tables,
ordered and un-ordered lists, bold, italic and underlined text can also be converted. Formatting
options are limited however. Fonts are basically ignored. Only Heading styles (1 to 6 in our case) are
extracted and converted to HTML headings <h1> to <h6>.
Working to these limitations, we edited and produced a set of source documents that would respect
this “limited palette” of formatting options. We also had to come to terms with the fact that when
converting to HTML, there is a difference between carriage returns and paragraph breaks! Simply
pressing ENTER in a Word document results in a paragraph break in the converted document.
Simple line breaks are achieved using SHIFT+ENTER.
Similarly, HTML documents tend to collapse multiple spaces into a single space. To work around this
issue, it was necessary to replace all such occurrences in the source documents with non-breaking
space characters (CTL+SHIFT+SPACE).
We also needed to make sure that any existing HTML hyperlinks in the Word documents were
turned into absolute/external links where necessary. Otherwise any document-relative links would
present potential portability issues in converted documents.
It is also worth noting that Pandoc produces one HTML document for each Word (.docx) document.
To avoid overly-long HTML documents, you may find it necessary to subdivide your source
documents.
A Simple Conversion
Pandoc is a command line utility. The intricacies of the Pandoc command line syntax is left for
further reading. In essence however, and using our Programming manual as an example, to convert
a Word document entitled “00about.docx” to its HTML counterpart we use the command:
pandoc.exe --include-in-header=C:\Users\garya\pandoc2\htmlhead.txt --toc --extract-
media="W:\onlinedocs\Programming\00about" -o "W:\onlinedocs\Programming\00about.html"
"F:\VCS\Docs\Programming\00about.docx"
--include-in-header allows us to specify some HTML content to be inserted at the start of the
converted document. We use this file to inject a stylesheet into the page and to insert some
JavaScript that will be used later to assist with navigation and to make sure the page is displayed
together with the table of contents.
--toc tells pandoc to produce an embedded table-of-contents (TOC) inside the target document.
Omnis will process and extract TOCs after conversion to produce a separate HTML file containing the
merged TOC entries from all converted document files.