April 01, 2023

A tribute to the .docx file format

 There are endless amount of document formats and word processors available. In the LaTeX community alone there are at least 30 different GUIs available to enter Tex sourcecode and LaTex is only a less frequently used system. It seems to be impossible to unite the different file formats into a single standard.

On the other hand, there is such a standard available. The combination of text files plus JPEG files can be read in any oporating system. The cause is that plaintext doesn't contain of formatting information and the jpeg format is a highly standardized graphics format built in any operating system. The only format which is not standardized is markup format to describe documents by it's layout and its logical structure.

The working hypothesis is, that the best format for such a purpose available today is .docx In contrast to a common myth, docx is not only a document format but it is a zip file which contains of other files. The interesting situation is, that such a zip file can be converted into text-only plus jpeg files under any operating system. Even if the original formatting of the document was lost, the ascii information plus the embedded jpeg images can be extracted under any situation.

From a technical perspective docx can be used as an alternative to a zip file. It stores some text files plus image information. The ability to parse the formatting information is only optional. At the same time, this markup information are less important to understand the content of a text. Let me give an example.

Supppose there is a docx file which contains of a 200 pages book. Unfurtunately it is not possible to open the file in MS-Word but the docx file can be converted into 30 jpeg files plus a single .txt file which has the text-only content. It is possible to read the book without any problems and the missing sections, columns and hyphenations are less important for the normal user.

Even docx was invented by a single company for creating formatted documents it can be utilized as a zip container for storing normal text file. This makes the format interesting as a universal document format.

A seldom known fact is, that the powerful pandoc tool is able to render a .docx file into the pdf format. This is realized with the lualatex engine inbetween.

pandoc --pdf-engine=lualatex input.docx -o output.pdf