April 11, 2023

LaTeX and the Knuth plass algorithm

 In the past many attempts were made to compare LaTeX with MS Word. On the first look the comparison is a subjective one, that means person1 likes MS word while person2 not. To judge on a more rational basis, it has to be defined first what the goal in typography is. The hidden goal is to create justified text which is the opposite of left alignment.

If the goal is only to provide left-align text the output of LaTeX and MS Word is the same. At the end of each line there is a white gap which is fluctuating. Such a text can be read easily and it also easy going to program a software to format such a text. Nearly all word processors are able to do so. In contrast, if the objective is to produce fully justified text, a certain word processor will come to its limit.

The goal in typography in general and in LaTeX in detail is to realize a certain sort of formatting which is known as difficult to realize. For the same task "produce justified paragraph" it is possible to judge about different programs like MS Word, LaTeX, indesign and so on. LaTeX is known for its strength in this single use case.

In contrast, possible alternatives over LaTeX like MS Word, Libreoffice, the fpdf2 library or an ascii text editor are not able to produce high quality justified text. These programs were programmed to typeset only left aligned text. In other words, if the idea is to write a LaTeX replacement from scratch, then the software needs the feature to produce justified text with ease.

Let us go a step backward. The most advanced challenge in typography is to produced justified text. Such a goal was hard to realize for metal based typography before the advent of the computer and it is also hard for modern software programs. In contrast, left justified text are much easier to realize. All what the software has to do is to put the characters next to each other with the same white space between the words. Writing a computer program or a library which is able to do so is an easy task.

In other words, the self understanding of LaTeX is to master the hardest topic within typography. And comparing LaTeX with other programs makes only sense for this single problem. So it is not about putting glyphs to a sheet of paper in general but in the context of a certain arrangement.

The interesting situation is, that 9/10 people will agree that LaTeX capabilities in creating justified paragraph is better than the Word ability to do so. Because this task can be measured on an objective basis. Such a benchmark doesn't explain why it is important to format text in this way. It is mostly a non practical challenge with the attempt to investigate if a certain book printer or a typesetting software is mastering the complicated problems.

It should be mentioned that programs like MS Word, Webbrowsers and text editors are never claiming that they can master this problem. For example the MS Word software has the left-aligned paragraph as the default setting for every new document. In contrast, the LaTeX software is using justified text as default. This is a hint what the self understanding of the program is. In other words, LaTeX assumes, that a book or a journal should be typesetted only in justified mode and no alternative is allowed.

April 10, 2023

The Knuth plass algorithm

 Over the decades the LaTeX community has become an ideology. The idea is, that LaTeX is superior over MS Word and the user is advised only how to format documents within LaTeX. There are endless amount of definitions available what LaTeX is about. But a more technical explanation is missing.

The main difference between LaTeX and possible alternatives like Lowriter and MS Word is, that LaTeX has the builtin Knuth plass algorithm.[1][2] Implementing the algorithm from scratch is very difficult.[3] Most of the TeX related source code is devoted to this single problem.

Suppose, MS Word is adding a simple button in the settings menu which enables the algorithm for a Word document. The resulting justification and the distribution of content over multiple pages would work with the same principle used within LaTeX. Then, the visual layout would be improved and the difference between MS Word and LaTeX would be smaller. All the other features in LaTeX like separation between content and layout and a robust file format are available in MS Word too. For example, Word has a built in draft editor which allows to enter the text without any formatting and Word stores documents in an open XML format which is superior over the .tex format used in LaTeX.

The difference between MS Word and TeX can be reduced to the mentioned line breaking / page layout algorithm. The idea behind the algorithm is that the boxes on the screen are positioned more elegant. Elegant means, that the word space is equal over different lines and the pictures are located at the correct position.

The surprising situation is, that even within the LaTeX community the Knuth plass algorithm is ignored or it is discussed only seldom. The amount of papers about the subject is small. There are fewer than 100 papers published from the early 1980s about this algorithm. So it is some sort of expert knowledge not available for the masses. The interesting situation is, that apart from the algorithm, the TeX ecosystem has to offer nothing. Or at least, nothing which is more advanced than what is available in MS Word today. Word can export documents into a pdf file with a simple mouse click, it can format a document with the "Latin Modern Roman" font and its ability to insert mathematical equation is excellent. The only real weakness in Word is the calculation of white vertical and horizontal spaces which results into a low quality typesetting. Even untrained users will see on the first look if a two column text was formatted with MS Word or with LaTeX. That means, the Knuth-Plass algorithm is producing a visible difference.

The question is not how to indoctrinate happy MS Word users to switch to LaTeX but it is the other way around. The idea is to explain in simple words how the TeX internal line breaking algorithm works so that it can be integrated into mainstream applications like MS Word, fpdf2, Libreoffice and so on.

References
[1] Knuth, Donald E., and Michael F. Plass. "Breaking paragraphs into lines." Software: Practice and Experience 11.11 (1981): 1119-1184.
[2] Plass, Michael Frederick. Optimal pagination techniques for automatic typesetting systems. Stanford University, 1981.
[3] Verna, Didier E. "ETAP: Experimental Typesetting Algorithms Platform." ELS 2022: 15th European Lisp Symposium. 2022.

April 08, 2023

Creating a LaTeX clone from scratch

The core element of a text rendering engine is a datastructure which holds the boxes on the screen.

id name x y w h
0 pageborder 0 0 210 297
1 textborder 35 30 140 227
2 line 60 30 115 14
3 line 35 49 140 14
4 line 35 68 140 14

The entry id2 shows the first line of a paragraph with a small intendation. All the elements on a page are stored in this single boxtable. The units are not pixel position on the screen, but they are metric millimeters on the physical sheet of paper. For rederning the boxtable to the screen a second table is created which holds the pixel information according to the scaling factor.

What an algorithm like the Knuth plass linebreaking algorithm is doing is to convert a piece of text like "the quick brown fox jumps over the lazy dog" into a boxtable. So the program is not operating with the 2d rendered page on the screen but it is using the internal tabular representation of the boxes.

April 07, 2023

Word vs. latex, what is the difference?

 There are at least two major document typesettings available used in the reality which are MS Word and LaTeX. Both programs have a large fanbase and it depends on the personal judgement which is prefered. What is missing in the debate is a general description about the differences. The only thing what is sure is, that Word and LaTeX are operating with different design principles. So let us summarize what the idea behind the latex typesetting system is:
1. open source
2. strict seperation between edit and rendering mode
3. high quality in both modes

Now it is possible to explain these features in detail. The first criteria is, that latex is provided under a gpl license while MS Word is distributed in a commercial fashion. An open source clone to word is libreoffice.

The second and third feature on the list are important to understand the interaction with the system. The main feature of the latex ecosystem is, that the user has to switch back and forth between two modes: editing and viewing. The editing mode has much in common with the draft view known in word. The difference is, that in case of latex the differences are emphasized. Editing in a latex editor means usually to use a monospace font, hide the images completely and avoid justification and hyphenation of the paragraph. What latex users are prefering is ironically that during editing there is no typography at all. That means, the hyphenation is wrong because it is missing, the spaces between the words is always the same and the vertical spaces between the paragraphs is always the same.

All the typographic enhancement are only visible in the rendering mode. The user has to press the preview button and then he sees the the DVI / PDF output on the screen which contains of hyphonation, justification and float images. This two mode philosophy is the core element of LaTeX typesetting.

So the underlying question is if there is a need for two modes in word processing or is a single mode (which is usually available in WYSIGYG DTP software enough)?

the main reaosn why this two mode interaction was introduced is because it simplifies the man machine communitcation and it makes it easier to program the software. In latex there are different front end / backend combination available. The user can run a lyx instance combined with lualatex or he has use texmaker in combination with the pdflatex backend. This allows to program a text editor and a text renderer as different project. This is perhaps one of the strengths of latex because each project can be made more feature complete.  In contrast, the GUI based Word software combines the editing and the rendering capability in a single program. From a cynical perspective this results into a medum quality draft mode plus a medium quality typesetting quality. What latex users are prefering is a high quality draft mode plus a high quality layout.

April 05, 2023

Creating PDF documents without Latex Part 2

 The only software which can be mentioned as a true alternative to latex is MS Word and libreoffice. Both programs are powerful word processor programs which allows to create single column and multi column documents. In contrast to latex, they have an elaborated document file format which allows to insert images and annotate text.

Until today, word and libreoffice were not able to replace latex because of many reasons. The problems are well known and described within the latex community. It is about the poor typesetting quality in combination with the missing seperation between layout and content. Both is a strength of latex which has the best typesetting quality and allows the user to focus on the content of the text.

Instead of arguing what of the programs should be used in the future the better idea is to describe first the current situation. The current situation is, that latex has the largest market share for creating academic content. It is followed by a large empty gap, and then all the minor software programs like Libreoffice and Distiller are following. The preditiction is, that within the next 10 years nothing will change. That means, the Knuth software has dominated the 1990s and it will do so in the 2020s too.

WYSIWYM Editors
Between Latex and Libreoffice there is a big difference. Libreoffice is working with a rendered layout editor and has no draft mode. In contrast what Latex users are prefering is the seperation between entering the text and preview on the screen. Let us describe the principle of a WYSIWYM in detail:

Edit mode:
- fixed monospace font
- left justified text
- no hyphenation
- no page border
- only a frame for images

Preview mode:
- high quality typesetting
- full justified text, global line break algorithm
- precise position of captions and pictures




In other words Latex combines very different principles: a high performance draft editior plus a visualually advanced rendering capabitly. In contrast, the libreoffice software combines both modes into a single GUI window. It has no dratt mode and no advanced rendering mode.

In the history of software development the VIM editor comes close to this concept: VIM is also working with a two mode concept. The user has to switch between both modes.

The reason why a seperation between edit mode and preview mode makes sense is because of the complex layout in two column typeset documents. Editing a two column paper in Libreoffice is very complicated for the user. The user has to understand at the same time the content and the visual apperance. For example, he sees the columns, the pictures, possible footnotes and a fully justified paragraph with different spacings between the words. Such kind of rendering isn't bad itself, but it has nothing to do with editing a text. At least, this is the opinion of the Latex community.

Another more traditional reason why latex preferes a clear distinction between draft and editing is because of the program complexity. Implementing all the typesetting algorithms is a demanding project. And writing a fully text editor is also a larger project. It makes sense to develop both components in different projects. Otherwise the resulting single project would have millions of code lines.

Writing a latex clone from scratch

 Of course the idea sounds like a failed project because everybody knows, that latex contains of millions of millions of codelines. On the other hand it would be interesting to write a prototype which reduces a typesetting system to its minimum.

First thing to know is, that an elaborated markup language exists already which is markdown. Markdown is an enhanced plain text format which allows the user to define sections, bullet points and tables. The language is more than capabie as an input format for a typesetting system.

The open question is how exactly a markdown file gets rendered into a picture? The creation of a .PNG file itself is a trivial task, many python libraies are available for this purpose. and converting a picture into pdf is also easy going. The more serious problem is how to position characters, lines and paragraphs at the picture.

A rough estimation comes to the conclusion that typesetting is mostly about a list of features which are stored in a long table. Features can be: margin left, margin bottom, font size for text, font size for sections, linespacing, distance between pictures and so on. In addition the table needs to store dynamic data like "word space in line1", "word space in line" and so on.

The working thesis is, that the creation of the png image is realized by sending queries to the datatable and storing information into the table.

Let me give an example Suppose the idea is to draw only the first page of a book and the page contains of a black rectangle which is filled. For doing so, the drawring routine needs some information from the layout engine:
- margin of the page
- position of the rectangle
- color of the rectangle
and so on.

The idea is, that any drawing process is working with the same principle. That means the datatable is the core element of a layout engine. Technically such a table can be realized as a hierarchical python struct, but it remains unclear how to do so in detail.

Creating PDF documents without Latex

 The pdflatex and lualatex tool are both established ways to create high quality academic papers. Sometimes, it makes sense to explore alternative ways in doing the same task. What can be used instead are the following programs:
- Adobe distiller ( a printer driver which can create pdf files from any application like Quarkxpress and so on)
- Microsoft Word, Lowriter (can create natively pdf files)
- Ghostscript ( a distiller like printer driver)
- Cairo, ApacheFOP and fpdf (mostly used as report generators)

None of these programs has reached a larger market share. This might be surprising because espespecially the MS Word software is a well known software program which is installed on million of computers worldwide. The problem with MS Word is, that it is only yet another tool for creating pdf papers and is not able to dominate the market.

Basically spoken, if someone likes to explain that the latex software is obsolete he has to give possible alternatives which are not available yet. It seems that the disadvantages in pdflatex are not so large, that the users are staying away completely from the system but they have found strategies to master the learning curve. For example if a single user isn't able to adjust the page margin in latex he isn't switching back to MS Word but he is using a different latex package and stays within the latex ecosystem.

But let us take a closer look into possible latex alternative. The mentioned list of programs can be divided into two groups: open source (libreoffice, ghostscript) and closed source. The programs are acting either as a printer driver or as a MS Word like word processor. The fact that so many programs are available is a sign, that the problem of generating documents ia high complex topic. Some critics of Latex have asked why the texlive distrbution needs 10 GB Space on the harddrive. The sad answer is, that possible alternatives like Adobe Acrobat will also need 5 GB and more discspace and this doesn't include external programs like Quarkxpress. So it seems, that similar to the situatio in the 1980s, the task of desktop publshing can only be managed with dedicated DTP Workstations with endless amount of RAM.

The only software which can be mentioned as a true alternative to latex is MS Word and libreoffice. Both programs are powerful word processor programs which allows to create single column and multi column documents. In contrast to latex, they have an elaborated document file format which allows to insert images and annotate text.

Until today, word and libreoffice were not able to replace latex because of many reasons. The problems are well known and described within the latex community. It is about the poor typesetting quality in combination with the missing separation between layout and content. Both is a strength of latex which has the best typesetting quality and allows the user to focus on the content of the text.

Instead of arguing what of the programs should be used in the future the better idea is to describe first the current situation. The current situation is, that latex has the largest market share for creating academic content. It is followed by a large empty gap, and then all the minor software programs like Libreoffice and Distiller are following. The prediction is, that within the next 10 years nothing will change. That means, the Knuth software has dominated the 1990s and it will do so in the 2020s too.

April 04, 2023

Libreoffice Writer for LaTeX users

Even if LaTeX is the most used document generator for academic papers there are some users who are not satisfied with the program. The next better alternative after pdflatex is the Libreoffice Writer software. It is working completely different from latex and it is used only occasional for academic writing, but never the less it makes sense to give a short tutorial how to do so.

First thing to mention is, that former latex users will miss many useful tools. Libreoffice has no bibtex like reference manager. What can be created instead is a single column table at the end of the document which stores the items in the format [AuthorYear] text. Such a table can be sorted alphabetically. If the user likes to cite on of the papers he types into the normal text the [AuthorYear] string without a dedicated link.

Second important thing to mention is the placement of figures. Libreoffice supports the many graphics formats like SVG, JPG and PNG which can be inserted as a link. Similar to latex, this will reduce the filesize. But if the image was only linked it is not possible to send the file to someone else. So the better idea is to embedded the picture. In the default setting the libreoffice software will placa an image at the here position. But a simple click into the context menu allows to position the image on top of the page. After inserting or deleting the text, the pictures' position get adjusted to the next top position, similar to what latex knows as a float figure.

The mechanism isn't working as accurate as in latex because in many cases, the user has to manuall adjust some settings but with a bit effort it works.

Another important aspect are fonts int he libreoffice Writer program. The best idea is to use the default one which is liberation serif and not embedded additional fonts like Latin modern roman. The reason is, that technically it is possible to embedded fonts but it will increase the file size drastically.

It is pretty easy to recognize which software was used to create a document. Libreoffice has a complety different hyphenation algorithm, doesn't have microtypographic enhancement and it lacks in vertical adjustment of the glue. So the average latex user can identify easily which sort of program was used. Never theless both programs (libreoffice and LaTeX) can be used to create well formulated academic papers. The resulting pdf file will contain all the graphics, and the text can be read on the screen. Main advantage of the .odf file format is, that it can be used to track changes of different users and very important it is not LaTeX which has become a mainstream ecosystem.


 

April 01, 2023

A tribute to the .docx file format

 There are endless amount of document formats and word processors available. In the LaTeX community alone there are at least 30 different GUIs available to enter Tex sourcecode and LaTex is only a less frequently used system. It seems to be impossible to unite the different file formats into a single standard.

On the other hand, there is such a standard available. The combination of text files plus JPEG files can be read in any oporating system. The cause is that plaintext doesn't contain of formatting information and the jpeg format is a highly standardized graphics format built in any operating system. The only format which is not standardized is markup format to describe documents by it's layout and its logical structure.

The working hypothesis is, that the best format for such a purpose available today is .docx In contrast to a common myth, docx is not only a document format but it is a zip file which contains of other files. The interesting situation is, that such a zip file can be converted into text-only plus jpeg files under any operating system. Even if the original formatting of the document was lost, the ascii information plus the embedded jpeg images can be extracted under any situation.

From a technical perspective docx can be used as an alternative to a zip file. It stores some text files plus image information. The ability to parse the formatting information is only optional. At the same time, this markup information are less important to understand the content of a text. Let me give an example.

Supppose there is a docx file which contains of a 200 pages book. Unfurtunately it is not possible to open the file in MS-Word but the docx file can be converted into 30 jpeg files plus a single .txt file which has the text-only content. It is possible to read the book without any problems and the missing sections, columns and hyphenations are less important for the normal user.

Even docx was invented by a single company for creating formatted documents it can be utilized as a zip container for storing normal text file. This makes the format interesting as a universal document format.

A seldom known fact is, that the powerful pandoc tool is able to render a .docx file into the pdf format. This is realized with the lualatex engine inbetween.

pandoc --pdf-engine=lualatex input.docx -o output.pdf