October 16, 2021

How to reprogram LaTeX from scratch?

 

On the first look the attempt in doing so looks very complicated. And some projects from the past like lout have demnstrated that it is not possible to reinvent the latex software. But perhaps it is possible if the existing latex engines are investigated much better? So the first question is what exactly is latex?
The current luatex repository is available at github and cntains of 1 million lines of code. That means latex is not a small software project and not a mid size but it is very large. The assumption is that a possible latex replacement needs at least 1 million lines of code otherwise the task can't be realized. The problem is that such a codebase can't be written by a single person, so there is a good reason not to do so.
But suppose is to try it no matter what the scucess rate is. The first thing to do is to describe which parts of the latex project are realized very well and which are not very well. The main idea behind latex is, that in the same text file the document text itself plus the latex commands are written similar to the markdown syntex. So latex is in the core form a rendering engine which takes a latex command like \tex and converts this command into a graphical output. And yes this is a very good idea. SImilar to the gnuplot project the idea of using commands allows to create very complex output.
What is working bad in the latex ecosystem is that the existing latex to pdf compilers like luatex, pdftex or xetex are very complicated and hard to maintain project. Improving a code base which contains of 1 million lines of code doesn't make much sense.
Supose the idea is to rewrite latex from scratch. First thing to do is to write down the requirements. Latex is basically a textparser which understands around 260 commands. These commands are defined in the latex reference. The important \tex command was mentioned already but there are many other commands available for printing text in bold face or to format a page. A possible latex replacement has to understand all these commands. Then it can be executed from the command line or from the lyx frontend and converts a .tex file into a .pdf file.
So the question is how to program a parser which understands around 250 typesetting commands? Exactly this seems to be the bottleneck. Suppose each command is realized with 109 lines of code. Then the parser will need around 25k lines of code. A command like \indent can be realized easily, but commands like \table will need a lot of code.
pylatexenc
There is a python library available which simplifies the task of parsing latex files. It is called pylatexenc and takes a latex file as input and prints out all the recognized commands. What this library isn't capable to do is to convert the recognized strings into the pdf format. But such an engine can be created. The cmbination of the pylatexenc project plus a pdf output would result into a full blown latex to pdf converter. This is equal to create a replacement for luatex or pdftex.
Let us a go a step backward and investigate why former attempts have failed to reprogram latex. First thing to mention is that former projects have assumed that the markdown or the xml format is able to replace the latex syntax. No they don't. Latex consists of around 250 dedicated commands which were designed for typesetting. A latex replacement needs to implement these commands.
The other attempt to replace latex was made with GUI systems like libreoffice. These programs can't replace latex because they are not providing low level textual commands. So we can say that latex is equal to converter which takes a “.tex” file as input and converts into a pdf file.
Creating such a tool is not easy. but it can be simplified drastically and there is no need to use the existing luatex software for this attempt. Luatex has some major problems. First thing is that the project is too large. The amount of 1 million lines of code is the current size. Second problem with luatex is that many typographic elements like three column layout or a better hypothenation for non english languages isn't available. That means luatex is not the best choice for creating pdf files.
The interesting situation with the pylatexenc library is that the project is relative small It contains of only 10k lines of code in python plus some XML files for the unicode mapping.
PyFPDF
Another interesting existing tool is PyFPDF.
Why LaTeX rocks
In general the existing LaTeX ecosystem works very well. That means even if the luatex project consists of lots of codelines, it is doing what the normal user likes. The main advantage of LaTeX is that with some simple commands it is possible to create nice looking pdf files. These commands can be typed in manual or with a frontend like lyx.
So we can say that the core idea behind latex works well. There are around 250 commands which are put in combination with the normal text and this will generate a document. The only question is about the details of this process. For example do we need really 250 latex command or would be possible to use only 150 of them? Or, do we need really the luatex project or is it possible to create a latex compiler much simpler?
Perhaps the best example that the original tex project consists of obsoete technology is the transition from the web programming language invented by knuth to the more recent c dominated programming style. That means the original TeX software was written in a different language than today's luatex engine. And it is possible that future versions of latex will be written in a different language.