October 31, 2021

Belief systems in typography

 

The community of computer users is famous for its debate about the pros and cons of software, programming languages and operating systems. A less researched area for creating opposite oppinions is digital typography which is equal to the LaTeX software. The problem with typography is that the domain is very complex and it is difficult to find a single discussion topic which will polarize the opinion of the users.
A highly visible conflict is between the LaTeX community and non latex users. The LaTeX community understands themself as experts for typography and critizes heavily the alternative software which is MS-Word and indesign. MS Word can't be used for creating professional documents and indesign is only a WYSIWYG software which means that lots of mouse clicks are needed before the document was created. If a user has decided for the LaTeX software he will create as default high quality documents, this is at least the assumption.
There is a need to analyze the art of typography in detail to understand that the same software can be used right and wrong. Typography consists of many sub problems, like micro typography, macro typography, fonts, bibliography, creation of tables and so on. The question is which sort of setting is recommended? Is a document which was typeset in a 10pt font easier to read than a document which was typeset in a 11pt font? Not always, it depends on individual preferences. Therefor there is no conflict about this issue. The same non conflict is available about the margin space. In reality, documents are created which have a small margin and a big margin as well.
But there is one problem not discussed frequently which has the potential to divide the community. It is the problem of flush left paragraph vs justified paragraph. From a technical perspective this problem is easy to master. Every word processor has the ability to format a text in both style. In MS-Word there is button available to change the formatting and LaTeX knows the regged2d package for doing so. After applying the command, the paragraph will either have a straight right edge or it will be formatted in flush left.
The reason why this formatting style has a high conflict potential is because in the world of professional publishing there is a clear dogmatic rule available that only justified text is allowed. This rule is formulated in the guidelines for submitting documents to a conference proceeding and it is visible in existing documents. That means, everybody, which means really everybody is following this rule in theory and practice.
Let us make a short survey to analyze the current situation. We are taking a look into randomly selected books, newspapers, conference proceedings and academic journals. And we are enlarging the time period from the year 2010 to 1960s, and even to 1900. The surprising discovery is that all the documents were typeset with fully-justified formatting. Calling this perception a typographic rule is an understatement it has more in common with a dogma.
The reason why is hard to investigate. An often formulated argument is that a justified text looks more elegant. Another less mentioned reason is that otherwise the mechanical printing press will explode because of the imbalance of the right edge. That means professional typographers can be highly irrational in this point. The printing press is rotating very fast, and if the right edge is creating zig zag pattern, the rotating process is disturbed.
Nevertheless the dominant reason why fully justified text is prefered for professional publications like newspapers and journals is because of historical reasons. Since 200 years and longer, all the journals were typeset in this style. The font has changed, the font size can be adjusted and the image quality differs but every text was typeset in the fully justified mode.. This formatting style is so common that it is often identified with typography in general. That means, a fully justified text is equal to good typography and a flush left is equal to its absence.
Perhaps it should be mentioned that in the digital world the situation is the other way around. Most HTML pages are using the flush left rendering. From a technical perspective, a HTML page has no page width but the window's size can be adapted. And nobody has complained about it.

Understanding the beauty of TeX

 

The TeX typesetting system is one of the largest and important programs in the history of the Unix operating system. The first electronic generated documents ever were created in the last 1970s with the TeX software and until now all the scientific journals are created with this software only. From an abstract perspective it is difficult to understand why LaTeX is so popular. The simple reason is that it has the build in feature of creating justified text.
To understand why this feature is important we have to go back in the history of typesetting. The monotype mechanical typesetting machine was invented around 1900. It contains of two parts. The desk for creating the punched tape, and the casting machine which takes the punched tape as input and produces the hot metal matrices. The interesting situation is, that even the monotype machine was able to create fully justified text. Workers who have used the machines 100 years ago can confirm this statement and very important, newspapers and journals from that area were typeset in this format.
How can it be that a magazine 100 years ago was using fully justified text? Isn't it really complicated to stretch and compress a line of word this way? Yes it is complicated, but the monotype machine was using this feature all the time. It was the main difference between ordinary typewriter produced manuscripts and professional created journals.
The ability to typeset fully justified text was re implemented by the TeX software later in the 1970s. Similar to the monotype machine, the tex software is using this formatting style as default. The assumption is that the needed output of the typesetting process is always the fully justified text. This workflow is described as beautiful typesetting. In contrast text which is only left justified is seen as the opposite.
Why is justified text better than flush text? The main reason is perhaps because the first category is harder to realize. Creating a text in this way requires more effort than producing ordinary text. It seems that this extra effort has become a tradition.
Let us analyze the typesetting process around 1900. The input for the human typesetter was a manuscript created with a mechanical typewriter. Like all manuscripts such a document was using a flush left alignment in a single column mode. Because such an output is easier to create for a typewriter. The main task of a monotype operator is to convert this input document into a fully justified description on the punched tape. That means, the operator beautifies the manuscript by realizing the justified mode.
In summary the monotype machine from 1900 was able to create fully justified text in multicolumn layout. This combination was needed for academic journals, books and newspapers as well. That means justified text is not an invention from the computer age, but it was common in the time of mechanical typesetting.
In contrast, the widespread usage of left justified text is a symbol of the digital age. The internet was the first medium in the history which was using this formatting style everywhere. The reason was, that before the internet was invented no other public medium like printed journals or newspaper were using left justified text. That means, the justified text which is more complicated to realize was invented much earlier and the flush left text which is easier to realize is a new thing available since the 1992.

October 30, 2021

Typography without justified alignment

 

The LaTeX typesetting system has become famous because of the high quality rendering of text. What most newbies doesn't know is that the latex community is using the term typography with a certain bias. The assumption is that good typography is equal to justified text. Justification means the opposite of flush left. It means, that the right edge is straight from top to bottom.
A short look into the history of mechanical book printing shows that the latex community is arguing with the same assumption like previous technicians. What the first gutenberg printing press, the linotype and the intertype photosetter have in common is the ability to produce fully justified paragraph. The goal of former typesetters wasn't only to print something on a paper but the goal was to print it with the fully justified layout.
It is a seldom described fact, but all the academic journals and newspapers in the last 200 were printed in the fully justified mode. This might be surprising because before the advent of desktop publishing only hot metal typesetting was available. That means there was a need for manual intervention to create a straight right edge.
It remains unclear why this unwritten rule in typography is valid for such along period. Perhaps it has to do with western tradition. The assumption is good typography is equal to justified text and this is the paradigm behind the latex software as well.
From a technical perspective all the printing device like the linotype machine and modern word processing software is able to create left flushed paragraph easily. In most cases the algorithm for doing so is much easier to realize. Because at the end of the line the last word gets wrapped to the next line and no in word adjustments are needed. But the interesting situation is that for final printings this ability is never used. Especially for academic publication there is an unwritten rule available that the text needs to be fully justified. The question is why?
Sometimes the argument is provided that a fully justified text can be read faster. Another argument is that the layout looks better because the columns can be recognized. But in most cases the real argument is that fully justified text has a long tradition and therefor new typesetting has to follow this rule too. Basically spoken it is not a need but it is personal preference why professional publishing is equal to fully justified formatting.
Since the internet age the situation has changed drastically. 99% of the web-rendered websites are using the flush left layout. All the browsers which includes lyx, firefox and the internet exploring are rendering text in this way. Similar to the situation with justfied text in a printed world, the main reason why all the browser are using flush-left rendering is because everybody else is doing so. That means in the internet world there is an unwriten rule that a paragraph needs to be displayed that way. The interesting fact is that the argument is the same like for the book printing world. The reason why firefox is using flush left rendering is because such a text can be read faster and it looks better.
The unsolved question is what is the best formatting style in an electronic pdf document? Nobody knows, a pdf document is using the us-letter format which is working different from a website. Technically it is possible to use justified and flush left formatting as well. So it is an individual decision how the typesetting is made. But according to the produced documents the amount of justified pdf papers is higher. A short look into the arxiv repository will show that 99% of the papers are fully justified. Especially the academic community beliefs that this unwritten rule is valid. A document formatted only with flush left is common for draft documents in which improvements are needed before the final publication.
Mass printing
What newspapers, books and academic journals have in common is that the amount of printed copies is high. A single issue of a newspaper gets printed many thousands times. This high circulation makes it economical useful to make the typography very accurate. It makes sense to use the Linotype printing machine to produce fully justified text. That means the human typesetter will need many hours until the layout is ready and nobody cares because many thousands people will read the result.
In contrast, manuscripts which are circulating with low amount of copies are created with simpler machines like a mechanical typewriter. A typewriter has no advanced featues like a fully justified text. Therefor normal manuscripts are flush left.
With the upraising of digital printing the situation has become softer. IN most programs it is only a simple click with the mouse to change the setting from justified to lush left. What we can say for sure is that in the past a justified text was equal to high amount of copies. Because a justified text was created by a professional typesetter and this can only be realized if a larger audience is interested in a book or journal.

Reducing the LaTeX software to the core

 

The current tex-llve package is not only a piece of software but it is a software distribution which contains of millions of lines of code which are distributed over endless amount of packages and commands. The newbie has make sure that around 5GB are available on the harddrive so it seems, that desktop pubishing with LaTeX is a demanding task. To understand the latex system better we have to reduce all the programs to the minimum. The question is what is the core idea behind LaTeX?
The surprising answer is that LaTeX has such a core idea. It is about typesetting of justified text. That means the main ability of tex is the combination of the word wrap algorithm, plus hyphenation and very important the microtype package. All these things make sure that the right border of a column is straight from top to down.
All the other typographic elements of LaTeX for example the ability to manage different fonts, referencing of literature, typesetting mathematical formulas and so on is only a gimmick on top of the paragraph justification algorithm.
Realizing such algorithm in software is a complicated task. Especially the microtype text which adjusts even the spaces in a word itself is an example for an elegant software. The generated output of justified text has a better quality than what Libreoffice has to offer and it is the main reason why LaTeX is used everywhere.
The interestion situation is that without a justified text the latex generated pdf document will look not very impressive. It has the same or even a lower quality than the libreoffice counterpart. So the question is why is the abilty to typeset justified text so important? Because it is assumed that this technical trick will improve the impression for the reader drasticaly. He will see on the first look that the document was typeset professional. In the time before the advent of digital typography it was very hard to produce justified text and with modern word processors the task remains a challenge. Only LaTeX and professional software like Indesign is able to do so while normal word processing software including pandoc, asciitops and emacs can't do so.
The situation will become more obvious if someone tries to program an algorithm from scratch which creates justified text. Creating a simple word wrapping algorithm can be realized in under 100 lines of code, but this won't produce a correct right edge. To stretch each character similar to the microtype package, an advanced software is needed which includes the ability to handle different fonts. This is an advanced typesetting problem and latex is an expert for this task.
There is less common package available in the texlive distributio which is ragged2d. This package produces a left justified text similar to what Libreoffice is doing. Activating this package is from a technical perspective trivial but it will generate a pdf document which looks very different from a conference proceeding and an academic journal.

October 29, 2021

More about the \raggedright command

 

The \raggedright command in LaTeX was in a previous blog post and it should be explained further. The initial situation was after inserting the command, a former LaTeX formatted document looks very unusual. It seems that everything what is known as the typical formatted layout is the result of justified text.
Perhaps it makes sense to explain it in detail. In the default formatting style, latex formats the text always in the justified way. That means one column text and two column as well has on the right side a perfect line from top to bottom. This edge is create by intelligent word wrap plus hyphenation plus microtypographic adjustment. The reason why justified text is seen as a must have is because of historical reason. In the long history of western typography the situation was that a two column text was formatted with a justified way.
Even mechanical typesetting like the Linotype were able in doing so. That means, 99% of typography is about creating a justified text. LaTeX can do this as well but it is working with algorithm which improves the quality drastically.
The working hypothesis that good typography is equal in using justified text. So the term good is referencing to this single style technique which is used since hundred of years? A rough description comes to the conclusion that justified text looks professional while missing justification looks unprofessional. What all the academic papers, books and newspapers have in common is that they are using justified text. It is the unwritten rule in typography. The open question is if this rule makes sense anymore. Sure, latex is a great software for creating justified text but what is the underlying purpose? Will a text loose the readers or will it less readable if justified layout is not there?
The interesting situation is that in the Internet world, justificated text is seldom available. All the browser including Firefox and working with a more simpler word wrapping technique. And nobody has complained that the text is difficult to read. So it seems that only in the pdf format the justified text has become the defacto standard?
Admitted the situation is a bit obscure. So let us listen to the expert why justified text is widespread use. The idea is that a twocolumn text which is justified on the right edge looks better. Better in terms that it is more professional and it has something in common with an artist work Exactly this kind of layout is seeked by the latex community. They want to create a layout which looks classic and shows a high professionalism.

Why LaTeX looks always the same

 

Critics and fans of the LaTeX document system have argued that a latex formatted document looks always the same. The idea is this increases the quality and makes academic papers easier to read What is missing in the debate is a reason which element of a latex document produces the typical layout. It is not the selected font and it is not the table of content. But what latex formatted papers have in common is the justified paragraphes.
Justification means that the right edge of a paragraph forms a straight line. This principle was used in western typography since 400 years. It is the only allowed style for academic authoritative documents. The LaTeX community has adapted to this style.
The interesting situation that from a typographic perspective there are many arguments against justified text available. But none of the argument is valid for the latex community. Bascialy spoken the idea is that justified paragraphs is the most important rule and can't be changed.
The interesting situation is that from a technical perspective it is possible to create with latex a left-justified text too The ragged2d package was created for this purpose.. But the resulting pdf paper looks different from the usual latex document. It has a very unique style and some design experts will say that it it looks better than justified text.
Basically spoken, latex means basically that the paragraph is justified. The combination of word wrapping, hyphenation and microtype package can do this very well. The question is if it makes sense to format documents this way. From a historical perspective justified text is equal to professional formatted text. Only very complicated typesetting systems are able to do. Programming an algorithm like the microtype package is not a trivial case. The idea is that this effort results into a high quality layout. That means after switching on the justified option the text will become a better one?

October 27, 2021

The \raggedright command in LaTex

 

There is a seldom used command available in LaTeX which is not introducing a new feature to improve the layout of the document but \raggedright deactivates the justified text. After inserting the command at the beginning no hyphenation is done and the text will use a simple word wrap algorithm known from text editors and libreoffice. The rest of the formatting remains in charge.
The interesting situation is that with this simple command the resulting document will look very different. It has nothing in common with a normal LaTeX file but it looks like a bad formated pdf file which was either created by a php script or was created with libreofice. That means, the subjective quality judgement that the \raggedright command has reduced the quality from excellent down to zero. It was mentioned already that the font is the same, and the picture are placed at the same position, only the word wrap algorithm was modified. Is it possible that the elegance of a latex generated documents is only the result of the word wrap algoirthm?
THis remains unclear because the ocmmand is nearly unknown in the LaTeX community. Most users have no reason to use this command because high quality word wrapping is the reason why they have switched to latex.
To understand the situation we have to focus on the normal word wrapping algorithm in LaTeX. By default, LaTeX formats the text justified, that means the right edge forms a straight line. This is realized with paragraph wide optimization, hyphenation and micro typography. The result looks amazing. LaTeX outperforms easily possible alternative likes Libreoffice because most paragraphs doesn't need any hyphenations and the amount of whitespace is low.
The \raggedright command deactivates the entire feature. No aadjustment is made but the line wrapping works the same what is known from webbrowsers. The text can be read but it looks not very well written.

Comparison between python and C++

 

There are two important programming languages available which are python and C++. The first one can be programmed much easier but the code is running slow. Python programs in general are around 30x slower than the C++ counter part. The additional problem is that code written in python can't be deployed because the python runtime engine is needed.
On the other hand, C++ is the defacto standard for writing low and high level software. The interesting situation is that both languages are used by today's programmers because they are fulfilling different needs. C++ is an example for a classical programming language. There is need to learn the language first, use a complicated IDE and take care of datatypes, pointers and object oriented features. In contrast to previous languages like pascal or C the C++ standard has simplified programming and can be used to create graphical applications If someone line to create a paint application or to write a game then the C++ language is a here to stay.
But if C++ is the number one language for creating applications and games as well what is the idea of python? The idea behind python is that there is no need for programming but what is created instead is a script. A script is a visual basic like macro which automates something. It doesn't make sense to package a script into a program nor to upload it into the internet, but scripts are locally run modifications to existing software similar to creating an excel spreadsheet. So python is not a programming language but a scripting language.
Basically spoken if someone likes automate something but has no time to write source code than the python language is a here to stay.
Python and C++ are providing a certain perspective. C++ is a language which is oriented on the needs of a computer. Similar to Assembly language or C, C++ makes it easier for a human to communicate with a computer. The programmer doesn't need to know opcodes in hexadecimal notation but he can write down function names and datatypes. With this assumption it is pretty easy to write software.
On the other hand, python is oriented on a problem. It supports problem solving and things like datatypes and pointers are not needed. The interesting situation is that python isn't able to replace c++. Even if python is the more recent approach for programming, python has created many unsolved problems. That means python is only an additional language over C++.
Let us try to criticize C++ a bit. Suppose a programmer has no need for fast running software and he has no need to deploy the code in a production environment. Under such constraints, there is no need to use C++ anymore. The reason why Python has become a wide spread language is because of changing demands. Running a software very fast was only important until the 1990 in which computer power was limited. And deploying production ready code is only needed, if the code should be used a larger audience.
This allows to imagine what the purpose behind Python is. Python is used on fast modern computers in which performance is no longer problem. And it is used to implement prototype software.

Groups for programming languages

 

The amount of programming languages is endless. Instead of describing a single one it makes sense to put them into baskets and compare the entire category. A modern programming language works with an interpreter and is easy to learn. A typical example is python. Compared to ruby and javascript (which are in the same category) the python language is more widespread used.
The next group of languages consists of classical compiled language. These languages are used by experts to create full blown desktop applications. C++ is the most dominant language in this group. The language is very powerful and much harder to learn than python. Other examples for classical large scale programming languages are C#, Java, C and rust.
The third groups contains esoteric languages which are working quite different from the previously mentioned category. Typical examples in this group are Lisp and Forth. it doesn't make much sense to compare languages from different groups with each other because they are very different. The better idea is to select one category first and then ask which of the languages is working fine.

October 26, 2021

Introduction into object oriented programming

 

Before the powerful OOP style can be explained there is need to take a step back and analyze how programs were created before the invention of Object oriented programming. The idea was that there functions which are accept a parameter and then the function returns something to the main program. The sourcecode for a simple calculator app is shown.
# without class
def init(mylist):
  mylist=[2,4,6]
  return mylist
def add(mylist):
  result=sum(mylist)
  return result
def avg(mylist):
  result=add(mylist)/len(mylist)
  return result
  
def main():
  a=[]
  a=init(a)
  s=add(a)
  av=avg(a)
  print(a,s,av)

main()
Each of the three functions take a list as input and returns a value or a list as output. The idea is the main program communicates with the subfunctions with the help of the parameter.
In the example only a single parameter was used but the concept can be extended so that the function header take 5 and more parameters as input. It is not very hard to guess that this programming style works technically great but it is hard to read. The alternative is object oriented programming. OOP means that no parameters have to be send to the function but the function knows in advance which variables are needed.
The resulting oop syntax is very easy to read and it contains mostly of a statements like “object.dosomething()”. All the needed parameters are stored in the object already as class variables.
Somebody may argue that the different notation wouldn't result into a faster binary code. And he is right, OOP is mainly something for the programmer but not for the computer. In most cases OOP oriented compilers are harder to create and will produce slower binary code. But the technique allows to create larger programs.
Even if the OOP has felt a bit out of fashion since the 1990s it remains the leading programming paradigm. Most of code is written in this style and even C code was rewritten in an OOP style. It is a very influential programming style. From a historical perspective OOP were introduced in the mid 1990s to the main stream programmer. In that period most of modern OOP languages like java, C++ and python were created. It is very difficult to find a larger project which is not using the OOP programming style.
The interesting point is that the OOP concept can be extended drasticaly. It is possible to store in an object another object. This allows to increase the amount of codelines further. At the end the program wil need 10k and more lines of code which is occupied by lots of classes. From a technical perspective such a program is not very hard to hard to compile or to execute. A program which has 10k lines of code fits into 400 kilobyte of disc space. The resulting binary file will need also around 400 kb of RAM. Compared to the average RAM of a computer this is very little amount of space. But creating all these codelies takes a long time. A single programmer will need many months until the program was created.

October 25, 2021

Why do people reject C++?

 

C++ is known as a complicated to use but powerful programming language. Lots of tutorials were created around the ecosystem and many programmers are explaining to the newbies that C++ is difficult to learn. But what exactly means difficult?
The only thing what a newbies should learn is to make things more comfortable. A good starting point is to use the command #include<myclass.cpp> at the beginning. This allows to put the classes in different files without creating header files. The second step to become a c++ expert is to ignore the so called heap memory at all but put all the class instances on the stack frame. This allows to avoid the pointer syntax and the code will look very clean. In addition the auto keyword can be used and the resulting c++ code has much in common with java and python.
The interesting situation is that it is pretty easy to create in the C++ language games, GUI applications or backend software. If the user finds out a difficult problem it is for sure that stackoverflow knows the answer already. Admitted, C++ is more complicated to program than Python. But suppose a python interpreter isn't available and C++ is the only language available. Then it is a reasonable well language which fits most needs.
The amount of existing c++ books, github projects and stackoverflow questions is high so the perception is that C++ is the opposite of a difficult to learn language. Using C++ for creating standard applications is like walking on a way which was used by many thousands before. It is not possible that the user will struggle by implementing the projects he wants. And object oriented programming makes it easy to create large projects.
The real problem with C++ is that similar to potential alternatives like C# or PHP the productivity of programmers are limited. That means even experts are not able to write more than 10 lines of code per day and they are failing in creating entire operating systems by their own. So the only sense making projects for a newbie is to create small programs for example a prime number generator or a pong clone.

October 24, 2021

What is wrong with C?

 

Over the years, many programming languages have been invented. But which one is the best one? To answer the question the existing languages have to be ordered in categories. There are the following groups available: interpreted vs compiled languages, procedural vs object oriented languages, mainstream vs. uncommon languages. The idea is that a certain combination of these categories will result into a very well language.
Let us start with the famous Python language. Python is fore sure one of the most languages ever. In contrast to the earlier visual basic, python is accepted by everyone. No matter if the programmer preferes Linux or Windows, no matter if he is a beginner or an expert, python is loved by everyone. Python is located in the interpreted group and has object oriented features. This combination makes it very easy to create programs. Possible alternatives over Python is groovy, which has the same syntax but is less popular.
The main problem of python is that the performance is poor. It is not possible to write operating systems libraries or production ready code in this language. So possible second language is needed.
An interesting language to python is C++. C++ was used in the 1990s before scripted object oriented language were available. The main idea is to combine OOP features with a compiler. The problem with C++ is that the user has to use pointers all the time. Possible alternatives over C++ like D are promising to solve this issue but real programs written in D are using pointers also. A typical situation in these languages is that the user has to initialize an object on the heap, and then he has to send the pointer to this object to other objects. This programming style is very complicated to master, so C++ isn't the best language on earth.
From a more abstract perspective the question is how OOP features are related to compiled languages. It seems that OOP works fine for scripted language. In case of compiled language OOP is harder to realize. So a working hypothesis is, that compiled language should use OOP features at all. And this will lead to the very interesting old and new language C. C is a compiled language which has pointers but not OOP features. Writing code in C is different from writing code in Python. C is located on the low level and the user has to fulfill the needs of the machine. In exchange C provides a fast binary code which works great in a library.
The open question is if OOP features can be combined with a compiled language. Let us try to convince existing Python programmers to switch to the C++ language. Unfurtunately, this is impossible. Python programmers have no advantage if the code is compiled but it will make the edit compile run cycle slower. Also python programmer doesn't need to use user pointers so the C++ language doesn't make sense for him. The prediction is that C++ will struggle in replacing python. And that means, C++ struggles in become the number one high level language used for prototyping purposes.
In the second step the idea is to convince existing low level C programmers to use C++. The advantage of C++ over plain C is that C++ has olasses. Unfortunately, C programmers have no need for classes. The reason is that C programmers doesn't implement prototypes or writing hello world scripts, but c programmers are writing low level libraries and operating systems. That means C++ will fail to become the number one low level language. And this means, C++ can't become the number one in any category.
The logical consequence is that the attempt of combining a compiled language language with object oriented feature have failed. It is a dead end in programming language design in trying so. In general the categories are:
1. low level procedural languages which are compiled and very fast
2 high level object oriented language which are interpreted and slow
It is not possible to invent a language which fulfills both needs. Instead separate languages are needed for each purpose. Bascially spoken the following programming languages are outdated: Java (but groovy not), C++ (but C not), C#,
From a positive standpoint, C programmers will never switch to C++ and Python programmers will never switch to C++ as well. A simple look in the existing programs written in C and python as well will show that both programming styles are too powerful that any language in the world can provide an alternative.

October 23, 2021

The LaTeX microtype package

 


Most features in the LaTeX typesetting system have to do with line wrapping. The latex internal algorithm has to decide which position each word gets and at which situation a hyphenation is needed. In the screen left the normal LaTeX typesetting behavior is shown. The result is a justified text which has no longer whitespaces. This makes latex the better choice over alternatives lies libreoffice. But inspite all the comfort, there are three hyphenation signs available which can be reduced to zero with the microtype package. On the right side the improved version is shown which is also LaTeX but with the recent version of the mentioned package.
On the first look it looks like magic how the words were formatted. Because it is the same text with the same font but this time microtypographic modifications were made which are not visisble for the untrained eye. The only thing what can be recognized easily are the missing hyphenation. Similar the normal latex output in the left window, no whitespaces are there and the paragraph looks easy to read.
Similar to most of the LaTeX feature the end user has nothing to do but the package is working in the background. Somewhere in the lyx settings there is the option to activate the microtype package which is recommended for improving the rendering and then latex is doing everything by it's own. That means, the same created text will look improved without any human intervention.
This emphasizes the leading role which is played by the latex typesetting system. The self claim is, that the software produces higher quality than any other word processing software and is easier to use than commercial alternatives. And yes, compared to the justified text of libreoffice or the archaic left justified text of a webbrowser the rendered pdf file looks amazing good. It is comparable to the quality in a printed book.

October 20, 2021

Understanding the latex ecosystem

 



Before it is possible to reprogram LaTeX from scratch there is need to understand what LaTeX is doing. In most cases the goal is to produce a nice formatted pdf document as output. But a postscript file can be accepted as well. The reason is, that postscript can be converted easily into pdf with the help of the ghostscript tool.
The remaining question is how to convert a latex file into postscript? Let us take a closer look what postscript is about. Postscript defines on a low level layer the position of the elements in a document. A simple two column text document is given as an example.
%!PS-Adobe-3.0
/Times-Roman findfont
9 scalefont setfont

/column1 {
0.5 setlinewidth
40 300 250 400 rectstroke
40 690 moveto 
(Hello World!) show
40 670 moveto 
(This is an example text to demonstrate how the Postscript language) show
40 660 moveto 
(is working internally. This is an example text to demonstrate how) show
40 650 moveto 
(the Postscript language is working internally.) show
} def

/column2 {
0.5 setlinewidth
310 300 250 400 rectstroke
} def

%------main---------
column1
column2

showpage
The interesting point is, that postscript has no linewrapping command but the textlines have to be provided individual. WHat a latex compiler is doing is to create such a postscript file. The reason is, that creating a postscript document by hand takes too long.
It seems that a latex like processor needs to fulfill the follwoing requirement:
1. convert of latex file into a post script file
2. using boxes to place the information on the screen
Let us go into the details. In the post script file, two boxes were defined. Both keystroke commands are accepting only absolute coordinates on the screen and the (0,0) position is bottom left. So the question to answer first is where exactly is the x/y position for a box? Right only latex knows of this. It depends on the text and a bit of mathematical calcuations. But, in theory it is possible to do the calculations with a software program automatically. In a sense that the program code creates the boxes for a longer document within milliseconds.

October 19, 2021

Reducing TeX to a minimum

 

Without any doubt, TeX is a powerful typesetting system which has revolutionized the computer industry. Unix combined with TeX has introduced digital typography and has reduced the costs for creating academic papers and books. The main problem with the TeX system is, that the ecosystem has become too complex.
It seems, that there is no higher authority who is able to remove some lines of code or certain programs from so called TeX distributions and as the result an endless amount of packages, binary files and different tex compilers were created in the past.
Some smaller attempts were made to replace latex which are loud and sile. Both a LateX like compilers but they do not provide the same functionality. Before it is possible to create a replacement for TeX there is need to analyze what the project is about.
The main idea behind TeX and the reason why the project is succesful is because it implements a list of layout commands. The current latex ecosystem has around 250 different commands which are extended with parameters and additional packages. Not all of the commands are important. So the next question is which of the commands are equal to the core of a typesetting system.
I have identified the most important single command which allows to create boxes. A general \box commands defines a frame on the output page. All the other commands in latex for typesetting tables, mathematical equations or paragraph have a lower priority.
Suppose a framed box is able to store different content like text, images or a table, then it is possible to arrange the boxes on a US-letter page and the resulting .pdf file will look like the output of latex. That means, in theory, the text or tables can be rendered outside of the layout formatter with external program. There is no need to combine everything in a single program.
A minimal TeX replacement provides a small amount of box creation commands and the ability to define columns and the pagesize. More is not needed to typeset a document. The estimation is that such a minimal program can be written in a low amount of codelines. What the software needs to do is to parse a text file, recognizes the \box command and render the pdf document as output.
The interesting point is, that this simple functionality allows to create longer books. Because what is in the boxes is fixed.
Example


From a technical perspective a latex compiler is a printer driver which is working with a pipeline That means, multiple steps are executed after each other and at the end the output is printed out. For creating the latex software itself the transition from an input textfile to the rendered pdf document is important.
For reason of simplification the idea is not to create pdf files which are usually done with additional libraries but the postscript format can be created much easier. The task for a latex typesetting software is to convert a “.tex” file into a “.ps” file. An example postscript file is given next. It contains of boxes which are filled with content.
 
%!PS-Adobe-3.0
/Helvetica findfont
12 scalefont setfont

/text1 {
40 650 moveto 
(Hello World!) show
} def

/text2 {
350 400 moveto 
(Text is here)
show
} def

/box1 {
40 600 moveto 
0 0 40 20 rectstroke
show
} def

/box1 {
2 setlinewidth
40 650 500 30 rectstroke
} def

/columns {
0.5 setlinewidth
40 150 200 400 rectstroke
350 150 200 400 rectstroke
} def


%------main---------
text1
text2
box1
columns

showpage
The problem with the postscript format is, that creating such files manually is a time consuming task. The reason is that all the coordinates are absolute values. That means, somebody has to enter that the second column has it's bottom left position at (350 150). The logical next step is to write a software which determines the values automatically. This task goes beyond of this small introduction.

Boxes in LaTeX

 


The main principle in the LaTeX rendering engine is the ability to show boxes as a graphic. A minimal TeX like language consists of a single command which is \box. This command allows to typeset a framed rectangle to the screen with different sizes. The box command is rendered into a concrete output. A single box can contain subboxes.
Let us slow down the situation a bit. The current luatex program consists of around 1 million code and lots of documentation. The ecosystem has grown over the decades and nobody is able to understand everything around TeX. There is a need to simplify the program drastically. The idea is to program some sort minimalist latex rendering engine. The idea is that short programs are better than larger ones.
The \box command which is provided in a text file is the basic requirement for such a latex replacement. The software has to parse the text file and converts the box commands into a layout stored in a pdf file. Such a program is maybe useless if someone likes to typeset a real document but it makes clear what LaTeX is about.

October 18, 2021

Understanding the core idea behind LaTeX

 

LaTeX is known as the most powerful typesetting program available. The open question which of the principles are future ready and which not. At first let us describe some elements of TeX which can be ignored. First problem is that the ecosystem is very huge. The tex live distribution is not a single binary program but it contains of hundreds of different programs which were created with million lines of code over decades. The problem is that nobody knows which of the code is relevant anymore.
Second problem with tex is that the original program “tex.web” was written with literate programming in mind and is based on the Turbo pascal syntax. Both ideas might be interesting as separate projects but they have nothing to do with typesetting.
After this negative describe let us search for some elements which are working great in TeX. First thing is the idea to use a markup language to create the layout. LaTeX provides around 250 commands which are rendered into a pdf document. This concept is similar to what gnuplot is using and it is very powerful.
The second interesting feature of TeX is that the rendering mechanism is working with boxes. A box is 2d space surrounded with a frame and can contains a single character, a paragraph, and an entire page. These boxes are arranged by the LaTeX renderer in the pdf document.
For reimplementing LaTeX some software has to be written which understand 250 different Tex commands and is using boxes to create a .png image or a pdf file. The latex markup language is used as input, it gets converted into boxes and then the output is generated as an image file. So the open queston is how to implement such a software?
One possible attempt would be an interactive prototype which contains of an input window left. The user types in a paragraph and on the right side the rendered boxes are shown as image. What the underlying renderer is doing is to create new boxes, and take decisions about the position on the screen.

October 16, 2021

The sile typesetting system

 

A less known possible replacement for LaTeX is Sile. For a one to one comparison the sourcecode can be fetched and then the lines of code are counted:
git clone --depth 1 https://github.com/sile-typesetter/sile.git
cloc sile
No matter what the original documentation of the project likes to explain the cloc tool is an objective measurement what the project is about. According to the statistics, Sile was programmed was mainly programmed in Lua with some parts in C++. The overall lines of code is 450k. Now we can compare this number with the well known luatex project. CLOC will show for luatex that it contains of 1 million llines of code. So we can say that Silo has the half size of luatex.
Unfurtunately, the amount of 450k lines of code is similar to the original luatex project not a small github code repository but a larger one. Maintaining all the code will need a lot of work. Only to get a better understanding what the size is. One line of code needs around 40 Byte. Suppose the code is converted 1:1 into a binary file than 1 million line of code is equal to 40 MB for the binary file. So we are speaking about a 20 MB executable file for the sile project and about a 40 MB file for the luatex project.
Now it is possible to analyze the reason why these layout engines are so big. What Sile, luatex and possible future latex replacements have in common is that they are implementing commands for typessetting. Even if the vocabulary of all the latex commands was never standardized, there is some sort of list available which latex commands are used frequently. In this list around 250 different commands are available which includes commands for typesetting tables, mathematics and showing images in the document.
An well programmed layout engine has to implement these 250 commands at minimum. In case of sile the syntax is a bit different but this is only a detail problem. The open question is how to program a layout engine which takes the 250 possible commands as input and produdes a pdf file as output?
This is realized with lots of lines of code:
Sile, 450k lines of code 250 latex commands -> 1800 lines of code for a single command
Luatex, 1000k lines of code for 250 latex commands -> 4000 lines of code for a single command
So the only difference is how much LoC are needed for a single command per average. From a technical side it is possible to implement only 10 commands, but such a layout engine is less powerful than the original tex software. That means, even if someone reinvents latex he has to implement all the 250 commands.

Why LaTeX is great

 


Sometimes newbie are asking why they should prefer LaTeX over possible alternatives like MS-Word or Libreoffice. The answer is they shouldn't. Instead the best practice method is to give the alternative software a trial and compare the result by themself.
The initial screenshot shows the same paragraph of text formatted with different programs. And the reader has to estimate which of the output was generated by the LaTeX tool. A small hint: it has to do with the amount of hyphenation used in the text. One of the program is able to reduce the hyphenation down to zero the other not.
The interesting point is that the Libreoffice software can't be modified so that it will produce better results. It is a built in bug. And the only way to avoid it is to switch to latex.
 
Both paragraphs have the same amount of lines which is 10. Both are using the same font but only one of the paragraph was formatted well while the other not. There must be reason why most of the technical books and papers were created with LaTeX but not with other programs.  It has to do with the fact that even recent version of libreoffice and MS-Word are not able to create high quality documents. And this example shows only the problem with the word wrapping. Other problems like floating images, biblographic indicies or mathematical formula are not mentioned yet.

How to reprogram LaTeX from scratch?

 

On the first look the attempt in doing so looks very complicated. And some projects from the past like lout have demnstrated that it is not possible to reinvent the latex software. But perhaps it is possible if the existing latex engines are investigated much better? So the first question is what exactly is latex?
The current luatex repository is available at github and cntains of 1 million lines of code. That means latex is not a small software project and not a mid size but it is very large. The assumption is that a possible latex replacement needs at least 1 million lines of code otherwise the task can't be realized. The problem is that such a codebase can't be written by a single person, so there is a good reason not to do so.
But suppose is to try it no matter what the scucess rate is. The first thing to do is to describe which parts of the latex project are realized very well and which are not very well. The main idea behind latex is, that in the same text file the document text itself plus the latex commands are written similar to the markdown syntex. So latex is in the core form a rendering engine which takes a latex command like \tex and converts this command into a graphical output. And yes this is a very good idea. SImilar to the gnuplot project the idea of using commands allows to create very complex output.
What is working bad in the latex ecosystem is that the existing latex to pdf compilers like luatex, pdftex or xetex are very complicated and hard to maintain project. Improving a code base which contains of 1 million lines of code doesn't make much sense.
Supose the idea is to rewrite latex from scratch. First thing to do is to write down the requirements. Latex is basically a textparser which understands around 260 commands. These commands are defined in the latex reference. The important \tex command was mentioned already but there are many other commands available for printing text in bold face or to format a page. A possible latex replacement has to understand all these commands. Then it can be executed from the command line or from the lyx frontend and converts a .tex file into a .pdf file.
So the question is how to program a parser which understands around 250 typesetting commands? Exactly this seems to be the bottleneck. Suppose each command is realized with 109 lines of code. Then the parser will need around 25k lines of code. A command like \indent can be realized easily, but commands like \table will need a lot of code.
pylatexenc
There is a python library available which simplifies the task of parsing latex files. It is called pylatexenc and takes a latex file as input and prints out all the recognized commands. What this library isn't capable to do is to convert the recognized strings into the pdf format. But such an engine can be created. The cmbination of the pylatexenc project plus a pdf output would result into a full blown latex to pdf converter. This is equal to create a replacement for luatex or pdftex.
Let us a go a step backward and investigate why former attempts have failed to reprogram latex. First thing to mention is that former projects have assumed that the markdown or the xml format is able to replace the latex syntax. No they don't. Latex consists of around 250 dedicated commands which were designed for typesetting. A latex replacement needs to implement these commands.
The other attempt to replace latex was made with GUI systems like libreoffice. These programs can't replace latex because they are not providing low level textual commands. So we can say that latex is equal to converter which takes a “.tex” file as input and converts into a pdf file.
Creating such a tool is not easy. but it can be simplified drastically and there is no need to use the existing luatex software for this attempt. Luatex has some major problems. First thing is that the project is too large. The amount of 1 million lines of code is the current size. Second problem with luatex is that many typographic elements like three column layout or a better hypothenation for non english languages isn't available. That means luatex is not the best choice for creating pdf files.
The interesting situation with the pylatexenc library is that the project is relative small It contains of only 10k lines of code in python plus some XML files for the unicode mapping.
PyFPDF
Another interesting existing tool is PyFPDF.
Why LaTeX rocks
In general the existing LaTeX ecosystem works very well. That means even if the luatex project consists of lots of codelines, it is doing what the normal user likes. The main advantage of LaTeX is that with some simple commands it is possible to create nice looking pdf files. These commands can be typed in manual or with a frontend like lyx.
So we can say that the core idea behind latex works well. There are around 250 commands which are put in combination with the normal text and this will generate a document. The only question is about the details of this process. For example do we need really 250 latex command or would be possible to use only 150 of them? Or, do we need really the luatex project or is it possible to create a latex compiler much simpler?
Perhaps the best example that the original tex project consists of obsoete technology is the transition from the web programming language invented by knuth to the more recent c dominated programming style. That means the original TeX software was written in a different language than today's luatex engine. And it is possible that future versions of latex will be written in a different language.

Writing an operating system from scratch

 

Existing operating systems are described in a certain way. This affects how programmers think about the creation of new operating system. The interesting part is that an operating system can be reduced to the core functionality which is the file system. The DOS operating system was working with the fat12 and the FAT16 filesystem. That means the overall system contains of a hard drive device driver plus a command shell on top of the system .The same understanding is valid for Linux systems as well. Linux is contrast to the famous myth not a kernel but it is a combination of the ext4 file system plus some addons.
The reason why the main focus is on the harddrive but not on networking, the GUI or the ablity to run programs is because interacting with the harddrive is what most users are doing with their system. The typical MS DOS user has inserted floppy disks into the PC and then files were read and written. And the typical Linux user is accessing to the harddrive as well. All the the other features of an operation system are optional. They are nice to have gimmicks but they not essential operations.
Let us investigate what will happen if a computer has no harddrive and no floppy drive. Then the user isn't able to type in texts and he can't write computer programs. So it is not a computer but only a calculator. A pocket calculator has a builtin CPU and of course it has an operating system but a pocket calculator can't do anything.

October 14, 2021

Reinventing LaTeX

 

The current LaTeX project is a mess. The luatex implementation which is available in most Linux distribution contains of around 1 million lines of code in total. 50% of the code was written in C the rest in other languages like bashscript and C#. The unsolved question is how to reinvent the well known LaTeX software?
Some projects have tried to do so in the past, namely “lout” and “rinohtype”. The obvious difference is that these projects are smaller one. For example “rinohtype” was announced as a python project which contains of only 6500 lines of code. It is no wonder that this small project doesn't match to the needs of today's publisher. So the question is how to make it better?
 A very basic LaTeX replacement contains similar to the gnuplot software of a command interpreter as the core. That means, after starting the binary file, the user has a command prompt in which commands can be entered. This affects the rendering of the text on the screen. For example after entering the command “fontsize 10”, the document right in the screen will become a new fontsize. The idea is that the user can test out in the command prompt different options and if the result works fine the commands can be copied into the original file.
 The only problem with this idea is that it is very complicated to write a parser which accepts thousands and more different commands. Typesetting is a very complex field and there is a need to fine adjust any detail. The existing LaTeX project is so big because many requirements are fulfilled. The chance is high that even the reprogrammed latex software will contains of one million lines of code.
 
**Command line interpreter**
The basic functionality of a LaTeX like software is to parse a mixture of plain text and formatting commands. The typical hello world example for a LaTeX file is “Hello world \TeX”. The command after the backslash is executing a certain action rendering action. LaTeX is mainly an endless list of commands which are starting with backslash and then the output is rendered.
The overleaf project shows very well what the idea is. The user can copy&paste the input text in the left window and the output is rendered into the right window. The assumption is that this principle is a powerful idea and should be implemented in a potential latex replacement.
The assumed project would start with a small vocabulary which contains of only 100 possible commands, and then the idea is to program much more commands. Formatting complex documents can only be realized if 1000 and more latex commands are parsed and rendered into graphical output. The idea is that the layout engine will become more powerful if more commands are provided.
In contrast to a common assumption this kind of interaction isn't difficult to master. Because it is possible to write a command reference similar to what is known from gnuplot. That means, in the dcumentation a hierarchical list of scenarios is provided and each of them shows the usage of a command. The user has to identify in the documentation the needed command and can copy and paste the example into the window. This allows to create longer documents.
The underlying idea behind LaTeX is a text window in which a short snipped can be inserted by the user. THis text snippet is converted into an image which is rendered to a pdf file. An entire paper / document contains of the text plus the latex commands which are starting with the backslash. The open question is which commands are needed and how many different commands are useful.

The filesystem as the core element of an operating system?

 

The thesis is formulated only as a question because it is a bit unclear what the situation is. Operating systems are usually perceived by their GUI. The windows OS has a certain form of window manager, and the Linux OS has a different one. But, an operating system has something which is located behind the surface which is the file system. Linux systems are working usually with the ext4 filesystem, Windows systems are based on NTFS and MacOS is working with APFS.
The interesting situation is that apart from the mentioned very powerful filesystems there are many others available for example the famous fat16 filesystem which was used in the MS-DOS age, or the ZFS filesystem which is used in FreeBSD. What these filesystems have in common is that they are completely incompatiable to each other. Even so called open source filesystems like ext4 are only available in Linux. Until today there is no simple to install software available for windows to mount ext4 formatted harddrive. Such a tool is only available for the btrfs filesystem, but btrfs is not used by most Linux users.
And other filesystems like NTFS are also not available for more than a single platform. It seems, that the different operating systems are using their filesystem to making their users dependent from them. But what exactly is the ordinary user doing with it's filesystem? The surprising insight is, that the use case scneraio is mostly the same between Windows and Linux users. In most cases the user has a home directory in which all the files are stored hierarchically. IN a network context additional filesystems are stored on a fileserver which results into more stored data measured in megabytes.
Let us talk about some limitations. All the following user requests can't be realized wit today's technology: installing Linux on the NTFS filesystem, installing Windows on the ext4 filesystem, reading ext4 usb sticks from Windows, writing to NTFS harddrives from WIndows, reading ext4 partitions from MAcOS, mounting Macos partitions in Linux. It seems, that the systems are not working very well together because of different reasons. In addition many new filesystems are created each year. The chance is high that in 4 years from today the ext4 filesystem is no longer used in Linux but replaced by something different.
The only standard available which is spoken by all major operating systems is FAT32. This outdated filesystem doesn't support journaling but at least it can be read and writing by most computers.

October 09, 2021

How Unix workstations became obsolete

 

There was a period in computer history in which so called Unix workstation were the most advanced desktop computer available. Around the year 1988 the Apollo/Domain workstation series was sold for around 150k US$. These machines were used for CAD, programming and text processing tasks.
The interesting situation is that only a few years the former workstations were no longer needed. Instead the PC together with the Intel 386'er has replaced these machines. Intel 386'er PC were sold for less money and provided the same and even more performance. The more abstract reason why Unix workstations have become obsolete is that the user is not interested in a certain use case but he prefers computer technology with a certain specification. It is measured in Megabyte for the RAM, the amount of MIPS for the CPU and very important the amount of megabyte on the harddrive is determined. The ibm compatible PC has provided the better price to performance ratio.
The main reason why technical unix workstation were available in the 1980s was because of historical reasons. The idea was to compress former mainframe computers into a desktop case and this was equal to the birth of the workstation. In the mid 1980s the normal IBM PC was too slow to replace former mainframe computers. The first IBM XT model was shipped without a harddrive and the processor was very slow. In contrast the existing minicomputers and mainframe systems were much powerful in the late 1970s. So the workstation was connecting both worlds.
The reason why workstations are interesting for today's users is not because of the hardware but because of the unix operating system. Unix was improved into the Linux operating system and it is used on today's computers frequently. Porting Unix to the IBM PC was an important milestone in the computer history.