Robotics and Artificial Intelligence: March 2020

March 24, 2020

Freezing an academic paper

With the upraising of the Open Access movement it was discussed heavily in the literature what an academic journal is and what not. The most widely accepted definition was created around the term predatory journal. The term was invented to make clear what the difference is between a serious peer reviewed journal and a joke-non peer reviewed journal.

The term predatory journal is widely accepted not because the definition is correct, but it was used in many thousands papers by OpenAccess experts. The term predatory journal is equal to a low cost journal. What predatory publishing has in common is, that the Article processing charge is lower. The typical predatory journal is published online-only, has a reduced fee of 100 US$ per paper and no peer review takes place. This definition was over a long time span an ideal definition to sort the existing journals into two groups. But it fails to explain what a peer review is.

A more elaborated definition divides academic journals into two groups: peer reviewed and non-peer reviewed. The problem is, that it's much harder to define the peer review process. Even Wikipedia has no clear understanding what peer review is about. The working hypothesis is, that peer review is realized with a stable branch in which a frozen upstream gets evaluated.

Freezing the upstream is something which is more complicated than a normal preprint server. A preprint server is location in which authors are submitting their journals. For example Arxiv is a preprint server, but Academia.edu and a github folder too. What all preprint server have in common is that no peer review takes place. Somebody uploads a document and the reader doesn't know if the document has high or a low quality.

A naive assumption is, that a preprint is transformed into a journal by peer reviewing the preprint. That means, somebody sends the manuscript to an expert and the expert gives a quality judgment. This understanding describes only the surface. What is missing is the reason why somebody should peer review a paper. A typical assumption from the past was, that peer review is equal to paid peer review. All the existing academic journals are working with money in the loop. So the assumption is, that a serious journal is equal to a high price journal.

The surprising information is, that this definition describes also not the complete picture. It's possible to combine high quality peer review with a non-commercial journal. The important underlaying process has to do with freezing the upstream. Freezing is a term used by Open Source advocates which are using the git version control system. A freeze is equal to create a branch. On the command line it's done with a simple “git branch stable” This creates a new branch, called stable, and in this branch a snapshot of the master branch is created. A freeze is equal to a point copy of the existing files.

Let me give an example. Suppose an author has uploaded a HTML document to a preprint server. The HTML document contains a 8 page long paper which consists of 20 bibliographic references at the end. Now, somebody else creates a copy of the document. He is freezing the upstream document. The result is, that both documents, original.html and copy.html can be edited independent from each other. The files are located in different folders. The ability to edit a document independently is producing a version conflict. To overcome the version conflict some sort of communication is required.

The communication actions to overcome a version conflict are equal to the peer review process of an academic journal. There are different options in doing so:

- dedicated peer review by external experts

- decision making by the journal editor

- negotiating on a mailing list

- overwrite the version by technical actions because user1 has admin rights, while user2 not

A dedicated peer review is only one option to solve a version conflict in a two branch project. The peer review process isn't at the beginning but it's an answer to a version conflict. The underlying reason is the out of sync behavior of two branches which holds the same file. Branch1 is maintained by the upstream in the preprint server, while branch2 is maintained by the journal in the downstream. Let us investigate what a potential alternative to an upstream freeze is.

Suppose the idea is not to create a stable branch but contribute to the original.html file in a different fashion. The workflow would work in the following way. At first, the author uploads the original.html file to a preprint server. Now a second user likes the article but he would like to add something. He sends an e-mail to the author with an additional paragraph. The original author accepts the modification and a new file is uploaded to the preprint server which is origina-improved.html.

That means, the modification in the file are taken place in a single branch. The original author plus user2 are communicating back and forth and if they have found a shared position, the file gets updated. Conflicts are not possible, because if the original author doesn't accept a modification the user has no option to modify the file.

The major difference between a single branch development model and a two branch model is, that in the two branch model a conflict is the possible. This conflict produces a certain communication style. Perhaps it make sense to provide an example. In the normal single branch model. the original author owns the file and the second user is a subordinate. In a two branch model the second user owns the stable branch and the original author is a subordinate. This kind of flipped social relationship is available for all peer reviewed journals. The original authors sends a manuscript to a peer reviewed serious journal only for the reason to become the subordinate of the journal editor. Not the author but the journal decides if the submission has a high quality. The advantage of this flipped role model is, that the reader of the journal has an advantage from it. The reader trust a journal if not the authors but the journal editor take the decisions.

March 23, 2020

Creating academic journals as a Linux distribution

The best role model for an academic journal is the Debian Linux distribution. Debian is working with two sections: upstream and downstream. A minimal academic journal will contain of a wiki page which contains of two sections for upstream and downstream.

The main feature is that the upstream and downstream section are running out of sync. In the downstream, the same papers are available like in the upstream section, but they have a different version. In the terms of the git version control system, the upstream section is a fork. The result is, that both sections can be edited independent from each other. This produces a lot of chaos and there is a need for an intermediate maintainer. His obligation is to sync the downstream with the upstream. And for doing so, some decisions have to made.

The result is a working journal editing pipeline. The overall system accepts incoming manuscripts in the upstream section which are provided by authors and it generates stable releases which are consumed by the audience. The idea is not completely new. The upstream section is sometimes called a preprint server, while the downstream section is equal to an overlay journal. What was missing in the past, is a clear minimalist description to build such a pipeline.

The most easy to realize system holds all the sections in a single wiki file. That means, the upstream and downstream section are not branches in a github project, but they are sections in a textfile. Then the changes of the textfile have to be tracked. How well the system is working depends only on the amount of edits. If more authors and maintainers are able to participate the journal will become more efficient.

Perhaps it make sense to describe each part. The upstream section is equal to a classical submission system. Authors are invited to upload their manuscript to a server. They can edit the document which is producing a new version. Every author can upload more than a single paper. This kind of preprint server makes sense for authors because it's a storage for their manuscript but the normal reader has no need to read through the documents. The upstream section is equal to the Arch Linux project. There is a machine generated trunk version which contains of the latest version of each document. But this trunk version has no value for the reader.

In the section “downstream” the existing content gets aggregated. The first decision to take is which of the papers are fitting to the academic journal. in the diagram the papers #1 and #2 are selected for the first issue of the journal. The issue #1 of the journal is a copy of a certain version from the upstream. It can be edited separately from the upstream version. This produces a conflict. Instead of providing a single trunk branch which holds all the papers, two branches are available which are running out of sync. This two branch model has a large impact:

- first it generates a role model for author, reader and journal maintainer They are located on different positions in the workflow

- secondly, it produces unsolved questions. The maintainer has to decide which papers are the right one and in which version they are accepted in the journal. The reader has the obligation to give feedback to the maintainer, and the author has to think about why a certain paper was rejected.

- third: the newly generated role model in combination with the unsolved questions results into a communication pipeline. A mailing list, a forum and an issue tracker is needed to coordinate all the stakeholders and requests.

Peer review made easy

Existing academic journals are equipped with a peer review. This is the main advantage over a normal preprint server. A preprint server is only an online storage for a document comparable to an individual blog, but a peer reviewed journal provides a trust layer on top of a paper which makes it more likely that a paper gets referenced by other.

So what is the secret behind the peer review process? Has it to do with sending a manuscript to experts? Yes and no. Peer review is the result of two branch development model, very similar what Linux distributions are doing. The Arch Linux distribution can be compared with a preprint server, it doesn't has a peer review. Only Debian consists of a stable and an unstable branch and the result is some sort of moderation. Perhaps it make sense to describe the overall workflow for a software project.

In the easiest case a single programmer creates a new project at github and uploads the self written sourcecode. By default a github project consists of a single branch, the master branch. Master is equal to the development aka trunk branch. If the software author has created a new version of the software he is sending the commit to this branch.

A more elaborated workflow contains of at least two branches: one development and one stable branch. By creating a stable branch a point snapshot is created from the development branch. After creating the branch, both branches will become out of sync. That means, the same file helloworld.py can be edited in the development and in the stable branch independent from each other. The result is a conflict. The conflict will be there if both branches should be merged. Because during the merge process the maintainer has to answer which of the versions is the right one.

Basically spoken, a second stable branch is created for the single purpose to create a conflict during the merge process. Every conflict has to be resolved. This can be realized with a mailing list or with a peer review. If only a single branch (the development branch) is available no conflict is there and no peer review is needed. The conflict can be explained with social roles. In the example with the two branch github model there are two conflicting roles: one programmer is responsible for the development branch and the other for the stable branch. The role conflict is producing a higher quality of the project. That's the reason why the Debian Linux distribution is recommended for productiion server, while Arch LInux isn't recommended for such a purpose. And exactly for the same reason, a peer reviewed paper gets referenced by other while non-peer reviewed paper won't.

Let us go back to the inner working of an academic journal. Suppose a journal contains of a development branch and a stable branch. The result is, that in the stable branch some decisions have to be taken. The major decision is, if a paper in the development branch should be published in the next issue. Solving this problem can be done in many ways. Either a random generator is asked, a formalized rule book is asked or in the best case, an external peer reviewer is asked for a quality judgment. That means, the maintainer of the stable branch of an academic journal makes his life easier, if he sends out an unpublished manuscript to external experts and asks them to review the content.

If the stable branch maintainer isn't doing so, he can't do a decision if the paper should be published. The consequence is, that the next issue can't be go online. The same situation is available for the Debian distribution. Before the next major release is published, the maintainers have to answer the question, if a certain upstream version should be included in the distribution or not. This kind of decision is only needed for stable release Linux distribution. In the Arch Linux project there is no need for such a decision, because the upstream dictates which version is the correct one, which is always the latest, no matter if it's an improvement or not.

Academic journal from scratch

Creating a peer reviewed journal from scratch is pretty easy. All what is needed is a two branch development model which is running out of sync. In the unstable branch the authors are uploading their manuscripts and in the stable branch the next release of the journal is prepared. Everything else, for example in which file format the manuscript is accepted, or which persons are allowed to peer review a paper are minor decisions. The same principle of a two branch model works in very different situations. It can be realized for a printed journal, for a predatory journal, for a serious journal, for a non sense journal, for an amateur journal, for a journal which is based on MS-Word, or which is based on LaTeX.

The social mechanic of peer reviewing is the result of a conflict between upstream and downstram branches. That means, a journal which is working with a single branch doesn't provide a peer review, and in case of two branches, a peer review is possible.

March 22, 2020

Building an academic journal with stable releases

From a technical perspective all the tools are available to create an academic journal from scratch. Webspace is available in a blog which allows to upload pdf files easily, the pdf file can be created with most document processors like Libreoffice or LaTeX and the version history during writing the document can be tracked with the git tool. Suppose a single author combines these tool and creates some papers, are these papers the same as an academic journal? No they don't, something is missing because the readers won't trust the journal. The reader understands what traditional journals are doing for example Elsevier and Wiley, but he isn't interested in reading self-created pdf papers, especially not if the content is provided for free.

It's possible to formalize the missing part better. It's called an Open Access downstream. The term downstream was invented in the domain of Linux distribution. For example the Debian distribution is the downstream, while the sourcecode in the stable version is called the upstream. The workflow from the beginning which includes the pdf file format created in LaTeX is located in the upstream. It has to do what the single author has to do for creating the content. The missing part called the downstream makes sure, that the content is forwarded to the normal user. It's a layer between the upstream and the normal user.

Let us describe what Debian is doing. Technically Debian is an additional branch in the version control system. A branch is a copy of the original content. This idea can be simplified a bit for better understanding. Suppose on the harddrive are two folders available. In folder A the incoming files from the upstream are stored, which is the pdf document of the author which contains the paper. In the folder B the stable branch is stored which can be read by the normal reader. The question what the downstream has to do answer is, what exactly should be copied into the stable branch.

In the diagram the picture explains the idea visual. Without the downstream branch, the reader has direct access to the upstream version of the documents. It's some kind of Arch Linux for academic publication. The authors are uploading the pdf files to a server, and the reader can read the information. The interesting point is, that in reality such a direct connection between author and reader doesn't work. To make the information from the upstream easier to read, the users are expecting a layer in between. This is called a journal. The journal is the downstream. It is doing the same what the Debian project is about. The journal forks the content from the upstream into an own branch, and for doing so, some decisions have to be made. In the given example, the decision was made to accept the pdf file 1 and also the pdf fil

e 2. The second decision was which version of the manuscript was accepted. The interesting result is, that for the reader it's easier to consume the downstream information than the upstream one.

It's important to know that in the journal branch no content is created, but the existing content is aggregated. The role model is again the Debian ecosystem. A debian maintainer hasn't programmed a piece of code, but he is talking with the upstream developer on a mailing list. If somebody likes to create an online academic journal, he needs such a workflow. It's only option to create trust.

It's interesting to know, that an academic journal doesn't need to be have a printed one. In the example diagram all the information is organized online only. What is important instead is, that n the version control system the upstream branch is forked into the downstream branch. The concrete decision who to do so is done by the journal editor. The result is two fold. First, for upstream authors is easier to communicate with the downstream section, and secondly it's easier for the reader to communicate with the downstream section.

How to communicate between two parties?

The diagram looks a bit complicated. There are so many circles and arrows. Why are the authors not only copy the files to a server and the reader browse through the content? This is a nice question. So good news is, that it was researched in detail for creating Linux distribution. It's the old question if Arch Linux or Debian Linux is the better development model. What the picture shows is the complicated Workflow of debian. According to the Debian community, it's not enough that the normal user gets the latest software from the upstream, but he needs a hand-curated distribution which is different from a testing repository. The result is that software developers and end-users are separated from each other. The author of a software checks in the latest changes in the upstream repository, while the user of the software has only access to the downstream version. The layer in between, called downstream, is used for communicating back and forth. That means, if the reader of a pdf paper has found a mistake he isn't contacting the original author but he opens a thread in the mailing list of the downstream community.

In the debate around Open Access this principle is sometimes called an overlay journal. An overlay journal takes existing pdf papers hosted in a repository, creates a copy of it and redistribute it to the user. Technically an overlay journal can be realized as a branch in the version control system. Let us make a practical example.

Suppose the idea is to build an academic journal in github. At first, we need two authors who have uploaded a paper to their individual git repository. In this repository the authors are allowed to maintain their individual version history. That means, the initial project gets updated to correct spelling mistakes.

Then an additional git repository is created which is a copy of the pdf file 1 and pdf file 2. Doing so is called forking. Forking means, to take a snapshot of a github folder and copy the content into a new one. Then the fork is improved a bit, for example, a cover letter is created, and a forward is written by the journal. And voila, the new academic journal is ready and can publish his first volume.

And now comes the interesting part. Such a pipeline will produce a lot of stress. The first thing what will happen is, that both upstream authors have recognized that their content was forked. They will open a new ticket in the journal directory and ask for the reason. Secondly, the first readers are not happy with the content and they will open a ticket as well. That means, in the github repository of the journal lots of traffic is created in which both sides are creating unsolved tickets. And this is equal that the journal is accepted by a third party. If somebody creates an issue against a github project he has a need to communicate with this project.

Perhaps it make sense to simply the creation of an academic journal to a minimum. From a bottom up perspective an academic journal is created with the unix command:

cp -r upstream/ downstream/

This unix command copies the existing upstream/ folder into a new one. It's not a soft link or a redirect but a copy. This copy creates a new branch from scratch and can be updated seperately. That means, if somebody edits in the file1.txt both folders will get out of sync. This produces a stress which is compensated by communication of the mailing list. Basically spoken, an academic journal is a forked of existing pdf files.

Can Wikipedia be forked?

The entire Wikipedia is too large to create a fork. The project has over 20k users and building a second encyclopedia from scratch would take too much manpower. But, if the aim is to fork only a single category, for example articles about Artificial Intelligence, a fork isn't very hard.

Suppose a single user creates 30 edits per month with a size of 1000 byte each. And the fork contains of 10 users who are working in parallel. After 5 year the project has generated 9000 article with 2000 byte for each of them. And after 10 years the small team of 10 users has produced the same amount of content which is available in real Wikipedia.

A good starting point of a Wikipedia fork is to submit new articles no longer to the Wikipedia itself, but only to the fork. The list with requested articles about AI is located at https://en.wikipedia.org/wiki/Wikipedia:Requested_articles/Applied_arts_and_sciences/Computer_science,_computing,_and_Internet#Artificial_Intelligence The content isn't written yet. But it can be created from scratch and then the article gets uploaded to the fork wiki. The bottleneck for the project is to motivate some users to participate. In most cases the users are only interested to upload content to Wikipedia but not a fork, because the clone has a smaller amount of pageview and no working copy editing team which is correcting spelling mistakes and moderates the process.

On the other hand, the content of the original wikipedia is overestimated. The articles in the AI sections contains of around 50 flagship articles with 50k bytes, and the rest has a poor quality. It's possible to build something which works better from scratch. That means, without take the existing content as starting point but create everything from scratch which will result into the lowest possible copyright conflict.

The only thing what is harder to fork is Google Scholar. Google Scholar and the underlying full text repository contains of 50 million academic papers. The AI Section in Google Scholar has around 1 million papers written by scholarly authors. Writing this content from scratch is very complicated and would take large amount of time and manpower. In contrast, the WIkipedia project is some kind of slideshow community. The users are creating overview snippets for existing academic full text paper in the hope that this is attractive for a larger audience.

The reason why academic publishers are not motivated to engage in Wikipedia is simple: because the project is trivial. Trivial means, that the amount of ressources which are required to build an encyclopedia is low. The entire WIkipedia which contains of all articles can be run with around 10k people. If the aim is to build only a subpart of the project about a single academic topic, for example artificial Intelligence, the amount of needed ressources are around 10 persons who are creating the content from scratch. That means, academic authors are able to build their own encyclopedia from scratch without copy&paste a single sentence. They are writing all the articles from scratch with less than 100 users in a short amount of time.

Small rant against the C language

The C language is the big elephant in the room. Everybody is writing C code but nobody is talking about. Let us change the rules of the game and try to overcome outdated C syntax and use a different kind of programming language for writing operating systems kernel, programming embedded applications, create graphics library and write object oriented applications.

Possible candidates for replacing C code are Java, C# and especially C++. A short look into a Linux distribution have shown that the C++ language isn't used very often https://dwheeler.com/sloc/redhat71-v1/redhat71sloc.1.00.html Only 15% were written in C++. And the assumption is high, that especially programcode which is not installed on real computers, for example KDE was written in C++, so that on running machines the ratio is worse for C++. But why exactly was C++ never able to replace C? The first version of C++ was published in the mid 1980s. Since then there was enough time to rewrite and recompile all the code. But this project was never started. All the newly written code is using the normal C language with minor modifications for example C99 instead of C89.

The paradox situation is, that C++ has replaced C in one category: the amount of books written about Object oriented programming in C++ is much higher than for C. The only book ever written about OOP in C was published in 1993 “Axel-Tobias Schreiner: Object-Oriented Programming With ANSI-C, 1993”. Apart from the book, there are two! (not more) Dr.Dobbs articles in the 1990s which are explaining how to program classes in C, and in some stackoverflow postings the topic is also discussed.

In contrast, the amount of papers, journals and books who are explaining how to program Object-oriented in the C++ language is larger than 20k overall. Additionally, nearly all university in the world is teaching how to program in C++ object oriented software. It seems, that the problem has to do with the difference between written code in the wild (mostly C) and computer classes in the university which are focussed on C++.

From a technical perspective it's not very complicated to create object oriented code in C. All what is needed are some function pointers, some structs and a bit discpline of the programmer. A look into existing software project at github will show, that most C programmers are experts for object oriented code. They are managing complexity in the written code by combining structs with functions in the same module. And they have no need for other programming languages like C++, Java or C#.

The only user group who is using dedicated OOP languages like Python or Java are newbies who are not familar with computer programming. They are reading all the C++ books in the hope to learn how to write object oriented code. This paradox situation can be overcome easily. What is needed are books with the title “OOP in C”. SImilar to to the mentioned book from AT Schreiner but published in the year 2020. Such books aren't available yet. And exactly of this reason, the newbies won't learn C at all. The typical newbie has understood that object oriented programming is a here to stay. Because it simplifies the programming of GUI Applications and games very well. And because the newbie has never programmed in any language at all, he decides for a typical OOP language in the hope this is the future. For example he is learning Java or C++.

The consequence is, that the newbie will waste his time. Because C++ won't never replace C code. The C language is way to powerful and provides too much object oriented features that expert programmer will switch to a different kind of language in the future. Basically spoken the existing software projects written in C for example the Linux kernel, the Windows kernel or a larger game can be translated with an UML generator into a nice looking object-oriented diagram. That means, there are objects (aka structs) and functions which have access to these objects. It's not possible to reprogram the Linux kernel in C++ because it is using object-orientation already.

The biggest strength of Python is it's slowness

If a newbie tries out the Python interpreter for the first time, he will notice that the code runs horrible slow. Compared to the compiled C language a python program is around 20x slower which makes the language unusable for practical application. And exactly for this reason Python is a great language. Because it draws a line between teaching and productive scenario.

From a technical point of view, it's not very hard to make Python faster. One option is to optimize the python interpreter or develop a just in time compiler. The resulting language would have much in common with node.js, java and C++. It will become a language which is used for teaching programming and for programming real systems at the same time.

The good news is, that this is not the goal of Python. It's a teaching language. It allows to learn programming and create prototypes but the Python ecosystem prevents that Python code gets executed in real operating systems. Let us compare Python with other object oriented languages:

Java, C++, C#, node.js and ruby have in common that they are used for teaching programming to the newbies. Java for example is widely used in an academic context. It explains very well what object oriented programming is. The fast executation speed is that main difference of Java to Python. A fast execution speed implies that the language can be used outside a learning environment as an alternative to C.

Is Java able to replace C programs? No it doesn't. C is the number one language in the wild. It's used for creating operating systems, libraries, AAA game and object-oriented desktop application. The only problem with C is, that it's not used for teaching programming, because it has no explicit classes. And exactly this gap was filled by Python. Python is the missing part to train the newbies. If somebody has understand who to write Python programs he can try to use C structs and C pointers for doing the same for writing production ready code.

Python -> C -> Forth

Python is the number one language for creating prototypes and learn to program. The entry barrier for creating python scripts is very low. Even non programmer can create a hello world application within minutes. The C language is the number language for creating software in the wild. Most (>80%) softwareprojects in the reality are realized in C and it's superior to C++, Java and C#. C is the dominant language for the x86 PC architecture and any sort of application can be created. The Forth language is a special case, it's a language for programmers who are already with C and who are searching for a faster alternative. The main difference is that Forth will run on non-x86 systems which can be designed in FPGAs from scratch. Rewritting existing C code into Forth is good startin point to get familiar with stack-based computing.

Educational programming languages

Recent object oriented languages like Java and C# are teached very often in computer courses as an example for object oriented programming. The audience are newbies and non-programmers who are interested in learning the language from scratch. Python can be teached also in such courses. The main difference between Python and Java is, that Python programmers are aware that their language can't be used for practical applications. If they are writing a small prime number generator with a for loop they will recognize very soon, that the language is way to slow for practical applications. Python is an educational only language. That menas, if somebody like to program software in the wild he won't use Python.

In contrast the educational situation for Java is different. Java is used in introduction courses and the same Java language has become popular in writing real applications. Similar to C++, Java is used in an academic context and for practical applications at the same time. The problem is that programming experts are using C since 30 years and they are not planning to rewrite the code in any other language. That means, all the newly written Java, C++, Python and Ruby libraries are useless. Real operating systems are equipped with normal C libraries which are providing the maximum performance and are maintained by experts and any other language is critized as a toy language. In the case of Python, the Python community won't argue against it. They know, that Python can't replace a C library.

The situation in the programming world is, that there is the expert language C on the one hand which is used for creating important software, productive software and for large scale projects, and all the other languages were developed for niche problems, for academic purposes or as an alternative to C. A relative new understanding of computer programming is, that the C language is especially recommended for object oriented programming. This is a bit surprising, because C++, Java and C# were developed as a dedicated OOP language, but they have failed to replace C in this domain.

What the alternative languages over C have in common is, that they are widespread used in an educational setting. Many books were written about it and they are used in computer courses at the university. In contrast, the C language is never teached anyware and modern literature isn't available. The assumption of the newbies is, that the C language is outdated and is replaced by Java, C++ and other languages. This thesis isn't backuped by the percpetion in the reality. If software projects becomes larger, and are realzed with modern OOP technique it's in all cases a C only programming project. This is not wishful thinking but can be determined by take a look into the sourcecode of the software.

Why is C so popular? The reason is, that software engineering can't be separated from low level programming. If somebody likes to write a high level application he will need an operating system and existing libraries for doing so. To get access to the existing sourcecode, an API is needed and every API is working with pointers. Even higher languages like C++ and Java are using pointers all the time, and before the newbie is able to program in Java he has to know what pointers are. That means, it's not possible to ignore the topic at all.

And if C supports pointers, structs and modules out of the box, the programmer has no need to use a different langauge than the existing one. That means, especially newly written code is created in C. The prediction is, that this will be the same in 10 years from now, except somebody invents a language which can replace C.

The only area in which C can be ignored is for academic reasons and for software prototyping. If the idea is to explain in general what object oriented programming is, how an algorithm is working in theory and how to create an UML diagram, the C language isn't the best choice in doing so. A java based UML Generator is the prefered choice for software engineering teaching, while algorithm can be explained with Python very well. It makes no sense to print a screenshot of C sourcecode in a textbook because the syntax is hard to understand. C is way to low level and provides too much details of the underlying CPU.

What happens since the 1990s with the C language?`

Around the year 1990 was a major milestone in computer programming. Because in this year, two important programming languages were available at the same time. The well known C language plus the newly developed C++ language. In the year 1990 both larger compiler suites from Borland and Microsoft has declared C++ as the new major language.

From today's perspective it's hard to explain why this event took place. Since the year 1990, the well known C Language is no longer relevant for computer education but was replaced with C++, Java and C#. The interesting fact is, that around the year 1990 some tutorials were published how to combine object oriented programming with a c compiler. Some libraries were written to extend the C standard with classes, but in the tutorials it was also explained how to combine normal C structs with c functions into objects.

The attempt of combining the well known C language with object oriented design technique was never very popular in the literature. In the reality, which means in the written code, it was very popular. Nearly all serious C libraries are using object oriented methods in combination with an ANSI C coding style. But let us listen what the advocates of C++ and Java are explaining in their books. The main idea was, that object oriented programming makes only sense if an object oriented language is used. The C language doesn't contains of objects and inheritance, so it's logical to invent a new programming language, which was C++ in the year 1990 and Java in the late 1990s.

The interesting situation is, that C++ never was popular for expert programmers. Linus Torvalds doesn't like C++, and he is not the only one. C++ was only popular for newbies who doesn't know anything about programming. They dreamed of writing larger projects with the object oriented C++ style. Some attempts were made, but such projects are not very popular for productive machines.

The reason is simple: The c language can be used by expert programmers similar to the C++ language for realizing larger projects. The programmer is dividing the task into modules, uses structs and pointers everywhere and doesn't miss C++ classes. The advantage of C is, that the program can be compiled for embedded systems and the C compiler is easier to maintain than a C++ compiler.

The question is not why the Linux kernel, systemd and the gtk+ system was written in C, the question is why is C++ teached in the university and has become so popular at Stackoverflow? The main problem which is solved by Stackoverflow and acedemic programming courses is not write production ready code, but the objective is to explain to the newbies what programming is about. C++ and other OOP language can be interpreted as a learning language. They were designed for educational purposes. The idea is, that the student can understand with C++ easier what a class is. The unsolved question is, what the student should do with his C++ knowledge if real projects are written entirely in ANSI C and will do for the next 30 years?

Somebody may argue, that knowledge about object oriented programming can be utilized for programming in any language. But can the C++ knowledge be used to create ANSI C programs as well? No it can't, because this is not described in the literature. I have found only a single book which explains how to program in C with the OOP technique, plus some smaller discussion threads in Stackoverflow. That means, according to the literature a programmer has to stay either in the C++ / Java language family, or he has to program with a procedural technique in C which makes it impossible to create larger projects.

But, if no book is available how to use OOP knowledge for writing C code, the normal student isn't able to do so in reality. That means, C++ knowledge isn't teached to use it for writing better C programs, but the literature doesn't make sense at all.

It's some kind of paradox situation, that C programmers who are familiar with object oriented programmer never write a book about the topic, while teachers who are familiar with OOP programming are writing books about Java, but not about the C language. The result is a gap, between theoretical education and practical software development. C programmers and C++ programmers doesn't talk to each other.

The exact year can be traced back in time. It was in the year 1990. In this time only the c language was available, and C++ was not used in mainstream computing. Since the 1990 it was discussed in the literature how to create larger software projects with the help of Object oriented programming. Newly languages like Borland delphi were developed for this purpose. And even today, the question is open which programming language is the best.

The rosettacode website collects programming problems. There is a section available in which the different programming languages should create a class. The interesting point is, that even non OOP languages like C and Forth are asked in doing so. https://rosettacode.org/wiki/Classes#C The example code for C shows very well how expert C programmers are creating a class in their language. At first a struct is created. Instead of putting the struct to the stack, a pointer to the struct is created and the malloc command is used. Then a constructor and a destructor function allows to create and destroy the class. What is interesting to is the naming convention. The function have the class name in the beginning and then follows the method name: MyClass_delete.

Creating single inheritance with the c syntax is possible but a bit more complicated.

The upraising of the C++ language

The C++ language is a widely discussed subject among computer programmers. Thousands of books and lots of stackoverflow entries are written about the language. At the same time, C++ is quite difficult to learn for newbies and they are preferring more clean designed languages like C# or Java. The reason why C++ is perceived as complicated has to do with a certain sort of tutorials how to program. In case of C/C++ there is a gap what programmers in the reality are doing and how they have documented their sourcecode. To understand the gap we must go back into the early 1990s.

In the early 1990s there was a big transition from older C compilers to modern C++ compilers. Or to be more specific, the books and published literature about C++ has increased while the reports about C programming has declined. Let us analyze the typical C programming book. What is written in the book are the C language standards and what is missing is a tutorial how to use the C language for creating object oriented code. The interesting is, that programmers in the reality are knowing very well how to create OOP with C. I have searched a bit in the sourcecode at github. Most of the C projects, especially games and C libraries are using a certain C programming style which was described briefly by https://softwareengineering.stackexchange.com/questions/308640/is-it-bad-to-write-object-oriented-c

The idea is to put a struct plus the function into the same file and call it from the outside, very similar of creating classes in C++. This programming style is seldom described in the manuals because it combines the classical C language with modern OOP design. But it's used in real github projects and perhaps most commercial C projects are working with the same style. That means, in the reality all the programmers are familiar with creating OOP software in C, but they haven't it documented in books.

What is described in the book is Object oriented programming with C++. This is described everywhere. The paradox situation is, that in reality nobody is using C++, especially not for serious projects. The amount of videogames written in academic C++ is low, the same is true for library of operating systems or serious applications created by experts. And exactly this mismatch explains why newbies struggle in learning C++ at all. They are reading the C++ tutorials in the hope to learn how to program modern C++. But they can't use this knowledge in real projects, because they are using the wrong programming language.

The reason why C++ was invented is to replace the C language. In reality C++ has failed in doing so. If the normal C language provides enough features to program semi-object oriented code with the help of structs, modules and pointers why should somebody switch to C++? Right, and because of this question around 65% of the code in the Debian Linux distribution was written in C and the prediction is, that the ratio wlll be constant for the next 20 years. Basically spoken, if a newbie wants to learn a modern objected oriented language which is used in the reality he should learn C and search for tutorials which are explaining how to combine C with object oriented programming style.

The gap between C programs in reality and manuals about how to create C programs is obvious. The average C programmer knows very well what Object oriented programing is. The C sourcecode is formatted like a C++ program which includes classes. So the assumption is, that the programmer has a deep knowledge of object oriented design. At the same time this knowledge isn't made explicit in the literature. The amount of tutorials who are explaining who to program OOP in C is very small. It's some kind of implicit knowledge how to combine the C language with object oriented features.

It's easy to predict what the future will bring. Instead of inventing better C++ programming language, in the future better tutorials were written how to use the well working C language for creating object oriented code. This will allow newbies to reproduce the existing C codebase and copy the programming style of existing C programmers.

Game engines

According to an often repeated story, the C++ language has become the standard language for game development. All the major game engines are written in C++. At least, this is told to the newbies. But let us take a deeper look into the problem. At first, the sourcecode of all the proprietary game engines is not available. In thoery it can be written in plain C and what is told to the public is the opposite. Some of the game engines have published the sourcecode, and indeed it's written in C++ syntax. A normal gcc compiler can't convert the code into a binary file. But is the code really written in C++?

All the so called C++ sourcecode contains of cpp files plus header files. It's interesting that all the code contains of pointers to structs. The reason is, that pointers combined with header files are the only option to realize object oriented paradigm in C++. Let us make a small thought experiment. What will happen if an expert programmer takes the existing C++ code and rewrites it in plain C code? That means, he replaces the class keyword with the struct keyword and adjusts some minor formatting issues. The resulting plain C sourcecode will look nearly the same like the original C++ code.

This thought experiment shows, that the so called C++ sourcecode is in reality normal C code. It contains of header files, is using pointers in every function and can be realized with a normal C compiler easily. So why it is called a C++ project? The funny thing is, that the programmers doesn't call it a C++ project. They know that the difference between C and C++ is small. They are calling it a C/C++ program and they are using the C++ syntax without a purpose.

March 11, 2020

C++ is obsolete

The major problem with the C++ language is the complicated pointer syntax. In contrast to modern OOP languages like Java and C#, C++ requires pointers at many situations. A look into existing larger C++ programs shows, that pointers are used together with classes very often. Not because the programmer doesn't know how to do it better, but he is using the fast programming technique available. That means, it's not possible to program in C++ in a different way. Even the latest iteration C++20 requires that the programmer prefers “point->x” over “point.x”.

The good news is, that the pointers in C++ are making more sense, if the same technique is realized in plain C. A hello world program which is using OOP plus pointers in the C language is given next:

// main.c
//------------------------
#include "stdio.h" // wrong brackets
#include "point.c"
int main()
{
  printf("main\n");
  point_run();
}

// point.c
//------------------------
typedef struct {
  int x;
  int y;
} Point;
void set(Point* p, int x, int y) {
  printf("set\n");
  p->x=x;
  p->y=y;
};
void show(Point* p) {
  printf("show %d %d\n",p->x,p->y);
};
void point_run() {
  Point p;
  set(&p,10,5);
  show(&p);  
}

It looks very similar to a C++ program except that no dedicated class statement was needed. At the same time, the programmer is storing the sourcecode in different files and splits the overall project into smaller programs which makes it easy to maintain the code.

A look into existing C repositories at github will show, that this sort of style is used by most programmers. Sometimes not in this direct clean form because the file length is longer, and more than a single struct is given in the text. But, if the programmer likes he can program in C similar to Python.

The assumption is, that we the C language everything is fine, it's not possible to replace the provided sourcecode with something which has more performance or can be written more elegant. Even if somebody rewrites the code in C# or Python it will look the same. The core idea of Python is:

- put every class in a new file which is less than 100 lines of code

- aggregate classes to more complex classes

- comment the code, create a documentation for the API

This is in short the best practice method to create modern object oriented code. The paradox is, that the old-school C language supports this idea very well. The syntax is a bit different from normal object languages. Because the function is available globally and a method needs a pointer to a struct before the class variable can be edited. But this is only a syntax decision.

What i want to explain is, that modern languages like C++, C#, and Java have struggled in replacing C code with something which can be maintained better. That means, all the projects from the past which were created in plain C but not in C++ are future ready. Nobody will rewrite C code with something which is easier to maintain, because C is the king already.

One important reason to prefer C over C++ is because the pointers in C are making sense. In the sourcecode, the pointer is the only option to copy the struct into a function. The result is, that all the C routines will look the same. It's not possible to write the code different. This is important for newbies which need a clear advice how to create a program.

Sourcecode browser

In the screenshot the geany IDE is shown together with the sourcecode. The frame on the left is very interesting. Geany has parsed all the structs and the functions from the file. If a single file is smaller than 100 lines of code, and if geany shows very well which datatypes and functions are defined in the file, the programmer has everything he needs. It's the same sort of overview provided in object oriented languages like Python or Java.

What exactly was the problem with C, why it's not used anymore? It is used, many thousands of games at github are using the tool. The only thing what is wrong are the programming books about C++, Java and C#. They are explaining to the user, that C is outdated and the user beliefs so.

Object oriented programming in C

Instead of using a dedicated OOP language like Python or Java, it's possible to create even in C an object oriented project. Some features are missing but in general it's possible for doing so. The only difference is, that the fucntion call doesn't need the full path to the module, but it can be called direct.

// file: main.c
// -------------------------------
#include "stdio.h"
#include "point.c"

int main()
{
  printf("main\n");
  point_show();
}


// file: point.c
// -------------------------------
typedef struct {
  int x;
  int y;
} Point;

void set(Point* p, int x, int y) {
  printf("set\n");
  p->x=x;
  p->y=y;
};
void show(Point* p) {
  printf("show %d %d\n",p->x,p->y);
};
void point_show() {
  Point p;
  set(&p,10,5);
  show(&p);  
}

The interesting question is, if such a programming style works for practical application. In the github archive there are many examples for games written not in C++ but in normal C. Many of them are using this style. The subparts of the game are stored in dedicated files which contains of a struct definition for the data and a list of functions for the C code. What these modules are doing is to to call their own functions, similar to the concept of classes in other programming languages.

Understanding the inner working is a bit harder than mastering Python OOP, because in a C project, pointers are needed for everything. But suppose the idea is not to use Python, C++, C# and Java, then this kind of programming style is a here to stay. It allows to write semi-object oriented software which scales very well. It's not a coincidence that most of real programs which are available out of the box in Unix and Linux operating systems were written in the C language but not in Java and not in C++. The reason is, that the advantage of dedicated OOP languages over C is small.

Let us take a closer look into the sourcecode. In the main file, only a simple call to a module is available. All the details of the Point module are hidden in the external file. In the point.c file, the datastructure is stored together with the functions in the same file. If the programmer takes care, that the maximum length of the file remains under 100 lines of code, it's not very complicated to maintain and bugfix the code. If the C code is rewritten in Python or C++ it will look nearly the same. That means, the overall project is divided into classes which are responsible for subparts of the project.

The assumption is, that writing larger projects in C can be realized with the same productivity like in Java or Python. That means, the C programmer won't miss OOP features, because most of them can be replicated with the C language. This makes it hard to convince a C programmer to convert the existing code into a different language. Basically spoken, the C language has a bright future and will be used very often.

Programming language statistics

The well known Tiobe index doesn't reflect programming languages in reality. There is a gap available that in computer education, Java and C++ is very important but in reality nobody is using these languages. A more realistic picture is counting the lines of code, https://www.openhub.net/languages?query=&sort=code

According to the Openhub directory the most used languages in the wild are:

1. C with 9.4 billion Lines of code
2. Javascript 4 billion LoC
3. XML 2.9
4. C++ 2.6
5. Java 2.4
6. HTML 1.4
7. PHP 0.9 billion
8. Go 0.6 billion loc
9. Python 0.7 billion
10. CSS 0.5 billion

In another older statistics, the C language outperforms also C++ easily in amount of written codelines. The sourcecode of the Debian operating system was measured in the year 2005 which contains of 105 million lines of code overall. 65% of them were written in C and only 12% were written in C++.[1]

There are some points against the C language. Most AAA videogames are not writtein pure C but the normal C++ language is used. And many developers are explaining that C is dead and they are prefereing C++ because videogames need object-oriented features. But, a closer look into the C++ sourcecode will show, that nearly all the game engines and the games written on top of the engines are using pointers in the C++ classes. It's not possible to avoid pointers in C++ because this ensures the maximum performance. What the programmers are writing in the code is not C++ but they are programming in C with pointers and using only the class statement and sometimes the templates of the C++ compiler.

Let us make a thought experiment. What will happen, if a larger computergame is reprogrammed in C? THe sourcecode will look nearly the same. That means, lots of pointers are needed to draw the sprites and call foreign modules. The difference between C a program which is using static functions to get access to structs and a C++ program which is using dedicated classes is low.

[1] Amor, Juan José, Gregorio Robles, and Jesus Gonzalez-Barahona. "Measuring woody: The size of debian 3.0." arXiv preprint cs/0506067 (2005).

Understanding pointers and references in C++

The C++ programming language is hard to understand. In contrast to modern OOP languages like Java, C++ knows many difference variables like normal variables, pointers, references and many more. Even the official manual isn't able to explain the details. The good news is, that a simple look into a C manual will help a lot to get the details.

But let us go a step backward: there are two different programming languages available: C and C++. Most software written in a Linux distribution wasn't created in C++ but in the normal C language. C++ is only teached in the books but seldom used in reality. The obvious difference has to do with object oriented programming. A while a go a stackoverflow users has asked how to realize classes in C:

https://stackoverflow.com/questions/1403890/how-do-you-implement-a-class-in-c

The answer was, that in a struct function pointers needs to be created. And yes, this explanation makes sense. And it explains why C++ is hard to grasp. Because C++ is doing the same but struggles in explaining the reason why. Suppose, somebody likes to program a state of the art program for the console or for the GUI, then the plain C language is the optimal choice. The funny thing is, that even the programmer is asked to declare function pointers the sourcecode is easy to read. If the plan is to not using pointers at all, the python language is a good attempt for software prototyping. It provides classes without pointers and is documented very well.

Instead of arguing against C++ we should ask why the language isn't used in reality more frequently.

But let us go back to the stackoverflow post with the function. A struct is a standard datatype in the C language. It allows to combine different variables into a new one. Extending a struct with functions is the logical next step towards advanced software. The work hypothesis is, that this kind of OOP technique is not an example for bad programming style, but it's the recommended way in programming modern software. The next interesting aspect is, that 95% of C++ programs have the same syntax. That means, pointers are used everywhere. The difference is that a C program which contains of structs and function pointers makes sense while the C++ doesn't makes sense for the newbie.

In the basic version, no function pointers are used, but the struct is provided as a pointer to a normal function:

#include "stdio.h"
typedef struct {
  int x;
  int y;
} Point;
void set(Point* p, int x, int y) {
  printf("set\n");
  p->x=x;
  p->y=y;
};
void show(Point* p) {
  printf("show %d %d\n",p->x,p->y);
};
int main()
{
  Point p;
  set(&p,10,5);
  show(&p);
}

In another stackoverflow post it was explained how to improve the struct with a function pointer. https://stackoverflow.com/questions/17052443/c-function-inside-struct But the answer says, that this is seldom used in reality. That means, the standard way of emulating OOP features in C is to define the function outside the struct but in the same .c file and call the function with a pointer of the struct instance.

March 09, 2020

How Linux will take over the business world

Microsoft is in the comfortable position, that 99% of all business oriented PCs are running under this operating system. From a technical point of view, these machines can be replaced with Linux software. And here is the road ahead. The first thing to do is to replace existing Microsoft SQL Servers with Linux systems. For applications like webserver, fileserver, printserver and user directory the Open Source systems are working stable. The next step is convert former desktop PC client into Linux stations. This can be realized with a terminal server. A terminal server has the idea to put all the database into the server, which includes the sql database itself, the middleware and also the frontend.

The citrix terminal server was mentioned already. The idea is, that the multi-user database is programmed on deployed on the server and the normal users are connecting with the Citirx server.

In an Linux environment a state of the art terminal server is the Gnome boxes software for the client. This software allows the user to connect to a remote desktop with the spice protocoll. On the server side the qemu/kvm software is installed which runs a desktop application. In the screenshot the Debian operating system was installed in a virtual machine and the Libreoffice calc program was started to enter a table. The user has to connect with the gnome boxes software to the remote qemu system.

This allows to convert an existing local table into a multi-user network ready table which can be used by many users at the same time. The idea is, that apart from the gnome boxes software, the user has no additional programs installed. He can use a normal Windows 10 PC and the Libreoffice suite is installed on the at the server in the virtual machine. Apart from Libreoffice calc any sort of database frontend can be run on the server. For example Java program or a python tkinter software.

Creating a middleware API with Python

The advantage of Python is, that it can be used for many things. It's used for creating GUI prototypes, testing out new algorithm, as a replacement for bash scripts or to program games. Python can even be used for creating the business logic in a database application and this should be described in the following blogpost.

The business logic is sometimes called a middleware API because it connects the frontend with the SQL backend. It can be visualized with a class diagram in the UML notation. The classes are realized with SQL tables which are connected to an Entity relationship diagram.

An UML class diagram explains very well what middleware is about. It provides a high level API to the outside world and it realizes the technical details with methods and an underlying database. The good news is, that Python has built in object oriented features. It can be used for creating classes and then the classes are filled with data.

An example is available online under the term “Northwind database”. Northwind is the example database introduced by the MS-Access software which contains the table for a fictional company. From the perspective of a relational database the ER-diagram is important but from the perspective of a middleware API the items are equal to classes.

Explaining what a middleware is can be realized by give the details of how a backend and how a frontend works. A backend is equal to a SQL database for example the sqlite software. SQLite communicates with the outside world with SQL statements. On the other hand a frontend for a database contains of a form generator which is able to draw windows on the screen. The user is allowed to press on buttons and gets the information he needs. Between the frontend and the backend there is a gap. That means, it's not possible to convert the output of the sqlite software direct into graphical information. This inbetween layer is the UML class diagram which describes the business logic from an abstract perspective.

According to Stackoverflow the simplest form of storing a python object into a file is the pickle module, https://stackoverflow.com/questions/4529815/saving-an-object-data-persistence But the pickle module will create a binary file which can't be used in external applications. The more elaborated form of storing python objects to a file is to convert it into the json format:

import json
class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age
  def getjson(self):
    return {
      'name': self.name, 
      'age': self.age,
    }
    
p=Person("peter",30)
print(p.getjson())

March 07, 2020

Small introduction into the Kexi software

The amount of tutorials about the Kexi-project is very low. The official forum has around 200 postings, and even youtube provides less than 5 videos about the topic. At the same time, there is a need for an Open Source desktop database, and Kexi is one of the most promising examples in that direction. The sad news for the beginning is, that the project is in early stadium. It can be called an alpha version which doesn't provide anything. Even MS-Access 2.0 has more features to offer and this program was realized 25 years ago.

But it makes no sense to be too critical. Because all the advanced existing database applications are close source projects and so called web databases on top of PHP and Ajax are very complicated to use and can't replace a desktop database. So i'd like to introduce Kexi to the newbies with some screenshots and explain what the developers have programmed so far.

The program is available in Fedora and Debian as well. According to the official description its an “integrated database environment for the Calligra Suite” and an “visual database applications builder” [1] The program has only a small size. The kexi sourcecode consists of 150k lines of code which is around 4 MB for the program.

The main menu looks exactly like a small project. The user is asked to create tables, forms and reports. The menu bar on top of the screen have no purpose, all the features have to do with tables, forms and reports. At first the user can create a new table. I have done so and the table stores the firstname and the last name of fictional students. Additional a primary key was defined which is incremented in the auto mode.

In the next screen the form editor is shown. Which is a minimalist one. There are some widgets available which can be dropped with the mouse into the form The most important one is the text field which holds the data of the underlying table. The data source can be provided right the properties menu. In contrast to the Gambas software (which is a visual basic clone) the connection between the database table and the form is working great. The user is able to see the data in the form.

The last item in the list is the report generator. This module has less features than the minimalist form generator but the user is able to create simple reports. He can drag and drop text fields, specify the underlying data source and with the printing driver of the Linux operating system it's possible to create a pdf document.

More features are not available. The official Kexi handbook explains mostly what is missing in the project and indeed the software is in an early stage. But compared to other software open source projects, Kexi is my personal favorite. It has to two main advantages. First one is, that the project is going into the right direction. The project goal is to program in C++ and integrated database software which contains of tables, forms, reports and scripting features. The second advantage is, that the code written so far is stable and the user can create a small but working database.

Let us make a simple thought experiment. Suppose enough manpower is provided for this project, and the amount of codelines grows by the factor 10 from today 150k to 1.5 million. Additionally some example .kexi files are provided and the forum gets more traffic. The prediction is, that Kexi will become one of the most interesting Open Source projects since decades. The reason is, that the normal user can do a lot of things with a database RAD tool. The options are endless. If such a software is programmed as open source with open standards in mind it can change the software industry.

I would guess that the next milestone for the project is to bring kexi on the same level like the outdated MS-access 2.0 software. I would guess that today kexi provides around 20% of the features from Access 2.0. Most of the interesting features are missing. But from a technical perspective it's possible to improve the software.

Importing CSV

Under the tab “external data” the user can import existing csv data. I have tested out the feature with an example csv file. [2] It works great. The csv file is converted into a kexi table.

disadvantages

Some major problems are visible with the software. The first one is, that after clicking with the right mouse button on a table and select “Export table as csv” the kexi program crashes without warning. It's not possible to export the table at all. This makes it hard to use the data for external purposes. The second problem is, that kexi has some preinstalled plugins which includes an sqlite driver, but it's not possible to connect to a sql lite database.

It's not clear what exactly the reason is why the important features are not working, but suppose the connection to a sqlite database works, and suppose the export filter gets improved, then the normal user would be able to use Kexi for some smaller databases. It has some advantages over LibreOffice calc to create a dedicated database, and create queries with the SQL language. As far as i can see the sourcecode is available so it would be possible either to contribute to the existing kexi project or fork the project and start a clone.

sources

[1] Debian kexi package, https://packages.debian.org/search?keywords=kexi

[2] CSV sample data, https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html

Building a minimalist database from scratch

From a historic point of view, many database management system were developed over the years: dbase was used in the 1980s, MS-Access in the early 1990s, since the 2000s MS-Access used in combination with SQL server and since 2010 the situation has become very complex, because many companies are experimenting with Java, Linux, C# and PHP.

The idea is to throw away everything and reinvent a database from scratch with Open Source software. A good starting point is the sqlite software which is available out of the box in all Linux distributions. The problem is, that sqlite is only a small part of an overall database management system. To make things more comfortable a middle layer and a frontend is needed:

backend sqlite -> middleware python -> frontend python

Programming a python frontend is not very complicated. It has to do with drawing windows on the screen the wxwidget library and add some animations plus sounds to make the game more pleasant. The more complicated part is the middleware. To introduce the business logic layer we have to analyze who to interact with a normal sqlite database.

Suppose the user has started the python3 interpreter and is connected to the sqlite database. What the user can do is to submit an SQL statement. He has to type in the SQL request into the command line and gets the feedback from the database. A naive assumption is that the interaction can be improved with the help of a GUI frontend. On the longrun not a GUI frontend is needed but a middlelayer. A business layer is working on the textual layer and allows the user to enter high level textual commands. Instead of typing in:

SELECT * FROM Customers WHERE First_Name='John'

The user enters the command:

middleware.showcustomer(“john”)

The mdidleware program code converts the high level statement into low level SQL commands which are submitted to the sqlite database. The interesting point is, that no GUI interface is needed and the user can interact with the database very comfortable.

REST middleware is the future of databases

The common database application in the 1990s was based on the backend frontend philosophy. The SQL server from Microsoft was used as a backend and the MS-access database is realizing the frontend. The disadvantage is, that most of the logic is located in the frontend. The answer to the problem is to introduce a new layer called middleware.

Microsoft has invented the .net C# platform which is directed towards this goal. But the term middleware can be realized with other programming languages as well. A famous one is Javascript which can be used for creating RESTful middleware api. To understand how this is realized in reality, we have to take a look at outdated MS-Access applications.

The idea is, that the user is pressing graphical buttons and this activates tasks in the software. THe better alternative is to provide a textual interface. Bascially spoken, the programmer has the obligation to convert an existing MS-access application into a text-only interface. This text-only interface is the REST API. Programming a graphical interface on top of an existing REST API is very easy.

The REST specification has become popular in web-oriented development. But is Javascript needed for classical desktop databases? Suppose the tables are stored in a sqlite database. The SQLlite database can't be called a middleware, but it's equal to the backend. Also the SQL language is located on the backend. What is needed an additional layer ontop of SQLlite. This middleware layer can be realized with python. The python software doesn't display the GUI interface for the user, but it connects an existing GUI with the SQLlite database.

sqlite -> python middleware -> python GUI

In the context of REST, the term business logic layer describes the middleware. Let us investigate how REST is working internally. REST means, that the methods of a Javascript class are mapped to external function calls. It has much in common with a programming library API which is realized with the HTTP interface. Suppose the idea is to program a desktop database but not a web-database. What is needed here is a Python library.

The python library contains of 100 lines of code, which has one class and can be imported from another python program. The python middlelayer connects two layers:

sqlite -> python middleware -> python frontend

The communication between python program1 and program2 is realized with the normal function calls. In the python frontend the functions of the middleware are called with:

import middleware

middleware.updatedate()

March 06, 2020

Linux can replace MS-Access

The main reason why Linux struggles on the desktop and especially on the desktop in the business world is because Linux developers didn't listen carefully what business users are trying to do with a computer. The most important software used on PC in a company isn't the LaTeX software, it's not Python and it's not the webbrowser. All of these programs are running great in Linux but the average business user has no need for such software. What the user need instead is MS-Access, MS-Access and nothing else but MS-Access.

The first step to bring Linux on the desktop of companies is to abstract a bit from the user needs and describe in detail what the mentioned software from Microsoft is doing and why it is so important. The main issue why computers are used in the business world is to store and retrieve table-related data, for example inventory lists, product lists, customer lists, accounting tables and so on. 90% of the IT infrastructure in the business world has to do with database management, and GUIs for database management software. The reason why the MS-Access software has become so popular is because it consists of two things:

- a database engine known as the Jet Engine
- a Rapid Application Development software in which the normal user can create forms, reports, views and macros

The combination of both fulfills the needs in the business reality great. The next step is to ask what the Open Source replacement will look like. The good news is, that the technology is available but it's not documented well enough. The database engine can be realized with sqlite and the Rapid application development software can be found under the name Gambas basic.

Let us describe the details of an open source MS-access clone. A single sqlite file stores the data of many database tables. The information is created and updated with sql statements. Sqlite is a very powerful software which is used for anything. The main problem is, that on the command line it's hard to interact with the tool. To make the life easier a rapid application development software is needed. This is a program which allows to create multi document interface applications. That is a single application in which different windows are opened at the same time. For example three tables, plus 2 forms. The Gambas IDE https://en.wikipedia.org/wiki/Gambas was initiated in 1999 and provides such graphical environment. Very similar to MS-Access, Foxpro and other database frontends the user is allowed to create forms and reports. And all the windows are connected to an underlying Sqlite database. Some examples projects have shown, that in theory the combination of Sqlite plus gambas is able to replace MS-Access databases in a business context.

To understand why desktop databases are used everywhere in a business context it's important to analyze the workflow of creating new software in a business context. In context to classical programming project a business software is never written in C++ or in Python. It would take too long to program 20 forms in this language. And normal text oriented programming languages are not flexible enough to change an existing layout. What business users are prefering instead are non-programming options to create applications which has to do with MS-Excel sheets and Visual basic layout software.

These non-classical programming system are ignored by classical programmers. It has nothing to do with programming language development but it's called a rapid application development system. A rad tool is some kind of painting software for creating lots of windows, and in the second step the window buttons are connected with macros. The concept was realized first in the MS-Access software, but many other MS-Windows programs are available for this need.

From a technical point of view it's possible to realize a rad tool in Linux with Open Source as well. It's only a question how many manpower is provided to create such software. The Gambas tool is on the same level like early versions of MS-Access it allows to create smaller business applications and for most purposes this fulfills the needs of the user.

What is missing are documentation and practical examples in which gambas is used in the real world. The amount of literature in which MS-windows based databases were created in the past is much higher. Linux in general and RAD tools for business application in detail are a very new development. In theory this will increase the market share of Linux massive. The reason is, that apart from creating GUI-database, the average business user has no further needs. If a software or an operating system fulfills this single task, the user is happy. He will use the application all the time, and he will recommend it to other.

Dive into Gambas3

After installing the software package the user can create forms. The software has much in common with visual basic. After doubleclicking on a button it's possible to enter basic sourcecode which opens new forms. Creating a complex GUI with different windows and tabs is pretty easy with gambas and some youtube tutorials are explaining the details.

The bottleneck is to establish a database connection. In theory the user has to create a connection first, then a datasource widget is created in a form and then a databrowser will display the content of the sqlite database in the form. The problem is, that it doesn't work. THere is an error message: connection can't be established and the help section has no answer to it.

It seems that Gambas is in an early development status. But the assumption is, that the databse connectivity will work in future versions and apart from Gambas there are similar projects started. One of them Camelot RAD which is a RAD tool around the python language. The idea is similar to Gambas because the user has to create forms and puts some glue code into the application.

sqlite and RAD tools

Let us describe the picture in general. What is available in Linux is the database backend. The user can create a single file in the sqlite format which holds all the tables. sqlite meets without problems every requirement. What is missing is a rad tool for developing forms and reports ontop of an existing sqlite database. Existing projects like gambas, Openoffice base and Camelot RAD are in an early development stage. They are working in theory but it's to early to recommend them as alternative to MS-access.

Suppose the team behind the gambas project is improving the database connection. Then this tool will be used very often in a business context. What the user can do today is to create GUI prototype application.

From an abstract point of view, the key requirement for business application is a RAD tool which allows to create forms, reports which are connected to an underlying sqlite database. In the context of web development the missing RAD tool isn't a problem, because with PHP it's possible to output HTML code. But if somebody likes to create a desktop application he needs such a tool. Programming all the forms in Python with the tkinter framework won't work. It will take too much time. The prediction is, that in 5 years from today a fully working RAD tool is available as Open Source which is using a sqlite database as backend. Such a tool will motivate many million of users to switch to the Linux operating system as their main desktop environment. A desktop database is the missing piece of software to make Open Source a success.

March 05, 2020

Making LInux compatible with MS-Access

The main reason why Linux has struggled on the desktop PC of companies is because of a single program, which is MS-Access. This single program has top priority for companies but it has a low priority with the Open Source ecosystem. A naive approach to adress the issue is to recommend to replace existing MS-access database with modern backend databases. Under Linux operating systems it can be realized with mysql for the server and PHP as a frontend programming language.

Unfurtunately, this advice works not very good in reality. Converting existing MS-access database into PHP scripts which are displayed in the browser is equal to start a large scale software project. It will take million of manhours and will costs million of US$. No company in the world is rich enough in doing so. And what the users are doing instead is not to use Linux in general and stay on the well known MS-Windows plus MS-Access software.

Suppose a company doesn't has existing .mdb files on the harddrive but they need the computer only for browsing in the internet, write some python scripts and create LibreOffice textfiles. Then it's pretty easy to switch from the Windows 10 operating system to a modern Debian Linux system. The problem is, that in reality, the average company has a need for a desktop database.

There is an Open Source software available which can maybe adress the problem. It's called mdbtool and was started 15 years ago. MDBtools is a software for reading existing Microsoft Access databases. https://github.com/brianb/mdbtools/tree/master/doc The format is outdated because the current format is called ACCDB and mdbtool can't write in the format. But the tool goes into the right direction. Suppose the software is improved a bit. The idea is not to create a database with Linux, but the idea is, that the real need is to read and write the MDB fileformat. In theory, this will help to bring LInux to the desktop.

Let us go a step back and analyze why the LibreOffice suite is a good replacement for existing Microsoft Office software. Because the tool is able to open and write in the same file format. If the user has any Word document he can open the file easily with Libreoffice. For the MS-Access software such a converter isn't available. Instead the normal Linux user is asked to write and read from different file formats like json, csv files and mysql databases. But these formats are different from the existing one. The MS-access fileformat is too important for the desktop PC to ignore it.

Is Microsoft's strong position on the desktop overestimated?

Microsoft Windows is known as the leading desktop operating system. It has a market share of around 90%. This ratio was stable over the last 20 years. Even if the Linux software has become more powerful and the Internet is widespread used, the only thing which seems certain is, that Microsoft is the leading software company in the world.

Perhaps it make sense to describe the current situation more objective and ask what Microsoft stands for. There are some elements available if somebody is using Microsoft products. At first the user will buy some books about Microsoft software. He has to buy 10-20 of them each is sold for 40 US$ in the hard cover version. For example from the oreilly publisher.

The second step is, that the user needs the software itself. Which includes the WIndows 10 operating system, the Office 365 suite, the visual studio suite, the SQL server, webserver and additional programs like Photoshop. The license fee for all these software cost a lot of money. It is more than 1000 US$ for all these programs combined.

The third step is very important. The typical Microsoft user has no experience with Linux based operating system. He is prefering Microsoft products because he doesn't like Open Source software in general, and isn't motivated to install Ubuntu or Arch Linux on his PC.

And now comes the question: Does this behavior makes any sort of sense? It's a behavior in which a user has to buy hardcover books, has to spend money for software licenses and isn't familiar with Open Source. It will produce only sense if the future is working with this principle too. But this future is not available in the reality. All of the behavior which results into a Microsoft centric software ecosystem is outdated.

It's true that according to the latest statistics the Microsoft desktop operating system has a market share of 90%, but the underlying behavior to replicate such a ratio isn't ready for the future. It's some kind of look back into the past of computing before electronic documents were invented and before Linux distribution were created.

The assumption is that the importance of Microsoft and the ecosystem around Microsoft is overestimated. That means, it has reached the peak and what we will see next is some kind longterm decline. The best example from the real world are printed newspaper. They are not able to produce sense, especially not for future needs. A while ago, Microsoft has increased the price for the Office 365. The reason why is unclear and especially a new generation of users are asking why the software isn't available as an app for only 2.99 US$. And they are asking what is wrong with Libreoffice if the software provides the same feature without any license fee. Such a question makes sense because Libreoffice and Microsoft Office has the same power. The only reason why MS-Office is used so often is because of historical reasons.

Perhaps it makes sense to go a step backward and see Microsoft as part of a more complex media evolution. What can be described on the long run is a transition from classical media like newspapers, radio, television and books towards modern media located in the internet. Similar to Microsoft the classical media didn't disappear they are all avalable for today's audience. The New york times is sold at the kiosk, the FM radio television is transmitted in the air, and most households have book shelf. But all these media have lost their sense making capabilities. It's not possible to explain to a newbie why these media are important anymore. There are modern media which are cheaper and more powerful. The switch from old media to modern media will take some years.

The logical consquence is about asking how many decades it will need if Microsoft has lost the position on the desktop and Linux has won the market. Will it take 10 years, 20 years or more? According to a report [1] the open source market will grow to a volume of 66 billion US$ in 2026. The attempt to predict the future sounds a bit like a forecast the development of the printed newspaper.

What is important is not to compare Linux with Microsoft on detail questions, for example to analyze if the ext4 filesystem is 10% faster than the NTFS filesystem but the problem should be analyzed from the perspective of media studies. That means by the ecosystem around the Windows 10 operating system and the ecosystem whic depends on the Linux kernel.

[1] https://www.verifiedmarketresearch.com/product/open-source-services-market/

March 03, 2020

The Windows operating system in the context of media history

The only group who is discussing the pros and cons of different operating systems are Linux users. They are interested in examine the difference between the ext4 and the NTFS filesystem and they are asking why Linux on the desktop never was a success. The average Windows user doesn't has such problems. He is using a computer not with a technical background but he is interested to playback multimedia software. The operating system needed to run edutainment software is never seletected by the user but by the producer of the CD-ROM.

The trick is to see operating systems not from a computing perspective but as media history. The standard users is entering a store and buys a game. For example a vocabulary trainer to improve the English skills. On the CD-ROM box there is written that this game needs the Windows 8 operating system. The user has this operating system so the product is compatible. He buys the product and use it at home.

The reason why million of normal users have decided for the Windows operating system but not for Linux is because they are attending a discourse in which Windows based edutainment software plays an important role. They are consuming a certain sort of educational information. In contrast, Linux users are attending a different sort of game which is working with a different bias.

The LInux operating system can be interpreted as a media artifact as well. A media technology is some kind of dialoque about something. In case of Linux the dialogue is driven from a producer perspective. The idea is to create computer programs and content under a GPL license and the Linux community is discussing about the details of this plan. The media dialogue around WIndows multimedia CD-ROMs is based on a different goal. Here is the idea that the user gets amused and can relax by buying a CD-ROM. Windows software is seen as the logical step after the invention of the music CD and home entertainment products.

media history

The advantage of the term “Media history” is, that it describes a large amount of different communication technologies from the past. Well known mass media are journals and books. But board games, video, television and radio can be defined as media history too.

The perhaps surprising fact is, that software can be defined as media history as well. An intro for the Commodore 64 is not only pure computer code written in the assembly language but it has a social context. The intro is produced and consumed with a certain purpose. The same is obvious for classical media artifacts. A printed newspaper doesn't contains only of the printed paper itself, but a newspaper has a producer and a consumer who are repeating a certain behavior during the interaction with each other.

One interesting point in comparing operating systems like Windows and Linux against each other is, that in most cases these technology are not interpreted as media technology but are described from their own understanding. That means, Linux is explained a software which was written in C and Windows is also a software product written in C. This point of view ignored the context why Linux or Windows is used by a large amount of users.

Instead of focussing on the Windows operating system itself, which is distributed under the brand name Windows 10 and is stored on a DVD, the more elaborated perspective is to focus on the media products which are based on this operating system. This is equal to the software sold in stores and consumed by a larger audience. Windows based multimedia software was the first computer based media artifact. It has become popular in the early 1990s. Describing the transition towards WIndows based multimedia CD-ROMs helps to understand what the Windows operating system is about.

Before the advent of the Microsoft dominated PC the media history contains of music CDs, VHS video casette, printed newspapers and colored books. VHS video and the music CDs was an electronic media format because it requires electric current while a printed newspaper doesn't need electricity. With the advent of the IBM PC plus the Microsoft operating system in the 1990s a new sort of media has become visible. It was called software which was running on computer systems. In the beginning software was distributed on 3.5 inch floppy discs later the CD-ROM was invented. Around this computer technology new sort of media publishing companies are built from scratch. Their buisness model was to use the computer for distributing media to the consumer.

A software publishing house, which is producing a CD-ROM for the WIndows 98 operating system can't be called a computer company but it's at foremost a media company. Instead of using printed paper, they are using the Windows 98 system calls to create an electronic book. The advantage of the Windows ecosystem is, that it's working with a classic understanding of media. At first the WIndows PC is standardized. Secondly the content is produced by media publishing houses.

It make sense to see the Linux ecosystem as an anti pattern to Windows based multimedia publishing. Linux is based on two idea: First the idea is to boycott the existing Windows programming standard but provide a Unix API. Secondly the idea is that anybody can become a producer and media is generated without a copyright.

The reason why Linux based operating system never become popular on the desktop is because this understanding of media is way to advanced. From the self-understanding of the Open Source community they are trying to overcome the existing Windows driven media production cycle. And this kind of goal is a very demanding one. It's too much for most of the consumers who are focussed on classical media production.

Let us describe the pipeline of classical WIndows based multimedia CD Rom production. The idea in the 1990s was that a commercial company simliar to a newspaper company is producing a CD-ROM, and this product is sold in the computer store to thousands of consumer for a fixed price. Bascially spoken a multimedia CD-ROM is an electronic book. Which means, it's copyright protected and was produced by experts on this field who are doing it in a commercial context.

A Linux user would argue that this kind of media artifact is obsolete. It has a lot of problems and Linux is the answer to overcome this outdated understanding of a multimedia CD-ROM. The question is what is the alternative to a Windows based edutainment CD-ROM? The alternative is called the Internet. A website is some kind of CD-ROM which is shown on the desktop PC. It's not wonder that Internet centric computer systems like Android smartphones and Google chromebook has become very popular. It's not enough to describe only the underlying operating system which is the Linux kernel, but the more interesting description is based on the ecosystem around this technology. An Google chromebook is used to display websites on the screen which are created with Javascript, HTML5, the PDF format and with H.264 videos. The normal user isn't focussed on the Linux kernel but he is interested in these websites. The interesting point is, that a website which is displayed on a Google chromebook is programmed different from classical Windows based multimedia CD-ROM.

Similar to windows based software, a website has to be programmed, but the programming langauges and the standards are different. What is different two is the pricing model. Most websites are available without any costs but are sponsered with advertaisment. From a media history perspective, the Internet is the logical next step after Windows based CD-ROMs. The prediction is, that Windows operating system and the ecosystem around it will disappear while Internet based technology will become more important.

Let us listen to Linux advocates why they are rejecting the Windows operating system at all. The main reason is, that they are not following the price model of commercial CD-ROM multimedia products. The reason is located in the information explosion. In the early 1990s the amount of media products was small. In the local computer store the total amount of CD-ROM was smaller than 100. How many websites exists in the year 2020? Right millions of them. It's not possible to burn all these websites on CD-ROM and distribute it for 5 US$ each to the consumer. Nobody has the storage space to collect hundred of CD-ROMs. That means, Microsoft based content production isn't compatible to the information explosion. And the consequence is, that it will fail. Or it has failed in the past but similar to the printed newspaper it's sold as long as possible to the consumer.

Multimedia

The term multimedia was introduced in the mid 1990s and describes a desktop computer which is equipped with colored monitor, graphics card, cd-rom drive and sound card. It's interesting to know, that the pipeline for producing content was focussed on the Microsoft operating system. That means, 99.9% of all Multimedia CD-ROMs since the 1990s are Microsoft Windows compatible. That means, the Microsoft multimedia PC was a playback platform for commercial media producers.

With the advent of the Internet the situation has changed drastically. The webbrowser has replaced the former Microsoft PC and every device which can display a website is a multimedia PC. The result was, that Android smartphones, Mac OS X PC and Linux PCs can be seen as a multimedia PC as well. From the consumer perspective there is no need to ask if the consumer has a Microsoft PC but the smallest common standard is, that he has access to a webbrowser.

What we say today is some kind of transitition from old digital media to internet media. WIndows multimedia CD-ROMs are sold, at the same time the consumer is paying for commercal internet based content and he has access to free content in the internet as well. The most important breakthrough after the Microsoft Windows operating system was the google search engine. It has become to a standard used by the world, no matter which sort of computer system they own. The interesting feature of Google is, that it's not a certain programming language or a software framework but it's some kind of online library which collects information.