May 15, 2018

Publishing academic content as HTML or PDF?


It seems, that both options are possible. Some papers can be downloaded in PDF format, other are in HTML formatted. Sometimes we see the Wiki syntax too. But which of the format is right?
The first idea might be, that HTML is superior to PDF, because PDF is a book format and can't be modified. That is the reason, why Wikipedia is not using PDF but the Wiki-syntax. The question is: does this make sense for research papers too? Let us suppose, that HTML advocates are trying to overcome the PDF format and force the authors to upload HTML content which can be easily shown in any browser. That is equal to a scientific blog which can be edited later and annotated with comments. There is no need for publishing papers anymore, because we have a blog.
But, are academic blogs are able to replace academic papers? The problem with blogs is, that they were not designed for archiving in mind. Instead the idea is to write something down, link to it and after some month the content get lost, because the website is no longer available. It is very complicated to export a complete Blog into a different format. Even the author of a blog has problems in doing so, and archiving an external blog is often not possible.
The reason why PDF is widespread used for academic publication has a good reason. It is the superior fileformat. Even if the file is not printed out, it makes sense to store a paper into such a format. PDF can collect not only the text itself, but also the pictures and the tables. Today, the standard format for academic is PDF or Postscript, that means, any paper of the 50M available papers are provided in the PDF format. I would guess, that this will not change in the future. The only thing what become possible is a new pdf version, for example with better compression or something like this. But replacing pdf with HTML is not possible.
Between an academic blog and an academic paper there is a difference. An academic paper has an indepth focus. That means, it contains lots of literature references and is focused on an expert audience. Above an academic paper there is no higher form in writing. That means, if the pdf format together with 200 bibliographic references is not enough to explain quantum computing and neural networks, then no other format can do so. Instead, a blog is a colloquial form of text. It is often created for entertainment reasons, discuss lighter topics and the invested energy by the author is small. It make sense to use PDF for heavy content, while HTML for lighter content.
The main idea behind PDF is, that the paper gets a unique identification. In the simplest form it is an url, but sometimes a doi and a bibtex entry too. A pdf file is closed system, which is created with a timestamp and then gets archived. It is not a living project or a discussion forum, it is written for eternity.
Some journals like PLOS one have both formats in parallel. The user sees the rendered HTML text on the screen, but he can also download the PDF format. In my opinion, it is enough to only provide the PDF file. If a researcher is not interested in reading a pdf file, he is not interested in the article at all. It is not possible to decrease the barrier and make the content easily accessible. A good academic paper is hard to read.