May 06, 2019

How to build a news site in the internet


Sometimes, Reddit was called similar to Facebook an overhyped website which is populated by bots. But what is an alternative? Let us analyze the idea of Reddit, Hackernews and Facebook groups to understand how to create such websites from scratch.
The internet contains of two elements, the first one is the World wide web which contains of URLs. If the user types in the URL into the browser he sees the content. The second important element of the Internet is a fulltext search engine, namely Google which is able to provide URLs. If a user types in a keyword in Google he gets a list of clickable links and this directs him to the content.
The problem with this kind of knowledgedatabase is, that the user must know the right keyword to find the URL. He can only see a piece of information, if he knows for what is searching for. What most users are prefering is a browsable list of URL which are ordered similar to the headline of a newspaper. Websites like Hackernews and Reddit are providing such list.
The ticker contains of at minimum of a headline plus an underlying URL. And the ticker is updated constantly, if new content was published in the internet. From a technical perspective such a ticker is monitoring the activities of the Google Crawler bot. The Google robot is searching for new websites and updated websites and puts them into a fulltext database. Each second many thousands websites gets updates. A short part of these updates is redirected to the Reddit ticker.
Such a minimal web aggregator can be improved with upvotes, user comments, preview pictures and categories. This is what all the large new aggregation websites are doing. An enriched new aggregator is more interesting for the users because it allows them to get the best content faster.
Let us imagine we want to build a news aggregator and a search engine from scratch so we can ignore the existing infrastructure of Google and Reddit. Taking both authorities as a work hypothesis offline allows to see the filter problem by itself. The assumption is, that in the internet are million of websites. These websites can be blogposts, forums, homepages, podcasts or sourcecode repositories. The websites do not linking to each other because they are not aware of their neighbors. A very large problem in this knowledge collection is to find the information which is the best one for a user. From a technical side the first thing which is needed is a fulltext search engine. It contains of a crawler bot who is indexing the fulltext and a database which allows a search in the content.
Such a search engine is only the first step for getting access to the information. What is missing is a ticker. A ticker works without entering a search term, similar to the table of contents of a magazine. It presents all the information available. The ticker is equal to the menu in a restaurant the user gets an impression what is available even if he has no idea what he likes most.
Creating a live ticker can be made in the easiest form with an automatic algorithm. The ticker presents all the updated websites in a linear list and the user can scroll from top to bottom. The disadvantage is, that even in a small internet, the amount of recent changes is large. SImilar to what WIkipedia presents under the tab recent changes. The better idea is to organize the ticker into categories like sport, funny, cars and to allow annotations to decide which of the content updates are important and which not.
What Reddit and other content aggregators are doing is to create/monitor the live ticker of the Internet. In contrast to the Wikipedia last changes section, not only a single websites is on the ticker, but the complete Internet.
The edit counter of Wikipedia per minute is known, because the data are available in the public domain. An estimation of the complete Internet was made in discussion https://www.quora.com/How-many-websites-are-created-each-year-month-week which comes to the conclusion that 10 websites are added each minute. It is measuring only new domains, if the same websites posts something new the number is higher. From another statistics it is known, that each day 4 million blogposts are created which is equal to 2778 postings per Minute. And blogs are only a small part of the internet. If we would monitor all the edits in a lifeticker, the number of changes each minute would be higher.


So called news aggregation websites are trying to monitor, evaluate, categorize and contextualize the changes in the Internet. The reason why so many bots are available who are posting URLs into aggregator websites is because humans are overwhelmed by such a task. Which human can evaluate 1 million updates each minute?
Let us simplify the process a bit. The assumption is that in the intranet the only website which is available is a wiki which is filled with information. Apart from the wiki no other domain is available. So the only URL which contains content is https://localhost/wiki/
If the wiki is updated this is made visible in the “last changes” section. To make the Wiki more transparent in the wiki an article is created called “News”. Humans are monitoring the last changes section and put the URL of the change into the news section. Sometimes they add a small comment.
On the first look, it is some kind of extra work. Because the wiki can monitor itself in the version history. The problem is, that this doesn't provide the context. A new user who is not familiar with the wiki has no entry point to get an overview. If he can read through the news section, he get a better impression how the wiki is working internally.
Summarize this concept is easy. Knowledge contains of three parts: the fulltext, a search engine, and annotated news created by humans to give an overview in a chronological order.