June 16, 2021

Comparing Google scholar with Microsoft academic

 

Google Scholar was the first attempt in building a search engine for academic papers. It is mostly known for it's powerful capabilities to search inside the documents and in contrast to websites of dedicated publishers it can search in the content of all the information available. After typing in some keywords, Google scholar returns a list of useful documents, similar to the normal google web search. And very important, it provides the formatted bibtex information as well so the found knowledge piece can be cited easily.
Most users have no need to try out a potential alternative. Microsoft Academic was developed later but it is used seldom. A closer look will show, that Microsoft Academic can't replace google but it can provide useful additional information. The strength is, that that Microsoft Academic has more in common with a classical bibliographic database. All the information is labeled with tags like “operating system” or “programming language”. In addition, the user can restrict the result list to papers, conference proceedings or books. The most interesting feature of Microsoft academic is, to change the sorting order. For every paper or book the amount of raw citation is shown. This allows to answer a question like “which books for a certain topic are popular?”, or “how many papers in a subject remain uncited?”.
Let us make a simple experiment. We are restricting the year from 1990 to 2000, select the domain “programming language” and show only books ordered by the raw citation count. The result page is some sort of recommended reading list. If somebody has no experience he can read the most referenced books and its for sure, that he will like them all:
1. Edmund M. Clarke: Model Checking
2. James Rumbaugh: Object-Oriented Modeling and Design
3. Martin Fowler: Refactoring: Improving the Design of Existing Code
4. Ed Anderson: Lapack Users' Guide
In contrast to a human guided recommendation list, these four books were determined by statistical information of the citation count. They are an objective list of books who are relevant for a discipline. A good library will provide these books, because they are requested frequently.
Let us imagine the situation from a broader perspective. Suppose there is a university library which has room for only 2000 books but not more. Which books should the library buy? The answer is, that they need to buy the most requested books, because the chance is high, that the student will find what they need.
Thanks to the Microsoft academic search engine, the concrete list of books can be created easily. All what the librarian has to do is to write down, the most cited books in each subject. In the example, 500 different subjects are there and each slot is filled with the most 4 important books. The result is a small but beautiful classical printed library which will be liked by its students very much.
The most referenced books for programming languages was mentioned before. Does it make sense to reduce all the books to only 4? Yes it is possible, because the requirement is, that the imaginary library doesn't have more space. Each single subject can only be filled with a restricted amount of books and not more. If “programming language” is supported with 20 books, the other categories available will be ignored. The hard to answer question is, which four books are important enough to become a role model for a single domain?
The entire domain of programming language has around 300k publications which are papers, conference proceedings, books, and postscript files in different repositories. The reason is, that a lot of people has contributed to the subject in the past. If a library likes to store all the information, it has to become a huge building.