Before a computer can process written information the dataset corpus needs to be transformed into a numerical representation. Otherwise the neural network can't be trained on the data. The problem is, that nearly all input datasets are formulated in English. There are a list of question answer pairs stored in a .csv file. A typical entry might be:
"What is the north?", "Its a direction similar to east and west"
"What is red?", "Its a color similar to blue or green".
These pairs are highly sense making for humans but a computer won't understands the words. Every existing neural network architecture requires numerical data in a floating point range. Unfortunately, the example dataset has no floating point numbers but only words.
Even if the problem is obvious it was discovered very late in computer science. First attempt for document retrieval doesn't require a word embedding model. Because classical text retrieval was realized with full text search engines. The algorithm compares the input sentence with a database and returns the correct document. Only if the text retrieval should be realized with a neural network, there is a need to convert the documents into a vector space which is the task of a word embedding model like word2vec or fasttext.
Modern large language models are built with word embeddings models in the background. These embedding make sure, that the neural network understands the sentences in the corpus. The word embedding model influences how fast and accurate the resulting chatbot is. For example, a minimalist bag of word model with a vocabulary of 100 words won't understand or generate an academic paper because the translation from a full text document into a vector space doesn't work well enough.
A domain specific bag of words model is perhaps the most minimal example for word embedding and should be explained for a point & click adventure game. There are only 4 verbs (walkto, pick, drop, use) and 4 nouns (key, stairs, door, ball). Each word is assigned to a number and possible sentences looks like:
"walkto door" = (0,2)
"pick key" = (1,0)
"use ball" = (3,3)
The first number in the vector is submitted to neuron #1 while the second number is submitted to neuron #2. Most existing point&click adventures doesn't implement a dedicated word embedding model to store the internal communication vocabulary, but for exploring new tools and NLP techniques it makes sense to introduce word embeddings into video games.
No comments:
Post a Comment