1
Programs / Text Corrector
« on: October 12, 2021, 04:24:29 pm »
Text Corrector corrects writing errors while the user is composing a text. It doesn't require a vocabulary file, instead it learns any language from a given sample text. The algorithm not only extracts the words from that sample, but it also extracts the links between words and analyzes the probabilistic features of the language. Basing on the collected data, typing errors are then modeled as gaussian distributions which mean is the right value.
While the user is typing a new word, in order to predict which word the user intends to write, the algorithm evaluates vocabulary words affinity, which is a weighted mean of three parameters:
- LENGTH(i): the probabilistic distance between the length of the typed word and the length of the i-th vocabulary word
- CONTEXT(i): the estimated frequency that, given one last typed word, the i-th vocabulary word follows
- STRUCTURE(i, j): the probabilistic distance of the j-th letter of the typed word from the j-th position in the i-th vocabulary word
To enhance prediction performance, the algorithm recalibrates these parameters with two further correction mechanisms:
- BIAS: the weight of the LENGTH parameter is reduced while few letters have been typed, since with few letters typed it's probable that length is not definitive (i.e., that more letters will be typed, and length will soon increase)
- BONUS: if the exact typed word exists in the vocabulary, its affinity is brought to the maximum, so that it prevails on all other vocabulary words.
Even though in order to fully work the algorithm needs a long and complete text sample that is representative of the language, the default sample hard-coded in the program can already show a good example of its predictive behavior.
While the user is typing a new word, in order to predict which word the user intends to write, the algorithm evaluates vocabulary words affinity, which is a weighted mean of three parameters:
- LENGTH(i): the probabilistic distance between the length of the typed word and the length of the i-th vocabulary word
- CONTEXT(i): the estimated frequency that, given one last typed word, the i-th vocabulary word follows
- STRUCTURE(i, j): the probabilistic distance of the j-th letter of the typed word from the j-th position in the i-th vocabulary word
To enhance prediction performance, the algorithm recalibrates these parameters with two further correction mechanisms:
- BIAS: the weight of the LENGTH parameter is reduced while few letters have been typed, since with few letters typed it's probable that length is not definitive (i.e., that more letters will be typed, and length will soon increase)
- BONUS: if the exact typed word exists in the vocabulary, its affinity is brought to the maximum, so that it prevails on all other vocabulary words.
Even though in order to fully work the algorithm needs a long and complete text sample that is representative of the language, the default sample hard-coded in the program can already show a good example of its predictive behavior.
Code: QB64: [Select]
- _Title "Text Corrector"
- sample$ = sample$ + temp$ + " "
- Close #1
- sample$ = "Today is a beautiful day, tomorrow I'm going to the park. The park is far from home, so today I will stay at home or maybe I will go to the supermarket. By the way, your house is beautiful and big, I really like it!"
- lastArchivedWord$ = "."
- isFirstTime = -1
- lastInputWasPunctuation = 0
- learnFromSample (sample$)
- isFirstTime = 0
- computeWordAffinity lastArchivedWord$, newWord$, 0.15 - bias1 - bias2, 0.30 + bias1, 0.55 + bias2, 1
- refreshTextEditor archivedText$, lastArchivedWord$, newWord$
- archivedText$ = archivedText$ + " " + lastArchivedWord$ + " "
- archivedText$ = archivedText$ + " "
- archivedText$ = archivedText$ + " " + highestAffinityWord$ + " "
- newWord$ = ""
- lastArchivedWord$ = key$
- lastInputWasPunctuation = -1
- Case " ":
- archivedText$ = archivedText$ + " " + lastArchivedWord$ + " "
- lastArchivedWord$ = newWord$
- lastArchivedWord$ = highestAffinityWord$
- newWord$ = ""
- lastInputWasPunctuation = 0
- lastInputWasPunctuation = 0
- showComputedWordAffinity
- toprint$ = archivedText$
- toprint$ = toprint$ + lastArchivedWord$
- toprint$ = toprint$ + newWord$ + "_"
- Print "Write something, I'll try to correct you:"
- Next I
- Sub showComputedWordAffinity
- best$ = highestAffinityWord$
- Next i
- Function highestAffinityWord$
- highestAffinityWord$ = words(I)
- Next I
- decomposeIntoLinks (sample$)
- orderLinks
- countLinks
- collectWords
- Sub computeWordAffinity (oldword$, newWord$, A, B, C, BONUS) ' A = length weight, B = context weight, C = structure weight, BONUS = value added to global probability (constraint: A + B + C = 1)
- wordsProbability(I, 1) = 0
- wordsProbability(I, 2) = 0
- wordsProbability(I, 3) = 0
- wordsProbability(I, 3) = structureLikeness(newWord$, words(I))
- Next I
- searchLinks (oldword$)
- position = wordPosition(searchedCountedLinks(I, 2))
- Next I
- wordsProbability(I, 4) = Int((A * wordsProbability(I, 1) + B * wordsProbability(I, 2) + C * wordsProbability(I, 3)) * 10 ^ 4) / 10 ^ 4
- Next I
- orderComputedWords
- shuffleWords
- Sub orderComputedWords
- changed = 0
- buffer$ = words(I - 1)
- buffer1 = wordsProbability(I - 1, 1)
- buffer2 = wordsProbability(I - 1, 2)
- buffer3 = wordsProbability(I - 1, 3)
- buffer4 = wordsProbability(I - 1, 4)
- words(I - 1) = words(I)
- wordsProbability(I - 1, 1) = wordsProbability(I, 1)
- wordsProbability(I - 1, 2) = wordsProbability(I, 2)
- wordsProbability(I - 1, 3) = wordsProbability(I, 3)
- wordsProbability(I - 1, 4) = wordsProbability(I, 4)
- words(I) = buffer$
- wordsProbability(I, 1) = buffer1
- wordsProbability(I, 2) = buffer2
- wordsProbability(I, 3) = buffer3
- wordsProbability(I, 4) = buffer4
- changed = 1
- Next I
- Sub shuffleWords
- buffer$ = words(I - 1)
- buffer1 = wordsProbability(I - 1, 1)
- buffer2 = wordsProbability(I - 1, 2)
- buffer3 = wordsProbability(I - 1, 3)
- buffer4 = wordsProbability(I - 1, 4)
- words(I - 1) = words(I)
- wordsProbability(I - 1, 1) = wordsProbability(I, 1)
- wordsProbability(I - 1, 2) = wordsProbability(I, 2)
- wordsProbability(I - 1, 3) = wordsProbability(I, 3)
- wordsProbability(I - 1, 4) = wordsProbability(I, 4)
- words(I) = buffer$
- wordsProbability(I, 1) = buffer1
- wordsProbability(I, 2) = buffer2
- wordsProbability(I, 3) = buffer3
- wordsProbability(I, 4) = buffer4
- Next I
- wordPosition = I
- Next I
- Sub collectWords
- oldstr$ = countedLinks(1, 1)
- j = 1
- words(j) = oldstr$
- j = j + 1
- oldstr$ = countedLinks(i, 1)
- Next i
- words(j) = oldstr$
- numWords = j
- numWords = j - 1
- totDistance = totDistance + probability(distance, 0)
- Next I
- structureLikeness = 0
- searchedCountedLinks(I, 1) = ""
- searchedCountedLinks(I, 2) = ""
- searchedCountedLinks(I, 3) = ""
- Next I
- j = 0
- totSearchedCountedLinks = 0
- j = j + 1
- searchedCountedLinks(j, 1) = countedLinks(I, 1)
- searchedCountedLinks(j, 2) = countedLinks(I, 2)
- searchedCountedLinks(j, 3) = countedLinks(I, 3)
- Next I
- numSearchedCountedLinks = j
- Sub countLinks
- oldstr1$ = links(1, 1)
- oldstr2$ = links(1, 2)
- counter = 1
- j = 1
- counter = counter + 1
- countedLinks(j, 1) = oldstr1$
- countedLinks(j, 2) = oldstr2$
- oldstr1$ = links(I, 1)
- oldstr2$ = links(I, 2)
- j = j + 1
- counter = 1
- Next I
- countedLinks(j, 1) = oldstr1$
- countedLinks(j, 2) = oldstr2$
- numCountedLinks = j
- sample$ = standardize$(sample$)
- oldword = newword
- newword = extractedword
- row = row + 1
- I = j + 1
- Next I
- numLinks = row
- Sub orderLinks
- changed = 0
- buffer1$ = links(I - 1, 1)
- buffer2$ = links(I - 1, 2)
- links(I - 1, 1) = links(I, 1)
- links(I - 1, 2) = links(I, 2)
- links(I, 1) = buffer1$
- links(I, 2) = buffer2$
- changed = 1
- Next I
- standardize$ = standardize$ + char$
- Next i
- destandardize$ = destandardize$ + char$
- punctuation = isPunctuation(char$)
- Next i
- probability = cumulativeProbability(value + 0.5, mean, 2) - cumulativeProbability(value - 0.5, mean, 2)
- value = (value - mean) / deviation
- cumulativeProbability = GaussTable(adaptedvalue)
- cumulativeProbability = 1
- Sub fillGaussTable
- GaussTable(0) = 0.5000
- GaussTable(5) = 0.5199
- GaussTable(10) = 0.5398
- GaussTable(15) = 0.5596
- GaussTable(20) = 0.5793
- GaussTable(25) = 0.5987
- GaussTable(30) = 0.6179
- GaussTable(35) = 0.6368
- GaussTable(40) = 0.6554
- GaussTable(45) = 0.6736
- GaussTable(50) = 0.6915
- GaussTable(55) = 0.7088
- GaussTable(60) = 0.7257
- GaussTable(65) = 0.7421
- GaussTable(70) = 0.7580
- GaussTable(75) = 0.7734
- GaussTable(80) = 0.7881
- GaussTable(85) = 0.8023
- GaussTable(90) = 0.8159
- GaussTable(95) = 0.8289
- GaussTable(100) = 0.8413
- GaussTable(105) = 0.8531
- GaussTable(110) = 0.8643
- GaussTable(115) = 0.8749
- GaussTable(120) = 0.8849
- GaussTable(125) = 0.8944
- GaussTable(130) = 0.9032
- GaussTable(135) = 0.9115
- GaussTable(140) = 0.9192
- GaussTable(145) = 0.9265
- GaussTable(150) = 0.9332
- GaussTable(155) = 0.9394
- GaussTable(160) = 0.9452
- GaussTable(165) = 0.9505
- GaussTable(170) = 0.9554
- GaussTable(175) = 0.9599
- GaussTable(180) = 0.9641
- GaussTable(185) = 0.9678
- GaussTable(190) = 0.9713
- GaussTable(195) = 0.9744
- GaussTable(200) = 0.9772
- GaussTable(210) = 0.9821
- GaussTable(220) = 0.9861
- GaussTable(230) = 0.9893
- GaussTable(240) = 0.9918
- GaussTable(250) = 0.9938
- GaussTable(260) = 0.9953
- GaussTable(270) = 0.9965
- GaussTable(280) = 0.9974
- GaussTable(290) = 0.9981
- GaussTable(310) = 0.9990
- GaussTable(390) = 1
- GaussTable(400) = 1
- nextValidValue = 0
- j = i + 1
- nextValidValue = GaussTable(j)
- nextValidValuePosition = j
- j = j + 1
- interpolation = Int((nextValidValue - lastValidValue) * (i - lastValidValuePosition) / (nextValidValuePosition - lastValidValuePosition) * 10 ^ 4) / 10 ^ 4
- GaussTable(i) = lastValidValue + interpolation
- lastValidValue = GaussTable(i)
- lastValidValuePosition = i
- Next i
- Function compare (str1$, str2$) ' returns 0 if str1$ = str2$, 1 if str1$ < str2$, 2 if str1$ > str2$
- compare = 2
- compare = 1
- Next I
- charDistance = 10
- charDistance = I
- charDistance = I
- Next I
- isSpecial = 0
- isPunctuation = 0
- isUnsupported = 0
- min = int1
- min = int2
- max = int1
- max = int2