-
Notifications
You must be signed in to change notification settings - Fork 4
Description
The following is a long explanation of things going wrong currently. It offers no possible solutions yet. These will follow asap. I am trying to figure out the 'easiest fix'.
A/ We have recently adapted TICCL-rank to the needs of the new TICCL-chain by making it sort its best-first ranked (parameter --clip=1 ) output file numerically descending on the frequency of the Correction Candidate (CC). This has broken the correct working of TICCL-rank.
B/ We have also quite recently make TICCL-LDcalc output 'short' correction pairs to a new output file *short.ldcalc and the ngrams from which the short correction pairs were derived to a new file with extension 'ambi'. This creates further problems for TICCL-rank, as we shall explain later.
C/ Furthermore, we do not know if the new ranking feature based on the number of observed ngrams in which a particular word forms appears is in fact operational in TICCL-LDcalc yet.
D/ We remain handicapped by the fact that we do not have an exhaustive description of the full ranking system as currently implemented in TICCL-LDcalc and TICCL-rank.
Addressing A/ : We have for a while been under the impression that TICCL 'just' misses the most obvious Correction Candidate. We think we now have found the cause for this.
We present output from TICCL-rank run with respectively --clip=1, --clip=5 and --clip=10 on TICCL-LDcalc output on the English book by Morse.
In CLIP5 we see clearly that the CCs are ranked according to their frequency and no longer according to the confidence score. In fact the highest confidence score is with the fifth ranked CC. In CLIP10 we see that the highest confidence score in CLIP5 is outranked by the even higher confidence score of CC 'Niles'.
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.ranked
nuiles#1#Naples#4000030272#2#0.998194
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP10.ranked
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#Dulles#2000005022#2#0.998645
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Wiles#2000001183#2#0.998052
nuiles#1#Nines#2000000883#2#0.998734
nuiles#1#Ailes#2000000578#2#0.999088
When we look at the appropriately sorted output of CLIP1000 we see that 'Niles' in fact has the highest confidence score. The now 'best' ranked top 10 CCs have swamped the actual desired correction 'miles', its capitalized version 'Miles', which was present in CLIP5, is now out of sight, too.
Current TICCL output (incorrectly sorted by CC frequency) for non-word word form 'nuiles':
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP1000.ranked | sort -gr -t '#' -k 4 |head -n 10
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#Dulles#2000005022#2#0.998645
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Wiles#2000001183#2#0.998052
nuiles#1#Nines#2000000883#2#0.998734
nuiles#1#Ailes#2000000578#2#0.999088
Output as should be sorted by highest confidence:
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP1000.ranked | sort -gr -t '#' -k 6 |head -n 10
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Tules#2000000029#2#0.999486
nuiles#1#nuclei#1000008297#2#0.999478
nuiles#1#rules#1000152878#2#0.99946
nuiles#1#Rules#1000021220#2#0.999433
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#nails#1000009554#2#0.999203
nuiles#1#Suites#1705034559#2#0.999194
nuiles#1#Nilus#1000000335#2#0.999176
nuiles#1#Yules#2000000019#2#0.999097
Anyway, the main thing is that currently even the best-first ranked CC offered with CLIP1 is not the one with the highest confidence score, but the one with the highest frequency, which is plainly wrong. This is an undesired artefact of the resorting implemented for TICCL-chain.
We see much the same, though the result is less wrong -- here the most confident score is given to the right correction, for 'Amarican':
TICCL sorted output:
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked |more
Amarican#1#America#4000475833#2#0.996842
Amarican#1#American#3001522167#1#0.998421
Amarican#1#Americas#3000025187#2#0.995474
Amarican#1#Américas#3000000831#2#0.991158
Amarican#1#African#2000256933#2#0.993263
Output resorted descendingly by confidence:
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked |sort -gr -t '#' -k 6 |more
Amarican#1#American#3001522167#1#0.998421
Amarican#1#America#4000475833#2#0.996842
Amarican#1#Americas#3000025187#2#0.995474
Amarican#1#African#2000256933#2#0.993263
Amarican#1#Américas#3000000831#2#0.991158
Nevertheless: the 'best-first ranked' candidate without parameter --clip is still the one obtained by highest frequency sorting:
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.ranked |more
Amarican#1#America#4000475833#2#0.996842
Addressing B/ : In prior runs without the foci file curtailed to the foreground corpus only we found that 'tire' is often a confusable for 'the'. We are rather surprised that that is still the case, although many more pairs representing this pair seem now to have been properly filtered out on the basis of their frequencies, i.e. these being validated word form pairs. We now see that in some cases this still happens, which is in itself another issue to be addressed. (This may be because capitalized word forms did not get the artifrq, at least in some of these cases).
Example:
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tirethe' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambithe#first_tire
tireFirst_the#first_tirefirst_the#tire_Great_Kanhawaythe_Great_Kanhaway#tire_Great_Kanhawaythe_great_Kanhaway#tire_Guisos_Mexicothe_Guisos_Mexico#tire_Guisos_Mexicothe_guisos_Mexico#tire_Guisosthe_Guisos#tire_Guisosthe_guisos#tire_Milliiippithe_Milliiippi#tire_lifethe_LIFE#tire_lifethe_Life#tire_lifethe_life#
As stated before, we are not currently attempting to solve confusables. But this example allows us to explain the issue currently at hand.
The short forms have duly been added to the *short.ldcalc file, as we have recently decided to do. It is here the first of the nine last of 52 such 'confusable' pairs in *short.ldcalc.
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tire' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.short.ldcalc |tail -n 9
tire00the000220100120
tire0tides1000007728100000988102311001
tire00ties00022010050
tire0tin0002201001
tire00tis00022010010
tire0toe0002201001
tire00tone00022010010
tire0wine0002200002
tire00wise000220000~1
[Another new issue which seems to have popped up the last week or so (as a consequence of one of the latest adjustments to the work flow) is here apparent: for lots of these pairs the usual information such as frequencies etc. is now missing.]
The issue we are inching towards is this: short word forms may well be 'properly' handled by *short.ldcalc and *ambi, but other pairs based on the actual bigram (mostly, if not exclusively, we suspect) are still incorporated in the regular 'long' *ldcalc file: (we do no longer see the actual 'tire_land' and 'tire_bay' examples we had a couple of weeks ago. The first delivered e.g. CCs 'Ireland' and 'fireland' in the long ldcalc file). But these examples are clear enough (granted: they should not be there by virtue of the frequencies of their composing words alone):
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tire_' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc
tire_as4455Tijeras1000000109100000010923318930336251110044
tire_as55Treas10000000981000000124238036236572511100
tire_as4455treas10000000261000000124238036236572511100266
tire_on266Ireson10000000921000000092148343068382510100
tire_on266266Tiron10000000841000000084232073370562511100266
tire_on266Treon10000000411000000041238036236572511100
tire_or6565TREVOR10520000183025512626967251110065
tire_or65Trevor2000018197200001830255126269672511100
tire_to170187Tirito1000000000100000000010444521431251110~0
A non-word example concerns 'ifle':
We have 596 pairs containing this non-word in short.ldcalc.
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ cat /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.short.ldcalc |grep '^ifle~' |wc
596 596 21098
For the probably correct resolution 'rifle' we have the following evidence:
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^iflerifle' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambirifle#The_ifle_is
iflethe_rifle_is#The_ifleThe_rifle#The_iflethe_rifle#and_the_ifleand_the_rifle#ifle_isrifle_is#ifle_ofrifle_of#ifle_on_therifle_on_the#ifle_onrifle_on#ifle_orrifle_or#small_iflesmall_rifle#the_ifle_ofthe_rifle_of#the_ifleThe_rifle#the_ifle~the_rifle#
'Long' LDcalc nevertheless still retains a number of 'ifle' bigrams.
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^ifle_' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc
ifle_is11Ifles142380362365725010001
ifle_is1ifles34238036236572501000
ifle_on11Flemon10000000021000000002280020708125101001
ifle_on1Fleron111197781063502500100
ifle_on11Flexon4444923474575025001001
ifle_on1Isleton10000000521000000052110889093722511100
ifle_or11Flexor1181128923474575025001001
ifle_or1flexor1010112892347457502500100
The problem with these is that TICCL-rank misses the possibly likeliest resolution which is in short.ldalc and will rank the rest, probably delivering a False Positive.
I am not sure what would be best to do about this. I think for now we should keep both the short.ldcalc and ambi output. And still add the 'short' bigrams to 'long' ldcalc so that TICCL-rank has the data necessary to do its job well.
Given the inordinate amount of possible pairs for 'ifle' in short.ldcalc, I am not sure the very large background corpus containing also ngrams helps rather than obfuscates the situation. It seems that we should boost the evidence of validated ngrams present in the foreground corpus where and how possible.
Yet one more 'new' issue that bothers me is the fact that capitalized word forms seem to have gained prominence in the corrections. This is due to the fact that TICCL-anahash sorts the anagrams collected alphabetically, it seems. If at all possible, these should rather be sorted by frequency.
Another thing... This run had --low=4. Yet we find the couple 'ifles~riffles', word lengths 5 and 7 respectively, in short.ldcalc.
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^ifles~' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambi |grep 'iflesriffles'riffles#ifles_of~riffles_of#
ifles
How does that happen?
Addressing C/ : I need to know.
Addressing D/ : I need to know, too.
Further to the ranking features: now we have the foreground foci file: we should use this as another, strong ranking feature: if the CC is present: boost.
Following up on mainly A/ and B/: I will post recommendations for remedial work asap.
MRE