Main two things currently wrong with TICCL-LDcalc and TICCL-rank (and two more gripes...)

The following is a long explanation of things going wrong currently. It offers no possible solutions yet. These will follow asap. I am trying to figure out the 'easiest fix'.

----------

A/ We have recently adapted TICCL-rank to the needs of the new TICCL-chain by making it sort its best-first ranked (parameter --clip=1 ) output file numerically descending on the frequency of the Correction Candidate (CC). This has broken the correct working of TICCL-rank.

B/ We have also quite recently make TICCL-LDcalc output 'short' correction pairs to a new output file *short.ldcalc and the ngrams from which the short correction pairs were derived to a new file with extension 'ambi'. This creates further problems for TICCL-rank, as we shall explain later.

C/ Furthermore, we do not know if the new ranking feature based on the number of observed ngrams in which a particular word forms appears is in fact operational in TICCL-LDcalc yet.

D/ We remain handicapped by the fact that we do not have an exhaustive description of the full ranking system as currently implemented in TICCL-LDcalc and TICCL-rank.

Addressing A/ : We have for a while been under the impression that TICCL 'just' misses the most obvious Correction Candidate. We think we now have found the cause for this.

We present output from TICCL-rank run with respectively --clip=1, --clip=5 and --clip=10 on TICCL-LDcalc output on the English book by Morse.

In CLIP5 we see clearly that the CCs are ranked according to their frequency and no longer according to the confidence score. In fact the highest confidence score is with the fifth ranked CC. In CLIP10 we see that the highest confidence score in CLIP5 is outranked by the even higher confidence score of CC 'Niles'.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.ranked 
nuiles#1#Naples#4000030272#2#0.998194

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked 
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP10.ranked 
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#Dulles#2000005022#2#0.998645
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Wiles#2000001183#2#0.998052
nuiles#1#Nines#2000000883#2#0.998734
nuiles#1#Ailes#2000000578#2#0.999088

When we look at the appropriately sorted output of CLIP1000 we see that 'Niles' in fact has the highest confidence score. The now 'best' ranked top 10 CCs have swamped the actual desired correction 'miles', its capitalized version 'Miles', which was present in CLIP5, is now out of sight, too.

Current TICCL output (incorrectly sorted by CC frequency) for non-word word form 'nuiles':

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP1000.ranked | sort -gr -t '#' -k 4 |head -n 10
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#Dulles#2000005022#2#0.998645
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Wiles#2000001183#2#0.998052
nuiles#1#Nines#2000000883#2#0.998734
nuiles#1#Ailes#2000000578#2#0.999088

Output as should be sorted by highest confidence:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP1000.ranked | sort -gr -t '#' -k 6 |head -n 10
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Tules#2000000029#2#0.999486
nuiles#1#nuclei#1000008297#2#0.999478
nuiles#1#rules#1000152878#2#0.99946
nuiles#1#Rules#1000021220#2#0.999433
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#nails#1000009554#2#0.999203
nuiles#1#Suites#1705034559#2#0.999194
nuiles#1#Nilus#1000000335#2#0.999176
nuiles#1#Yules#2000000019#2#0.999097

Anyway, the main thing is that currently even the best-first ranked CC offered with CLIP1 is not the one with the highest confidence score, but the one with the highest frequency, which is plainly wrong. This is an undesired artefact of the resorting implemented for TICCL-chain.

We see much the same, though the result is less wrong -- here the most confident score is given to the right correction, for 'Amarican':

TICCL sorted output:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked |more
Amarican#1#America#4000475833#2#0.996842
Amarican#1#American#3001522167#1#0.998421
Amarican#1#Americas#3000025187#2#0.995474
Amarican#1#Américas#3000000831#2#0.991158
Amarican#1#African#2000256933#2#0.993263

Output resorted descendingly by confidence:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked |sort -gr -t '#' -k 6 |more
Amarican#1#American#3001522167#1#0.998421
Amarican#1#America#4000475833#2#0.996842
Amarican#1#Americas#3000025187#2#0.995474
Amarican#1#African#2000256933#2#0.993263
Amarican#1#Américas#3000000831#2#0.991158

Nevertheless: the 'best-first ranked' candidate without parameter --clip is still the one obtained by highest frequency sorting:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.ranked |more
Amarican#1#America#4000475833#2#0.996842


Addressing B/ : In prior runs without the foci file curtailed to the foreground corpus only we found that 'tire' is often a confusable for 'the'. We are rather surprised that that is still the case, although many more pairs representing this pair seem now to have been properly filtered out on the basis of their frequencies, i.e. these being validated word form pairs. We now see that in some cases this still happens, which is in itself another issue to be addressed. (This may be because capitalized word forms did not get the artifrq, at least in some of these cases).

Example:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tire~the' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambi
tire~the#first_tire~First_the#first_tire~first_the#tire_Great_Kanhaway~the_Great_Kanhaway#tire_Great_Kanhaway~the_great_Kanhaway#tire_Guisos_Mexico~the_Guisos_Mexico#tire_Guisos_Mexico~the_guisos_Mexico#tire_Guisos~the_Guisos#tire_Guisos~the_guisos#tire_Milliiippi~the_Milliiippi#tire_life~the_LIFE#tire_life~the_Life#tire_life~the_life#

As stated before, we are not currently attempting to solve confusables. But this example allows us to explain the issue currently at hand.

The short forms have duly been added to the *short.ldcalc file, as we have recently decided to do. It is here the first of the nine last of 52 such 'confusable' pairs in *short.ldcalc.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tire' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.short.ldcalc |tail -n 9
tire~0~0~the~0~0~0~2~2~0~1~0~0~12
tire~0~0~tides~1000007728~1000009881~0~2~3~1~1~0~0~1
tire~0~0~ties~0~0~0~2~2~0~1~0~0~5
tire~0~0~tin~0~0~0~2~2~0~1~0~0~1
tire~0~0~tis~0~0~0~2~2~0~1~0~0~1
tire~0~0~toe~0~0~0~2~2~0~1~0~0~1
tire~0~0~tone~0~0~0~2~2~0~1~0~0~1
tire~0~0~wine~0~0~0~2~2~0~0~0~0~2
tire~0~0~wise~0~0~0~2~2~0~0~0~0~1

[Another new issue which seems to have popped up the last week or so (as a consequence of one of the latest adjustments to the work flow) is here apparent: for lots of these pairs the usual information such as frequencies etc. is now missing.]

The issue we are inching towards is this: short word forms may well be 'properly' handled by *short.ldcalc and *ambi, but other pairs based on the actual bigram (mostly, if not exclusively, we suspect) are still incorporated in the regular 'long' *ldcalc file: (we do no longer see the actual 'tire_land' and 'tire_bay' examples we had a couple of weeks ago. The first delivered e.g. CCs 'Ireland' and 'fireland' in the long ldcalc file). But these examples are clear enough (granted: they should not be there by virtue of the frequencies of their composing words alone):

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tire_' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc 
tire_as~44~55~Tijeras~1000000109~1000000109~23318930336~2~5~1~1~1~0~0
tire_as~44~55~Treas~1000000098~1000000124~23803623657~2~5~1~1~1~0~0
tire_as~44~55~treas~1000000026~1000000124~23803623657~2~5~1~1~1~0~0
tire_on~266~266~Ireson~1000000092~1000000092~14834306838~2~5~1~0~1~0~0
tire_on~266~266~Tiron~1000000084~1000000084~23207337056~2~5~1~1~1~0~0
tire_on~266~266~Treon~1000000041~1000000041~23803623657~2~5~1~1~1~0~0
tire_or~65~65~TREVOR~105~2000018302~5512626967~2~5~1~1~1~0~0
tire_or~65~65~Trevor~2000018197~2000018302~5512626967~2~5~1~1~1~0~0
tire_to~170~187~Tirito~1000000000~1000000000~10444521431~2~5~1~1~1~0~0

A non-word example concerns 'ifle':

We have 596 pairs containing this non-word in short.ldcalc.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ cat /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.short.ldcalc |grep '^ifle~' |wc
    596     596   21098

For the probably correct resolution 'rifle' we have the following evidence:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^ifle~rifle' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambi
ifle~rifle#The_ifle_is~the_rifle_is#The_ifle~The_rifle#The_ifle~the_rifle#and_the_ifle~and_the_rifle#ifle_is~rifle_is#ifle_of~rifle_of#ifle_on_the~rifle_on_the#ifle_on~rifle_on#ifle_or~rifle_or#small_ifle~small_rifle#the_ifle_of~the_rifle_of#the_ifle~The_rifle#the_ifle~the_rifle#
    
'Long' LDcalc nevertheless still retains a number of 'ifle' bigrams.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^ifle_' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc 
ifle_is~1~1~Ifles~1~4~23803623657~2~5~0~1~0~0~0
ifle_is~1~1~ifles~3~4~23803623657~2~5~0~1~0~0~0
ifle_on~1~1~Flemon~1000000002~1000000002~2800207081~2~5~1~0~1~0~0
ifle_on~1~1~Fleron~11~11~9778106350~2~5~0~0~1~0~0
ifle_on~1~1~Flexon~44~44~9234745750~2~5~0~0~1~0~0
ifle_on~1~1~Isleton~1000000052~1000000052~11088909372~2~5~1~1~1~0~0
ifle_or~1~1~Flexor~118~1128~9234745750~2~5~0~0~1~0~0
ifle_or~1~1~flexor~1010~1128~9234745750~2~5~0~0~1~0~0

The problem with these is that TICCL-rank misses the possibly likeliest resolution which is in short.ldalc and will rank the rest, probably delivering a False Positive.

I am not sure what would be best to do about this. I think for now we should keep both the short.ldcalc and ambi output. And still add the 'short' bigrams to 'long' ldcalc so that TICCL-rank has the data necessary to do its job well. 

Given the inordinate amount of possible pairs for 'ifle' in short.ldcalc, I am not sure the very large background corpus containing also ngrams helps rather than obfuscates the situation. It seems that we should boost the evidence of validated ngrams present in the foreground corpus where and how possible.

Yet one more 'new' issue that bothers me is the fact that capitalized word forms seem to have gained prominence in the corrections. This is due to the fact that TICCL-anahash sorts the anagrams collected alphabetically, it seems. If at all possible, these should rather be sorted by frequency.

Another thing... This run had --low=4. Yet we find the couple 'ifles~riffles', word lengths 5 and 7 respectively, in short.ldcalc.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^ifles~' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambi |grep 'ifles~riffles'
ifles~riffles#ifles_of~riffles_of#

How does that happen?

Addressing C/ : I need to know.

Addressing D/ : I need to know, too.

Further to the ranking features: now we have the foreground foci file: we should use this as another, strong ranking feature: if the CC is present: boost.

Following up on mainly A/ and B/: I will post recommendations for remedial work asap.

MRE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Main two things currently wrong with TICCL-LDcalc and TICCL-rank (and two more gripes...) #29

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Main two things currently wrong with TICCL-LDcalc and TICCL-rank (and two more gripes...) #29

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions