# Posts Tagged lemmatization

### Disinflections

Posted by Angela Brett in News on May 12, 2021

I enjoy taking words that have irregular inflections, and inflecting other words the same way — for instance, saying *squoke as the past tense of squeak, analogous with speak and spoke, or even *squought, analogous with seek and sought. Sometimes those disinflections, as I’ve decided to call them, look or sound like other words… for instance, analogous with fly, flew, and flown, I could use crew and crown as past tenses of cry, or boo and bone as past tenses of buy. Indeed, analogous with buy and bought, the past tense of fly could be *flought, but then again, perhaps the present tense of bought could be ‘batch’ or ‘beak’, or ‘bite’, analogous with caught and catch, or sought and seek, or fought and fight.

## The Disinflectant app

For a while now, I’ve wanted to make an app to find these automatically, and now that I have a bit of free time, I’ve made a prototype, mostly reusing code I wrote to generate the rhyme database for Rhyme Science. I’m calling the app Disinflectant for now. Here’s what it does:

**Read words from a file and group them by lemma.**Words with the same lemma are usually related, though since this part is using text only, if two distinct lemmas are homographs (words with the same spelling but different meanings) such as bow🎀, bow🏹, bow🚢, and bow🙇🏻♀️, then they’re indistinguishable. This part is done using the Natural Language framework (henceforth referred to as ‘the lemmatiser’), so I didn’t write any complicated rules to do this.**Find out the pronunciation of the word, as text representing phonemes.**This is done using the text-to-speech framework, so again, nothing specific to Disinflectant. The pronunciation is given in phoneme symbols defined by the API, not IPA.**Find all the different ways that words with the same lemma can be transformed into another**by switching a prefix or suffix for another. For instance:

Transform type | Transform | by analogy with |
---|---|---|

Spelling suffix | y→own | fly→flown |

Pronunciation suffix | IYk→AOt | seek→sought |

Spelling prefix | e→o | eldest→oldest |

Pronunciation prefix | 1AW→w1IY | our→we’re |

Most prefixes in English result in words with different lemmas, so Disinflectant didn’t find many prefix transforms, and the ones it found didn’t really correspond to any actual grammatical inflection. I had it prefer suffixes over prefixes, and only add a prefix transform if there is no suffix found, so that bus→buses would result in the spelling suffix transform ∅→es and not the prefix transform bu→buse.

Each transform can apply to multiple pairs of real words. I included a way to label each transform with something like ‘past tense’, so the app could ask, ‘why isn’t crew the past tense of cry?’ but didn’t end up filling in any of them, so it just calls them all inflections.

**Apply each transform individually to each word, and see whether the transformed version matches another word with a different lemma.**

It could just make up words such as ‘squoke’, but then there would be hundreds of millions of possibilities and they wouldn’t be very interesting to sift through, so it’s better to look for real words that match.

That’s it. Really just four steps of collecting and comparing data, with all the linguistic heavy lifting done by existing frameworks.

### The limitations

Before I show you some of the results, here are some limitations:

**So far I’ve only given it a word list, and not a text corpus.**This means that any words which have different lemmas or different pronunciations depending on context (such as ‘moped’ in ‘she moped around’, with the lemma ‘mope’, vs. ‘she rode around on her moped’, with the lemma ‘moped’.) I have code to work with corpora to add homographs to rhyme.science, but I haven’t tried it in this app yet.**It’s only working with prefixes and suffixes.**So it might think ‘woke’ should be the past tense of ‘weak’ (by analogy with ‘speak’ and ‘spoke’) but won’t generalise that to, say, ‘slope’ as the past tense of ‘sleep’ unless there is another word ending in a p sound to model it on. I could fairly easily have it look for infix transforms as well, but haven’t done so yet.**It doesn’t distinguish between lemmas which are spelled the same**, as mentioned above.

## The results

For my first full test run, I gave it the SCOWL 40 list, with 60523 words, and (after about a day and a half of processing on my mid-2014 MacBook Pro — it’s not particularly optimised yet) it found 157687 disinflections. The transform that applied to the most pairs of actually-related words was adding a ‘z’ sound to the end of a word, as for a plural or possessive noun or second-person present-tense verb ending in a voiced sound. This applies to 7471 pairs of examples. The SCOWL list I used includes possessives of a lot of words, so that probably inflates the count for this particular transform. It might be interesting to limit it to transforms with many real examples, or perhaps even more interesting to limit it to transforms with only one example.

I just had it log what it found, and when a transform applied to multiple pairs of words, pick a random pair to show for the ‘by analogy with’ part in parentheses. Here are some types of disinflections it found, roughly in order from least interesting to most interesting:

#### Words that actually are related, just not so much that they have the same lemma:

Some words are clearly derived from each other and maybe should have the same lemma; others just have related meanings and etymology.

- Why isn’t
**shoppers**(S1AApIXrz) with lemma shopper the inflection of**shops**(S1AAps) with lemma shop? (by analogy with**lighter’s → light’s**) - Why isn’t
**constraint**(kIXnstr1EYnt) with constraint same the inflection of**constrain**(kIXnstr1EYn) with lemma constrain? (by analogy with**shopped → shop**) - Why isn’t
**diagnose**(d1AYIXgn1OWs) with lemma diagnose the inflection of**diagnosis**(d1AYIXgn1OWsIXs) with lemma**diagnosis**? (by analogy with he → his) - Why isn’t
**sieves**(s1IHvz) with lemma sieve the inflection of**sift**(s1IHft) with lemma sift? (by analogy with knives → knifed) - Why isn’t
**snort**(sn1AOrt) with lemma snort the inflection of**snored**(sn1AOrd) with lemma snore? (by analogy with leapt → leaped)

#### Words that definitely should have had the same lemma, for the same reason the words in the analogy do:

These represent bugs in the lemmatiser.

- Why isn’t
**patrolwoman’s**(pIXtr1OWlwUHmIXnz) with lemma patrolwoman’s the inflection of**patrolwomen**(pIXtr1OWlwIHmIXn) with lemma patrolwomen? (by analogy with**patrolman’s → patrolmen**) - Why isn’t
**blacker**(bl1AEkIXr) with lemma black the inflection of**blacken**(bl1AEkIXn) with lemma blacken? (by analogy with**whiter → whiten**)

#### Transforms formed from words which have the same lemma, but probably shouldn’t:

These also probably represent bugs in the lemmatiser.

- Why isn’t
**car**(k1AAr) with lemma car the inflection of**air**(1EHr) with lemma air? (by analogy with**can’t → ain’t**)*Both ‘can’t’ and ‘ain’t’ are given the lemma ‘not’. I don’t think this is correct, but it’s possible I’m using the API incorrectly or I don’t understand lemmatisation.*

#### Words that are related, but the lemmatiser was considering an unrelated homograph of one of the words, and the actual related word was not picked up because of the first limitation above:

- Why isn’t
**skier’s**(sk1IYIXrz) with lemma skier the inflection of**skied**(sk1IYd) with lemma sky? (by analogy with**downer’s → downed**)*In this case, the text-to-speech read ‘skied’ as the past tense of ‘ski’, but the lemmatiser read it as the past participle of ‘sky’, as in, ‘blue-skied’, which I think is a slightly obscure choice, and might be considered a bug in the lemmatiser.* - Why isn’t
**ground**(gr1AWnd) with lemma ground the inflection of**grinding**(gr1AYndIHN) with lemma grind? (by analogy with**rewound → rewinding**)*Here the lemmatiser is presumedly reading it as the noun or verb ‘ground’ rather than the past and past participle of ‘grind’.*

#### Pronunciation transforms finding homophones of actual related words:

- Why isn’t
**sheikhs**(S1EYks) with lemma sheikh the inflection of**shaking**(S1EYkIHN) with lemma shake? (by analogy with**outstrips → outstripping**)*‘Sheikhs’ sounds just like ‘shakes’, which is indeed the present tense or plural of ‘shake’.* - Why isn’t
**soled**(s1OWld) with lemma sole the inflection of**selling**(s1EHlIHN) with lemma sell? (by analogy with**sold → selling**)*‘Soled’ sounds just like ‘sold’, which is indeed the past tense of ‘sell’.*

#### Pronunciation transforms based on an incorrect pronunciation:

These represent bugs in the text-to-speech. Try them yourself on a Mac by setting the system voice to an older American English one such as Victoria, selecting the word, and choosing **Speech→Start Speaking** from the **Edit** menu or the contextual menu.

- Why isn’t
**nape’s**(n1AEpIYz) with lemma nape the inflection of**nappy**(n1AEpIY) with lemma nappy? (by analogy with**suffocation’s → suffocation**)*The text-to-speech pronounces ‘nape’ correctly, but pronounces ‘napes’ like ‘naps’ and ‘nape’s’ like ‘nappies’.* - Why isn’t
**mice**(m1AYs) with lemma mouse the inflection of**me**(m1IY) with lemma I? (by analogy with**modernity’s → modernity**)*The text-to-speech pronounces ‘modernity’ correctly, but pronounces ‘modernity’s’ like ‘modernitice’.*

- Why isn’t
**queue’s**(ky1UWz) with lemma queue the inflection of**cubing**(ky1UWbIHN) with lemma cubing? (by analogy with**lambs → lambing**)*The text-to-speech pronounces the ‘b’ in ‘lambing’. I’m not sure if there is an accent where this is the correct pronunciation, but it isn’t in the dictionaries I’ve checked.*

#### Small transforms that can be applied to many other words:

Sometimes it will find that a word with the same lemma can have one letter or phonemes changed or added, and then there are a huge number of words that the transform can apply to. I wonder if you could almost change any final letter or phoneme to any other.

- Why isn’t
**mine**(m1AYn) with lemma I the inflection of**mind**(m1AYnd) with lemma mind? (by analogy with**shoe → shod**) - Why isn’t
**ham**(h1AEm) with lemma ham the inflection of**hay**(h1EY) with lemma hay? (by analogy with**them → they**)*This one could also be extended to hair (from them → their) to get a full set of weird pronouns.* - Why isn’t
**hearth**(h1AArT) with lemma hearth the inflection of**heart**(h1AArt) with lemma heart? (by analogy with**sheikh → sheik**) - Why isn’t
**captor**(k1AEptIXr) with lemma captor the inflection of**captain**(k1AEptIXn) with lemma same? (by analogy with**whiter → whiten**) - Why isn’t
**colt**(k1OWlt) with lemma colt the inflection of**coal**(k1OWl) with lemma coal? (by analogy with**shopped → shop**)

#### Spelling prefixes and suffixes that don’t quite correspond to how the inflections are formed:

Sometimes changes such as doubling the final consonant are made when an -ing or -ed is added. Since Disinflectant only sees this as a suffix being added, it thinks that specific consonant can also be added to words that end in other consonants.

- Why isn’t
**braking**(br1EYkIHN) with lemma brake the inflection of**bra**(br1AA) with lemma bra? (by analogy with**picnicking → picnic**) - Why isn’t
**garbs**(g1AArbz) with lemma garbs the inflection of**garbling**(g1AArblIHN) with lemma garble? (by analogy with**corrals → corralling**) - Why isn’t
**badgering**(b1AEJIXrIHN) with lemma badger the inflection of**badge**(b1AEJ) with lemma badge? (by analogy with**transferring → transfer**) - Why isn’t
**bobsled**(b1AAbslEHd) with lemma**bobsled**the inflection of**bobs**(b1AAbz) with lemma bob? (by analogy with**patrolled → patrol**)

#### Disinflection I might have come up with myself:

- Why isn’t
**hay**(h1EY) with lemma hay the inflection of**highs**(h1AYz) with lemma high? (by analogy with**lay → lies**) - Why isn’t
**bowled**(b1OWld) with lemma bowl the inflection of**belling**(b1EHlIHN) with lemma bell? (by analogy with**sold → selling**) - Why isn’t
**bodies**(b1AAdIYz) with lemma body the inflection of**bodice**(b1AAdIXs) with lemma bodice? (by analogy with**emphases → emphasis**) - Why isn’t
**lease**(l1IYs) with lemma lease the inflection of**loosed**(l1UWst) with lemma loose? (by analogy with**geese → goosed**) - Why isn’t
**wield**(w1IYld) with lemma wield the inflection of**welt**(w1EHlt) with lemma welt? (by analogy with**kneeled → knelt**) - Why isn’t
**gauze**(g1AOz) with lemma gauze the inflection of**goo**(g1UW) with lemma goo? (by analogy with**draws → drew**) - Why isn’t
**cheese**(C1IYz) with lemma cheese the inflection of**chosen**(C1OWzIXn) with lemma choose? (by analogy with**freeze → frozen**)

#### Transforms based on abbreviations:

- Why isn’t
**chuckle**(C1UXkIXl) with lemma chuckle the inflection of**chuck’s**(C1UXks) with lemma chuck? (by analogy with**mile → mi’s**) - Why isn’t
**cooperative’s**(kOW1AApIXrrIXtIHvz) with lemma cooperative the inflection of**cooper**(k1UWpIXr) with lemma cooper? (by analogy with**negative’s → neg**) - Why isn’t
**someday**(s1UXmdEY) with lemma someday the inflection of**some**(s1UXm) with lemma some? (by analogy with**Friday → Fri**)

#### Other really weird stuff I’d never think of:

- Why isn’t
**comedy**(k1AAmIXdIY) with lemma comedy the inflection of**comedown**(k1UXmdAWn) with lemma comedown? (by analogy with**fly → flown**) - Why isn’t
**aisle**(1AYl) with lemma aisle the inflection of**meal**(m1IYl) with lemma meal? (by analogy with I → me) - Why isn’t
**hand**(h1AEnd) with lemma hand the inflection of**hens**(h1EHnz) with lemma hen? (by analogy with**manned → men’s**) - Why isn’t
**out**(1AWt) with lemma same the inflection of**wheat**(w1IYt) with lemma same? (by analogy with**our → we’re**)

If people are interested, once I’ve fixed it up a bit I could either release the app, or import a bigger word list and some corpora, and then publish the whole output as a CSV file. Meanwhile, I’ll probably just tweet or blog about the disinflections I find interesting.