Implementation

Simple Normalisation

This section outlines the steps involved in the normalisation process performed by the simple_normalise function.

Normalization Steps:

Lowercase Text: Convert the entire text to lowercase.
Remove Commas: Eliminate commas from the text.
Remove Undesirable Characters: Remove undesirable characters from the text.
Remove Duplicate Contiguous Characters: Remove duplicate contiguous characters from the text.
Add Space Around Punctuations: Ensures there are spaces around punctuations, while preserving slashes and hyphens in certain cases.
Anonymise Sentence: Anonymise the sentence, replacing sensitive information.
Remove Extra Spaces: Eliminate any extra spaces between words.
Fix Typos: Correct typos using the corrections dictionary.
Tokenise: Tokenise the text into a list of words.
Align Tenses to Present Tense: Adjust verb tenses to present tense using _to_present_tense.
Convert Nouns to Singular Form: Singularise words using _singularise.
Recreate Text from Tokens: Reconstruct the normalised text based on processed tokens.

For more detailed information on each step, please refer below.

Fix Typos

This step involves correcting typos in a given text based on a provided mapping dictionary.

Original Text	Corrected Text
pummp is Broken	pump is Broken
a/c leakin	air conditioner leak
wall was craked	wall was cracked

Anonymise Sentence

This function anonymises a given sentence by replacing potential asset identifiers with the placeholder “AssetID”.

The pattern used to identify asset identifiers is defined as follows:

b: Represents a word boundary, ensuring that the pattern is treated as a distinct word.
[A-Za-z]*: Matches zero or more alphabetical characters, covering cases where an asset identifier might start with letters.
d+: Matches one or more numeric characters, ensuring patterns with numbers are identified.
[A-Za-z]*: Matches zero or more alphabetical characters, covering cases where an asset identifier might end with letters.
b: Another word boundary to ensure the end of the matched pattern is treated as a distinct word.

Align Tenses to Present Tense

This section describes the manual steps involved in the to_present_tense function for normalizing words to present tense.

Pass through all words listed in the corrections dictionary.
If a word ends in -ing or -ed, pass through words listed as non-verbs.
Obtain the stem form by removing the -ing or -ed endings.
If the original word is one syllable (the stem consists only of consonants), it is passed through.
Dealing with One-Syllable Root Words:
- One-syllable -ll stems
- Two-syllable -ying words
- One-syllable -yed words
- One-syllable -ied words
- Stem consists of vowel + consonant
Dealing with Patterns Applying Over Most Letters:
- Verbs ending in double consonant letters typically drop the last letter.
  Example: hopping/hopped => hop
  
  Exceptions: -ff, -ll, -ss, -zz
- Verbs ending in consonant + vowel + consonant pattern add an “e” to the stem.
  Example: hoping/hoped => hope
  
  Exceptions: -r, -w, -x, -y
Dealing with Two Vowel Syllable Division Exceptions:
- Handle cases when vowel sounds remain distinct.
Dealing with Vowel Digraphs Exceptions:
- Handle cases when vowel sounds blend together.
Alphabetical Handling of Exceptions:
- Go down exceptions alphabetically from the end of the stem word.
- Deal with double letters -ff, -ll, -ss, and -zz.

Convert Nouns to Singular Form

This section describes the manual steps involved in the singularise function for converting words to singular form.

Pass through all words listed in the corrections dictionary.
Short words (3 letters or fewer) are not singularised.
Words ending in “es” are handled based on specific rules:
- Special case words ending in -ices changes to either -ex or -ix based on specific cases. e.g., indices => index, matrices => matrix
- Special case words ending in -ses change to -sis. e.g., analyses => analysis
- Words ending in -ses, -xes, -zes, -ches or -shes drop the “es”. e.g., boxes => box
- Words ending in -ies and having a length greater than 4 change to -y. e.g., families => family
- Words ending in -oes drop the “es”. e.g., potatoes => potato
- Words ending in -ives change to -fe. e.g., knives => knife
- Words ending in -ves change to -f. e.g., leaves => leaf
- Exception words ending in -ves. e.g., “detectives” => “detective”
- Other words ending in -es drop the “s”.
Dealing with plural words ending in -a:
- Special case words ending in -a change to -um. e.g., data => datum
- Special case words ending in -a change to -on. e.g., criteria => criterion
Dealimg with plural words endding in -i:
- Words ending in -i change to -us. e.g., radii => radius
Words ending in -ys and preceded by a vowel change to -y. e.g., boys => boy
Words ending in -ss remain unchanged after dropping “es”. e.g., glass => glass
Words ending in -s and not preceded by “u” or “i” drop the “s”.
- Example: cars => car, dogs => dog, radius => radius, tennis => tennis
- Exceptions: nouns that end in -as