Formalizing Informal Text using Natural Language Processing
English is a funny language, they say. While it may be, for non-native speakers, learning English has always been a really challenging task. They find it difficult to comprehend the nuances and subtleties which are so obvious to native speakers. The crux of any language stems from its grammar, and obscurity of English grammar doesn’t help either in understanding the language. However, things are changing. Globalization certainly influenced non-native speakers to improve their English proficiency by watching movies, reading books, or listening to news and sports commentaries. As far as informal conversations, messages, or tweets are concerned, they don’t have any problems with that. However, formal English where grammar intricacies are involved still perplexes them. This case study is an attempt to design a system that can convert informal English text into formal English text using Natural Language Processing techniques.
The system should be able to generate formal English text, given its informal form. Informal text can be anything ranging from a random tweet to comic extract. The formal text should have the syntactically correct meaning as intended by its informal counterpart. It should be able to correct grammar, punctuations and if possible, even capitalizations. For instance,
what r ya talking abt
What are you talking about?
The informal text refers to the text with a lot of abbreviations and misspelled words. Hence, it is imperative for the system to correct these words. The system should be able to generate the text in real-time given its informal form to avoid latency problems. Also, in trying to generate grammatically correct text, it should not change the meaning of the underlying text as it might be counterproductive to the task at hand.
There is a real scarcity of publicly available normalized datasets. Fortunately, NUS Social Media Text Normalization and Translation Corpus can be used for this case study. The corpus is created for social media text normalization and translation. It is built by randomly selecting 2,000 messages from the NUS English SMS corpus. The messages were first normalized into formal English and then translated into formal Chinese which we can ignore. Corpus is available for download here.
While technological advancements have sped up the research in natural language processing, it has largely been concentrated on the works such as word embeddings and natural machine translation. As a result of which, text formalization as an application of NLP, is still in its nascent stage. Given below is a list of some works to pioneer text formalization.
- Sentence Correction using Recurrent Neural Networks:
The work is published by Gene Lewis, Stanford University with the same objective as this case study. The author proposed embedding ASCII characters found in the text in a 94-dimensional one-hot encoding vector to create a character-level model. Similarly, word-level embeddings were also used to experiment with both approaches. The vector was left to be learned while training the model. The model is composed of Long Short Term Memory (LSTM) cells capable of capturing implicit relationships between underlying characters. There were two variations of Recurrent Neural networks with one and two hidden layers respectively. For estimating output probability, the author chose the softmax layer as it is capable of producing normalized probability for each character. For training, mini-batch stochastic gradient descent was used with the cross-entropy loss the function which we are proposing here as well.
The results were reported in terms of perplexity values with a 1-layer model achieving 5.014 on the character level version and 1.533 on the word level version. The 2-layer model achieved perplexity values of 5.278 on the character level version and 1.490 on the word level version. These values were significantly lower than the baseline unigram model indicating that the models were indeed capable of converting text in standard English format. The author hence concluded that more data would allow our neural network to translate more accurately.
2. Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus,
Benchmarks and Metrics for Formality Style Transfer:
This monumental work in the style transfer domain is published by Sudha Rao with Joel Tetreault at Grammarly to address the data scarcity by compiling the dataset of nearly 110k pairs of formal and informal sentences. Big efforts were made to convert raw informal text sentences with the help of subject matter experts. The informal text consisted of a Yahoo question-answer corpus from various domains such as Entertainment & Music and Family & Relationships of equal distributions. Because of varying degrees of formality in both domains, the authors reported results separately on both domains. Various combinations of Neural Machine Translation models were used like encoder-decoder models with attention mechanisms to train the dataset. The models were evaluated on the overall ranking of various schemes like formality, fluency, meaning preservation.
The aim of this project was to kickstart the field of formality style transfer by making a benchmark dataset which the authors achieved impressively.
3. Harnessing Pre-Trained Neural Networks with Rules for Formality
The research paper mentioned above has been published by Yunli Wang et al. The authors proposed a system to take advantage of earlier rule-based approaches of formality style transfer. They preprocessed the input text with simple rule-based approaches such as capitalizing the first letter, handling misspellings, and appended it to the underlying informal text with <EOS> character to make an input sequence. They used Grammarly’s Yahoo Answers Formality Corpus (GYAFC) dataset to benchmark their results as it is a big dataset consisting of 100k informal and formal sentence pairs. The model was not trained from scratch but was fine-tuned using the preprocessed data on state-of-the-art models such as GPT-2. The evaluation metrics used were the BLEU score which evaluates the n-gram overlap, and PINC is an auxiliary metric indicating the dissimilarity between an output sentence and an input. A PINC score of 0 indicates that the input and output sentences are the same. With improved scores over GPT-2 models, the authors proved that it is indeed beneficial to have a concatenation of informal text and its preprocessed form over plain informal text to train the model.
4. An Empirical Analysis of Formality in Online Communication:
The work is published by Ellie Pavlick and Joel Tetreault to classify the English text based on its formality level on an ordinal 7-point Likert scale, with labels from -3 (Very Informal) to 3 (Very Formal). The text was gathered from community question-answers forums, blogs, emails, and news. The annotation was crowdsourced with annotations of each sentence being averaged by 5 different annotators. The authors derived 11 features for each sentence like the case, punctuation, readability, subjectivity, parts of speech tagging, etc. to train the classifier. The performance of the classifier was evaluated by using the spearman correlation coefficient between prediction and human annotations on various baseline metrics such as sentence length, Flesch scale, F-score, 3-gram perplexity, and baseline ridge regression classifier. The classifier outperformed all the baseline models thereby proving that the formality of the sentence can be quantified using machine learning techniques.
5. Multi-Task Neural Models for Translating Between Styles Within and Across Languages:
This work improved on phenomenal work done earlier by Rao and Tetreault by Xing Niu et al. The goal was to design a multi-task learning model that can perform both bi-directional English formality transfer (FT) and translate French to English (FSMT) with desired formality. It was trained jointly on monolingual formality transfer data and bilingual translation data. The evaluation was carried out by using both automatic and human evaluation. The modeling part involved designing three models to jointly perform FT and FSMT by implementing shared encoders and decoders. The architecture of NMT consisted of a bi-directional encoder with a single 1012 LSTM layer (Bahdanau et al., 2015) of size 512, multilayer perceptron attention with a layer size of 512, and word representations of size 512. The BLEU score was improved over earlier works by Rao and Tetreault by about 2 to 3 %.
Augmenting the Dataset:
As we are limited only to 2000 labeled data points, we can augment the dataset to address this data scarcity. For data augmentation, we can use the
library. For our purpose, synonym augmentation and spelling augmentation are suitable techniques. For each formal sentence in the dataset, we will first add synonym augmented pairs to get 4000 instances. On top of that, we will apply spelling augmentations to get 8000 instances in total.
The dataset is structured as informal text followed by its formal correction and then Chinese translation each on a new line. We will make a pandas DataFrame with two columns namely Informal text and Formal text by reading the text file of the data. As the text is collected from messages and general conversations, we will preserve the case, punctuations, and stopwords.
The model we will design is known as a sequence-to-sequence model as we are providing a text sequence as input and expect the text sequence as output. Speaking of sequence-to-sequence models, two distinctions based on the level of tokenization can be made viz. word level and character level. On experimenting with both types of tokenizations, we find that the character level tokenizer achieves much better results which are to be expected given the size of the dataset, hence we will focus on the same in this case study.
Now, the input to the encoder should be encoded with the start of sentence and end of sentence tokens as it will enable the encoder to know the span of each sentence. We can use ‘<’ and ‘>’ tokens for initiation and termination respectively. For the decoder, the input should be appended with the ‘<’ token at the beginning, and the output should be appended with the ‘>’ token at the end. For example,
Encoder input: <I’m thai. what do u do?>
Decoder input: <I’m Thai. What do you do?
Decoder output: I’m Thai. What do you do?>
Before we split the data, we will look at the distribution of lengths of encoder_inp, decoder_inp, and decoder_out to get the idea of the input shape we will need to embed our data into.
As we can see, most of the sentences are of length around 50 characters and almost all the sentences have lengths less than 200 characters. Hence, we can filter out the sentences which are of length more than 200 for uniformity. We can now split the data into train, validation, and test sets. As we have less data, we will split with about 90:05:05 split to use more data to train the model.
Tokenizing and Padding data:
Tokenizing the data means, encoding the sentences with numbers. The numbers are assigned by a unique id from the vocabulary. So, the particular sentence will be encoded by unique ids of words occurring in that sentence. We will create the two tokenizers each for informal and formal data. Padding refers to appending a common id (i.e. generally 0) to make all the sentences of the same length. As we saw earlier, we can make the sentence lengths as 200 characters.
Designing the Simple Encoder-Decoder Network:
The seq2seq model consists of two subnetworks, the encoder, and the decoder. The encoder, on the left hand, receives sequences from the source language as inputs and produces, as a result, a compact representation of the input sequence, trying to summarize or condense all of its information. Then that output becomes an input or initial state to the decoder, which can also receive another external input. At each time step, the decoder generates an element of its output sequence based on the input received and its current state, as well as updating its own state for the next time step. Here’s a simple pictorial representation of how our model will work.
The encoder will take sequential word embeddings of the source sentences as input at each time step, and encode its information in encoded vector using current state and LSTM hidden state. Hence, at the output of the encoder, we get an encoded vector of the source sentence which can be thought of as a latent information vector.
The decoder is designed in the same as Encoder as its the chain of LSTM units where, the hidden state of the first unit is the encoder vector, and the rest of the units accept the hidden state from the previous unit.
Designing Encoder-Decoder Model:
Now that we have Encoder and decoder models, we can now integrate them into the Encoder-Decoder model. We will add an additional dense layer as an output layer whose output is calculated using a softmax function to obtain a probability for every token in the output vocabulary.
Training the Encoder-Decoder Model:
We can now train the Encoder-Decoder model using sparse categorical cross-entropy as a loss function and adam optimizer with a default learning rate of 0.0001 which can be lowered on plateau using a callback. The model converges after 27 epochs achieving validation loss of 0.5212 with the simple encoder-decoder model.
Regarding evaluation, we get a Bleu score of 0.454 on the test set. This is by no means a great score to deploy the model, but the test set only contains about 90 sentences. This score can serve as a baseline metric for future models.
The distribution shows that the model achieves the bleu score of around 0.6 for a majority of the sentences. Let us generate a random prediction using this model.
The model corrected the words ‘wat’, ‘r’, ‘ya’ to ‘What’, ‘are’, and ‘you’ respectively along with capitalizing the first letter. It also correctly introduced ‘?’ at the end. But more importantly, the prediction is not meaningful or convincing. This issue can be overcome by training the model on a large dataset.
While the simple encoder-decoder seq2seq model works well for shorter sequences, it badly struggles for longer sequences. This is because of the fact that the output token at a particular timestep might be dependent on a token parsed by the encoder a while back. But decoder model only gets to know the output token of the current step. This is where the Attention model comes in. It introduces a simple architecture between encoder and decoder to enable the decoder to consider weighted outputs of the encoder of all the previous timesteps.
The figure above shows the mechanism of the attention-based encoder-decoder model. As you can see, the encoder model only differs from a simple encoder-decoder model in that it generates an output for each timestep along with the hidden state. All the encoder outputs are then fed to the attention model, where the weights corresponding to all the encoder outputs are calculated to enable the decoder to focus on certain tokens while making predictions. The input to the decoder is then computed by weighing the ground truth token with an exponent of a concatenated output of hidden state at previous timestep and attention weights to make a prediction.
The important part of the attention model is to calculate the weights of output encoder tokens also known as attention weights. These weights are computed by using specific scoring functions. We will consider three types of scoring functions in this case study namely Dot, General, and Concat.
Hence, in total, we will train three Attention-based Encoder-Decoder models using Dot, General, and Concat scoring functions.
The encoder will take sequential word embeddings of the source sentences as input at each time step, and encode its information in encoded vector using current state and LSTM hidden state. Hence, at the output of the encoder, we get an encoded vector of the source sentence which can be thought of as a latent information vector. All in all, the encoder will remain the same as earlier.
Designing Attention Model:
The attention model takes two inputs in the form of decoder hidden state of previous timestep and encoder output and calculates attention weights.
Designing Timestep Decoder:
For each time step, the Timestep decoder will implement concatenation operation on the output of the previous timestep of decoder and attention weights computed by the attention model.
The decoder model simply calls the timestep decoder at each timestep and generates the final output tokens.
Designing Final Model Architecture:
The attention-based Encoder-Decoder model gets the tuple of input sequences as input and implements the Encoder, Attention, Timestep Decoder, and Decoder models using subclassing API.
Training the Attention Model using Dot Scoring function:
To train the model, first, we will train the model using a dot scoring function and a custom loss function which will mask the padded zeros of the sequences while calculating loss to provide a more reliable measure for the loss. For more clarity on this, refer to TensorFlow documentation. Upon training the model, by adam optimizer with a default learning rate of 0.0001 which can be lowered on plateau using a callback, the model converges after 23 epochs achieving validation loss of 0.3774 on the validation set which is better than that of a simple encoder-decoder model.
The model achieves the BLEU score of 0.505 on the test set which is significantly better than that of the baseline encoder-decoder model. Let us check the distribution of the bleu scores.
The distribution shows that the model with dot scoring function achieves the bleu score of around 0.7 for the majority of the sentences. Let us generate a random prediction using this model.
The model corrected the word ‘how’ to ‘How’ along with capitalizing the letter ‘h’. It also tried to correct the word ‘u’ to ‘you’ but failed to do so instead incorrectly predicted ‘your’. It also introduced the correct punctuation ‘?’. But more importantly, the prediction is not meaningful or convincing. This issue can be overcome by training the model on a large dataset.
Training the Attention Model using General Scoring function:
Now, we will train the model using a general scoring function and a custom loss function mentioned earlier. Upon training the model, by adam optimizer with a default learning rate of 0.0001 which can be lowered on plateau using a callback, the model converges after 50 epochs achieving validation loss of 0.2196 on the validation set which is better than that of a simple encoder-decoder model and also the model with dot scoring function.
The model achieves the BLEU score of 0.5096 on the test set which is significantly better than that of the baseline encoder-decoder model and marginally better than the model with a dot scoring function. Let us check the distribution of the bleu scores.
The distribution shows that the model with a general scoring function achieves the bleu score of around 0.6 for the majority of the sentences but is somewhat better than the dot model. Let us generate a random prediction using this model.
As expected with the lower validation loss, unlike the model with dot scoring function, the model is predicting the formal text pretty accurately. It also correctly introduced capitalization and punctuation.
Training the Attention Model using Concat Scoring function:
Now, we will train the model using a concat scoring function and a custom loss function mentioned earlier. Upon training the model, by adam optimizer with a default learning rate of 0.0001 which can be lowered on plateau using a callback, the model converges after 20 epochs achieving validation loss of 0.4160 on the validation set which is better than that of a simple encoder-decoder model but not better than the model with general scoring function.
The model achieves the BLEU score of 0.459 on the test set which is not better than that of the model with a general scoring function. Let us check the distribution of the bleu scores.
The distribution shows that the model with concat scoring function achieves the bleu score of around 0.65 for the majority of the sentences but is somewhat better than the general model. Let us generate a random prediction using this model.
The model corrected the informal word ‘r’ to ‘are’. It also corrected the capitalization. But more importantly, the prediction is not meaningful or convincing. This issue can be overcome by training the model on a large dataset.
The model with a general scoring function has performed best among the three models. Now, we will analyze the behavior of this model on the test dataset by checking the best and worst predictions made by the model. For that, we will have to sort the bleu scores achieved by the model on the test set, and then print the corresponding predictions.
Informal Input : Okay.
Expected Output : Ok.
Predicted Output : Ok.
Bleu Score of Prediction : 1.0
Informal Input : So what are you doing now? Can 1 pastime you to a walk? See? Then Iodin maybe pass in by where you bequeath later.
Expected Output : So what are you doing now? Can I interest you to a walk? See? Then I maybe pass in by where you leave later.
Predicted Output : No. I just want to watch that someone ask you a few dollars on Friday. So if you want to go for the party.
Bleu Score of Prediction : 0.7071067811865476
Informal Input : You prefer other day's onr lyou want Monday em Thursday soy you can comming directly? Or DO you have fire days? You choose, because you aire the only on studying.====================================================================
Expected Output : You prefer other days or you want Monday and Thursday so you can come directly? Or do you have free days? You choose, because you are the only one studying.
Predicted Output : You are having a good time too. Thanks for the dinner. I just want to watch movies. I was thinking of asking you and have to do then going?
Bleu Score of Prediction : 0.7001600475627672
Informal Input : Okeh, and so later call me.
Expected Output : Ok, then later call me.
Predicted Output : Ok. I want to there later.
Bleu Score of Prediction : 0.0
Informal Input : Okay. Then dont be late ah
Expected Output : Okay. Then don't be late.
Predicted Output : Ok. I am not crazy.
Bleu Score of Prediction : 0.0
Informal Input : How ' s they shopping?
Expected Output : How's the shopping?
Predicted Output : How about you and message me and sit with you then.
Bleu Score of Prediction : 0.0
The important observation regarding the predictions is that be it best predictions or worst predictions, the model is capable of correcting the misspellings, capitalizations, and punctuations. The predictions with a higher bleu score have more words overlapping with ground truth. The worst predictions however are for the instances where there are a lot of misspellings and incorrect capitalizations as the model is sensitive to it. Nevertheless, the model is trained on very little data and hence has a lot of scope for improvement with large datasets like the GYAFC corpus.
The models with the dot scoring function and concat scoring function did not generate a satisfactory prediction. But the model with a general scoring function performed exceptionally well, and also achieved the lowest validation loss. While data augmentation and introduction of the attention model significantly improve the performance, it can be further improved by using a large dataset. For now, though, the model with a general scoring function is suitable for our deployment purposes.
Deployment Using Flask Server:
For deploying the application, we can use Flask API on any of the cloud services. For the sake of simplicity, let us deploy our app on the local server. Before deployment, ensure that there are no latency constraints at run time. If latency is unbearable, try reducing the model complexity or post-training quantization. For more details, refer here. The app given below is a python script that just loads the trained model with a general scoring function using the trained hyperparameters and makes the prediction.
Now, it is time to see our app in action.
As discussed many times during the case study, the biggest challenge in formality style transfer is gathering the labeled dataset. There are no publicly available datasets other than this one. The real power of deep learning is when trained on large datasets. Hence, a large corpus like GYAFC (which unfortunately I could not get access to) might produce much better results.
Update: I have found NAIST Lang-8 Learner Corpora which although is more oriented towards grammar correction, we can use for our purpose. If time permits, I will try the same. Stay tuned.
Sentence Correction using Recurrent Neural Networks
Effective Approaches to Attention-based Neural Machine Translation
Attention is all you need
Character-level text generation with LSTM
Neural machine translation with attention
Applied AI Course
Thanks for reading the article and I hope you liked it as much as I did writing it.
Informal to Formal Sentence Converter