After the text document is made available for further analysis, the output text document looks bulky and clumsy. The parsed document is further broken down into words or terms known as tokens. This process is called tokenization. In the process of tokenization all punctuation, numbers, special characters, stop words, white spaces are removed, and the entire text is lowercased.
After the tokenization process, the parsed document is tokenized into the list of tokens/terms (Karamcheti 2010, pp 21-26). · Remove PunctuationAfter the tokenization procedure, the document is made available in the form of terms/words, and we need to apply further pre-processing stages on this corpus. Punctuation is one of them which can provide grammatical context, which supports understanding. Often for initial analyses, we ignore the punctuation. For the later use, punctuation might be essential to support the extraction of meaning (Ben Larsonr, 2016). In this thesis, we used tm_map () function with removePunctuation () to remove the punctuation from the existing corpus because we do not need to take punctuation as further analysis. · Remove NumbersNumbers may or may not be relevant during this research. This transformation can remove numbers simply.
- Thesis Statement
- Structure and Outline
- Voice and Grammar
So, using tm_map () with removeNumvers () function to remove the numbers from the corpus. · Convert text to lower-caseIt comes to the fact that computers do not treat ‘D’ and ‘d’ as the same letter, even though they are. If a word starts with a capitalize letters, the computer can view this word as separate words rather than it starts with small letters. So ‘D’ does not match ‘d’ found in the list of stop-words to remove. General character processing functions in R can be used to transform our corpus.
So, a typical requirement is to map the documents to lower case, using tm_map() with tolower (). As above, we need to wrap such functions with a content_transformer () (Ben Larsonr, 2016).· Remove stop wordsAfter converting the whole document into the lower case, the next issue we need to tackle is known as stop words. Stop words are common words found in a language. The most common words in text documents are articles, prepositions, and pronouns, that does not give the meaning of the documents (Vijayarani, Ilamathi, and Nithya 2015).
Moreover, these words should be removed from the above text. These words are frequent but provide little information. So, we may want to remove them. Some typical English stop words include “I”, “She’ll” ,”the”. In the “tm” package, there are 400 – 5001 stop words on this standard list. In fact, when we are doing analysis we will likely need to add to this list.
Leaving certain frequent words that don’t add any insight will cause them to be overemphasized in a frequency analysis which usually leads to a wrongly biased interpretation of results. Along with English stop words we can also use the extended list of stop words which exclude any other words from our corpus which does not provide any useful information. Using c () function allows us to add new words (separated by commas) to the stop words list. Once we have a list of stop words that make sense, we will use the removeWords () function in our text. removeWords () takes two arguments: the text object to which it is being applied and the list of words to remove. Figure 11 presents the extended list of stop words used for further removal of words from our corpus.
1 Standard set of English stop words, Available from: