ABSTRACTThere exist numerous systems and applications that facilitated translation from English to numerous other global and Indian languages. For many in the Indian populace habited in the remote regions, a basic fluency in the English language, which is now a global necessity, is challenging. Furthermore, for many tourists visiting the country, a translation mechanism becomes essential, especially when it comes to signboards and banners on the roadside. Hence, there is plenty of research and work that has been carried out in this field. There exist many Transfer based machine translation systems such as the MANTRA MT system, Shakti MT system, MATRA MT system, etc. However there is no significant work undertaken when it comes to the Hindi script, and its translation to the English language. This paper places its focus on developing a translation tool using the transfer based MT mechanism.
The system takes an input sentence in the Hindi language, analyses individual word tokens within the structure of the sentence and uses grammar rules to generate the final translated sentence in English. What makes this approach better than a rule-based/corpus-based machine translation system is that for the latter, elaborate knowledge of the lexicon of the language is required, which is not an available resource, whereas for the former, only knowledge of the source and target language is necessary to make the transfer rules. (Accuracy reports). 2. INTRODUCTIONEnglish, being the dominant global language that it is today, finds the requirement of its knowledge in a variety of domains. However, since Hindi is the national language within the country, there exists a vast population that is unaware of the linguistics and semantics of the English language.
- Thesis Statement
- Structure and Outline
- Voice and Grammar
Thus, there is a necessity to develop a machine translation application that will bridge the gap between these two languages. Machine translation, as a domain, reflects extensive and exhaustive work. MANTRA, an initiative by CDAC Pune, is an application used for translating documents written in English to Hindi. MANTRA translates the English text into Hindi in a specified domain of Personal Administration, specifically Gazette Notifications, Office Orders, Office Memorandums and Circulars. The strategy adopted here is – NOT WORD-TO-WORD, NOT RULE TO RULE, but a LEXICAL TREE-TO-LEXICAL TREE.
MANTRA uses the Lexicalized Tree Adjoining Grammar (LTAG) formalism to represent the English as well as the Hindi grammar. Furthermore, there is the Punjabi-to-Hindi MT system, that has been developed by G S Josan and G S Lehal which is based on direct word-to-word MT approach. This system comprised of modules such as pre-processing, word-to-word translation using Punjabi-Hindi lexicon, morphological analysis, word sense disambiguation, transliteration and post processing. Accuracy of the translation produced by this system is 90.
67%. There is also Shakti (developed in 2003), an initiative by Bharati, R Moona, P Reddy, B Sankar, D M Sharma and R Sangal, which translates English text to any Indian language with simple system architecture. It combines linguistic rule based approach with statistical approach. Another system that exists is the Bengali-to-Hindi MT system developed by Chatterji S, Roy D, Sarkar S and Basu A, which is a hybrid Machine Translation system. It uses an integration of SMT with a lexical transfer based system (RBMT). The performance of hybrid system has a BLEU score is 0.
2275. There also exists the UNL-based (Universal Networking Language) English-Hindi MT System developed by Dave S, Parikh J and Bhattacharyya P using UNL as the Interlingua structure. The UNL is an international project aimed to create an Interlingua for all major human languages. IIT Mumbai is the Indian participant in UNL project. Furthermore, there is also the VAASAANUBAADA project of 2002, developed by Vijayanand K, Choudhury S I and Ratna P, which is an Automatic Machine Translation system for Bengali-Assamese News Texts. It involves Bengali-Assamese sentence level Machine Translation for Bengali text. It includespreprocessing and post-processing tasks. The bilingual corpus has been constructed and aligned manually by feeding the real examples using pseudo code.
Hence, it can be observed that, while there exist various tools for translation from Hindi to other regional languages and vice-versa, as well as from English to Hindi, there is no system that exists that carries out translation from Hindi to English seamlessly. This paper intends to elaborate on a method to do just that: take as input, a set of sentences in Hindi and output their English translations, while maintaining the syntax of the English language. 3.LITERATURE SURVEYMachine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another. On a basic level, MT systems carry out word-for-word translations from source language to target language. This, however, does not complete the requirement of a machine translation system, as only word-for-word translations do not suffice. The following are the various approaches to machine translation : i. Rule-based machine translation: Rule-based machine translation (RBMT) is generated on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages.
Aa example of a system adopting this approach is the English-Hindi MT System (2002).ii. Direct, Transfer and Interlingual machine translation: The direct, transfer-based machine translation and interlingual machine translation methods of machine translation all belong to RBMT but differ in the depth of analysis of the source language and the extent to which they attempt to reach a language-independent representation of meaning or intent between the source and target languages . Their dissimilarities can be obviously observed through the Vauquois Triangle, which illustrates these levels of analysis. The Direct MT approach has been adopted by the Punjabi-Hindi MT system (2007-2008), the transfer MT approach by the ManTra (1999) developed by CDAC and the interlingua MT approach by UNL-based (Universal Networking Language) English-Hindi MT System. Fig 1 : Vauquois Triangleiii. Statistical and example-based machine translation:Statistical machine translation (SMT) is generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The initial model of SMT, based on Bayes Theorem, proposed by Brown et al.
takes the view that every sentence in one language is a possible translation of any sentence in the other and the most appropriate is the translation that is assigned the highest probability by the system. Example-based machine translation (EBMT) is characterized by its use of bilingual corpus with parallel texts as its main knowledge, in which translation by analogy is the main idea.The statistical approach methodology has been adopted by the Shakti MT system (2003) and the example based methodology has been adopted by the VAASAANUBAADA project of 2002.The approach undertaken for the research of this paper is the Transfer-based approach. It is currently one of the most widely used methods of machine translation. In contrast to the simpler direct model of MT, transfer MT breaks translation into three steps: analysis of the source language text to determine its grammatical structure, transfer of the resulting structure to a structure suitable for generating text in the target language, and finally generation of this text.
Transfer-based MT systems are thus capable of using knowledge of the source and target languages. Such a facility supports the proposed application and hence, makes this approach the most suitable. 4. METHODOLOGYThe primary step for the translation process is to carry out a word-for-word translation from individual Hindi word token to their corresponding English counterparts.3 This sequence of word tokens is then passed on for being tagged with their specific part-of-speech within the structure of the sentence. 1 To effectively formulate a syntactically correct sentence, the following operations need to be carried out : Pre-processingIn this phase, the sequence of words that requires translation should be polished, such that they are executable by the machine translation system. This includes treatment of punctuation and special characters that in all probabilities would not requires translation.
TokenizerTokenizer, or lexical analyzer, segments the sequence of words that require part-of-speech tagging into units known as tokens. The input for this phase will be the output of the preprocessing phase.Part-of-speech taggingThe output from the above operations results in a word token sequence which will be required to be tagged with its specific part-of-speech which will be used as a primary operative for formulating a sentence according to the provided grammar rules.
2Part-of-speech is useful because of the large amount of information they give about a word and its neighbors. Knowing whether a word is a noun or a verb tells us a lot about likely neighboring words (nouns are preceded by determiners and adjectives, verbs by nouns) and about the syntactic structure around the word (nouns are generally part of noun phrases), which makes part-of-speech tagging an important component of syntactic parsing. This task is accomplished with the employment of the 7 Hidden Markov Model. 6 The Hidden Markov Model uses a Markov process, which is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. A popular implementation of the HMM is the Viterbi algorithm. In the scope of this paper, the Viterbi algorithm is employed to assign a sequence of part-of-speech tags for a sequence of translated words.5 The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events.
The Viterbi algorithm used for the reference of this paper follows a bigram model, i.e. the part-of-speech tag assigned to a word will only depend on the part-of-speech tag assigned to one previous word. TranslationIndividual words are processed to find their corresponding translation in the English script which is fetched from a database.Another task that is executed within this phase is known as Transliteration. For proper noun, transliteration is used instead of translation as entire scope of the proper noun set cannot be incorporated into the database.
Grammar checkThe individual, translated word tokens, along with the tags assigned to them are used to determine the final structure of the translated sentence. This involves employing a predefined grammar, which will be used as a reference for syntactically arranging the words. 5. EXPERIMENTS AND RESULTSA sentence passed for translation will be pre-processed to make it void of any components that do not pass the parameters for making it necessary for translation (e.g. punctuation marks, special symbols, etc).
Fig 2 : Original sentenceOnce this is done, the original sentence (Fig 2) will proceed for tokenization. The sentence will go through the tokenization phase to obtain individual word tokens such as below (Fig 3): Fig 3 : Tokenized Sentence This collection of tokenized words, will then be passed onto the part-of-speech tagging module. Fig 4 : Tagged Words Each tokenized word will now be passed on to the part-of-speech tagging module to tag it to create the most appropriate sequence of tags (as in Fig 4). As the Viterbi algorithm being used within the scope of this paper follows a bigram model, there is no reference for the first word in the sequence, i.e. water. Hence a pseudo-state named START is used as a reference for the first word for every sequence. Another pseudo-state being used is known as END which will indicate the completion of the determined tag sequence.
The motive of this algorithm is to find the most suitable tag sequence for the word sequence “water only life is”. This can be determined using the following formula :probability_of_tagSequence = P(START) * P(PRP|START) * P(?? | PRP) * P(NN | PRP) * P( ????? |NN) * P(PREP | NN) * P(??? | PREP) * P(VFM | PREP) * P(???| VFM) * P(END | VFM) P(START)=1 To put it simplistically, the probability of the current word have tag NN(noun) given that the previous tag is START is multiplied by the probability that the current word is “water” given the current tag is NN is calculated to decide upon NN being the tag for the word “water”. However, this probability is not estimated only for the NN tag.
This probability calculation takes place for all of the tags present within the NLTK Hindi corpus 6 (the corpus being referenced for this paper) and the tag having maximum probability is assigned to the current word. This assigned tag for the first word will be used as a backward reference for the second word till a tag sequence is estimated for the entire word sequence. 5 Reference Key :NNSingular nounPRPPersonal pronounPREPPrepositionVFMVerb5 A diagrammatic representation of the Viterbi algorithm would be as follows (Fig 5): Fig 5 : Viterbi algorithm for tag sequence Once the tag sequence has been obtained, the words will proceed to the translation module.
Translation is carried out in a word-by-word fashion, wherein every word’s corresponding English counterpart will be fetched, as shown in Fig 6: Fig 6 : Word-for-word translationsAfter the translated word sequence is obtained, it needs to be checked for its accordance with the grammar that is predefined. The grammar that was defined is as follows (Fig 7): Fig 7 : Defined grammar The word translations along with the part-of-speech tag assigned to them are used to determine which sequence of words are most appropriate with the above defined grammar. All of the tokenized words are placed into their appropriate part-of-speech array for processing. The primary rule that will always be followed is S -> NP VP.
Post this the NP (noun phrase) is analysed for rule check. For NP, the only rule fitting the criteria is NP -> N. Thus, the noun phrase analysis ends with water being processed as a noun. Moving on to VP (verb phrase), the rule that will be used will be VP -> V NP, for “is” and the consequent words to be processed. This processing will continue till all the word arrays are empty.The final sentence that will be obtained after restructuring it with proper grammar is as in Fig 8: Fig 8: Final translated sentence 6. CONCLUSIONIn this paper, a transfer based machine translation approach which is based on a rule-based approach of machine translation has been discussed.
A basic set of grammar rules has been incorporated to check syntactic relevance of the translated sentences. With regards to the future, work will be undertaken to ensure that system takes into consideration tenses, gender and number. Furthermore, grammar can be expanded to include rules for processing exclamatory and interrogative sentences.7. REFERENCES1 Akanksha Gehlot, Vaishali Sharma, Shashipal Singh, Ajai Kumar; Hindi to English Transfer Based Machine Translation System; AAIG, Centre for Development of Advanced Computing, Pune India2 Nisheeth Joshi, Hemant Darbari, Iti Mathur; HMM based PoS Tagger for Hindi, Department of Computer Science; Banasthali University, Center for Development of Advanced Computing, Pune India3 Ye Zhonglin, Jia Zhen, Huang Junfu, Yin Hongfeng; Part-of-speech Tagging Based on Dictionary and Statistical Machine Learning; School of Information Science and Technology, Chengdu China, DOCOMO Innovations Incorporation, Palo Alto, USA 4Jayashree Nair, K Amrutha Krishnan, R Deetha; An efficient English to Hindi machine translation system using hybrid mechanism; Advances in Computing, Communications and Informatics (ICACCI), 2016 International Conference5 G V Garje, G K Kharate; Survey Of Machine Translation System in India;Department of Computer Engineering and Information Technology PVG’s College of Engineering and Technology, Pune, India; Principal, Matoshri College of Engineering and Research Centre, Nashik, India6 Jagjeet Singh, Lakhvir Singh Garcha, Satinderpal Singh; A Survey on Parts of Speech Tagging for Indian; International Journal of Advanced Research in Computer Science and Software Engineering7 Jana Diesner; Part of Speech Tagging for English Text Data; School of Computer Science Carnegie Mellon University, Pittsburgh8 Sanjay Kumar, Dwivedi, Pramod Premdas Sukhadeve; Machine Translation System in Indian Perspectives; Department of Computer Science, Babasaheb Bhimrao Ambedkar University,Lucknow, India9 Ananthakrishnan R, Kavitha M, Jayprasad J Hegde, Chandra Shekhar, Ritesh Shah, Sawani Bade, Sasikumar M; MaTra: A Practical Approach to Fully-Automatic Indicative English-Hindi Machine Translation; Centre for Development of Advanced Computing (formerly NCST),Juhu, Mumbai,India10 Latha R. Nair, David Peter S; Machine Translation Systems for Indian Languages; International Journal of Computer Applications 11 Antony P. J.; Machine Translation Approaches and Survey for Indian Languages; The Association for Computational Linguistics and Chinese Language Processing12131415