Plagiarism Detection Techniques

Plagiarism Detection Techniques
Faculty Mentor: Students Name:
Dr. Deepti Sharma Latika Malhotra (MCA-II)
Ashima Tyagi (MCA-II)

The evolution of information technology and the use of smart phones expanded the availability of information. Plagiarism is usually defined as ”the wrongful appropriation” and “stealing and publication” of another author’s “language, thoughts, ideas, or expressions” and the representing them as one’s own original work.” 1
And moreover, according to the Merriam-Webster Online Dictionary, to “plagiarize” means:
– “To steal and pass off the ideas of another as one’s own.”
– “To present as new and original an idea derived from an existing source.”
– “To use someone’s production and not crediting the source.”
– “To commit literary theft.”

Plagiarism can be categorized into five divisions: Copy & Paste Plagiarism, Idea Plagiarism, Word Switch Plagiarism, Style Plagiarism Metaphor Plagiarism, and Idea Plagiarism 2.
Although with the advancing minutes the techniques to plagiarize are experiencing many new variations but, the two most popular ones are still intact in public’s mind:
1. Textual plagiarisms: committed most frequently by researchers and students in academic enterprises, this sort of plagiarism leads the documents to be totally identical to the original documents, reports and scientific papers.
2. Source code plagiarism: frequently committed by students of universities, this sort of plagiarism is done while attempting or copying the whole or some segments of that specific source code created by someone else as his own, the ordinary feature of this plagiarism leads it to being hardly detected.
There are many variants of plagiarism detection process. Among them, a few are- “textual based plagiarism, citation based plagiarism, and shape based plagiarism for flowchart” 3. Textual plagiarism is a name given to the plagiarism which takes place if the original text is copied, with small alterations or if it’s translated by machines. However, when the translation of this text is done by humans then currently used techniques have very little means to detect it. Plagiarism Detection based on citation on the other end of the spectrum, assimilates the occurrences of citations in order to identify the similarities. Meanwhile, plagiarism based on shape for the purpose of flowchart showcases a technique for detecting multimedia retrieval and image processing. But, it is still not able to recognize plagiarism for varying charts and figures along with their contents.

Fig. 1 Four-stage Plagiarism Detection Process 2
1.1. Text Based Plagiarism
This sort of plagiarism deals with the detection of similarities between documents with the help of vector space model. Moreover, it has the ability to count and calculate the redundancy of the term in the document, and from that point they match the document’s fingerprints with fingerprints from other documents and find the parallelism between them. This methodology is appropriate for non-partial plagiarism as described above, it uses the complete document and takes help of vector space to match between the documents, but if the document has been partly plagiarized it yields very poor results. It includes “copy and paste”, “modification or changing some words of the original information from the internet book magazine”, “newspaper, research, journal, personal information or id0eas” 5.
1.1.1. Process of plagiarism based on text is divided into the following stages: –
1) Stage One Collection: First stage of Plagiarism Detection Process involves “the student or the researcher to upload their assignments or works to the web engine, the web engine acts as an interface between the students and the system.”
2) Stage Two Analysis: Second stage entails that submitted corpus or assignments are run through a similarity detecting engine to check whether documents are similar to other ones or not. There are two categories of similarity engines, first intra-corpal engine and second extra-corpal engine. “The intra-corpal engines work by returning ordered list between each similar pairs.
3) Stage Three Confirmations: This stage functions to identify if some specific relevant text has been copied from original text or to adjudge if there is a high rate of similarity between any other text and the source text.
4) Stage Four Investigation: This is the final stage of a Plagiarism Detection Process and it relies on human intervention. In this step a human expert is responsible for determine if the system ran correctly as well as determining if a result has been truly plagiarized or simply cited. 2
1.1.2. Different methods used for textual plagiarism detection
The most prevalent form of plagiarism among high school students is taking phrase word by word or taking entire source without adding quotation marks and even not citing it properly. The Internet has made copy & pasting extremely tempting for students’ common plagiarism detection techniques rely on character-based methods to compare the suspected document with original document. Character matching approaches can detect identical strings.
Following are the two methods for textual plagiarism detection. Grammar-based method: The grammar-based method is an important tool to detect plagiarism. It focuses on the grammatical structure of documents, and this method uses a string-based matching approach to detect and to measure similarity between the documents. The grammar- based methods are suitable for detecting exact copy without any modification, but it is not suitable for detecting modified copied text by rewriting or switching some words that has the same meaning. This is considered as one of this method’s limitations. External plagiarism detection method: It basically depends on a reference corpus composed of documents from which passages might have been plagiarized. A suspicious document is checked for plagiarism by searching for passages that are duplicates. Then a report is sent by the external plagiarism system to these findings to a human controller who is responsible for deciding whether the detected passages are plagiarized or not. 5

Copy & Paste
~70% Good results even for short fragments Unsuitable as short fragments cannot be detected
Disguised Plagiarism