Plagiarism detection
Encyclopedia
Plagiarism detection is the process of locating instances of plagiarism
Plagiarism
Plagiarism is defined in dictionaries as the "wrongful appropriation," "close imitation," or "purloining and publication" of another author's "language, thoughts, ideas, or expressions," and the representation of them as one's own original work, but the notion remains problematic with nebulous...

 within a work or document. The widespread use of computers and the advent of the Internet has made it easier to plagiarize the work of others. Most cases of plagiarism are found in academia, where documents are typically essays or reports. However, plagiarism can be found in virtually any field, including scientific papers, art designs, and source code.

Detection can be either manual or computer-assisted. Manual detection requires substantial effort and excellent memory, and is impractical in cases where too many documents must be compared, or original documents are not available for comparison. Computer-assisted detection allows vast collections of documents to be compared to each other, making successful detection much more likely.

Computer-assisted plagiarism detection

Computer-assisted plagiarism detection (CaPD) is an Information retrieval (IR)
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

 task supported by specialized IR systems, referred to as plagiarism detection systems (PDS).

Plagiarism detection for text-documents

Systems for text-plagiarism detection implement one of two generic detection approaches, one being external, the other being intrinsic
External PDS compare a suspicious document with a reference collection, which is a set of documents assumed to be genuine .
Based on a chosen document model and predefined similarity criteria, the detection task is to retrieve all documents that contain text that is similar to a degree above a chosen threshold to text in the suspicious document .
Intrinsic PDS solely analyze the text to be evaluated without performing comparisons to external documents. This approach aims to recognize changes in the unique writing style of an author as an indicator for potential plagiarism .
PDS are not capable of reliably identifying plagiarism without human judgment. Similarities are computed with the help of predefined document models and might represent false positives
.

Detection methods

The figure below represents a classification of proposed methods for computer-assisted plagiarism detection from a technical point of view. The techniques are characterized by the type of similarity assessment they apply. Global similarity assessments use features taken from larger parts of the text or the document as a whole for similarity computation, while local methods take confined text segments as input.
Fingerprinting is currently the most widely applied approach to computer-assisted, plagiarism detection. The procedure forms representative digests of documents by selecting a set of multiple substrings (n-gram
N-gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application...

s) from them. The sets represent the fingerprints and their elements are called minutiae
. A suspicious document is checked for plagiarism by computing its fingerprint and querying minutiae with a pre computed index of fingerprints for all documents of a reference collection. Minutiae found matching with those of other documents indicate shared text segments and suggest potential plagiarism when exceeding a chosen similarity threshold . Generally, only a subset of minutiae is compared in order to speed up the process and allow for checks against large collection, such as the internet .

Checking documents for verbatim text overlaps represents a classical string matching problem known from other areas of computer science. Numerous approaches have been proposed to tackle this task, of which some have been adapted to external CaPD. Checking a suspicious document in this setting requires the computation and storage of efficiently comparable representations for all documents in the reference collection, which are compared pairwise. Generally, suffix document models, such as suffix trees
Suffix tree
In computer science, a suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.The suffix tree for a string S is a tree whose edges are labeled with strings, such that each suffix...

 or suffix vectors, have been adapted for this task in the context of CaPD. Nonetheless, substring matching remains computationally expensive, which makes it a non-viable solution for checking large document collections.
Bag of words analysis represent the adoption of vector space retrieval
Vector space model
Vector space model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings...

, a traditional IR concept, to the domain of CaPD. Documents are represented as one or multiple vectors, e.g. for different document parts, which are used for pair wise similarity computations. These might be based on the traditional cosine similarity measure
Cosine similarity
Cosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them. The cosine of 0 is 1, and less than 1 for any other angle. The cosine of the angle between two vectors thus determines whether two vectors are pointing in roughly the same...

 or more sophisticated similarity functions.
Citation-based plagiarism detection is a computer-assisted plagiarism detection approach designed for usage with academic documents, since it does not rely on the text itself, but on citation and reference information. It identifies similar patterns
Pattern
A pattern, from the French patron, is a type of theme of recurring events or objects, sometimes referred to as elements of a set of objects.These elements repeat in a predictable manner...

 in the citation sequences of two academic works. Citation patterns represent subsequences non-exclusively containing citations shared by both documents being compared

. Similar order and proximity of citations within the text are the main criteria for identifying citation patterns. Other factors, such as the absolute number or relative fraction of shared citations in the pattern as well as the probability that citations co-occur in a document are considered for quantifying the patterns’ degree of similarity


.

Stylometry
Stylometry
Stylometry is the application of the study of linguistic style, usually to written language, but it has successfully been applied to music and to fine-art paintings as well.Stylometry is often used to attribute authorship to anonymous or disputed documents...

 subsumes statistical methods for quantifying an author’s unique writing style
and is mainly used for authorship attribution or intrinsic CaPD. By constructing and comparing stylometric models for different text segments, passages that are stylistically different from others, hence potentially plagiarized, can be detected .

Plagiarism detection systems for text-documents

General design of academic plagiarism detection systems geared for text documents include a number of factors:
Factor Description and alternatives
Scope of search In the public internet, using search engines / Institutional databases / Local, system-specific database.
Analysis time Delay between the time a document is submitted and the time when results are made available.
Document capacity / Batch processing Number of documents the system can process per unit of time.
Check intensity How often and for which types of document fragments (paragraphs, sentences, fixed-length word sequences) does the system query external resources, such as search engines.
Comparison algorithm type The algorithms that define the way the system uses to compare documents against each other.
Precision and Recall Number of documents correctly flagged as plagiarized compared to the total number of flagged documents, and to the total number of documents that were actually plagiarized. High precision means that few false positives were found, and high recall means that few false negatives were left undetected.


Most large-scale plagiarism detection systems use large, internal databases (in addition to other resources) that grow with each additional document submitted for analysis. However, this feature is considered by some as a violation of student copyright.

The following systems are all web-based, and, with the exception of CopyTracker, closed source. The following list is non-exhaustive:
Free
Chimpsky
CitePlag
CopyTracker
eTBLAST
ETBLAST
eTBLAST is a free text similarity service search engine currently offering access to the MEDLINE database, the National Institutes of Health CRISP database, the Institute of Physics database, Wikipedia, arXiv, the NASA technical reports database, Virginia Tech class descriptions and a variety of...

Plagium
SeeSources
The Plagiarism Checker

Commercial
Attributor
Attributor
Attributor is a provider of digital content protection for the publishing industry. Its products enable publishers to identify and verify copy infringement, enforce authorized use, analyze market demand and monetize digital content.- Features :...


Copyscape
Copyscape
Copyscape is an online plagiarism detection service that checks whether similar text content appears elsewhere on the web. It was launched in 2004 by Indigo Stream Technologies, Ltd....


Iparadigms: Ithenticate
Ithenticate
iThenticate is a plagiarism detection service for the corporate market, from iParadigms, LLC, which also runs the websites Turnitin and Plagiarism.org. The service was launched in 2004, as result of market demand...

, Turnitin
Turnitin
Turnitin is an Internet-based plagiarism-detection service created by iParadigms, LLC. Typically, universities and high schools buy licenses to submit essays to the Turnitin website, which checks the documents for originality...

Plagiarismdetect

PlagScan
Urkund
Veriguide


Detection performance

Comparative evaluations of plagiarism detection systems
indicate that their performance depends on the type of plagiarism present (see figure). Except for citation pattern analysis, all detection approaches rely on textual similarity. It is therefore symptomatic that detection accuracy decreases the more plagiarism cases are obfuscated.



Literal copies, aka. copy&paste (c&p) plagiarism, or modestly disguised plagiarism cases can be detected with high accuracy by current external PDS if the source is accessible to the software. Especially substring matching procedures achieve a good performance for c&p plagiarism, since they commonly use lossless document models, such as suffix trees
Suffix tree
In computer science, a suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.The suffix tree for a string S is a tree whose edges are labeled with strings, such that each suffix...

. The performance of systems using fingerprinting or bag of words analysis in detecting copies depends on the information loss incurred by the document model used. By applying flexible chunking and selection strategies they are better capable of detecting moderate forms of disguised plagiarism when compared to substring matching procedures.

Intrinsic plagiarism detection using stylometry
Stylometry
Stylometry is the application of the study of linguistic style, usually to written language, but it has successfully been applied to music and to fine-art paintings as well.Stylometry is often used to attribute authorship to anonymous or disputed documents...

 can overcome the boundaries of textual similarity to some extent by comparing linguistic similarity. Given that the stylistic differences between plagiarized and original segments are significant and can be identified reliably, stylometry can help in identifying disguised and paraphrased plagiarism. Stylometric comparisons are likely to fail in cases where segments are strongly paraphrased to the point where they more closely resemble the personal writing style of the plagiarist or if a text was compiled by multiple authors. The results of the International Competitions on Plagiarism Detection held in 2009, 2010 and 2011 , as well as experiments performed by Stein , indicate that stylometric analysis seems to work reliably only for document lengths of several thousand or tens of thousands of words. This limits the applicability of the method to CaPD settings.

An increasing amount of research is performed on methods and systems capable of detecting translated plagiarisms. Currently, cross-language plagiarism detection (CLPD) is not viewed as a mature technology and respective systems have not been able to achieve satisfying detection results in practice .

Citation-based plagiarism detection using citation pattern analysis is capable of identifying stronger paraphrases and translations with higher success rates when compared to other detection approaches, thanks to the fact that it is independent of textual characteristics . However, since Citation pattern analysis depends on the availability of sufficient citation information it is limited to academic texts. It remains inferior to text-based approaches in detecting shorter plagiarized passages, which are typical for cases of copy&paste or shake&paste plagiarism. The later refers to mixing slightly altered fragments from different sources .

Source code plagiarism detection

Plagiarism in computer source code is also frequent, and requires different tools than those found in textual document plagiarism. Significant research has been dedicated to academic source-code plagiarism.

A distinctive aspect of source-code plagiarism is that there are no essay mill
Essay mill
An essay mill is a ghostwriting service that sells essays and other homework writing to university and college students. Since plagiarism is a form of academic dishonesty or academic fraud, universities and colleges may investigate papers suspected to be from an essay mill by using Internet...

s, such as can be found in traditional plagiarism. Since most programming assignments expect students to write programs with very specific requirements, it is very difficult to find existing programs that meet them. Since integrating external code is often harder than writing it from scratch, most plagiarizing students choose to do so from their peers.

According to Roy and Cordy, source-code similarity detection algorithms can be classified as based on either
  • Strings – look for exact textual matches of segments, for instance five-word runs. Fast, but can be confused by renaming identifiers.
  • Tokens – as with strings, but using a lexer
    Lexical analysis
    In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...

     to convert the program into token
    Token
    A token is an object of value, and may refer to:* In logic, computational linguistics, and information retrieval, a token is an instance of a type; see Type-token distinction...

    s first. This discards whitespace, comments, and identifier names, making the system more robust to simple text replacements. Most academic plagiarism detection systems work at this level, using different algorithms to measure the similarity between token sequences.
  • Parse Trees
    Parse tree
    A concrete syntax tree or parse tree or parsing treeis an ordered, rooted tree that represents the syntactic structure of a string according to some formal grammar. In a parse tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the...

     – build and compare parse trees. This allows higher-level similarities to be detected. For instance, tree comparison can normalize conditional statements, and detect equivalent constructs as similar to each other.
  • Program Dependency Graphs
    Call graph
    A call graph is a directed graph that represents calling relationships between subroutines in a computer program. Specifically, each node represents a procedure and each edge indicates that procedure f calls procedure g...

     (PDGs) – a PDG captures the actual flow of control in a program, and allows much higher-level equivalences to be located, at a greater expense in complexity and calculation time.
  • Metrics – metrics capture 'scores' of code segments according to certain criteria; for instance, "the number of loops and conditionals", or "the number of different variables used". Metrics are simple to calculate and can be compared quickly, but can also lead to false positives: two fragments with the same scores on a set of metrics may do entirely different things.
  • Hybrid approaches – for instance, parse trees + suffix tree
    Suffix tree
    In computer science, a suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.The suffix tree for a string S is a tree whose edges are labeled with strings, such that each suffix...

    s can combine the detection capability of parse trees with the speed afforded by suffix trees, a type of string-matching data structure.


The previous classification was developed for code refactoring, and not for academic plagiarism detection (an important goal of refactoring is to avoid duplicate code, referred to as code clones in the literature). The above approaches are effective against different levels of similarity; low-level similarity refers to identical text, while high-level similarity can be due to similar specifications. In an academic setting, when all students are expected to code to the same specifications, functionally equivalent code (with high-level similarity) is entirely expected, and only low-level similarity is considered as proof of cheating.

Source code plagiarism detection systems

MOSS and JPlag can be used free of charge, but both require registration and the software remains proprietary. Personal systems are normal desktop applications, and most of them are both free of charge and released as open-source software
Open-source software
Open-source software is computer software that is available in source code form: the source code and certain other rights normally reserved for copyright holders are provided under a software license that permits users to study, change, improve and at times also to distribute the software.Open...

.

Literature

  • Carrol, J. (2002). A handbook for detecting plagiarism in higher education. Oxford: The Oxford Centre for Staff and Learning Development, Oxford Brookes University. (96 p.).

See also

  • Locality sensitive hashing
    Locality sensitive hashing
    Locality-sensitive hashing is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability .-Definition:An LSH family \mathcal F is defined fora...

  • Nearest neighbor search
    Nearest neighbor search
    Nearest neighbor search , also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q...

  • Kolmogorov complexity#Compression – used to estimate similarity between token sequences in several systems

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK