(Go: >> BACK << -|- >> HOME <<)

Content similarity detection: Difference between revisions

Content deleted Content added
Undid revision 1121151777 by AManWithNoPlan: restore valid ref added in edit of 17:59, 9 August 2019 but erroneously altered in edit of 19:06, 18 March 2021 and subsequently deleted in edit of 19:52, 10 November 2022
Citation bot (talk | contribs)
Add: arxiv, citeseerx. | Use this bot. Report bugs. | Suggested by BorgQueen | Category:All articles needing examples | #UCB_Category 405/701
Line 2:
{{Use dmy dates|date=May 2017}}
{{cleanup|date=December 2010}}
'''Plagiarism detection''' or '''content similarity detection''' is the process of locating instances of [[plagiarism]] or [[copyright infringement]] within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others.<ref>{{Cite web |title=Plagiarism, prevention, deterrence and detection |url=https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.107.178&rep=rep1&type=pdf |url-status=dead |access-date=2022-11-11|archive-url=https://web.archive.org/web/20210418111409/http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.107.178&rep=rep1&type=pdf |archive-date=18 April 2021 |last1=Culwin |first1=Fintan |last2=Lancaster |first2=Thomas |year=2001 | citeseerx=10.1.1.107.178 |via=[[Advance HE|The Higher Education Academy]]}}{{cbignore}}</ref><ref name=":0">Bretag, T., & Mahmud, S. (2009). A model for determining student plagiarism: Electronic detection and academic judgement. ''Journal of University Teaching & Learning Practice, 6''(1). Retrieved from <nowiki>http://ro.uow.edu.au/jutlp/vol6/iss1/6</nowiki></ref>
 
Detection of plagiarism can be undertaken in a variety of ways. Human detection is the most traditional form of identifying plagiarism from written work. This can be a lengthy and time-consuming task for the reader<ref name=":0" /> and can also result in inconsistencies in how plagiarism is identified within an organization.<ref>Macdonald, R., & Carroll, J. (2006). Plagiarism—a complex issue requiring a holistic institutional approach. ''Assessment & Evaluation in Higher Education, 31''(2), 233–245. {{doi|10.1080/02602930500262536}}</ref> Text-matching software (TMS), which is also referred to as "plagiarism detection software" or "anti-plagiarism" software, has become widely available, in the form of both commercially available products as well as open-source{{Example needed|s|date=November 2020}} software. TMS does not actually detect plagiarism per se, but instead finds specific passages of text in one document that match text in another document.
Line 42:
 
=====Neural networks=====
More recent approaches to assess content similarity using [[neural networks]] have achieved significantly greater accuracy, but come at great computational cost.<ref>{{Cite arXiv|title=Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks |eprint=1908.10084 |last1=Reimers |first1=Nils |last2=Gurevych |first2=Iryna |year=2019 |class=cs.CL }}</ref> Traditional neural network approaches embed both pieces of content into semantic vector embeddings to calculate their similarity, which is often their cosine similarity. More advanced methods perform end-to-end prediction of similarity or classifications using the [[Transformer (machine learning model)|Transformer]] architecture.<ref>{{Cite journal |last1=Lan |first1=Wuwei |last2=Xu |first2=Wei |date=2018 |title=Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering |url=https://aclanthology.org/C18-1328 |journal=Proceedings of the 27th International Conference on Computational Linguistics |location=Santa Fe, New Mexico, USA |publisher=Association for Computational Linguistics |pages=3890–3902}}</ref><ref>{{Citation |last1=Wahle |first1=Jan Philip |title=Identifying Machine-Paraphrased Plagiarism |date=2022 |url=https://link.springer.com/10.1007/978-3-030-96957-8_34 |work=Information for a Better World: Shaping the Global Future |volume=13192 |pages=393–413 |editor-last=Smits |editor-first=Malte |place=Cham |publisher=Springer International Publishing |language=en |doi=10.1007/978-3-030-96957-8_34 |isbn=978-3-030-96956-1 |access-date=2022-10-06 |last2=Ruas |first2=Terry |last3=Foltýnek |first3=Tomáš |last4=Meuschke |first4=Norman |last5=Gipp |first5=Bela|arxiv=2103.11909 |s2cid=232307572 }}</ref> Particularly [[Paraphrasing (computational linguistics)|paraphrase detection]] benefits from highly parameterized pre-trained models.
 
====Performance====