Skip to content

Instantly share code, notes, and snippets.

@Morendil
Last active July 23, 2021 23:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Morendil/85336bf97211f9f31102ce2ee4e44829 to your computer and use it in GitHub Desktop.
Save Morendil/85336bf97211f9f31102ce2ee4e44829 to your computer and use it in GitHub Desktop.

This is an old piece I wrote a few years ago (late 2012), which I don't think I ever published. I couldn't be bothered to write up a conclusion, I think - it was pretty obvious anyway what to conclude.

Research in the age of sampling - dissecting the Frankenpaper

Around Halloween I came across a paper titled "Cost Effective Software Test Metrics", by Lazic et al. It appears to have been published in "WSEAS Transactions on Computers", in June 2008.

WSEAS is an organization whose academic standing is a little difficult to investigate; searching for its name will lead you to a number of blogs, for instance, that vigorously denounce other organizations for creating bogus conferences, sending spam, or defrauding researchers. The Shakespearean phrase "the lady doth protest too much" comes to mind.

Plagiarism in the Google age

WSEAS may not be bogus, but we will see that this paper definitely is; it can be charitably described as a "patchwork of plagiarized chunks", or what I'll call Frankenpaper for short.

Plagiarism has become a riskier gamble than it used to be. The risk that you would get caught used to be a function of limited human memories; plagiarists are now much more easily detected thanks to Google, and large comprehensive search indexes in general.

The general technique for detecting plagiarism consists of taking a short snippet from the text suspected of being a copy or clone, and Googling for this snippet as an exact phrase. Finding a match in a text other than the original paper provides prima facie evidence that there has been plagiarism.

Some caveats apply:

  • the phrase may be common enough that there is a high likelihood that it has been used by two authors independently; using longer snippets will largely mitigate this
  • the plagiarism could have been in either direction: it's important to establish the publication dates of the two texts, to ascertain who copied whom
  • the match could also be a quotation, properly attributed: it's important to look at the suspected text for evidence of attribution or quotation
  • failing to find a match is not positive evidence that the source text is original: it may result from subtler plagiarism, with some words substituted from the original

The abstract

The paper's abstract appears to be original. This hypothesis is reinforced by the fact that it has a few grammatical flaws: for instance, "Software test metrics is a useful for test managers".

A rule of thumb that my investigations seem to confirm is that the less grammatical something is, the better chance it has of being original text, generally connective tissue that has been used to stitch together plagiarized bits.

However, searching for the exact phrase "Software test metrics is a useful for test managers" turns up evidence that the paper's abstract had been reused by someone else, a paper titled [Program-Operators to Improve Test Data Generation Search|http://www.wseas.us/e-library/transactions/computers/2010/89-830.pdf], also published by WSEAS. This paper does give credit to Lazic et al., but fails to identify their borrowing by using quotation marks. It appears to be itself guilty of at least one count of plagiarism, as can be verified by searching for the first few words of its very first sentence. This could be a pattern - but we will defer investigation, and stick close to our original Frankenpaper.

Dissecting Section 1

The fun starts with Section 1.

The text "As organizations strive..." up to "improve both development and testing processes" (a bit under 120 words) matches exactly the description of an [Arizona State University class|http://web.archive.org/web/20081025142445/http://webapp.poly.asu.edu/jacmet/courses/practicalsoftware.html] given in mid-2008, slightly before the Frankenpaper, by an instructor named Jim Collofello.

The evidence for plagiarism seems solid enough, though there is a tiny bit of room for doubt: the dates are close enough that it's just barely possible that the class description is copied from the article rather than the other way round.

There is less doubt concerning the next chunk: "Software testing is one activity...", up to "ready for release" (55 words). This is the same text that appears in a 2004 brochure by "IV&V Australia" titled [How Are We Going? Good Test Metrics to Collect|http://www.ivvaust.com.au/downloads/thGoodTestMetrics.pdf].

Interestingly, the same chunk of text also appears in a 2006 press newsletter advertising the EuroSTAR conference, under the byline of a "Ramesh Pusala, Infosys, India" (and without attribution), in an article on [Operational Excellence through Efficient Software Testing Metrics|http://archive.newsweaver.com/qualtech/newsweaver.ie/qualtech/e_article000573744.html].

The lesson here is that more than one author might appreciate the value of one particular source text.

Next: "We had many areas to address..." up to "logical place to start" (45 words). This is lifted from an article by Rick Tennant, [Creating Five-Star Test Metrics On a One-Star Budget|http://www.stickyminds.com/sitewide.asp?Function=edetail&ObjectType=ART&ObjectId=6033], originally presented at STARWest 2002.

Next: "With metrics collection..." up to "problem report (PR) information and test information." (100 words). Here the authors are going back to the "IV&V Australia" 2004 brochure.

These Aussies are definitely valuable contributors! And indeed, the next bit, from "Testing is often seen..." to "significant value to the development process" (50 words) is from a different brochure from "IV&V Australia", also dated 2004, titled [How Do You Improve Your Testing Process?|http://www.ivvaust.com.au/downloads/thImproveTestProcess.pdf].

Not everything good can be imported from Australia, and so the following text, "Planning for testing on a software project..." up to "costly and painful surprises late in the project" (165 words) is from a different source: [Managing the Testing Process|http://www.stickyminds.com/sitewide.asp?Function=edetail&ObjectType=ART&ObjectId=3650], a 2002 StickyMinds newsletter article by Stephen Shimeall. With one little difference: the insertion of a sentence about "the Balanced Productivity Metrics (BPM) strategy", which seems to be the authors' original contribution.

The next sentence is such a cliché that it would be hard to prove plagiarism: "It is often said, You cannot improve what you cannot measure.” However, there is much stronger evidence that the promise to "describe some basic software measurement principles and suggest some metrics that can help you understand and improve the way your organization operates" is borrowed from a 1999 article by Karl Wiegers, [A Software Metrics Primer|http://www.drdobbs.com/a-software-metrics-primer/184415704].

At this point there appear a couple of sentences which appear to be original connective tissue, followed by a chunk: "Effective test management..." up to "benefit to the end-users" (90 words), which is lifted from a CrossTalk article by Dr Richard Bechtold, [Efficient and Effective Testing of Multiple COTS-Intensive Systems|http://www.crosstalkonline.org/storage/issue-archives/2004/200405/200405-Bechtold.pdf], dated 2004.

This is followed by a brief paragraph giving an overview of how the rest of the paper is organized.

Cost effective plagiarism metrics

At this point, let's review the stats: Section 1 of the Frankenpaper consists of 818 words, of which about 650 are plagiarized: that is about 80%.

Section 2:

"Metrics are defined"..."over multiple projects." (220) (From "Test Metrics: A Practical Approach to Tracking & Interpretation", Bradshaw, 2004, http://www.stickyminds.com/sitewide.asp?Function=edetail&ObjectType=ART&ObjectId=11480)

"Metrics are measurements"..."lines of code inspected" (75) (From "Metrics", Software Productivity Center, 2007, http://web.archive.org/web/20070220172340/http://www.spc.ca/resources_metrics.htm, which itself cribs from "The Information Security Dictionary", Gattiker, 2004)

Section 2.1:

"Once it was clear"..." an important testing activity" (90) (Partly from Design of Biomedical Devices and Systems, possibly other sources)

Section 2.2:

"Test metrics and data gathering" (25) (From "Managing Testing Through the Innovation Lifecycle from Research to Disposal",Määtta, 2005)

"Accurate data and relevant metrics"..."at a project management level" (230) (From "Managing Testing Through the Innovation Lifecycle from Research to Disposal",Määtta, 2005, this is also plagiarized in a Validata white paper from 2011 titled "Realizing the true value of testing" and a IntellectUK white paper with the same title.)

Section 2.3:

"What Are Software Metrics?".."collecting and using these metrics" (270) (From Software Productivty Center, 2007, http://www.spc.ca/resources/metrics/index.htm, also plagiarized elsewhere)

"...begins by showing the value of tracking".."definitions and explanations" (100) (From Test Metrics: A Practical Approach to Tracking & Interpretation, Bradshaw, 2004)

Section 2.4:

"Test Metrics are meaningful".."implement process changes" (190) (From Test Metrics: A Practical Approach to Tracking & Interpretation, Bradshaw, 2004)

Section 2.5:

"Tracking Test Metrics throughout".." into the metrics." (150) (From Test Metrics: A Practical Approach to Tracking & Interpretation, Bradshaw, 2004)

Section 2.6:

"Base Metrics constitute"..."in fact an improvement" (290) (From Test Metrics: A Practical Approach to Tracking & Interpretation, Bradshaw, 2004)

Section 2.7:

"As mentioned earlier"..."reason for the change" (180) (From Test Metrics: A Practical Approach to Tracking & Interpretation, Bradshaw, 2004)

Section 3:

"Metrics help you"..."SMART technique" (130) (This has heavier scar tissue than other sections; bits and pieces can be recognized from "A Software Metrics Primer" by Karl Wiegers, 1999 (http://www.drdobbs.com/a-software-metrics-primer/184415704), "Efficiency and Effectiveness Measures To Help Guide the Business of Software Testing" by Jon Huber, 1999 (http://www.stickyminds.com/sitewide.asp?Function=edetail&ObjectType=ART&ObjectId=1452).)

"Furthermore, we need"..."evaluate process stability" (120) (From "An integrated process and product model", Schneidewind, 1998)

Section 3.1:

"Measuring the impact"..."attempting to uncover" (60) (From "Managing the Testing Process")

Table 1 appears to be cribbed from http://www.psmsc.com/UG1998/Workshops/PSySM%20workshop%201998.PDF

"Once a list of valid"..."time period on previous project" (200) (From Efficiency and Effectiveness Measures, Huber, 1999)

"After collection"..."contribute to the problem" (65) (From http://www.qualitytimes.co.in/6sigma_dmaic.htm)

"Root Cause Analysis"..."get through the analyze phase" (180) (From Six Sigma Best Practices: A Guide to Business Process Excellence for Diverse, Kumar, 2006, see also http://www.leanflowconsulting.fr/DMAIC%20Quick%20Ref.pdf)

Table 4 appears cribbed from http://as.nida.ac.th/~rattakorn/Forum/Software_Engineering/evening/Group10/Appendix_B.html (source unknown)

"This information"..."black box" (50) (From "Managing the Testing Process")

Section 3.1 (again):

"Tracking defects"..."phase of origin and detection." (250) (From "Cost of Quality: a Key Effectiveness Metric for Software and IT", Gary Gack, 2007-2009?)

"The following indicators"..."payoff potential is very large" (230) (From "Core Set of Effectiveness Metrics for Software and IT", Gary Gack, 2007-2010? - the most recent published date of these Gack articles is posterior to 2008, but the first one appears in a version that has a 2007 copyright indicator; it's a reasonable assumption that they are republished versions of material written earlier by the same author. We can't quite rule out the thesis that Gack is plagiarizing Lazic rather than the other way round - but that does seem vanishingly unlikely.)

"By comparing defect counts"..."which the defect was detected." (85) (From "Is Software Inspection Value Added?", Gack, 2001- for this one there is definite evidence of an earlier publication date, even though all current Web pages carry a relatively recent date, see http://web.archive.org/web/20020805195142/http://www.isixsigma.com/library/content/c020520a.asp)

"Defects are real"..."understand the dynamics of software development" (240) (From “In Praise of Defects", Laird, 2005, http://www.cs.stevens.edu/~lbernste/CS%20533/Lectures/lecture%208%20-%20defects%202.ppt)

"In general, defects follow"..."Lower Control Bounds" (30) (From "What will the reliability be?", Laird, 2004)

Section 3.2.1 (again):

"There are two very important measurements of"..."would be 90 percent." (585) (From "Measuring Defect Potentials and Defect Removal Efficiency", Jones, 2008 http://www.crosstalkonline.org/storage/issue-archives/2008/200806/200806-Jones.pdf)

"Adopting an optimized testing approach"..."existing testing environment" (75) (From "How to balance quality, cost and schedules", Compuware white paper, 2006)

Section 3.2.2:

"The cost of fixing defects increases"..."and higher-quality applications" (110) (From "How to balance quality, cost and schedules", Compuware white paper, 2006, http://www.computerworlduk.com/white-paper/it-strategy/3266/how-to-balance-quality-cost-and-schedules/)

"The specification defines"..."an error or a fault" (100) (Apparently cribbed with alterations from "Testing Object-Oriented Systems: Models, Patterns, and Tools", Binder, 1999)

Section 3.3.3:

"Cost numbers vary"..."phases of the development cycle" (315) (From "Eliminating Embedded Software Defects Prior to Integration Test", Bennett and Wennberg, 2005, http://www.triakis.com/pages/Downloadable%20Files/Triakis%202005%20Xtalk%20Article-c.pdf)

Section 3.3.4:

"An important area to focus"..." how long it will take to test them." (220) (Apparently from "How to balance quality, cost and schedules", Compuware white paper, 2006)

@Morendil
Copy link
Author

Additional fun fact I just discovered; their Figure 10 cites Bennett, Ted L., and Paul W. Wennberg, “Eliminating Embedded Software Defects Prior to Integration Test.” Dec. 2005. but it's actually lifted from an article in the January 2008 issue of Crosstalk (which uses that figure and cites the same paper).

How do we know ? Because that version has a typo, "Relative Cost of software fault propogation" (instead of "propagation"). As I've written elsewhere, plagiarism can often be detected from errors being copied forward. You can always copy something from someone and correct a typo as you do it, but plagiarists are not that careful, they tend to keep typos. And it should be quite rare that two independent errors will introduce the exact same typo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment