Steal this article, robot.I dare you

Graphics by Sean Mullins.

I recently discovered the shocking fact that bots are stealing articles from Webster’s Magazine. Naturally, I made the only logical response: write an article about it, which itself will inevitably be stolen.

On Monday, October 10th, I was scrolling through the article comments on the journal website dashboard – because that’s what you do when you have no life – and found several pingbacks. WordPress notifies us when other sites embed links to our articles in such text. However, these articles were not shared. They are copied and pasted — or, in colloquial parlance, “borrowed indefinitely” by scam sites.

This phenomenon is the result of crawlers, bots invading websites, and search engines republishing digital content. We noticed these stolen articles because a bot in its infinite wisdom accidentally copied the embedded link to the original. Veteran and new writers from across the board fell prey to the crawlers; apparently, the bot really liked my critique of the dating app, as it got pingbacks from four unique impostors.

Seeing these pirated articles is like a real-world “can I copy your homework” meme. They distinguish themselves by mistake (“Dine and Discuss” becomes “Dine and Discus”) or clumsily substitutes synonyms (“Students take hands-on learning to new level” becomes “College students take study to a brand new stage”).Most notably, my review of “Shovel Knight Dig” was Bad google translation into german, resulting in a like for this s-rated tweet.

Photo by Sean Mullins. The Wall Street Journal’s WordPress site got a call back from a news-scraping bot that plagiarized a student’s article, slightly altering the title of the article.

We’re lucky that crawlers are stupid enough to embed links, but who knows how long they’ve been targeting us without our knowledge? How many of our articles have been stolen by malware-infected websites? As far as I know, there is a bootleg version of my “Deltarune: Chapter 2” review that, when clicked, downloads Spamton G. Spamton to your hard drive. I had to investigate further.

According to Rami Essaid’s MediaShift report, 2015 was the first year that web traffic from bots surpassed that from humans, with approximately 59 percent of website visits being automated. Essaid distinguishes between 36% of web traffic that year as “good bots,” including search engines and social media aggregators that benefit sites, and 23% when “bad bots,” such as news scrapers, rob the Internet.

“[Bad bots] can cause loss of site visitors, hurt their SEO rankings and reduce ad revenue,” says Essaid. “And because of all the bandwidth bad bots are using while stealing content, they make pages load slower, annoy human visitors and Further damage search engine rankings. “

Photo by Morgan Smith. Sign in room 116 of Sverdrup Hall, where Wall Street Journal staff writes, edits, contributes, and publishes articles.

From stealing clicks to harvesting data, a crawler can steal weeks of honest work in minutes. This is not limited to text content such as journal articles; image and video content can also be plagiarized. Essaid noted that bot prevention software can address the problem at the individual level, but preventing this systemic problem of digital publishing is impossible without effective enforcement of laws such as the Digital Millennium Copyright Act.

Even major publications are not immune to scrapers. After the shady website Newsbuzzr published a bootleg version of her article, HuffPost senior reporter Jesselyn Cook shared her journey down the rabbit hole of bot plagiarism. Reading her article feels like a section-by-section description of everything I see on pingbacks: randomly placed synonyms that break coherent sentences, patently unsafe scam sites, the occasional link to the original article, and so on.

Cook’s searches showed her several sites that were stealing content from every big-name publisher imaginable — from The New York Times to Wired — and monetizing it through revenue programs like Google AdSense. Although Google later notified Cook that Newsbuzzr had violated quality guidelines resulting in the site being blocked from AdSense, the sites abused a broken system that required manual responses rather than proactively fighting crawlers. That’s the motivation for item theft: profit.

“On the face of it, this rip-reword-repost operation is a creative little con (yes, it produces some really good ‘Florida Male’ content),” Cook said. “But the concern is that scraping content for ad traffic in this way is clearly profitable, and the scheme illustrates just how distorted the economic incentives are in our click-driven media industry.”

To learn more about how crawlers work and the solutions that exist, I reached out to webmaster, author, and plagiarism/copyright consultant Jonathan Bailey. Scrapping popped up during the rapid growth of the internet in the early 2000s, when Bailey saw his news writing being routinely plagiarized. This led him to launch his own website, “Plagiarism Today,” in 2005 to share “techniques for detecting and preventing abuse of online content.”

Google began cracking down on crawlers in February 2011, rolling out several algorithm updates, including Google Panda, which promotes high-quality websites with original, well-researched content. Although scrapers temporarily became less common after these changes, they never completely disappeared. Bailey explained that scrapers have recently seen a resurgence in popularity as they have become cheaper and more convenient. They’re designed to be as hands-free and hassle-free as possible, making them look lucrative.

“Even if it doesn’t work 99.9% of the time, all people have to do is create 1,000 sites hosting spinoff content to have some level of success. Spam, all kinds of spam, is always a numbers game, The numbers are only starting to lean more toward those sites,” Bailey said.

Because of their customizability, crawlers employ a wide range of strategies, from crawling keywords to following RSS feeds. Some crawlers copy individual pages and push them into a synonym generator, while others combine sentences from different pages into what Bailey says “reads like an incoherent Frankenstein essay.” Monetization methods also vary, from AdSense revenue to boosting the signal of other dubious sites through links, or even ads from other scammers.

I find it ludicrous to hear that scammers can profit from the work of student journalists. They’re not even good enough to be fakes for me. However, Bailey assured me they might be scratching the bottom of the barrel. The handful of companies that advertise on scam sites pay very little, and any meager revenues likely go to higher authorities.

“I have confidence [scrapers] make a living off of ill-gotten gains, but they’re often middlemen, either serving clients who want unscrupulous SEO or advertisers peddling questionable products,” Bailey said. “These groups may do better . “

So, what can digital publishers do to fight back? Anyone can submit a copyright claim to Google or a web host, and Google regularly updates its algorithm to downgrade low-quality sites that irritate search engine users. That said, Google itself errs on rare occasions, sometimes mistakenly flagging the crawler as the original source and punishing the victim.

One of the reasons creators might not pursue crawlers is that fighting them requires an investment of time or money, which is much more limited for smaller publications. Bailey recommends using these resources for a targeted approach; while it would be unreasonable and ineffective to find every crawler, searching for duplicate articles that rank higher in Google will provide the most valuable targets with less effort.

Photo by Morgan Smith. A laptop of Wall Street Journal editor-in-chief Sean Mullins, on which this article was written.

Stealing your hard work can be frustrating, although thieves probably won’t be able to afford the kraft paper that’s taken from stolen website traffic alone. I suspect this won’t be the last time the next generation of journal staff gets caught, and while I strongly doubt I’ll pursue journalism myself after graduation, copyright theft is common across all media fields. Fortunately, Bailey left me with words of encouragement from one writer to another.

“As cheesy as it may sound, the best advice I can give is not to let it get to you. Understand that it does exist and you may have to deal with it sometimes, but don’t overplay it. Remember, it’s not personal The problem, this is done by bots, it’s the internet itself, not you or your work,” Bailey said. “Keep it pragmatic and you should be fine.”

share this post


Sean Mullins (she/they) is the opinion editor and webmaster of the magazine. She majored in media studies and minored in professional writing at Webster University, but has been in student journalism since high school and was previously a game columnist for the Webster Groves Echo at Webster Groves High School, Blogger and cartoonist. She is passionate about writing and editing stories about video games and other entertainment media. In addition to writing, Sean is the treasurer of the Webster Literary Club. She enjoys playing games, spending time with friends, supporting LGBTQ+ and people with disabilities, streaming, making scary puns, and listening to music.

Leave a Reply

Your email address will not be published. Required fields are marked *