…the process of scanning through a large number of blogs, usually daily, searching for and copying content. This process is conducted through automated software. The software and the individuals who run the software are sometimes referred to as blog scrapers. Scraping is copying a blog that is not owned by the individual initiating the scraping process. If the material is copyrighted it is considered copyright infringement, unless there is a license relaxing the copyright. The scraped content is often used on spam blogs or splogs.
I’ve noticed (usually by looking at the trackbacks in my WordPress dashboard) a variety of scrape blogs grabbing content from my site over the past few years. Sometimes these sites include an attribution link, other times they do not. Alan Levine growled about scrape blogs today in his post, “Take My Whole Blog Post, Please? Why?” and after reading it I thought I’d add my voice to his and start a chorus. I’m not suggesting there is a constructive pathway of action for stopping this type of behavior, but am rather just lamenting this type of content replication and promoting greater awareness of the phenomenon. As Open Educational Resources (OER) become more common and popular, it’s entirely possible we’ll see blog scrapers who specialize in OER. (Let’s hope not, but it’s possible.) Blog scrapers are seeking page views for advertisements, and seem to copy posts from well read blogs / blogs highly rated on Google to boost their own page view potential. It’s certainly NOT a bad thing to be an entrepreneur and seek ways to earn money, but the legitimacy and ethics of these methods appear dubious at best.
The screenshot below is one I snapped from Google (with Skitch) last week on April 8th, after I posted about my Facebook account hack. I discovered these scrape blogs when I was searching for other bloggers who might be writing about the same situation. I didn’t find any: Instead, I just found scrape blog examples. 🙁 The screen snap shows top Google search results for the keywords “cnbc8 facebook phish” (without quotation marks in the query.) There were 348 results on Google as of April 8th:
In that query, you’ll note my post (which is the only “real” post in the first five Google results for that query – everything else was a copy/scrape of my post) my post was listed first.
In this same search today, my original post is listed third, and the top two results are scrape blogs. Total posts matching the search is now up to 1180.
I’m not losing any sleep over this, but it is interesting (and maybe a few other things, like “irritating”) to see how blog scraping is continuing. I don’t have research results to cite that show blog scraping is on the rise, but I strongly suspect it is along with legitimate blogging more generally. Authors of the post “Defending Your Site From Scrapers” suggest bloggers and website owners should use a “cloaking” method to somehow send garbage content to “blog thieves.” I have no idea how such a method would discern how to send “correct” content to legitimate readers and garbage to the scrapers. It also seems that strategy would contravene the open standards which power blogging and RSS/feed aggregation in the first place.
“Web scraping” is not just a term used to describe wholesale copying and re-posting of blog content to generate ad revenue and page views. It’s also used to describe data harvesting activities. The following description for a “Web Scraping project” was posted on smartfreelancers.com on April 7th:
I need an experienced operator to conduct a quick web scraping project. The site to be scraped is: www.isc.co.uk I need the names of all the schools on the site along with the Headmaster name, the full address, telephone and email address… (Budget: $30-250, Jobs: Web Scraping)
I don’t really have a problem with that type of web scraping / data mining, but like Alan I think wholesale copying of blog content without the addition of ANY commentary or original ideas seems dishonest and wrong. There’s a difference between plagiarism and the legitimate use of quotations with citations. Alan wrote:
I am still left trying to figure out the purpose of a web site that just lifts full content from others and republishes it (in the worst case, it is a splog, but this site was not that bad). The site is affiliated with a town in Colorado.
But c’mon- if you are going to have a blog powered site, its one thing to write stories based on what other people blog, maybe pull quotes, but to lift an entire blog post and republish it is either lazy or worse.
Like Alan I publish under a Creative Commons license, so when scrape bloggers include an attribution link they may POSSIBLY be complying with my use license and attribution terms. Still, this type of content re-use is NOT what Creative Commons sharing is all about or, in my view, seeks to empower. I suppose this is a mildly dark (or at least irritating) side of open content licensing.
In the case of blog scrapers, as well as email and blog comment spammers, my main thought is: How sad these folks aren’t doing something more CONSTRUCTIVE and CREATIVE with their technology and communication skills?! The coders who are doing these kinds of sites can be anywhere in the world, so who is to say what their vocational options might be at this point? It would be great if someone could point the scrape bloggers of the world (and the spammers) to lucrative, legal, and constructive ways to use their talents. Perhaps this would make a good case study / discussion topic for educational courses focusing on blogs and blogging.
The Simple Trackback Validation Plugin for WordPress is one I use to cut down on trackback spam. I don’t think there’s a plugin which can stop scrape blogs, unfortunately. If there was a Creative Commons license which specifically forbade blog scraping, I’d be very interested in learning about it and consider using it here. Brian Lamb, commenting on Alan’s post, points out the CC Attribution license permits authors to specify attribution terms. I include attribution terms on my blog now, and am not sure if that could legally prohibit/disallow scrape blogs or not.
Did you know Wes has published several eBooks and "eBook singles?" 1 of them is available free! Check them out! Do you use a smartphone or tablet? Subscribe to Wes' free magazine "iReading" on Flipboard!
If you're trying to listen to a podcast episode and it's not working, check this status page. (Wes is migrating his podcasts to Amazon S3 for hosting.) Remember to follow Wesley Fryer on Twitter (@wfryer), Facebook and Google+. Also "like" Wesley's Facebook pages for "Speed of Creativity Learning" and his eBook, "Playing with Media." Don't miss Wesley's latest technology integration project, "Mapping Media to the Curriculum."
On this day..
- Learn About Coding & Minecraft: Saturday April 18 at PLAYDATE OKC (free!) - 2015
- Kids Teaching Kids by Eric Marcos - 2012
- Great Educational Ideas from TEDxNYED - 2012
- Hands-on Introduction to Mobile Learning App/Game Creation for Non-Programmers - 2012
- Buckling Down Down for Oklahoma Standardized Testing Military Style [VIDEO] - 2012
- Application of OBTE Principles by Randal Wickman - 2011
- Military Development at West Point by Col Casey Haskins - 2011
- Tony Wagner discussing learning at the 2011 Intellectual Warrior's Conference - 2011
- Social, economic and cultural commentary via remixed multimedia: Meet DJ Spooky - 2010
- Understanding Race to the Top and the Obama Education Reform Agenda (deja vu, GW Bush) - 2010