 |
 |
 |
Wednesday, May 7, 2003 |
The craft of Weblogging is something I've been trying to write about all week. (The Cyveillance thing just took on a life of its own... the things a guy will do, to avoid cleaning the garage). But I digress.
I once spoke at, even helped organize, a digital imaging conference for fellow photographers. We all agreed that the measure of success would be when we stopped talking about RAID arrays, SCSI, and CPUs and started talking about making compelling photos. That conference no longer exists.
Weblogging is maturing fast: I guess I missed all the conferences and punditry. Was there a WebLogCon Expo 2002? Heh. I guess Weblogs are one of those things that have just appeared... pushed its way up from the grass roots, below.
But the good news is that we're already talking about compelling content. I marvel at the quality of the blogs I'm fortunate enough to peruse daily. Today I dropped in at Blaugustine, produced by Augustine, Natalie d'Arbeloff's cartoon alter ego, who was in an interview with Vince (you know, the Van Gogh kid). Yikes... you might find production standards as high as this in mainstream media, but I don't think you'll find the spirit, the originality, the compelling humanness... I think a golden age of authorship is upon us... more tomorrow...
10:30:50 PM
|
|
The Cyveillancebot home page is a compendium of what's known about the technology, what's broken, and how to protect networks and servers from its 'thrashing' mode.
The whole point of the Cyveillance attention is not just that it annoys me that their badly written bot freezes my Net connection once or twice a month, or even that their business model - which seems to involve getting paid for presenting other people's work, including mine - is a bit disingenuous for self-styled defenders of copyright.
The real issues are in this article about systems like CAPPS II and TIA. We seem to be entering a strange world, a time when snitches rule.
Human snitches have never been the best of people, but at least they're human.
Today it doesn't take a ne'er do well jailbird: badly written software can basically wreck your life. Think about a bug in CAPPS II, the airline screening program. Imagine nervous, armed policemen and pilots: now imagine that the CAPPS II bug causes you to be identified as a 'shoot on sight' terrorist.
Even Cyveillancebot can hurt you. The record companies clearly want to hurt people by making examples of customers whose listening habits they don't happen to like, as 4 students found out last week to the tune of some $50,000. Imagine if Cyveillancebot fingered you, inaccurately, as an egregious copyright miscreant and a legion of RIAA lawyers sought to get you fired, in debt and in prison as an example to the rest of the world.
If the inference engine that Cyveillance uses is as badly written as its 'bot, then we're all in trouble. Not only do we live in a time of snitches, but the snitches are bad, buggy software...
4:01:26 PM
|
|
How to block Cyveillancebot: Hans gives some guidance both to check your logs and set Apache to reject Cyveillancebot's normal IP range. You may want to do this if you're a victim of Cyveillancebot's 'thrashing' mode, which amounts to a DoS attack that may consume all your bandwidth while it's working...
12:41:06 PM
|
|
Jonathan Peterson notes that Cyveillancebot "seems somewhat selective, following links mainly to stories that have DRM, DMCA, RIAA and other keywords." He also mentions the two modes that I and others have seen: mode 1 the 'bot comes in on a link and visits just the linked page. Mode 2 - the disruptive one - the bot voraciously tears through dozens or hundreds of pages, as fast as it can. The host server is effectively DoSed until Cyveillancebot finishes. Modes 1 and 2 don't seem to be linked, as far as I can see at this point...
11:07:55 AM
|
|
Slashdotted! Daved! Guess the Cyveillancebot rant touched a nerve... good news on the log analysis front: the greater activity and inbound linking mean that some patterns are easier to discern.
I spent some time reading the Cyveillance Web site, and then reading cached versions of previous press releases it has since deleted, in the hopes of understanding just what their technology does, the better to see if patterns in my Web access logs fit various theories.
One theory that's emerged is that Cyveillancebot is particularly sensitive to certain hubs, one of which is almost certainly www.scripting.com. A quick eyeball of the logs seems to show a correlation (and one other blogger noted this) between being linked from Dave's site and being crawled by Cyveillance.
It's as if Cyveillancebot regards certain sites as 'evil' (or maybe it's 'good') and hurries to see what they're linking to. It would be informative to analyze logs at scripting.com and a couple of other high-flow zeitgeist sites and see if there's a discernible pattern.
Cyveillence touts its proprietary Extraction Agents and Data Transformation technology claiming that it delivers 100% relevant results (and call a salesperson for a demo). However, knowledgeable people who hang out at Webmasterworld and Slashdot describe Cyveillancebot as 'stupid' and 'badly behaved'. Not only is it unusually aggressive (it routinely completely saturates my modest Net connection), it gets caught in loops on database-driven sites that most other crawlers have long since been programmed to avoid.
While a 2000 press release bragged of indexing every one of the Web's then 2 billion pages, I note that, as far as I can tell, there are only 4 or 5 Cyveillancebots versus the dozens (or more) that Google runs. Napkin math would seem to indicate that you'd be hard pressed to crawl 3 billion Web pages - including the 7 million new ones every day - at anything like a level that would provide Cyveillance clients with the sort of immediate warning of misappropriation of assets or brand that Cyveillance marketing seems to promise.
Anyway, I've set a couple more 'bot experiments in place: be fun to see what, if any data can be gleaned. Despite the Slashdotting, and the presence of the name of many of Cyveillance's major customers' names in close proximity to hot-button terms in this document, and a word list likely to be very interesting to RIAA, Cyveillancebot has not visited since it came in on a link on Monday. And please do continue to send me 'GREP sightings' of Cyveillancebot: you can find it in your access logs by typing at the command line:
grep 63.148.99 access_log
...if you're equipped and inclined to do the analysis, I'd be interested in correlations between its activity (especially the intense sessions where it downloads every page of a site) and inbound links. Do links from certain sites seem to trigger it?
9:14:19 AM
|
|
Top of page | Home | About gulker.com | About Chris Gulker
Updated 6/1/03; 5:39:04 PM
|
Updated 6/1/03; 5:39:04 PM
Dotcom Garden
Picture Weblog
Random Access (soon)
Search
Venture News
Weblog Metrics
gulker.com Cam
Natalie d'Arbeloff
Azeem Azhar
Ken Bereskin
Blogging Ecosysytem
Blogging Network
BlogStreet
Boing Boing
Tim Bray
Matt Croydon
DaveNet
Rael Dornfest
Esther Dyson
Dave Farber's IP
Dave Fitch
David Galbraith
William Gibson
Dan Gillmor
James Gleick
Bernie Goldbach
Meg Hourihan
Joi Ito
Xeni Jardin
Jeff Jarvis
Linux Journal
Mitch Kapor
Kuro5hin
Gunnar Langemark
Joshua Levy
Scott Loftesness
Macintouch
Ross Mayfield
Hans Moravec
Rafe Needleman
Nonsense Verse
OS Opinion
Tim Porter
Recommended Reading
Reverse Cowgirl
Glenn Reynolds
Roger Ridey
Phil Ringnalda
John Robb
Scott Rosenberg
Anita Rowland
Brent Simmons
Robert Scoble
Doc Searls
Gavin Sheridan
Shifted Librarian
Stefan Smalla
Bruce Sterling
Scripting News
Slashdot
Dan Shafer
John Tringham
Jon Udell
Moicho Umeda
Kevin Werbach
Amy Wohl




|
 |