logo_text_trans.gif
Click to see the XML version of this web page.
Friday, May 9, 2003

Roger finds another 'creepy crawler':

12.148.209.198 - - [21/Feb/2003:05:45:13 +0000] "GET /robots.txt HTTP/1.1" 404 300

12.148.209.198 - - [21/Feb/2003:06:08:01 +0000] "GET / HTTP/1.1" 200 7651

12.148.209.198 - - [21/Feb/2003:06:43:01 +0000] "GET /gate/archive.html HTTP/1.1" 401 493

12.148.209.198 - - [21/Feb/2003:07:21:54 +0000] "GET /gate/video.html HTTP/1.1" 401 493

12.148.209.198 - - [25/Feb/2003:11:29:35 +0000] "GET /robots.txt HTTP/1.1" 404 300

12.148.209.198 - - [25/Feb/2003:12:00:44 +0000] "GET /gate/album/index.html HTTP/1.1" 401 493

12.148.209.198 - - [25/Feb/2003:12:31:26 +0000] "GET /gate/archive.html HTTP/1.1" 401 493

12.148.209.198 - - [25/Feb/2003:13:04:24 +0000] "GET /gate/video.html HTTP/1.1" 401 493

12.148.209.198 - - [09/May/2003:20:18:32 +0100] "GET /robots.txt HTTP/1.1" 404 300

12.148.209.198 - - [09/May/2003:20:18:32 +0100] "GET /gate/album/index.html HTTP/1.1" 401 493

12.148.209.198 doesn't resolve on a reverse dns lookup. but further investigation turns up: this link, which leads to http://www.nameprotect.com

Interesting to see the words they are looking for: album, archive, video. at least they first look for robots.txt.

I'm getting hooked on access_log!

I should note that Roger's 'blog and mine are hosted on the same server, run by a guy who lives up to his motto "World's worst ISP". We have to grep the Apache logs to find the hits just for our sites, 2 among the six hosted by this guy. He's lazy, and can't be bothered to configure Apache to spit out separate logs... a least this 'bot is relatively polite...
Comments 10:48:50 PM    


Here's a strange site that produces a disturbing result when I search on my name. Yikes...
Comments 2:12:17 PM    

Cyveillancebot returns:

63.148.99.232 - - [09/May/2003:08:31:47 -0700] "GET / HTTP/1.1" 200 81802 "http://www.gavinsblog.com/2003/03/31.html" "Mozilla/4.0 (compatible; MSIE 5.05; Windows NT 3.51)"

63.148.99.232 - - [09/May/2003:09:00:38 -0700] "GET /2003/05/05.html HTTP/1.1" 200 27517 "http://scriptingnews.userland.com/2003/05/05" "Mozilla/4.0 (compatible; MSIE 5.05; Windows NT 5.0)"

63.148.99.232 - - [09/May/2003:09:52:31 -0700] "GET /2003/05/01.html HTTP/1.1" 200 31896 "http://www.smalla.net/infofeed/index-full.rdf" "Mozilla/4.0 (compatible; MSIE 5.05; Windows NT 3.51)"

63.148.99.233 - - [09/May/2003:10:35:04 -0700] "GET /blog/ HTTP/1.1" 200 36281 "http://www.prandial.com/archives/004099.html" "Mozilla/4.0 (compatible; MSIE 5.05; Windows NT 3.51)"

63.148.99.232 - - [09/May/2003:11:38:33 -0700] "GET /rutland/ HTTP/1.1" 404 285 "http://www.n-net.com/vt.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

Note the changing IDs: I wonder if this is how they know which crawler is working. But it still doesn't seem to be aware of the existence of a veritable HTML shrine to all things Cyveillancebot. Googlebot knows, which lends credence to the theory that Cyveillancebot doesn't really have the horsepower to traverse the Web and keep up with new pages. Most of the crawls above are from and to old material the 'bot has already crawled.

As seen above, Cyveillancebot has, finally, crawled the page you're reading now, which has lots of links to the C-bot materials, so it will be interesting to see how long it takes before it crawls those directories. I'm not looking forward to the 'thrashing' or mode 2 behavior that it exhibited in the past - it will consume all our bandwidth while it's at it, but I guess it's the only way to discover how it works.

Speaking of Googlebot, I wonder why they just don't use Google or Yahoo search? That way they wouldn't have to tie up the resources for a crawler farm, their searches would be invisible, and their results would be much more up to date. If I was going to start a service, call it CyberSnitch, I'd cut a deal with Google to do the crawling, and then point the AI to the Google results...


Comments 1:41:50 PM    


Web servers are amazing: I'm so used to them that I hardly ever think about them anymore. Yesterday I created a whole new edition of this Weblog in about 5 minutes, including a RSS feed. That's an outrageously short period in which to create what amounts to a new publication.

Some ten years ago, I helped a fellow named Neil Chase and the Hearst Corporation launch a new weekly magazine called "We", a bilingual publication that circulated in the U.S. and Russia. It was, at the time, the fastest and cheapest launch of a Hearst publication ever. Though wildly popular in Russia We succumbed to market forces: Izvestia, the Russian publishing partner, couldn't afford the paper to print it on.

And though We didn't make it, the Hearst Corporation adopted the methods that Neil and I developed - based on desktop publishing - for all of its launches since. That 'miracle' publication went from idea to newsstand in something like 3 months. Yesterday, I did what amounts to the same thing, by myself, in 5 minutes.

It helped that I had excellent software - Userland Radio and NetNewsWire - as well as Apache, but it is still an awesome thing. Five minutes from idea to a publication with global possibilities... and I can afford the 'paper' to 'print' it on...
Comments 9:23:05 AM    




Top of page | Home | About gulker.com | About Chris Gulker

Updated 4/16/04; 12:38:07 PM

Chris Gulker's view from Silicon Valley - in words and pictures

Updated 4/16/04; 12:38:07 PM


May 2003
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Apr   Jun

Gulker Photo Archive Logo

Features & Categories:
Columns (soon)
Dotcom Garden
Lone Genius Hackers
Picture Weblog
Theory & Strategy
Weblogging

gulker.com Cam
gulker.com Cam

Interesting blogs et al.:

AlwaysOn Network
Natalie d'Arbeloff
Azeem Azhar
Ken Bereskin
Blogging Ecosysytem
Blogging Network
BlogStreet
Boing Boing
Tim Bray
Matt Croydon
DaveNet
Rael Dornfest
Esther Dyson
Dave Farber's IP
Dave Fitch
David Galbraith
John Getze
William Gibson
Dan Gillmor
James Gleick
Bernie Goldbach
Meg Hourihan
Joi Ito
Xeni Jardin
Jeff Jarvis
Linux Journal
Mitch Kapor
Kuro5hin
Gunnar Langemark
Joshua Levy
Scott Loftesness
Macintouch
Ross Mayfield
Hans Moravec
Rafe Needleman
Nonsense Verse
OS Opinion
Tim Porter
Recommended Reading
Reverse Cowgirl
Glenn Reynolds
Roger Ridey
Phil Ringnalda
John Robb
Scott Rosenberg
Anita Rowland
Brent Simmons
Robert Scoble
Doc Searls
Jessica Shea
Gavin Sheridan
Shifted Librarian
Stefan Smalla
Bruce Sterling
Scripting News
Slashdot
Dan Shafer
John Tringham
Jon Udell
Moicho Umeda
Philipp Weltentummler
Kevin Werbach
Amy Wohl

Click here to visit the Radio UserLand website.

Subscribe to "www.gulker.com - words and pictures from Silicon Valley" in Radio UserLand.






Google