Compendium of information about Cyveillancebot, a crawler believed to be run by Cyveillance, Inc. based in Arlington, Virginia. The Cyveilancebot is reported to be the work of MIT graduate Jason Thomas. Cyveillance's clients include Bell Atlantic, Dell Computer Corp., Levi Strauss & Co., Mobil Corporation, Time Inc.-New Media, Washington Post, Newsweek Interactive, Bell South, ASCAP and the RIAA, in addition to leading companies in the pharmaceutical, financial services and computer industries, among others.

It is belived that Cyveillancebot crawls the Web looking to mine information about the current Web zeitgeist for corporations, as well as searching for copyrighted materials and brands and logos that may be misappropriated. The bot, according to the Cyveillance Web site and other sources, is part of a suite of technologies that feeds to a human analyst. The technology was called NetSapien in the 1999-2000 timeframe, though Cyveillance' Web site uses other terms in 2003.

Cyveillancebot uses IP addresses in the range of 63.148.99.224 - 63.148.99.255, and may use others (but unconfirmed). Here's a list of other 'media enforcer' bots, servers et al.

Cyveillancebot ignores robot.txt, as far as anyone can tell. Cyveillancebot spoofs its identity, naming itself various flavors of Windows browsers:

63.148.99.232 - - [02/May/2003:13:01:37 -0700] "Mozilla/4.0 (compatible; MSIE 5.05; Windows NT 3.51)"
63.148.99.232 - - [02/May/2003:13:01:37 -0700] "Mozilla/4.0 (compatible; MSIE 5.05; Windows NT 5.0)"
63.148.99.232 - - [02/May/2003:13:01:58 -0700] "Mozilla/4.0 (compatible; MSIE 5.05; Windows NT 3.51)"
63.148.99.232 - - [02/May/2003:13:01:58 -0700] "Mozilla/4.0 (compatible; MSIE 5.05; Windows NT 3.51)"
63.148.99.232 - - [02/May/2003:13:02:57 -0700] "Mozilla/4.0 (compatible; MSIE 5.05; Windows NT 4.0)"

Whether this is code changing the ID at frequent intervals, or, say, a number of machines behind a common firewall is unknown. Cyveillancebot sometimes shows the same ID for all accesses over a given time period.

Cyveillancebot is a bit unusual for a bot in that it includes the referrer line (Googlebot doesn't), but this may be part of the ploy to look like a browser in access logs.

Cyveillancebot doesn't, however, download graphics files, java or other page components. It does seem to download other types of binary files (perhaps it is looking for illegal mp3s etc.).

Cyveillancebot seems to operate in 2 modes. In mode 1, it comes in on a link and reads a single spage. In mode 2, it downloads every page in a directory or even a whole web site as fast as it can. The mode 2 behavior is notoriously bad, and can amount to a DOS attack consuming all available bandwidth.

The bot is also reported to get stuck in query loops on database-driven sites and is reported to have brought at least one server down. No one knows why this behavior is allowed to continue: the parameters and practices for good behavior are well understood. Whether this is incompetence, indifference or, perhaps, an (curious) example of music industry 'punitive' technology, is unknown.

Here is a description of Cyveillance' 'NetSapien' technology from a journal at Emory University written in 1999:

How exactly does NetSapien work?

Spidering

The search begins by using client-specified data to constantly search for sites with relevant topics.    Then it performs an Œintelligentı or deep crawl to look at each linked page.  Then an algorithm is applied to determine if a link should be further examined.  As the spider continues to search on the client specified topics, the Œsmarterı it gets about the topic.

Filtering

This step deletes any broken links or previously searched pages from the results.

Prioritizing

This is where the inference engine steps in.  Here, algorithms are plugged in to the engine to prioritize the results based on the clientıs business criteria.  The feature looks through all the pages for recognition of text, video, audio, hidden text, meta tags, and links to assess how revenue generation is to occur and the intent of the content.  It then groups the pages back into sites to show the most relevant examples of the site.

Extracting

NetSapien technology then extracts the relevant data from each page and enters it into a database according to the clientıs business criteria.  Some questions may be:

· Is this page generating revenue?

· Is it domestic or international?

· Does this target specific clientele?

The last step prepares the data for easy formatting and analysis, being constantly updated by a learning/feedback loop through the process.

How to tell if Cyveillancebot is accessing your site using grep to look in your Apache (or other Web server) access log:

grep 63.148.99 /var/log/httpd/access_log

How to block Cyveillancebot:

If you are using Apache, have access to your .htaccess file, and the rewrite engine is enabled, then you may add the following lines to block the Cyveillence bot:

# Cyveillance RewriteCond %{REMOTE_ADDR} ^63.148.99.(22[4-9]|2[3-5][0-9])$

# FILTER BOTS : 403-Forbidden RewriteRule ^.* - [F,L]

These line will return a http status code of 403 (Forbidden) anytime a request is made from any of Cyveillence's IP addresses.