I think it is interesting that while few have dared to go out in the market with new products like botbox.com ,
rather many companies have chosen to offer web-monitoring services on a corporate basis and that this is further become a buzz around making money from buzz- social media surveillance.
At some point these guys have to develop intelligent, directed web-bots to go and find the new content and keep on feeding in this and the meta-stats on how hot something is.
Web-bot – spiders or crawlers are the principle types of web-robots which search through URLs and find key words, links or metacontent and store them for indexing as lists on the search engines.
A distributed web sprider/crawler is one operating on decentralised processing nodes with usually a central machine allocating domain names and MIME types to look at. This is exciting because it could be integrated to specialist groups web browsing behaviour, searches and conscious wishes in building more relevant listings on SE and directing updates on particular topics, sites, themes or ”web rings” ie interactivity.
What is exciting to me is to create a personalised web bot system which enables a user in a specialist field to get to more relevant data per search minute! Also to let them keep updated on those site or page finds, and gather new closely relevant data from the web.
Web bots and indexing are bandwidth, memory and processing time heavy items but if the tasks are spread out over a cluster of like minded individuals active in a search area, and spread out over time when their machines are on line but doing very little then we have the basis for something which can truly be personalised, efficient and a product or service which can attract revenue streams.
The key principles in reducing actual live-search times is to metacrawl and screen out links which are undesirable. Users, or the user cluster, build on the efficiency and relevance by having their actual behaviour monitored- the system learns by AI like processes to frm what the user finds relevant and what they discard or look over. Successful filter builds from metacrawling lead to directed and ’temporal’ (return for updates, expansions, deletions) crawls in slow-time.
It is a sad fact that maybe only 25% of the web gets indexed on the main engines, and this is largely due to the pathway rules the bots obey and the inclusion of top level URLS to crawl down. Partly this is due to the duality of being ”linked in” – the principle of links in and out ’catching’ the bots (having been googles founding algorythm –mantra ” the more linked to relevance you are, the higher the listing”) as they explore outside the URL list. SO they miss material, but of course much of that material is going to be the vanity-web-logging which so many of us do, as I am right now. Proportionally to the traffic of say 1997, are search engines really any more useful to us now, than say following web rings or installing those channel buttons on your desktop was to us back then?
I see spam listings covering at least half of any simple key word search and end up doing my own, old fashioned web ringing to get to new unlisted or ’burried’ content in an area or simply resorting to several BOLEAN search alternatives- both tedious if someone could do this for me while I am away from my keyboard!!!
I am also guessing that in conducting new searches, there would either have to be a screen on URL names, MIME content tags and so on OR a randomly generated domain name engine which uses key words, and web ring to construct domain names and test them for real content.
We have sites like – compete, technocrati, google labs and so on which do something for free but they are apparently metacrawlers and some want paid as soon as value added data is approached. Also metastats are of course useful stuff to marketeers and criminal /terrorist intelligence, and you have to pay.
With the advent of appstores on several platforms and the predominace of book and music buying on the web for under 30s I think that revenue streams from sales and advertising can be secured for either international companies or those with some local-knowledge / language advantage.