How does a Social Media Monitor Search Engine Work? Need it be so expensive to use when google is free? What can we do ourselves in house ? How do we cope with analysing 100 million tweets per day ?
Right now from what I have seen, the SMM companies combine several principal computing competancies (or modules if you like) all of which are actually undergrad' level, but of course require man hours and server resources beyond what universities will embellish upon "youff" or even post docs. This is probably why there are so many start ups in SMM strangling each other and dragging down the price chargeable for the most established consumer data mining companies!
Here are the key modules and processes that an SMM Search Tool will include:
1) The " 'Bot":
this is the robot or crawler/spider call it what you will which goes out and gathers information from the web sites. In effect it works just like your browser but instead ofg routing all the HTML and graphics to be displayed as a nice web page, it identifies the "posts" or "tweets" or other common text fields (or information like consumer assigned sentiment). This is also called screen scaping, but the complexities of forums and the perishability of tweets means it has to be more sophisticated, but still this is undergrad' project stuff.
Another issue here is that you want to probably have a link back to the original web page, you maybe want to track users by their name and time stamps for the posts, so your crawler-scaper has to handle this. As you get the drift, the eventual items will be stored in a database methodology and most often these days, in a MySQL back end database. The item gets its own ID and sits like a record with a few fields and an indexing methodology is applied so it can be found efficiently from keyword searches.
Now of course in the world there are lateral differences between web forums and of course FB versus digg or myspace, and these sites themselves vary longditundially ie in the course of their lifecyle they get re-formatted, revamped, extended and for forums often they get a new installtion of the "forum engine" or a completely new one. Now the project starts to take on the jobbing graduate (often drop outs actually who end up earning more than those who stayed to do their bachelors!) just to keep on top of the restructuring and when crawlers get kicked off forums ie the IP address or browser type is "sniffed" as being unwanted spying!!!
2) The "itemiser": This is the next step from screen scraping the information: it has to be repakaged in a standard format so that the firms software can use it ie all data should be standardised, and "Unitised". So another key difference to what your web browser and things like FF Scrap Book do is that the unit of selection is no longer the entire web page, but rather the individual snippit relating to one consumer comment which is stored and indexed.
3) The Indexer and Navigator: this is applied both to entire datasets and to individual entries. This is a little beyond the scope of this blogg and to be frank my know how outside XML, but all the words are indexed for each post ID : some indexers ignore common english words ("the" etc) , some run "triages" to index variations of difficult brand names ( eg acronyms, nicknames, abbreviations). Most SQL databases have an engine which will naievly index or can be programmed to the above, or you can build a specific indexing strategy in the engine or import the structure to it.
Indexing is in, navigation is "out" but in effect relies on good indexing. So you type in a search word and the system has a quick way of finding that word in the index and then finding all ( or in sequences of 100 say, per results page) the records which contain that word. This is the clever bit at google and yahoo: getting this to work at super speeds!
Computers are naturally top atheletes when it comes to running through lists and refering to "attached" or referenced data and then retrieving it. In effect Google and Yahoo make a library index card system, but instead of telling you the shelf location of one book by author or title from the alphabetical indexes, they know the shelf location by each word. There after they retrieve a short abstract ( the listing entry) and of course the link back to the original web page and display this for you ( at say 30 or 100 a time, the indexing has a ranking which ranks posts by various generic, pedantic or isoteric means and this is the soul of the arms-race in Search Engine Marketing /Optimisation)
3) The Web Portal : this is then, just a straight forward front end and "mid ware" to shuttle the query and results back and forth from the standard web browser, or the mobile device portal. You can also use the back end SQL interface itself, and sometimes an intranet http version is included so you can see how Oracle or MySQL think the web interface should work. Even a bolean search page and results service like Google Advanced or All-the-Web is now an undergrad' level project and loads of "home brew" programmers can do a pro' looking job in PHP. All this and the server ( computers at the SMM company end) are covered in other lectures.