Fred Rants: social-media-monitoring-tools

Contention: . Traditional hard and boring market research techniques can save your bacon when you are looking to turn-around fast, accurate and most of all PROFITABLE reasearch reports or data mining portals.

I've been lucky to be in the doors of a couple of Social Media Monitoring hot-shops (SMM) and also have a friend who has programmed both back- and front-end for screen-scraping and data mining in web monitoring resources.

I see that today's technology means the usual IT smoke-and-mirrors dissolve away into what is today pretty standard undergraduate project work, just lots of it. Even sentiment rating tools are at a level of final-year project for a Harvard Comp Sci Major.

In this blogg let us look back under the bonnet (the last blogg) and maybe chuck out the v8 single carb for a nice little supercharged injection engine.

A Heavy Handed, Verbose Approach to Social Media Monitoring

What the media-monitor companies seem obssessed with is having accurate census of all posts in consumer forums, and now with access to FB and twitter, all hits on the brand names. This is because, like Everest or the Great Wall, they can build it. Not that they really need 'go there' but because they are used to huge data resources, it seems the right thing to do.

Now this is by no means a labor-of-sisyphus, it probably just feels like that, espeically when major forum web sites kick your crawler's IP address off their triage-server. Up until now, server resources have not been an issue, but with capture and indexing for Twitter, Digg and not the least FB indexing and retrieval navigation start to become sluggish and resource heavy on both vertical, horizontal and labor in maintainance and expansions.

So faced with the practical, economic and client-patience issues of providing huge data mines, what is the alternative "blue sky" out there? Well of course come on guys, don't re-invent the wheel.

Comp'Sci'Bachelor's Myopia

There is a cultural issue with computer science "majors" because they race through math, especially statistics and only listen to the more web oriented parts of any ( if at all) marketing course units. Like many people outside the profession, it seems all comms and "fluffy", not a serious branch, just a bunch of luvvies making hot air.

What they all want to do now is either be a hot java programmer, admirably, and/or do phone apps or APIs for web sites. What they do not want to spend time on is "aligning" to the harder side marketing. The web is the new patch, keep off with your Taylor Nelson 1950s stats and punch cards.

Well post-pubescent-programmer, er sorry, software engineer/app developer, you are wrong. Traditional hard and boring market research techniques can save your bacon when you are looking to turn-around fast, accurate and most of all PROFITABLE reasearch reports or web.

Sample Dear Boy!

The first population cenus was perhaps conducted and documented as long ago as 3400BC according to Wiki. Census means getting ALL the possible data points in the bag. Even in the infancy of market research, the census approach was rejected for consumer reports because it is just too combersome and expensive, and if you make any mistakes in the methodology, like ambiguiities in questions, then your entire results are biased and not accurate. Also there is just a lot of irrelevant data capture points.

Market researcher managers, the hard-men of the industry, soon adpoted sample-set techniques from statistical science used earlier in biology for example, and went on to develope their own methodoologies for creating such smaller sub groups. The key here is that the mathematics gives you a known accuracey or probability that your results from the sample group represent the population - as whole say of a country- or as a set of say, people visiting Wallmart on tuesdays.

Researches uses Random Probability Sampling and went on to use stratified-random-sampling and nice temporal techniques like "sample-resample" where time is also a random factor in combination with geographical location as a random or defined set.

In outset all these techniques mentioned ( apart from sample-resample where you define the time period) rely on a knowledge of the size of the total set, maybe the national population or the geographical locations of all supermarkets. However, you can use educated estimates or other statistical techniques to produce a number.

Dangers and Benefits of Statistical Sampling
These techniques are reliant on a few assumptions, and often some manual intervention as to common-sense sample techniques. Here in lie two dangers though:

1) If you have say one model with assumptions for the population total and there are of course some assumptions in things like the t-test and Chi Squared test, then you can create massive sampling errors by the combination of the two levels of assumptive error.

2) You can become overly confident in the results based on a very good (and expensive) sampling methodology, while being completely let down by "non sampling error", usually a crappy questionnaire.

The big benefits are that you get a senisble and often very small sample size to then look into with GOOD study methodology ie you can put the man-hours into design, execution and results and not the shoe work. Also to the delight of many accountants or numbers guys with MBAs in senior management, you can present the margin of error expected from a sample size ( always " n ") and thus make cost-benefit decisions based on need for accuracy.

How Does this Apply to SMM?

Sampling is great for SMM because if your crawler is working to 99% or better, and your clients are only interested in the web sources you index, then you can spend time doing far better sentiment analysis, manually, than the algorythms will EVER manage to produce on larger sets of data.

You have two approaches in using sampling: sample the entire set of data BEFORE you index it or sample into results (or index-strata) for a given topic.

So for Twitter for example, with maybe 100million global tweets day to the minus one, it would be an efficient strategy to work on total per day ( you hear about it in the news, or you maybe can pay for the info, on a country basis too) and if you can geographically restrict your base, then you can work on the main number being in accessible languages and sample to a much smaller figure.

Comes with a Health Warning

Now here comes two dangers, the first one relates to the latter instance of twitter sampling: if you are looking to drill down to tweets on a brand name, in one country or language, over a short time, then you run the risk of NOT sampling enough records to give statistical robustness at this level of "cross tabulation" aka drill down. This is actually a big problem with many standard market research studies, because when you drill into say: Age by location by salary by single then suddenly your stats on that sub sample to compare them or predict them, evapourate into inprobabilities. Hence you stratify your sample: you use a common sense or pre-filtered method to include those sub populations or you study those sub populations against a sample of the General Public.

You see where I am going with reference to your itemised consumer posts being the "population", not the individual users.

Also the other problem is in this stratificaion: you choose a criteria which is not exclusive enough or exhaustive enough to satistfactorily capture that sample or that population to sample into. This would be the case with building a simple query, a search-taxonomy to dig out records in a certain topic area or a time frame, or geography. You don't capture accurately so all the stats-techniques in the world applied on that sample - population relationship won't save you from GIGO.

Beauty in Eating only Some of the Elephant

However the beauty of it there as a tool is also apparent: you can run very quick manual dips into either pre-indexed data (indexing takes time!) or into query results from your indexed data, or into your indexed data set as a whole. Then you have a back up too in delivering automated sentiment results and taxonomy query reporage: in other words you run the sentiment algorythm on total data for the period you want, and then sample into it and compare your manual ratings ( based on the same principles at least as the algortythm) done carefully on the sample. This gives you both a whole new strategy to add value to client reports; it also gives you a means of building taxonomy with a security you have enough threads to take key words out of; and it gives you a QA on your taxonomy or automated sentiment reportage.

DIY Social Media Sentiment Rating

Worse than that for an SMM company: it gives the average biology graduate, let alone stats' major, a tool to go and do it all manually by knowing forum sizes or tweet rates, and then looking up the random sampling tables to give them their sample points or periodicity.

It then comes down to how objective the human can be in allocating sentiment on say a five point scale to a post. Some people would prefer to leave this to an algorythm and I have sympathy for that!

How does a Social Media Monitor Search Engine Work? Need it be so expensive to use when google is free? What can we do ourselves in house ? How do we cope with analysing 100 million tweets per day ?

Right now from what I have seen, the SMM companies combine several principal computing competancies (or modules if you like) all of which are actually undergrad' level, but of course require man hours and server resources beyond what universities will embellish upon "youff" or even post docs. This is probably why there are so many start ups in SMM strangling each other and dragging down the price chargeable for the most established consumer data mining companies!

Here are the key modules and processes that an SMM Search Tool will include:

1) The " 'Bot":
this is the robot or crawler/spider call it what you will which goes out and gathers information from the web sites. In effect it works just like your browser but instead ofg routing all the HTML and graphics to be displayed as a nice web page, it identifies the "posts" or "tweets" or other common text fields (or information like consumer assigned sentiment). This is also called screen scaping, but the complexities of forums and the perishability of tweets means it has to be more sophisticated, but still this is undergrad' project stuff.

Another issue here is that you want to probably have a link back to the original web page, you maybe want to track users by their name and time stamps for the posts, so your crawler-scaper has to handle this. As you get the drift, the eventual items will be stored in a database methodology and most often these days, in a MySQL back end database. The item gets its own ID and sits like a record with a few fields and an indexing methodology is applied so it can be found efficiently from keyword searches.

Now of course in the world there are lateral differences between web forums and of course FB versus digg or myspace, and these sites themselves vary longditundially ie in the course of their lifecyle they get re-formatted, revamped, extended and for forums often they get a new installtion of the "forum engine" or a completely new one. Now the project starts to take on the jobbing graduate (often drop outs actually who end up earning more than those who stayed to do their bachelors!) just to keep on top of the restructuring and when crawlers get kicked off forums ie the IP address or browser type is "sniffed" as being unwanted spying!!!

2) The "itemiser": This is the next step from screen scraping the information: it has to be repakaged in a standard format so that the firms software can use it ie all data should be standardised, and "Unitised". So another key difference to what your web browser and things like FF Scrap Book do is that the unit of selection is no longer the entire web page, but rather the individual snippit relating to one consumer comment which is stored and indexed.

3) The Indexer and Navigator: this is applied both to entire datasets and to individual entries. This is a little beyond the scope of this blogg and to be frank my know how outside XML, but all the words are indexed for each post ID : some indexers ignore common english words ("the" etc) , some run "triages" to index variations of difficult brand names ( eg acronyms, nicknames, abbreviations). Most SQL databases have an engine which will naievly index or can be programmed to the above, or you can build a specific indexing strategy in the engine or import the structure to it.

Indexing is in, navigation is "out" but in effect relies on good indexing. So you type in a search word and the system has a quick way of finding that word in the index and then finding all ( or in sequences of 100 say, per results page) the records which contain that word. This is the clever bit at google and yahoo: getting this to work at super speeds!

Computers are naturally top atheletes when it comes to running through lists and refering to "attached" or referenced data and then retrieving it. In effect Google and Yahoo make a library index card system, but instead of telling you the shelf location of one book by author or title from the alphabetical indexes, they know the shelf location by each word. There after they retrieve a short abstract ( the listing entry) and of course the link back to the original web page and display this for you ( at say 30 or 100 a time, the indexing has a ranking which ranks posts by various generic, pedantic or isoteric means and this is the soul of the arms-race in Search Engine Marketing /Optimisation)

3) The Web Portal : this is then, just a straight forward front end and "mid ware" to shuttle the query and results back and forth from the standard web browser, or the mobile device portal. You can also use the back end SQL interface itself, and sometimes an intranet http version is included so you can see how Oracle or MySQL think the web interface should work. Even a bolean search page and results service like Google Advanced or All-the-Web is now an undergrad' level project and loads of "home brew" programmers can do a pro' looking job in PHP. All this and the server ( computers at the SMM company end) are covered in other lectures.

Fred Rants

Thursday, March 24, 2011

DIY Social Media Sentiment "Barometer" Techniques

Under the Bonnet of Social Media Monitor Tools

Snoop Through all Fred Rants

About Me

Rant rating

Blog Archive

Pages

Rant Snoops