Contention: . Traditional hard and boring market research techniques can save your bacon when you are looking to turn-around fast, accurate and most of all PROFITABLE reasearch reports or data mining portals.
I've been lucky to be in the doors of a couple of Social Media Monitoring hot-shops (SMM) and also have a friend who has programmed both back- and front-end for screen-scraping and data mining in web monitoring resources.
I see that today's technology means the usual IT smoke-and-mirrors dissolve away into what is today pretty standard undergraduate project work, just lots of it. Even sentiment rating tools are at a level of final-year project for a Harvard Comp Sci Major.
In this blogg let us look back under the bonnet (the last blogg) and maybe chuck out the v8 single carb for a nice little supercharged injection engine.
A Heavy Handed, Verbose Approach to Social Media Monitoring
What the media-monitor companies seem obssessed with is having accurate census of all posts in consumer forums, and now with access to FB and twitter, all hits on the brand names. This is because, like Everest or the Great Wall, they can build it. Not that they really need 'go there' but because they are used to huge data resources, it seems the right thing to do.
Now this is by no means a labor-of-sisyphus, it probably just feels like that, espeically when major forum web sites kick your crawler's IP address off their triage-server. Up until now, server resources have not been an issue, but with capture and indexing for Twitter, Digg and not the least FB indexing and retrieval navigation start to become sluggish and resource heavy on both vertical, horizontal and labor in maintainance and expansions.
So faced with the practical, economic and client-patience issues of providing huge data mines, what is the alternative "blue sky" out there? Well of course come on guys, don't re-invent the wheel.
There is a cultural issue with computer science "majors" because they race through math, especially statistics and only listen to the more web oriented parts of any ( if at all) marketing course units. Like many people outside the profession, it seems all comms and "fluffy", not a serious branch, just a bunch of luvvies making hot air.
What they all want to do now is either be a hot java programmer, admirably, and/or do phone apps or APIs for web sites. What they do not want to spend time on is "aligning" to the harder side marketing. The web is the new patch, keep off with your Taylor Nelson 1950s stats and punch cards.
Well post-pubescent-programmer, er sorry, software engineer/app developer, you are wrong. Traditional hard and boring market research techniques can save your bacon when you are looking to turn-around fast, accurate and most of all PROFITABLE reasearch reports or web.
Sample Dear Boy!
The first population cenus was perhaps conducted and documented as long ago as 3400BC according to Wiki. Census means getting ALL the possible data points in the bag. Even in the infancy of market research, the census approach was rejected for consumer reports because it is just too combersome and expensive, and if you make any mistakes in the methodology, like ambiguiities in questions, then your entire results are biased and not accurate. Also there is just a lot of irrelevant data capture points.
Market researcher managers, the hard-men of the industry, soon adpoted sample-set techniques from statistical science used earlier in biology for example, and went on to develope their own methodoologies for creating such smaller sub groups. The key here is that the mathematics gives you a known accuracey or probability that your results from the sample group represent the population - as whole say of a country- or as a set of say, people visiting Wallmart on tuesdays.
Researches uses Random Probability Sampling and went on to use stratified-random-sampling and nice temporal techniques like "sample-resample" where time is also a random factor in combination with geographical location as a random or defined set.
In outset all these techniques mentioned ( apart from sample-resample where you define the time period) rely on a knowledge of the size of the total set, maybe the national population or the geographical locations of all supermarkets. However, you can use educated estimates or other statistical techniques to produce a number.
Dangers and Benefits of Statistical Sampling
These techniques are reliant on a few assumptions, and often some manual intervention as to common-sense sample techniques. Here in lie two dangers though:
1) If you have say one model with assumptions for the population total and there are of course some assumptions in things like the t-test and Chi Squared test, then you can create massive sampling errors by the combination of the two levels of assumptive error.
2) You can become overly confident in the results based on a very good (and expensive) sampling methodology, while being completely let down by "non sampling error", usually a crappy questionnaire.
The big benefits are that you get a senisble and often very small sample size to then look into with GOOD study methodology ie you can put the man-hours into design, execution and results and not the shoe work. Also to the delight of many accountants or numbers guys with MBAs in senior management, you can present the margin of error expected from a sample size ( always " n ") and thus make cost-benefit decisions based on need for accuracy.
How Does this Apply to SMM?
Sampling is great for SMM because if your crawler is working to 99% or better, and your clients are only interested in the web sources you index, then you can spend time doing far better sentiment analysis, manually, than the algorythms will EVER manage to produce on larger sets of data.
You have two approaches in using sampling: sample the entire set of data BEFORE you index it or sample into results (or index-strata) for a given topic.
So for Twitter for example, with maybe 100million global tweets day to the minus one, it would be an efficient strategy to work on total per day ( you hear about it in the news, or you maybe can pay for the info, on a country basis too) and if you can geographically restrict your base, then you can work on the main number being in accessible languages and sample to a much smaller figure.
Comes with a Health Warning
Now here comes two dangers, the first one relates to the latter instance of twitter sampling: if you are looking to drill down to tweets on a brand name, in one country or language, over a short time, then you run the risk of NOT sampling enough records to give statistical robustness at this level of "cross tabulation" aka drill down. This is actually a big problem with many standard market research studies, because when you drill into say: Age by location by salary by single then suddenly your stats on that sub sample to compare them or predict them, evapourate into inprobabilities. Hence you stratify your sample: you use a common sense or pre-filtered method to include those sub populations or you study those sub populations against a sample of the General Public.
You see where I am going with reference to your itemised consumer posts being the "population", not the individual users.
Also the other problem is in this stratificaion: you choose a criteria which is not exclusive enough or exhaustive enough to satistfactorily capture that sample or that population to sample into. This would be the case with building a simple query, a search-taxonomy to dig out records in a certain topic area or a time frame, or geography. You don't capture accurately so all the stats-techniques in the world applied on that sample - population relationship won't save you from GIGO.
Beauty in Eating only Some of the Elephant
However the beauty of it there as a tool is also apparent: you can run very quick manual dips into either pre-indexed data (indexing takes time!) or into query results from your indexed data, or into your indexed data set as a whole. Then you have a back up too in delivering automated sentiment results and taxonomy query reporage: in other words you run the sentiment algorythm on total data for the period you want, and then sample into it and compare your manual ratings ( based on the same principles at least as the algortythm) done carefully on the sample. This gives you both a whole new strategy to add value to client reports; it also gives you a means of building taxonomy with a security you have enough threads to take key words out of; and it gives you a QA on your taxonomy or automated sentiment reportage.
DIY Social Media Sentiment Rating
Worse than that for an SMM company: it gives the average biology graduate, let alone stats' major, a tool to go and do it all manually by knowing forum sizes or tweet rates, and then looking up the random sampling tables to give them their sample points or periodicity.
It then comes down to how objective the human can be in allocating sentiment on say a five point scale to a post. Some people would prefer to leave this to an algorythm and I have sympathy for that!