Thursday, March 24, 2011

DIY Social Media Sentiment "Barometer" Techniques

Contention: . Traditional hard and boring market research techniques can save your bacon when you are looking to turn-around fast, accurate and most of all PROFITABLE reasearch reports or data mining portals.

I've been lucky to be in the doors of a couple of Social Media Monitoring hot-shops (SMM) and also have a friend who has programmed both back- and front-end for screen-scraping and data mining in web monitoring resources.

I see that today's technology means the usual IT smoke-and-mirrors dissolve away into what is today pretty standard undergraduate project work, just lots of it. Even sentiment rating tools are at a level of final-year project for a Harvard Comp Sci Major.

In this blogg let us look back under the bonnet (the last blogg) and maybe chuck out the v8 single carb for a nice little supercharged injection engine.

A Heavy Handed, Verbose Approach to Social Media Monitoring

What the media-monitor companies seem obssessed with is having accurate census of all posts in consumer forums, and now with access to FB and twitter, all hits on the brand names. This is because, like Everest or the Great Wall, they can build it. Not that they really need 'go there' but because they are used to huge data resources, it seems the right thing to do.

Now this is by no means a labor-of-sisyphus, it probably just feels like that, espeically when major forum web sites kick your crawler's IP address off their triage-server. Up until now, server resources have not been an issue, but with capture and indexing for Twitter, Digg and not the least FB indexing and retrieval navigation start to become sluggish and resource heavy on both vertical, horizontal and labor in maintainance and expansions.

So faced with the practical, economic and client-patience issues of providing huge data mines, what is the alternative "blue sky" out there? Well of course come on guys, don't re-invent the wheel.

Comp'Sci'Bachelor's Myopia

There is a cultural issue with computer science "majors" because they race through math, especially statistics and only listen to the more web oriented parts of any ( if at all) marketing course units. Like many people outside the profession, it seems all comms and "fluffy", not a serious branch, just a bunch of luvvies making hot air.

What they all want to do now is either be a hot java programmer, admirably, and/or do phone apps or APIs for web sites. What they do not want to spend time on is "aligning" to the harder side marketing. The web is the new patch, keep off with your Taylor Nelson 1950s stats and punch cards.

Well post-pubescent-programmer, er sorry, software engineer/app developer, you are wrong. Traditional hard and boring market research techniques can save your bacon when you are looking to turn-around fast, accurate and most of all PROFITABLE reasearch reports or web.

Sample Dear Boy!

The first population cenus was perhaps conducted and documented as long ago as 3400BC according to Wiki. Census means getting ALL the possible data points in the bag. Even in the infancy of market research, the census approach was rejected for consumer reports because it is just too combersome and expensive, and if you make any mistakes in the methodology, like ambiguiities in questions, then your entire results are biased and not accurate. Also there is just a lot of irrelevant data capture points.

Market researcher managers, the hard-men of the industry, soon adpoted sample-set techniques from statistical science used earlier in biology for example, and went on to develope their own methodoologies for creating such smaller sub groups. The key here is that the mathematics gives you a known accuracey or probability that your results from the sample group represent the population - as whole say of a country- or as a set of say, people visiting Wallmart on tuesdays.

Researches uses Random Probability Sampling and went on to use stratified-random-sampling and nice temporal techniques like "sample-resample" where time is also a random factor in combination with geographical location as a random or defined set.

In outset all these techniques mentioned ( apart from sample-resample where you define the time period) rely on a knowledge of the size of the total set, maybe the national population or the geographical locations of all supermarkets. However, you can use educated estimates or other statistical techniques to produce a number.

Dangers and Benefits of Statistical Sampling
These techniques are reliant on a few assumptions, and often some manual intervention as to common-sense sample techniques. Here in lie two dangers though:

1) If you have say one model with assumptions for the population total and there are of course some assumptions in things like the t-test and Chi Squared test, then you can create massive sampling errors by the combination of the two levels of assumptive error.

2) You can become overly confident in the results based on a very good (and expensive) sampling methodology, while being completely let down by "non sampling error", usually a crappy questionnaire.

The big benefits are that you get a senisble and often very small sample size to then look into with GOOD study methodology ie you can put the man-hours into design, execution and results and not the shoe work. Also to the delight of many accountants or numbers guys with MBAs in senior management, you can present the margin of error expected from a sample size ( always " n ") and thus make cost-benefit decisions based on need for accuracy.

How Does this Apply to SMM?

Sampling is great for SMM because if your crawler is working to 99% or better, and your clients are only interested in the web sources you index, then you can spend time doing far better sentiment analysis, manually, than the algorythms will EVER manage to produce on larger sets of data.

You have two approaches in using sampling: sample the entire set of data BEFORE you index it or sample into results (or index-strata) for a given topic.

So for Twitter for example, with maybe 100million global tweets day to the minus one, it would be an efficient strategy to work on total per day ( you hear about it in the news, or you maybe can pay for the info, on a country basis too) and if you can geographically restrict your base, then you can work on the main number being in accessible languages and sample to a much smaller figure.

Comes with a Health Warning

Now here comes two dangers, the first one relates to the latter instance of twitter sampling: if you are looking to drill down to tweets on a brand name, in one country or language, over a short time, then you run the risk of NOT sampling enough records to give statistical robustness at this level of "cross tabulation" aka drill down. This is actually a big problem with many standard market research studies, because when you drill into say: Age by location by salary by single then suddenly your stats on that sub sample to compare them or predict them, evapourate into inprobabilities. Hence you stratify your sample: you use a common sense or pre-filtered method to include those sub populations or you study those sub populations against a sample of the General Public.

You see where I am going with reference to your itemised consumer posts being the "population", not the individual users.

Also the other problem is in this stratificaion: you choose a criteria which is not exclusive enough or exhaustive enough to satistfactorily capture that sample or that population to sample into. This would be the case with building a simple query, a search-taxonomy to dig out records in a certain topic area or a time frame, or geography. You don't capture accurately so all the stats-techniques in the world applied on that sample - population relationship won't save you from GIGO.

Beauty in Eating only Some of the Elephant

However the beauty of it there as a tool is also apparent: you can run very quick manual dips into either pre-indexed data (indexing takes time!) or into query results from your indexed data, or into your indexed data set as a whole. Then you have a back up too in delivering automated sentiment results and taxonomy query reporage: in other words you run the sentiment algorythm on total data for the period you want, and then sample into it and compare your manual ratings ( based on the same principles at least as the algortythm) done carefully on the sample. This gives you both a whole new strategy to add value to client reports; it also gives you a means of building taxonomy with a security you have enough threads to take key words out of; and it gives you a QA on your taxonomy or automated sentiment reportage.

DIY Social Media Sentiment Rating

Worse than that for an SMM company: it gives the average biology graduate, let alone stats' major, a tool to go and do it all manually by knowing forum sizes or tweet rates, and then looking up the random sampling tables to give them their sample points or periodicity.

It then comes down to how objective the human can be in allocating sentiment on say a five point scale to a post. Some people would prefer to leave this to an algorythm and I have sympathy for that!

Under the Bonnet of Social Media Monitor Tools

How does a Social Media Monitor Search Engine Work? Need it be so expensive to use when google is free? What can we do ourselves in house ? How do we cope with analysing 100 million tweets per day ?

Right now from what I have seen, the SMM companies combine several principal computing competancies (or modules if you like) all of which are actually undergrad' level, but of course require man hours and server resources beyond what universities will embellish upon "youff" or even post docs. This is probably why there are so many start ups in SMM strangling each other and dragging down the price chargeable for the most established consumer data mining companies!

Here are the key modules and processes that an SMM Search Tool will include:

1) The " 'Bot":
this is the robot or crawler/spider call it what you will which goes out and gathers information from the web sites. In effect it works just like your browser but instead ofg routing all the HTML and graphics to be displayed as a nice web page, it identifies the "posts" or "tweets" or other common text fields (or information like consumer assigned sentiment). This is also called screen scaping, but the complexities of forums and the perishability of tweets means it has to be more sophisticated, but still this is undergrad' project stuff.

Another issue here is that you want to probably have a link back to the original web page, you maybe want to track users by their name and time stamps for the posts, so your crawler-scaper has to handle this. As you get the drift, the eventual items will be stored in a database methodology and most often these days, in a MySQL back end database. The item gets its own ID and sits like a record with a few fields and an indexing methodology is applied so it can be found efficiently from keyword searches.

Now of course in the world there are lateral differences between web forums and of course FB versus digg or myspace, and these sites themselves vary longditundially ie in the course of their lifecyle they get re-formatted, revamped, extended and for forums often they get a new installtion of the "forum engine" or a completely new one. Now the project starts to take on the jobbing graduate (often drop outs actually who end up earning more than those who stayed to do their bachelors!) just to keep on top of the restructuring and when crawlers get kicked off forums ie the IP address or browser type is "sniffed" as being unwanted spying!!!

2) The "itemiser": This is the next step from screen scraping the information: it has to be repakaged in a standard format so that the firms software can use it ie all data should be standardised, and "Unitised". So another key difference to what your web browser and things like FF Scrap Book do is that the unit of selection is no longer the entire web page, but rather the individual snippit relating to one consumer comment which is stored and indexed.

3) The Indexer and Navigator: this is applied both to entire datasets and to individual entries. This is a little beyond the scope of this blogg and to be frank my know how outside XML, but all the words are indexed for each post ID : some indexers ignore common english words ("the" etc) , some run "triages" to index variations of difficult brand names ( eg acronyms, nicknames, abbreviations). Most SQL databases have an engine which will naievly index or can be programmed to the above, or you can build a specific indexing strategy in the engine or import the structure to it.

Indexing is in, navigation is "out" but in effect relies on good indexing. So you type in a search word and the system has a quick way of finding that word in the index and then finding all ( or in sequences of 100 say, per results page) the records which contain that word. This is the clever bit at google and yahoo: getting this to work at super speeds!

Computers are naturally top atheletes when it comes to running through lists and refering to "attached" or referenced data and then retrieving it. In effect Google and Yahoo make a library index card system, but instead of telling you the shelf location of one book by author or title from the alphabetical indexes, they know the shelf location by each word. There after they retrieve a short abstract ( the listing entry) and of course the link back to the original web page and display this for you ( at say 30 or 100 a time, the indexing has a ranking which ranks posts by various generic, pedantic or isoteric means and this is the soul of the arms-race in Search Engine Marketing /Optimisation)

3) The Web Portal : this is then, just a straight forward front end and "mid ware" to shuttle the query and results back and forth from the standard web browser, or the mobile device portal. You can also use the back end SQL interface itself, and sometimes an intranet http version is included so you can see how Oracle or MySQL think the web interface should work. Even a bolean search page and results service like Google Advanced or All-the-Web is now an undergrad' level project and loads of "home brew" programmers can do a pro' looking job in PHP. All this and the server ( computers at the SMM company end) are covered in other lectures.

Tuesday, March 22, 2011

Mini Me

Well here it is actuall, the mini CV

As I have said in earlier bloggs on job hunting, a few tips and experiences shared, this is a great way of working around recruitment consultants and getting your details kept at hand.

A CV is a private document so the two page A4 is both physically and psychically cumbersome. People want to file in a safe place, or shred vis a vis, delete and clear from Outlooks waste basket!

Consultants have had CD rom mini calling cards for years, and freelancers portfolios on the same media and I dare say a few punters have had their CVs there.

But now you have a wallet sized drop off!

I'd not put as much on it as the guy in the image, just something like " Mike Hunt, MSc Marketing 2011-12 ( ie not student in so thick ink) : Expertise In: marketing, inorganic c hemistry, last employer ( project) : dee dee dum. Contact, skype, e-mail, linked in.

Friday, March 18, 2011

Assumptions and the Black Box

One fundamental managment issue is what I call "assuming what the black box does" and this is taken in context of two negative outcomes: you either think the black-box process will perform as desired, or you think the black box process is completely unimportant.

Are you getting a handle on the manager's idea of the "black box"? In this monologue the said satanic box is clearly a process, product, service or system which delivers part of a larger process or system. Managers assume that it will perform as described or importantly, as they presume. It will deliver quality results, all on time, or on the other-sdie-of-the-coin, it is a current-operating system which requires no management attention.

There is a very topical case right now, and that is at the Fukushima power plant. A ten-fifteen meter wave and body of water could be anticipated perhaps. However everyone presumed they had adequate back up for failure in power to the cooling systems. The entire cooling system is no doubt "doubly redundant" in that there are two systems each with their own internal back up. I dare say they had many banks of doubled-up diesel generators in the event of power loss. However, they did not anticipate a swamping of these "fail safe" engine houses. They made an assumption, the black box, would deliver in whatever crisis may come.

Now I am a pedantic type when it concerns detail I am interested in and also "what if" scenarios. I should have probably become a safety engineer or the like, anyway instead I often hear alarm bells when I see managers either lighting up with the glee that "this black box will solve our problems!" or the condescending, eyes burried in paperwork " that black box is unimportant Jim".

Black boxes most often arise in corporations within either software, system management or consultancies, and wo-be-tide thee who combines all three on your desk as a general or marketing manager! The related group of nouns include: " alignment ; migration ; evolution ; deliverables ; and even basically Project"

Patenting as an Example of Intellectual Russian Dolls
One area I have been involved with, in some depth, is patenting. Or rather the results of "black box fixits" in the whole patenting process and contentions of patents thereafter. After a three year break from anything to do with the EPC and US Patent Office, it lands on my desk.

This is a very good case in point: a competitor was challenged for possible infringement. Very quickly however, the patent showed itself not to cover all eventualities or arrangements of the type of solution. It was a watertight patent, but only for one defined, narrow solution. The black box here was the patent itself.

The patent agents' and attornies' very own black box was how the Patent Offices would treat the patent and how it could be contended. That is the next process they pipe the application and "file wrapper" into. They DID NOT presume they could get an umbrella patent, in fact the reverse: they considered a safe way of feeding the black box with food it would most probably not spit out in their faces: a narrow, single application patent.

The agents then of course could call the patent "Patent of this type of thing" in the title, which is what managers read, and then tell only upon challenge them that an umbrella patent would never have been accepted, or been so expensive to contest that it is better with a nice defined little safe and most-of-all granted patent.

If you read line one of the "claims" then you know immediately that this is a simple patent based around the single design solution instead of trying to be an "umbrella patent" which covers ideally the whole concept. Management assumed paying 50,000 USD for the relevant global coverage WOULD deliver exlusitivity on the general solution.


So you kind of get the point...and moreover ....worse: there can be a russian-doll chain of black boxes. One could argue to basis of the western capitalist system is based just on that: you beleive the next black box in line to you will create value, and that is enough for your own little horizon. Hence the whole sub-prime pyramid sell could happen and completely undermine the entire system.

Buyer Beware Black Box Pendlers


I remember the film "total recall" where Arnie is sent an expert doctor to coax him out of his alleged dream world as a rebel on Mars. Arnie goes along with the story so far, until he notices that single bead of sweat is running down the good doctors brow. He then realises the trap and responds in true terminator style.

Now of course you can never know the exact workings of all those black boxes in the chain of systems or events around you. However, you can apply quality standards and systems like six sigma, to impose standards across all black boxes. Hence if six sigma is a commonly accepted and learned mantra, you worry less about what each box does. A different approach is to get expert advice, on an independent basis, or go and learn the gist of the system: this is where I stand. I either give-a-damn and learn as much as I can and feel I need in the system or I ask someone to show me it, or better yet an inpartial person to tell me about its stregnths, scope and weaknesses.

When dealing with consultants, sellers or internal managers, especially IT managers, you have to look for that bead of sweat, or that little bit of over confidence in their offer to solve the woes of the boss. Also avoid, as I have blogged before, ambushes in these situations, and rather make space for "scoping" and "scenario affirmation" or whatever BS, to basically smoke-out the BS and get to what is really being offered and what eventualities really may not be covered for. Then throw in a 10 meter high wall of ocean visiting you for an hour for good measure!

(c) Author 2011

Wednesday, March 16, 2011

Steely Scam

During the recession of the 1970s and early 1980s, there became alleged over capacity in steel and to a lesser extent, aluminium. Earlier in the 1970s there had been an energy crisis, which in part precipitated the recession ( citation 1,2 ). At the time of the recession, many steel works around the western world were either in public hands or subsidised and protectionism ( which is still an issue with countries like the USA and japan ( citation needed) ) was frequently used by countries trying to both defend their industry from low cost countries, and also secure enough supply ( and perhaps oversupply) for their key industries like car building.

In a very odd combination of command economics and free marketism, countries more or less conspired to drive out capacity from these industries, with the UK being a major loser in terms of jobs and then later loss of capacity.

So on one side subsidies should be reduced, and companies fully privatised while on the other supply would be "capped" so as to make the price of steel higher and thus more attractive to the stock market investors. As mentioned once again, many countries "cheated" and took legal loop hole clauses and "modernisation" exceptions to their actual plans.

While the UK under the Thatcher administration, went about taking out value adding processes such that smelting plants looked less economic and could be closed on grounds on unviability. So for example, the strip mill at Gartcosh which added value to the raw ingots produced at Ravenscraig, was closed by command.

Ravenscraig enjoyed economies of scale, local energy supplies and a deep water port in economic distance and should have been a prime candidate for privitisation towards a free market with its existing value-adding plants and local customers in especially the Oil industry.
However, the conservative government of the 1980s considered that the stock market would not want to invest in an industry with over capacity and potential for more competition from the far east at that time. Since they had few seats in Scotland, they chose this plant and it was not just Sir Ian McGregor's management board who decided on the action. ( citations needed). The same was also partly true in the UK aluminium industry.( citation).

A decade or so later, and steel was starting to come in short supply, with many of the far eastern smelters increasingly being tied to japanese, korean and latterly chinese customers where geographical cost savings and fortuities could be taken into account. Aluminium started to become an econonic alternative in some industries, and there too some shortfalls in supply drove prices up.( citations).

Dole or Subsidy in Transition to Market Metals?

The key socio-economic question is, was it better to subsidise and protect the industry as the USA did anyway through to the 2000s ( cite) , or to pay the social benefits and soft-capital to try and rebuild these communities' economies which were so dependent on steel and coal?

There is no real yes-no to this question because there may have been an inevitability that with over supply, and emerging low cost nations ( india included) the outcome would have meant that the UK industry was not competitive. On the other side, given a smoother de-subsidisation and privatisation with state as a beneveloent share and stake holder ( the current Chinese and Norwegian model for primary and heavy industries) and the up turn in the demand for steel which did happen anyway, then these plants could have been active in an eventual fully free-market ( in the EU at least).

Couple to this advances in technology both in production, logistics and then in the whole value adding chain, then the steel industry could have survived and thrived as a larger entity in the UK: whether or not the stock market would be particularily interested in the difficult overgangs period is another matter, but instituional investors may have been willing to both take the long term risk AND also contribute to protecting value ( eg savings and pensions, value chain share prices) in the economy as a whole.

Saturday, March 12, 2011

Dynamic Web Page Interface Languages

This is the first entry in the blogg version of lecture II in "web site languages". It is aimed at the non IT Manager as well as alma mater Strathclyde University MSc Marketing Students with no prior technical IT background. It is broken down in smaller bloggs for ease of reading and repurposing,


PHP is probably the leading dynamic language today

PHP has become probably the world leading language in terms of volume of web sites and transactions handled using this language. It is an open source language and engine so many developers cooperate in its development and it is available in free versions as a server engine. Modern browsers are optimised for PHP. As a code language it is one which is interpreted on the server by a "mother" programme, very much like the other languages discussed below, (asp, cfm).

Coding is both embedded in the html, in the URL constructs and in post-form data as well as in the files on the server side. It has evolved a diversity of really useful "routines" which you call up with a fairly simple pneumonic or english sounding programming command and are actioned by the server through its interpreter.

By in large, information from the client (you!) is requested in forms, buttons etc or refered to via cookies ( or if you are mid way through a transaction, the session file) and then inputted to programmed actions to then give planned results in information, web pages etc you get back. Thus what you actually see on the web page is often the result of quite a lot of dynamic computation, based on what the server had to work on in the second you visited or interacted. This can be in everything from building up a shopping cart full of goods and moving to check-out, to just the simpel dot com request for the home page.

Historical note on PHP
While Cold Fusion was ruling the cutting-egde-roost, PHP was the poor mans cousin: "Personal Home Pages" was a free shareware language which was used mainly in small web start ups and colleges. Microsoft then took the Cold Fustion crown with ASP, probably due to active-server-pages being better integrated to SQL Server language, which a lot of nerds were learning at university.

At that time, cold fusion programmers could charge a premium of even several thousand dollars a day! ASP was pretty quick to pick up if you had both HTML and SQL Server and Microsoft had the infrastructure to support wide spread training. Thus CF and .cfm lost the strangle hold they had on the market and macromedia faded into a buy out by Adobe. I actually moved a whole web agency over to asp from cold fusion because of both cost and the obstreporous nature of "cfm" programmers in 1999.

Meanwhile, the community of PHP developers were sneeking up and it wasn't long before the cross platform, open and cheap nature of using PHP and shareware MySQL erroded microsoft's heavier weight and contendably, restrictive APS_SQL Server structures. Being free and open source, it meant programmers could get mutual benefits in building upon the language itself, having peer review on the web for their prototype commands/routines, and then the whole language could evolve forward using the best types while rejecting the weaker ones.
I remember turning up at a fledgling web agency in 2000, having worked at an CF / ASP hot shop as a project manager, to see this home brew code in PHP and taking the whole thing as being amateur. How wrong we were all to be proven.

What Dynamic Languages Do and How

All these languages mentioned above, have a fairly common modus operandum ( but not perl script and some others very little used today) : when the server is configured to accept PHP requests and run PHP commands server side from those client requests, then input strings/ forms / requests go to the appropriate area on the server set up and allow the programme to execute a response, and most often send out information back the user on the internet. Most often this will execute a database query based upon an input string. Otherwise it could find content from a source which varies: a file location or third party web site: by just changing the one parameter, like the contents of the folder or the web address, you then serve up new information without needing to.

Information sent to and from the server can be text based, mediated in various string forms or XML, or it can be graphic or other "sub files" to be used on the web pages at the client end, or it can be array based where data is hidden on the client end and presented upon request.

These programmes like PHP and their eventual coding allow for several useful operations to be combined in one command, which has often some "english" comprehensibilyt eg $_GET is a way of handling the strings from GET

In php you can choose to use GET or POST when forms are submitted : post goes behind the scenes so to speak, while GET places the search string in the URL
You can look into the "under the bonnet" workings of many web sites if there is a GET command in the URL when you interact with the website. Usually this approach is submitting a query behind the file name

eg. www.xyz.com/page.asp?ring=3&type=30&date-01-01-2011:01.03.2011.....

The question mark denotes the GET , the first or primary element is defined and thereadter ampers AND '&' is used to both define the operators, the conditions and any bolean AND search terms which are either included by your typing in a word or as part of your order processing for example.

If you come across one such GET line in the URL then you can play around with building your own queries and submitting them to see what works and what comes back. As a web developer you can also of course then use this web site to deliver queried information via your own server ( but not on web pages due to the 'same origin policy' enfored in browsers, see 'security'). The only current alternative to scraping content or 'proxy' serving queries is to use RSS feeds from 3rd party web sites and then parse their content into the web page or create an on-the-fly array to query.

Another way of analysing dynamic web sites is to use "firebug" and right click on any dynamic element and "Inspect Element" which takes you to the code directly. This is invaluable because web code can take ages to read through and it is easy to miss a table element for example. This is one use of the term "screen scraping" in this case for dynamic code and not XML content.

> ie www.searchmonkey.com/search.php?q=monkeys : the apache gets the URL request and knows to both go to the search.php location and send the string "monkeys" to be processed by the php there.

The php programme must be installed to the servers network: CFM used to just be a disc or an online download for ISP web hosts. The primary end of the server network, usually apache, then sends requests for php pages from the browser to the appropriate file location.

PHP has some clumsy language and syntax according to many programmers, but it is relatively client-server light and easy to learn. PHP v 5.1 and 5.2 where stable and debugged largely in 2010 for all browsers and most server types.

There is a master command file, an initiate / config file usuallycalled " Php.ini" which sets up what elements and syntax will be allowed.

Despite the criticisms, PHP above all delivers reliable and scalable web sites. Programming resources are now widely available, with the ease of learning and share ware nature rendering dynamic web sites within the reach of many amateurs. It is very much a language that is here to stay and will continue to evolve with the browsers and server technology. So for the forseeable future we will see many web sites or at least pages, based around PHP or wholly dependent upon this dynamic language.

Web Site Languages Part I

Languages Utilised in Internet Technologies

This is part of a larger lecture-blogg series which is aimed at marketing students at the University of Strathclyde, and non IT managers. In this lecture, and the following, we introduce some more terminology and a behind the scenes look at how modern web sites are programmed and how they interact with you the "client".

Introduction:

How and Where Does All This Fancy Stuff Get Done?

We have considered the simple protocol languages which enable communication to be ordered and policed on the physical internet. Once connected to a server, in order to present and interact with web sites, there are computational tasks mediated both by programmes on the web server itself, like Apache and PHP, and on your own home computer, mediated via your web browser programme.

Both "server side" and "client side" language-engines rely on reading instructions ( code) from files which are either downloaded or refered to, called upon if you like, to perform tasks in presenting the web pages, shutteling information and performing calculations.

For many programmers, the goal is to have most of the work done server side, such that communication is quick to and from the client ( ie you and your IE Explorer/ FireFox). This approach is known as the "thin client" and has become popular with the growth of mobile phones being used on variable connections. This approach limits the number of http requests and the size of code in each download to the client machine, thus reducing front end server load and outward bandwidth requirements.

The downside of "thin client" is that with more complex web sites and interactions, there is a heavy computational load on the part of the server structure, usually a second layer of computers, which are then required to work more processes. This requires more thought in "vertical scaling", a discussion taken up in another lecture-blogg. Also, although the processing reduces code download, it may in some cases increase the amount of raw data because little can be "expanded upon" on the client end.

However, to provide the most rapid web experience, a compromise is often reached for the more data-rich web sites, like Google maps, whereby the local client runs an API ( see below) and requests packets of information without needing to reload the entire web page. ( more detail on this in the second lecture in this series). Many mobile phones have specific "apps" for web mediated services like mail and social media sites, which offer a slightly thicker client at a single download, but are better optmised for speed and screen presentation for the small screen environment.

In some more technical depth:
In essence most languages used in building and running web sites, which run either client- or server- side are actually instructing a mother programme to perform computational tasks: that is to say the language coding itself is not interpreted or compiled directly.

Exceptions to this would perhaps include running Java modules on your local 'client' PC, some PERL script used in communications, or using "C" and other interpreted/autocompiled languages running server side from instructions sent forward from other languages like XHTML, PHP or Javascript.

At the client end, ie you, the mother programme is most often the internet browser ( apart from Java, which is presented through the browser window though) , and some of the updates you will receive to browsers are important because they allow for the latest advances in the languages to operate on your machine. For example the "javascript engine".

Server side, it is really dependent on what you install: php , cfm etc all come with an installation disc and these "interpreter" programmes are really very large relative to say the javscript element in a browser.

HTML


HyperText Mark-up Language is the standard language type for web sites, but it does vary and evolve so that the earlier web browsers like IE version 5 for example, do not work with more modern web sites. Also some browsers do not automatically fix bugs in the HTML code.

In essence, HTML does as it says, it allows for text to be formatted in a simple, high level ( ie "near english instructions" in the code) language which is easy to learn for anyone slightly interested in computers with the motivation to make web sites! Apart from formating text into headings, fonts, paragraphs etc, the code "coralls" the structure of the web page and how it expands etc.

HTML only web sites are very outdated today, but the language is still at the core of structuring web pages and allowing for communicaiont. Indeed in its later forms, it coralls both this stucture, communication to-and-from the browser, and also the inclusion of elements which utilise other languages or call upon other data sources.

It is VERY worthwhile as a marketing student to take a course on HTML and learn how to read the "anatomy" of web pages using tools like Firebug. Even if you never actually build an entire web site yourself, it will allow you to edit things like links or titles quickly or paste up emergency notices, or take out elements immediately for legal reasons. We go into little detail here because it is a subject in its own right.

So HTML still forms, for now, the core of most all web sites, but other elements mediated by other coding in the languages below, and the use of API elements, are reducing the volume of HTML code to be found in a modern web site. Hence we take up the thread with the most closely related language to HTML, XML:

XML

Historically, XML's roots predate HTML in it being related to a form of sharing ASCII/ Unicode based documents across formats. This goes back to the 1970s, and the forerunner "standard-general text Mark up Language" : SGML. XML takes this cross platform, simple form of sharable text mark up to a form which is inclusive of many desirable features and flexibilites which now mean it is used for both static text and very dynamic data-shutteling.

XML is very like HTML: it is near to being easily interpreted in english, ie high level, and in that it marks up text with mark up tag and then allows for content to be basic Unicode text. There are elementary mark ups as in HTML which are simple and allow for documents to be parsed into other web sites and presented in different formats based on the key heading and paragraph strcuture. In fact XML is to a large extent, a simplified text handling approach than HTML, or cleaned up because HTML has evolved to have many other functions while its text functions have become somewhat limited.

Features of XML

Mark Up : Text Document Standard, Shareable Formatting:


Presentation: Integrates / interoperates across platforms and repurposed documents like pdf output, or .doc/.xdoc. Apart from text, there are also there are some vector based graphic elements like simple flash which can be useful.


Exchange: used to shuttle data in HTML, javascript/JSON/AJax and also between non internet browser systems: like from different ERP or legacy systems, or from web site orders over to ERP systems by an indirect system ( not a true "Back End" database set up )

Programmability : Eg X-Query ( SQL liknende for XML) , X-Path routing/indexing tools to search data/operate data in the larger XML file(s) Simple API Xml integration SAX.
XML celebrated 10 years in 2009, and has come into extrememly frequent use in web sites and shared news or document sources. XML documents.


XML has become "interoperable": because of its simplicity, other programmes apart from HTML engines (now XHTML) in browsers can integrate quickly to the language and extract information from XML files readily. In terms of the language, it is 'neutral' in that data can be exchanged between languages and operating systems / platforms. Some XML formats are very data centric: The benefit over 'flat' csv/psv/tab.sv files is that the information is "marked up" with syntax/formatting tags or other tags, and also custom field name-tags eg"
........" can be applied to text strings, thus rendering XML a database resource.


One benefit is that the structure of files is easily readable, in fact a pure XML file is usually easier to understand than a corresponding HTML. Also, older ASCII text sources or scan-read sources can be readily parsed into XML and then it is easily shared to different web sites with their own formats.

Mark up tags can be programmer definable, which is one area of true advanced capability with XML. The second area, often lumped into the umbrella term "XML" is using APIs and server side app'lets to access the data held in XML repositories: often this is like a simplified SQL.


The evolution of XML continued until a point when the drawbacks of all the embedded text in HTML became so large that the simpler root to text data was attractive enough. In other words, the "atmosphere" around text embedded was suffocating the flow of information on the internet and necessitating re-purposing of text between web sites and systems.


With the widespread use of Cascading Style Sheets, XML became even more attractive, as the one source of information can be repurposed to different styles automatically or in a browser type/ version dependent way after sniffing the browser.

Yahoo YUI has good javascript applets, like a pop out calendar.

Another benefit of using XML files and information sources is in serving the same web site to different devices or bandwidths: so for example it is still the case that javascript on many mobile devices ( phones) is either very limited or not there: therefore you can have a simpler web site with the same source in XML for text information: the same is true in optimising screen layout for the window in iPhone, Windows Mobile or Win CE

What this all means is that text documents can be published once in simple XML and then accessed from all over the internet and republished in the format of the web site requesting the XML from the link or serveer side source file. When using APIs, this means that the document content can be accessed and presented in more dynamic ways by using small programmes rather than having to place the XML document or text in a back-end server-database system with perhaps three layers of access: apache web server; MY SQL interface ; Routing interface and data repository. Also this allows the programmer to develop very defined small client side javascrip applets which will have very specific, targeted functionality: for example, find the latest news items relating to the US Senate from three XML sources ( RSS is an XML subscription app')

XML is fast and simple for text based dynamic web sites with several internal and external sources of updated text.

Also another very common use of XML today is in form submission and handeling of orders on e-commerce sites, especially when there are "Under vendors" ie suppliers external to the main web site. Using XML the information can be passed in a simple and cross platform message, compiled during the session. Here XML is an intermediate, common format which allows for specific tags useful eg ...

In fact any UID database which can be output toa tab/pipe seperated file can be parsed into XML and reverse parsers for ORACLE, SQL Server etc exist: field names move into the tags
for example and then the data can be imported intelligently to many different systems, with manual reference made somewhat idiotproof by the use of english and a simple mark up coding.

It need not be documents, any ASCII Unicode file can be used, so for example a tab or Pipe separated database could be encoded and referred to by simple searches and operations server side, with the form going as XML.

One powerful capability with XML is being able to link different sources of data with a unique ID : for example you can combine some graphic like maps with post codes or coordinates (longditude/latitude) , with a text source like population. This is used in the tiles for google map and satellite images for example, and to link the simple geotags for the icons which appear with a pop out text box. This is often done in APIs using the standart API Xml module SAX.
Structure in XML files: XML files should take the schmatic that the structure of the text is like a tree: there is a sequence of nesting data within its tags, or branching out if you like.


So these form a simpler version of tables in a database, and you should consider structing different XML files with common, strict UIDs so as to be tidy and offer more functionality, while keeping individual file size small and closely clustered in relevance of data. XPATH helps navigate this efficiently. So XPATH may be used to create an efficient, exclusive search which goes to
and returns for example.

You can ad alot of tag attributes in a text document you want to be highly structured/relational/ nested in searching, to give exclusive search results or actually just serve up the relevant information in terms of pages, paragraphs or specific hit count with links to the lines of text ( the actual scentences) : so you tag up "US presidents" beside each relevant name and maybe add a period-of-office date range whereby you can find each president, who was when, and which of them reigned over a period you are interested in, without knowing the term of office dates themeselves.

ISO8601 is an ISO standard on date-formats contained within data fields, which is adopted into XML easily and you can find it easily. The same is true for long-latitude Geotags.

RSS ( really simple subscription)
RSS is a subscription news feed service which works through XML and both allows you to get updates on news or alerts, view them in any browser and then also re-publish the xml: so the example on the Harvard E75 course is using a javascript interface to show both google maps and "geo tagged" news feeds as links in pop up bubbles upon rolling over town locations. RSS can also be used for other not text based updates or as a quick way to move (push!) small amounts of information to a web site a user is subscribed on: so for example Pod casts which should not really be included in RSS by the spirit of the original convention, ie this pushes a file and not jsut a simple news item.

Podcasts on iTunes/ other clients are actually based on RSS with a media file linked in.
A web source will publish RSS to a URL and your insertion to your web page or server side parsing.

Now RSS is useful too for capturing information form syndicated/external sources because it is so simple and allows you to parse, then search, store, reformat and summarise/ abbreviate information from a web site which otherwise is not doing XML code you can integrate to ie parse in the whole page and dig out the XML marked up text you want, called actually "page scraping".




XHTML : this is the most XML friendly version of HTML, which has some other syntax and case sensitivity to XML tags, but all later version browsers run it because it is XML optimised. XHTML also allows for otherwise illegal characters,

XPATH
is a sub-language which allows for SQL style querying of such data. Thus you can have either XML on the page, or hidden, as either permanent dataset or something dynamic, even transient, and query it without the need for a complex back end database interaction at this point.

AJAX was a synthesis of javascript and xml, which allowed for client side javascripting to request required data as simple XML /XHMTL, rather than having that intelligent computation on client behaviour, happening by http to the server and back. The best known example of this is google maps, which was the first to run a credible javascript scrolling system for showing graphic and bitmap link. The javascript detects the scrolling locally and asks only for the packets of information you require, and incorporates them without any more server intensive processes needed. The term "ajax" is now used to include XHTML and JSON as well, we will return to this in the next lecture.

API: application programming interface:
a function, or a little premade applications, which is available for your to call up and utilise in your web site. This differs somewhat from a "web service" which is a web server which performs a service for you on their processing time: eg tweet decks, so you use a GET$ URL perhaps to send in the data or request for computation to happen there. An API works in javascript on the client side, and these are shared as programmes to allow web sites to propagate their service externally while most likely holding the upidated data and say related push advertising from their source: eg Google maps, Twitter, Facebook linking/write to FB. We will also return to APIs, like google maps in the next blogg.

User definable tags means that whole new mark up languages can be made within XML: for example DNA sequence mark up language.


The next lecture will focus on the leading dynamic languages, starting with PHP and moving on to the most modern, Ajax/Json which keep Facebook and Google Maps hyper-dynamic!

Wednesday, March 09, 2011

The Anatomy of a Web Site Address ( URL)

Intro

The web address you type in the "address bar" of your internet browser (Internet Explorer, Firefox, or Safari etc ie ), has a subtle structure and interpretions. Browsers read URLs (Uniform Resource Locator and not actually 'unique' as it is sometimes called, but actually "is") you type in right to left, ie backwards, and when they read through an internet domain name designator, the primary being ".com" the browser then sends the request for the IP address of that domain name to the local domain name server ( more on this below).

Once the browser has this IP address, and it may be cached a while thereafter, your browser goes direct to the address and sends the entire URL expecting to open a communication most often governed by the means of transmitting web pages, that is http.


But since we start at the beginning of a web address and anything after the "dot com" point gets read left to right by other computers serving us information, let us begin at the LHS:

http:// HTTP: Hyper text transfer protocol: just meaning the protocol handeling text which is sharable across the internet, and the agreed means to transmitt and recieve html based / related files (hyper text mark up language) or smaller packets of information to be used to make a web page appear in your browser. ( for example, the primary web page may be written in XHTML while it calls upon javascript files and PHP files/ information through PHP operations from the server). The colon and the forward slashes are just a computer convention and were irritating before browsers autocompleted this for you. However, they serve a purpose to make the browser use the address line to formally request to open an http dialogue with the server.
Other protocols inclued https, a more stringent one-to-one version of http;
FTP which is simpler and relates to moving files from one computer to another, this is nearly always associated to a password and a resulting level of permission to what folders / routers you get access to, ICMP is used in PING , see below. VOIP is voice-intenet protocol, used in software like skype.

Different protocols can utilise or search for different PORTS on routers and eventual servers which then is optimised for that type of traffic. HTTP usually uses port 80.

Secure, Encrypted Protocol: https
SSL is secure serving, using the https protocol, for use with credit cards, banking and so on and domanin names wanting to use this must have a unique IP ie only one web site can be utilised which is really very limiting. It means banks have more power than maybe they should have in controlling and offering paid services for access to SSL servers.

Http latest version
The latest version allows for multiple handshakes of http/meta data in one connection ie there is a virtual pipe , which may be the same path on the internet nodes, and there is no need for the full http handshake each time.

Tricks for less http handshaking: to download just one image or one xml file while the client side then only shows part of that, or the http for images and content is only activated upon scrolling down or over to it.

"www." As stated above, this just states that the type of communication will be world wide web based, web page oriented. It is now a little superflous and as you will see you can type many web addresses without this and get to the site's first page or the page when presented to you will lack the www. See below/above for another comment on this.

Finally there is the annoying colon-slash-slash and on early browsers this had to be typed manually! "http://..." is just the syntax for instructing http communication to-and-from the web address, and like "www" is largely redundant and could well be excluded from view: I expect browsers to do this in future, especially with the growth in both mobile handset HD TV based internet browsing where by you may be using a key pad with "Multi touch" spelling rather than a proper key board, thus making it tedious to even type "www".

"xyz.com" The Top Level Domain

Dot com is just one of many web titles or TLDs, with some having restrictions on who can own them like .gov.uk, .ac.us and so on. Dot com remains the most popular in western countries at least, despite the availability of .biz, .info, .inc and so on.

" Dot TV" ie www.younameit.tv is actually a country code for Tuvalo ( a TLD; top level domain) so the islanders there make a lot of cash out of TV firms and streaming web services.

There used to be restrictions on ".net" for somekind of network provision, and ".org" but the domain-name-authorotories are not strict on these now. They are of course strict on the uniqueness of the address, ie if an address is currently owned by someone you cannot own it and get it pointed to "Your" server IP address until it expires or they sell it to you.

Some countries domain name authorities are fairly strict afore or if approached, when you infringe a trademark or "inpersonate" a trading company, registered orginsation or person. More on this below.

The web server identifies any request for just the .com, then it sends out the default page and information, for example index.html or default.asp back to your, IP address.
www.xyz.com

The domain name per se is now complete as all modern browsers would be able to process it from this. As mentioned the name and TLD is to be chosen carefully so as to be exlcusive to your company, while exhaustive in covering everything in TLD variants, abbreivations, brand names, acronyms etc and you may well want to explore common typo's, mis-concieved spellings of your company name and former company names and merged company names to help your customers find you and protect you against people trying to steel your traffic or misrepresent themselves.

Additional Syntax : Full-Stop-Text: eg www.joebloggs.xyz.com
Your name as mentioned needs to be unique and you need to know the IP address of the server computer it will be held upon. However because the address is read from right to left by the computers on the internet at the dot com TLD area leftwards, you can arrange interesting sub domain names: like this type which I hypothesised for individual person campaigns ( there would be some legal issues in some countries in using this approach) It is not until the actual web server is reached that the computer tries to interpret this: In Apache server you can set up for such sub domain names, or by defaul send them all to the home page. They are sometimes used as a slight privacy screening, or for "perishable" messages when you only want people who have recieved the actual address to be able to get access for a given time.

dot com slash...... .com/
Now the computer stops reading right-to-left at the server end. The forward slash is just like a file path in DOS, windows or Mac and means that the file will be in the next level folder, in this case the top level folder of course.

Apache server and other such programmes, direct .com requests to "/' index' folder" to serve up information from there, or as is the case interact with the server computer through this location.
For most web sites in 1998, this meant that index.html was sent out to you, and usually this would in turn ask for graphic files and maybe a style-sheet to help the web site look, erm, nice. Graphics were usually held in a folder called /images and you could for fun go through this folder on some browsers "ftp" view.

In the case of more advanced "dynamic" web sites, the information you send , which may just be the intial http request for the dot com name, is processed by a server side programme addressed through this URL folder location, and the result, ie the computed answer is the information given you. So for example if you have a permanent cookie for a web site, the first page may well be personalised just for you or more likely: advertising may be related to your previous interactions with that web site, with a profile being stored on the server, or when you completed a form, the data may be held: like a user name, or a search term (string).

Behind the scenes a little:

If you own a web site and either transfer the files up for the first time or re-write over them: When you move the web pages and info (by FTP usually, file transfer protocol) to put the web site "up", the server has a structure often with a unique folder for the "home page" : some stipulate that the first file must be called "index.html", other servers have one folder (directory) where you place only one master file. "Apache" type software knows to send you the contents of this folder first if you have not specified a long URL., ie just http://www.zyx.com wil serve you up http://www.zyx.com/index.html a single HTML file which will travel over the internet and be put together by your browser, IE or Firefox, safari etc. Consciencous programmers will also direct any non existing URLs beyond a correct "dot com" or lacking "www." at the front end to this first page

The HTML source mentions files, most often images like gifs or JPEGS and then your computer goes back to the web server to fetch them: this time it can go direct because it knows the IP address from the DNS. The IP is cached locally, and this can be for a notable period of time, even days.

Looking Behind the Scenes:

These small sub web page level, HTTP requests cause load on servers which perhaps prefer a more direct relationship: FIREBUG is a firefox plug in which has a server analysis function allowing you to see what requests back for information were made: in other words each one of these Http requests is subject to delay and re-packeting issues for the individual user, who may be on a cell phone for example, and on the server side there is much more secondary handshaking going on which is just low value load on the apache server.

A web site which waits for jpegs, flash etc to load BEFORE rending the general structure and text in HTML is badly developed. The latest dynamic web sites actually have timers built in to wait to load content or to rotate content, like advetising or playing a short video.
Having a lot of files needing separate HTTP requests is one load factor, the other is having web sites with a lot of code which is not compressed. If you run a popular web site, then even spaces, indents and long "parameters" like "query" instead of "Q" add up if you are serving millions of hits, into paying for more bandwidth inwards from your ISP.

DNS: Domain Name Servers - What do they do ?


The route to the 'nearest' DNS to your computer is predetermined in your internet connection values: this includes your own router (WiFi internet box for most of us) and the telephone company / ISPs DNS ( domain name -server AKA -system). It is really a digital telephone exchange and your ISP will either own this or pay for access to the regional or national DNS.
The DNS IP address is in your internet settings supplied by your ISP and your computer dials it to then use it as a telephone directory for who the dot com name is, ie where, which computer it is on. The DNS in turn duly sends back the IP address for the web site if it exists. Therafter your browser ( or FTP client) will store this IP address for use when dialling to that web address over a period of time so as not to hassle the DNS server again.


What actually happens is that the DNS sends back the IP address corresponding to that web address you have written. Your computer rings up Now your computer sets up an HTTP handshake with its own IP address in a little "packet" of data which is then sends out to the internet nodes / routers. In other words, your "dialed" request follows a route which is like a cross between counntry code, area code, zip code ( local postal code) AND a unique telephone line.
DNS are powerful in terms of server-client capacity, but in reality all they do is look up a two column flat table: "dot com" www.xyz.com address in one and IP address in the other.
When you launch a new "dot com" web site, or park a name somewhere else for example at a web-hotel, ALL the DNS servers in the free-world need to add this bit of data. This takes time to populate of course, and not all ISPs have admin rights to enter these to the process. Three layers of delays are possible: manual at your ISP waiting to send request to an authorised ISP; queing at the DNS authorised ISP; Queing at the primary DNS nodes; Populating and thereafter queing at local ISP nodes/ routers.

Internet Router Machines

After your own home router, there is not just one but a whole string of routers which form the internet nodes for trafficing of information. There is a large element of control of bandwidth for the "free" internet, and different routes are chosen between these nodes and your http request packet ( a little post card asking for the web site to reply) may be deprioritised, sent further or wholly re-routed. In the USA it is very frequent to have 14 routers even for internal traffic.

In days of old, all traffic in and out the USA went through a transatlantic cable and the first router "node" was in Maryland, and hence webstats reported lots of viewers living in that state! I don't know if the US blocked ICMP further or if it was not fully available anyway at that point, but that is as far as you could "see".

Behind the Scenes of Internet Routing:
Ping, TraceRoute and ICMP: Detective Work on the Often Twisty Route your HTTP Request Takes on the Internet. ICMP is a shortened protocol which commands PING and Traceroute use (in the DOS simulators, command line programming)

Traceroute is a little programme in itself which tries to identify as many nodes ie routers as possible in the path to the eventual server with the IP / Web site your requested. It will identify how many steps there are at least, and shows you the time it took from home to that step. This can help identify if your ISP is relying on cheap, slow routes on the internet and also which country for example, a website or client requesting your web site lives in.

Some routers do not allow ICMP due to load or security wishes(reasons) and the same is true for VOIP which is why it can be so rubbish quality: it often takes torturous routes and the data packets get lost.

Paths on the Internet
Your data does not need to follow the same path with each datapacket: your computer and the eventual server piece together the datapackets: so for example a jpeg for a web page may go a completely path back to you, through different routers on the internet.

Incidentally, IP addresses with the DNS entry to a URL run out over time, it used to be a year or two, and you then have to refresh the DNS entries around the world. WHOISIT and other sites, offer information on expiry dates of the URLs.

Domain Name Hosts, Web Site Servers, Web Hotels and Hijackers


These usually manage your DNS entry and "Pointing" to them, and it can be simplest to buy URLs through them. The best in the USA is argueably "Go Daddy" dot com, who despite a very commercial web interface, offer very good service and technical DNS/ URL management features for a low price, from 10 USD! "Network Solutions Inc" used to be the most trusted source of this service.

Web hotels are people who will hold your URL and show a holding screen or their own home page with a message about your URL. THey are useful if you own a good URL name but have no design or current content. It is best to find a reputable one, or use a well known web server. Some less reputable may sell off your web name or wait in the wings to buy it when it expires without warning you, just having "expiry" or ownership in the small print.

"squatters" are web hotels and individuals who have registered domain names speculatively : the famous case was "coca cola", the famous plaintiff won of course over the couple in london.

"Hijackers" are worse: previoulsy hackers and unscrupolous types could use weak internet countries like new zealand to send e-mail with your domain name, or redirect you away from a web request. These days som ISPs are doing it with mispells and typos, and if you have a cable provider you can find you go to a site which you must be careful you don' end up buying something from: they know who you are and can probably bill your credit card. I tried once to register "cacao calo" dot com and Glof.com as misspells of coke and a big american/ UK middle management waste of time site. Both were refused by my ISP, who blame the DNS authorities, while I bet the ISP registered themselves.

This is another REAL danger of snousing out domain names on some sites like "whois" and with your own ISP router machine sniffing what you are requesting: there is a risk some companies monitor these and snaffle up domain names you were checking were clear.

A 404 error does not mean the PDN ( primary domain name) is inactive, it just means the server which owns it does not serve it to you.

In the late 90s BT in the UK had issues with their primary DNS being located in wales, and there being massive amounts of transatlantic and DVLC data going down the backbone through wales. What they did was to reroute people, and they actually cached entire html web sites, in their thousands, at sites north and east of wales. We wondered why when we FTPed a site up, and we could see it in the file path, but we could not get it on the internet. This would not be "possible" in today's dynamic web site environment.

Virtual Host
Is just a name for a single IP address used to serve multiple domain names, usually mediated with on the main Apache server or an intelligent router as a 'triage' for larger ISPs and Web Hotels. This is very normal today of course, but actually relies on the browser sending the actual requested domain name ( and whole file path URL) in the packet which "dials" the IP address. Early browsers just sent the IP.

Red Indians out There ? Apache?
Apache is a very common software which operates the server computers : for example, sorts out the individual URL requests, dealing out the actual web sites from the servers behind this. The information needed to serve and administrate this account, say for a paying customer, is held in a config file: the primary domain name, any variants / alias's like no www, or .co.uk.

If the files for the web site are down or not present, then the 404 error will contain a reference to this. The config file also allows permissions for types of technical programming and access: eg using CFM or ASP or PHP. The top html or PHP file for the home page is usually held in the ( unix style) folder "/public" and the other public access files are in folders which have the permission level to let the external requests go out ie not restricted. Quite often you can see that you do not have permission to view the folder when you write a "hot wire" filepath in the URL.

Apache contains small programming areas to built up of web-master commands which are called Modules: The first most important of these is for hosting the DN: the "rewrite" module: used for permanent redirects of all domain name URL request: ie wrong file paths, old deleted pages and if you will have it non-case sensitive ( caps lock issues for people browsing) and if maybe you are doing a temporary redirec ( code 302) OBS! if you make www.jim.com go to www2.jim2.com, then if search engines crawl you then without a temp redirect note, they will list you as www2: this is the same for web hotelling. You can also capture frequent incorrect URLs which can show bad links from external web sites or poor indexing on search engines, and these can be worth implementing as pages near to what is described at source or listing. Https routing can also be acheived in this module, although of course you need to have an SSL certificate from the authorities ( Microsoft usually) in order for user-browsers to accept the https connection. ( see security lecture for non IT Managers)

Usually these days there will also be a primary web site language engine running on the machine, like PHP or Active Server Pages, and often a SQL server programme to facilitate more advanced "dynamic" features on web sites and access to "back end" database computers. This will be the topic of a subsequent lecture.

Tuesday, March 08, 2011

The Internet for Non IT Managers and Students

FOREWORD: This is actually the very first lecture in the series " The internet for non IT managers". Since blogs are in descending order anyway, and some either include a "blog deck" or republish to have a logical order, then this will appear first over some of the more involved and technical blogs.
This first blog is also aimed at MSc Marketing students at Strathclyde who have no real IT background, and some further reading is given at the foot which tries to delve further into technical details while these are mostly written in good prose for the non IT student.


Lecture Notes on The Internet
for Non IT Managers and Students



What is the Internet?

The internet is the physical and electrical data network which now covers large geographical areas on a global basis. To be more precise, the internet exists within what we know as the telephone and data-networks owned by amongst others, phone companies and governments. Since nearly all calls in western countries are digitised now, the internet is just another type of data traffic which is handled in a destinct way on these data-networks.


Further to this, you can say that the internet is the portion of the total capacity, the bandwidth within the diverse cables, which is accessible to data-traffic. This data traffic is itself using special types of communication connection language for example, the exchange protocol called TCP/IP and http. These are agreed methods of picking up the connection, exchaning information and then closing the connection when handling data like web sites, internet file transfer, video streaming, "skype" type calls (VOIP) and e-mail (POP3 and SMPT).

Outside orginisational private networks, very few data cable routes carry exclusively internet traffic. However, the internet's own "telephone exchanges" around the world which route your computer to find, then receive and send information to-and-from remote web sites, are often solely dedicated to connect internet traffic.


Origins and Principles of the Internet and World Wide Web


The internet as we know it originates from two sources: academic networking, ie connections between universities, and governmental security and military networking. It was the latter who decided the internet should be in theory a "web" where no one break in connecting line or node ( city or exchange) would result in loss of data-connectivity between two other points.The idea being that if one line or city got "Nuked" then the information between unaffected cities would find another path to flow around the destroyed element.

So we have both on the one side the egalitarian principles of free flow of information, while on the other side the concerns of the cold war threat of nuclear war in the US and other countries, coming together to create such an efficient means to connect information sources and people.

The structural "redundancy" ie more than one route being available from A to B, has evolved into its own world of money-men and prioritisation of traffic. So for example, the UK driving licence authority in Swansea, pays for a very large amount of bandwidth in the UK fibre optic backbone towards and through south wales, and internet traffic is deprioritised on this route when large transfers of data to and from the Police Force are being done.

Connections between computers have of course a far longer history, as indeed does the utilisation of phone lines for sending binary data signals. Many of the historical means of establishing, permitting and holding contiguity in communications over a network, have been built into the internet, and there is one global standard group of methods for this 'protocol' called TCP/IP which we will come back to.


Economics and Freedom

Some people frequently refer to the internet as being free, and this means that by in large there are no major restrictions in the sharing of information and being connected. Also of course, it does not cost you like a 'post stamp' to get information, and you can read blogs like this and get free lectures over the internet. Some countries and DNS servers associated to them, are not as strict as the USA, EU countries, China and in fact Norway, at controlling criminal activity and illegal information and in particular illegal images. So the wish to have freedom to both publish and access the internet comes with some down sides, and in fact many argue for a stronger police force within the DNS structure.

Internet access is not however free-of-charge. At some point you pay your phone company, or portion of your cable TV fees, to access the internet and they in turn pay for access to national backbones and international connections and networks. (although there may still be some ISPs offering free connecitivity based upon advertising revenue, and as a fee-paying-student, your university may see fit to include internet access).

As the owner of a web site server , hosting web sites, you have to pay for the number of physical lines, the IP addresses and the bandwidth demand coming to and from you. As mentioned in the last paragraph, because there is some degree of horse trading to establish cheap routes within the internet, sometimes your data will be seemingly slow from all web sites in the USA for example, because your national phone company is not paying for larger bandwidth on the fastest route at that point in time, and is diverting you a cheaper, slower route.

In 2009 the advertising revenue spent on the internet in the USA and UK surpassed that spent on traditional media like newspapers and the TV 1 2 3 . If you also factor in that many companies of course use their web address and get free listings on google, then effort, spend and traffic is now vastly flipped over to the internet for many brands, shopping experiences, holiday purchases and of course the huge volume of business-to-business marketing. The evolution of brands who are virtually exclusive to the internet, like Google, Yahoo, Amazon, Facebook and Twitter is worth an entire book and there probably are many good bloggs of course to locate on this topic. However that this media space and these brands command so much economic power is concerning to some commentators and threatening to owners of traditional pay-to-read/view media.


links/ references:

1. http://www.guardian.co.uk/media/2009/sep/30/internet-biggest-uk-advertising-sector
2. http://www.socialtimes.com/2010/12/online-overtakes-newspaper-ad-revenue-so-does-google/
3. http://business.timesonline.co.uk/tol/business/industry_sectors/media/article2767087.ece



What's the difference, or relationship between the World Wide Web (www) and the Internet ?

..more to be written accurately here The World Wide Web was the area on the internet set up for and allowed to use http " hyper text transfer protocol", ie web site traffic. So originally it was defined IP servers and routers which when a "www" request was sent from your browser, your communication would be channeled through. Now this term is somewhat redundant as this "www" is the majority of public internet traffic, if you exclude e-mail. All internet routers accept and handle http traffic now and for a long time have done, and almost any IP address can host an http web site , in western countries at least.

"www2" was an odd colaboration...more to be written accurately here

Principles of Internet Protocol and Web Communication

"Hello, Operator?"

How do we know we are on the world wide web and how do computers know to connect to web sites or send e-mail? An analogy we will revisit is that of either an individual householder or one company trying to contact another company in the nineteen forties:

In those old days, to start a call you had to dial a simple number to get to your first manual operator: you could then either ask for a telephone number in a town to be dialled to, or ask for directory enquiries if you knew the name but not the number of the company. We will see that this exact analogy happens today with your computer contacting the DNS computer in your area or country, to ask where a web page called "www. xyz.com" is and what its number is.

In the 1940s, you would be lucky maybe and be connected with just one more operator on the public network before you reached the company's own phone exchange, but in the internet there are often dozens of exchanges between you and the host computer, and worse than that, these change between "calls" and even during calls! Luckily this all happens automatically and with a rather unbelievable high level of accuracy, usually ensuring a seamless flow of information with only tiny delays.

International calls way-back-when, may have been somewhat torturous to get connection, and some were restricted by the governments, like to and from the USSR. International operators would have to speak to each other before you got connected, adding delays. Luckily the internet uses universal languages though between computers to automate all of this, but there are still some restrictions placed on connectivity, in China for example, and to undesirable web sites for those dialling outside of Norway for example.

Now once through to your company, you would then need to establish contact, make your purpose known, ask to be transfered further to a named person at the host's telephone "Board". Sometimes your enquiry is just denied, that person does not exist or you have not the right to get information from them! Or you find that the company is defunct, but a new company pops up to offer a service to you. Eventually you may get to one person, who then just refers you to another internal line and so on.

Using the internet to download and communicate to web pages is very analogous to talking to all those manual operators working at different points in the old phone network, and with good reason: communication has to have protocols, agreed methods of connecting, inter-connecting, intra connecting, passing information and actualy terminating a call. Otherwise computers would not know if a communication was safe or correct, routers ( the telephone exchanges on the internet) would not know where to relay information to, and your host computer could possibly wait around like an operator on a line you have not hung up on would have maybe done 70 years ago.

In this brief lecture for non IT Managers, we will summarise the main protocols and languages used for internet communication and to display information on web pages.

Internet Protocol and Hyper Text Transfer Protocol - TCP/IP & HTTP


Internet Protocol IP; is the primary connectivity protocol between two computers using the internet, which is actually the very near parallel to the old manual operator-phone system above. Being a "Protocol" means that this type of comunication is in agreed way, a defined set of data being communicated to establish connection.

The full terminology is actually TCP/IP (Transmission Control Protocol (TCP)) and this first half provides reliable, ordered delivery of a stream of bytes from a program on one computer to another program on another computer. Wheres, IP, is more focused on dialing up the initial connection. TCP/IP together is the combined protocol that major Internet applications rely on, applications such as the World Wide Web, e-mail, and file transfer.

TCP/IP allows for computers to communicate to each other on the public internet and private ethernets, with defined exchange of small packets of information and there with, handshakes to confirm connection when most often downloading web pages to a client or passing forms back to a server. TCP controls the start, continuation, confirmation and termination of this sending of datapackets.

In essence the primary purpose of IP itself, is to dial between computers through the 'telephone exchanges ' called routers to get to the evenutal server which owns the IP "number" which is ironically called an address.

Now we have 32 bit, x.y.z.w style IP addressing. With the IP address in a given connection. The problem being here that there are only 4 billion IP address combinations in the current xxx.yyy.zz.ww address, but this is soon to be expanded. We are therefore at V6 installment (roll out) because in practice the internet is now full for unique IP addresses, and in fact we are now reliant on people/companies defaulting or giving up on some IP addresses so they can be released for new sites. . This new IP v 6.0 , allows for the IP address in 128bits, a very long internet address protocol, with trillions of new possible IP addresses.

Today though, most internet IP addresses for web sites/ ie servers and often individual user's computers accessed through these servers, begin with zero to 255.xxx.xx.xxx as the first three digits, and most academic univeristies have their own first 4 to 6 digits.

So as a student at a university, the world probably sees you as that one phone number and to send e-mail or direct web page communications is reliant on software to transfer to-and-from you as a unique user.

One IP address can host many domain names, as the primary web server. This server is a computer itself, or a router-computer, like the old telephone exchange at the company in the 1940s. Instead of manual tele-receptionists, this computer is often running a useful bit of software, the most popular brand being called "Apache Server". This sorts out the requests a bit like a mail office or a telephone exchange in an old fashioned company.

Monday, March 07, 2011

Quick Review +ves/-ves X10 Mini Pro

Just worthy of a quick mini review, my new X10 mi-pro

This is the third smart phone I have owned, having had a nice little rest from their short battery life with a bog basic Nokia which lasted often more than a week between charges!

Previous phones have included the really very good Nokia N70e, and the SE P1. The latter was a pain to use for just ordinary calls with the off button being only soft and tucked away in the corner of the otherwise interesting screen. The p1 was very good come to think of it otherwise as a scribe phone but it kind of taught you to write, not really the other way around!

Okay

X 10 Mini Pro


Firstly the phone has the usual SE build quality: that is it is about a 90% total quality feel and just falls down a little in feel ( weight maybe) and execution of the slide-out QWERTY keyboard: this is a little inprecise suggesting it will be the fatal failure in the long term for the phone. There is quite a lot of leverage in using it or mis-using it, so the joint-looseness can only become weaker. But the screen and keyboard are very nice. The hard keys are a little thin and vague: triumph of style over functionality.

Don't get me wrong though! This is a great little phone!


Where the p1 felt a little too big in pre HD2 / Desire / Galaxy days, the X10 mini pro will suit those who like "tout bejous". If it were any smaller at all it would render it unusable for my manly big spades-with-fingers. But the screen is spacious enough for good presentation of data, the touch keys are sensible sized and located and the hard QWERTY is very nice to use, even if it is small. I guess the white version has been a big hit with the ladies and their non navvy-like fingers.

Bug bears are the thin hard keys as mentioned, the slider inprecision and the ring / call volume which is a little too easy to unwittingly turn down while holding the phone. The vibrate is rather gentle and the ring volume / speaker seems to get relatively muffled in pockets compared to any other phone I have had. Also on hard keys again, there is no call pick up on hard, and sometimes the back key is torturous to use: there is no get fast out of an app or web site without dropping the menu screen down over what you are doing, which can leave the app running in the background . Resizing web pages is also a bug bear, with no pinch zoom and often a delay to the plus/minus zoom appearing: double click to enlarge can be risky in hitting a link instead, and often zooms in too way too small content on non mobile version web sites. Still it seems to run 95% of them, if that being with microscopic text at first sight and painful sliding aorund to make sense of the page.

In all the above so far, in comparison to other phones then the HTC Touch Pro II is the one which leaps to you: it is better built and just a bit bigger, but has a slower touch interface.

Touch Interface



The touch screen is plastic and unlike some phones it does not need the galvanaic voltage set up of real skin: it works okay through my plastic waterproof watersports case: Great!

The interchangeable corners are really ideal: stick facebook on there if you are having a SM connection frenzy or keep it to what phones actually do: call, contacts, sms! Also being android, there are different dialers and "core system" UI apps which may suit those who just must get into a new "skin".

The android app' menu scheme is very iPhone and nice to use and re-arrange: the only thing being there is not a circular scroll through ie you come to the last page of icon apps and then you have to scroll all the way back ( it has a ghost memory for the last page you were on!)

Operating System

The android OS and the underlying or parallel phone OS are seamlessly integrated and personalising, adding new apps or altering some core functions with downloaded widgets/apps are all a joy.

The phone comes with a capacious 2 gig ( 1.8 effective) mini SD with it's own USB stick convertor and the phone charges on the USB; with a nice USB plug-transformer which tops up the iPod Touch very much better than from a PC!!!


I mention memory because the on board phone memory is used of course as working RAM, so it is worth installing an APP KILLER : how good this is remains to be seen, but several apps running in the back ground make the entire phone user interface slow: this is one dissadvantage in so closely integrated call-android systems are: keep it above 60mb and the phone will run quickly.

Android apps lack some of the branded trustworthyness and range of the iPhone iStore market, but really they seem pretty much as good : the pay-for-apps are maybe a bit cheaper on average, but some of the preview free ware apps are more limited than those for iPhone.

Incidentally, WiFi is about 90% as good as the iPod Touch and probably only slowed down by my own home line -in! Worth avoiding using on hot spots due to the lack of security and so far I am not sure about the functionality of anti malware programmes in what is a multitasking OS.

Web Browsing

A few web apps are included which run in their own screen UI rather than the standard browser: FB eventually boots you over to the browser. The reasons for this are twofold:

1) one touch, pre-registered log in
2) a screen-size optimised first interface


This is noteworthy because the screen is small and not all content has a mobile browser version: FB's "www.m."is also too oriented around larger screen sizes, so resizing is a bug bear. With no pinch zoom and often a delay to the plus/minus zoom appearing: double click to enlarge can be risky in hitting a link instead, and often zooms in too way too small content on non mobile version web sites. Still it seems to run 95% of them, but lacks Flash Player TM.

YouTube however, has its own X10 optimised player-app' which boots out from web pages and allows for a nice, streamlined user interface to the web site when searching in the app itself.

Mobile content which is oriented around WAP type simple one column table cells, works super good and some versions of this content are actually better than the full web site: eg yr.no weather.

Normal web content is presented with with microscopic text at first sight and painful sliding aorund to make sense of the page. However, the browser does include an auto column re-size to screen width when you are lucky and get a table cell roughly aligned to the centre. This is a boon! If the text is still small in "portrait" then a quick flick over to landscape makes things readable.

Android Market is nearer to the actual browser by the way, and thus has pretty small text just on the edge of being annoyingly so!!


GPS is a weakness

Some apps are a little annoying: like GPS, of which the installed version freebies all seem to want to trade you up to either the pay-for or the on line map service. It does come with an okay google maps integrated GPS nav package, and a preview free trial of an in car NatSav ( which is okay!) but these are torturous to add simple way points and save them.

The phone takes ages to aquire five satellites first time, and some of the freeware GPSs force a location recapture between functionality screens.

Camera:
Camera Lacks Zoom, Images are Dark and Video Is Pixely. In good sunlight the still shots are actually very good, and can be cropped down to 20% of the area with no loss in a resulting 700 pixel wide final image. The LED flash-light is good enough, but I am yet to find an app which will switch it on as a "torch" ( mag light app does not work unfortunetly)

But: Will it Call Your Mother on Sundays?

The all important: connectivity and call qaulity!

Like most smart phones it drops calls once in a while, and generally has a fairly low signal connectivity relative to the bars actually showing and other simple phones. Quality is good though in call, and the hands free speaker-mic seem to be better than anything I have ever been in the room with! ( the stooopid on the Touch Pro II; it has to be flat on the desk and the mic is shit)

It does take a while to connect calls and has a dumb swipe to accept call, which I would rather have as a hard button option for in holster to quick draw pick up!


Conclusion: Is this the Phone for You?

The X10 mini pro version with the qwerty keyboard is probably the most compact in its class in terms of actually being useable and offering both a big enough screen and keyboard for practicality and enjoyment.

The small size means it has a mugger-advantage: it can be hidden in the palm. Nice also for surreptitious FB ing at boring conferences not to mention up-skirting for the pervs. In fact, let's not mention that at all.

If you already own an iPod touch or a tablet then this makes a nice addition to your familiarity in a more discreet package, and a good alternative to booting up your netbook or laptop for social media in particular: SE have their own integrated contact total surround set up, which I can't be bothered with to be honest despite being an FB fan.

Personalisation is generally pretty good, with corners, ring tones, apps etc

If you use the web as a serious tool and want to read a lot of text heavy sites or pdfs etc, then this is resolutely NOT the phone for you: web is best for mobile content version when provided and auto-'sniffed'. If you want to check the weather, the odd share price, and catch up on FB / Twitter and G mail , then this is a truly quick and fun tool, erm, toy for you.

With a 2.5mm stereo jack (with data appendage flange!), the phone is a bit of an all rounder for music, social media, web browsing, android apps/games while some functions which would be nice to have 100% operational, eg GPS, zoomless camera , poor video quality, don't quite get there.

If you are looking to downsize from your slab of a desire or iPhone you just broke without insurance, then this is an ideal phone for those with nimble fingers and a lust for social media.

Sorry it's a handset my teenage colleage reminds me.

Off to put on the gramaphone then me.