Monday, January 24, 2011

The Road to the Next "Facebook": Start Ups

When establishing your next venture to rival FB for popularity and success, then you have to start somewhere buddy, and that is going to be soft capital and the business angle network.....

Business Angel and First VC Round Funding

Business Angels can be roughly divided in two: individuals with resources who are prepared to invest in maybe only one venture; and the other portfolio "angels" who are actually self made high risk fund-managers in effect. Now both have their value in bringing not only money but well needed guidance and goal setting for the real world ie 'window dressing' the company for the next round of funding.

Angels fund of course, but also most usually match other resources to the start-up's needs as well: most often that means people with talent, as employees on the initial management team, or consultants to help the team, and often the key members of the board of directors. Some will also marry premises, or actually are property owners with suites ready for expanind businesses.

Start up enterprises should be sensitive to conflict of interests in both these areas above: human resource allocations and premises:

1) croni-ism: It is all very well for the business angel to have given the seal of approval in bringing someone in, but by not advertising generally then you do not actually know if this person is best suited in experience to your company and your market. The Angel may well shoe-horn someone into business development or engineering who has been successful in their portfolio, but who may not be ideal: On the other hand at least their work capacity and motivation will be QA'ed which is a large part of the battle

2) In terms of property and services, some business angels (and management "guns for hire") are far from angelic: they look for ROI in terms of direct revnue back to them, and subsidiary capital gains: so they may push you into buying property, so they enjoy you paying it off with revenues eventually, or actually rent you property they have an interest in. Also they may want consultancy fees or a direct salary.

3) Stand alone business "angels" who have limited experience, and maybe not even a portfolio, will probably want more of the action than they should: they will want to be directly on the board themselves or even actually be an operational manager or internal consultant. This must be weighed up with 2. above as to their intentions, and also their own skill base and personal network : is their market and resource access really good enough?

4) Other personal biases: Angels will have personal biases which may fall on the down side of all the above potential conflicts: furthermore they may have a bias to a market player, partner, university or IPR source which is actually not the ideal route for the company. The route may be completely counter productive because the initial positive personal contact mediated by the "angel" dries up to being a deal which could be of higher value with a competitor to them.

Show Me The Money......

These days in either software, hardware, consultancy or business services a highly profitable company with high growth potential can be hitting its KPIs and milestones in Y1 and Y2 on first round Y1 funding of half to a million dollars/ Euros and one to two millions in the Y2.

Angels will often want to pick up on the next round too, because they see the potential overlooked by VC or the VCs are not "in" that industry / sector of industry at that time. VC invests in industries and high growth sectors by in large, and not individual companies per se. They will have expertise in a segment or buy that in when the KPIs look more rewarding than others. So VC can be high value-niche product shy, because they have bet on other growing sectors as attractive and are hunting for companies who are in there.

Alternatively one fortuity of portfolio Angels is that VCs trust them, and when they are looking to invest extra monies or sectors they have looked into go sour, then they may back a wild-card horse from the trusted portfolio. Some VCs in Silicon Valley probably just take companies "screened" by their favourite portfolio angels.

Business angels without portfolios, say with just a few companies, are going to be more hands on, while those with portfolios are going to be more dogmatic and prescriptive: the latter will want to see the management team and board being of a proven calibre, and this has its expenses in terms of salaries and arrogance of managers coming in to ride and steer the founding entrepreneur's ideas.

Backdoor references: are often looked for, unsolicited networking to assess the qualities of personnel. So there are ongoing issues for the entrepreneur and core "engineering" team in accepting new personnel and being assessed for their own "personal brand" off the record.

Business Angels and Venture Screening : Why the "The Elevator Pitch" ??

The two groups of investors behave on the ground floor of things, quite similarily: they both screen potential companies to invest in based on their "elevator pitch" as the first major screen, and even then there has probably been a "triage" whereby the company pitching is
recommended or screened by an associate.

So the elevator pitch has to explain the product or service, the USP and the size and growth of the market in a very concise and attractive way.

VC will also be doing a further job in finding what industries to invest in: where there is growth and margin, where there is need for supply. Whereas the Angel is probably entrenched in the industry or just has a more open mind to high risk backing a good concept.

Practically a portfolio manager or angel will see between 5 pitches a week and 5 a day. Any pitches that get through the "elevator" exec summary stage then get screened by looking at those unsolicted references, the CVs of the founders, the supply chain, and an idea of what people pay for products and services already in that area. The knock out rate here is therefore high.

This means though that even with a good pitch, you are actually only as good as 1) the attractiveness of your industry to the VC or Angel 2) how good the other pitches were that day, week, month or quarter.

So the elevator pitch is just a function of human buying behaviour: we browse, we feel branding, we pick the best from the shelf to match our thinking on that day. We don't spend time reading the best Harvard MBA business plans, risk assessmetns and financial scenario algorythmic results : we take short cuts based on highly summarised information, a concise presentation of the idea, and the quality of the concept itself.

Attrition and Success Rates for Angel Supported Companies

50 to one ? A third go bust, a third return their original investment plus interest maybe, while the top third make the large multiples in ROI: In terms of the 100% (2x) , 500% or 1000% Y2 ROI in their valuation when they are sold further or floated by FPO. The large ROI on the one third stars must of course pay for the fail-third and the "stake back on your bet" third.
An independent angel without a portfolio may conversely, have such a good grasp of the concept and be able to offer so many open doors and resources that they choose to invest and the marraige is successful. Not only there by, but because the Angel is prepared to put time and shoe leather into supporting the business.

Nimble and Lean Burn is Important to Angels and VCs

Being "nimble" means being able to move quickly to take advantage of changing conditions in your supply chain: It means most of all, being open minded to morphing the business to a new direction. This can include:

  • Picking up on new potential revenue streams quickly and converting them
  • Redefine who and where the customer is in the supply chain and in the world
  • Being able prototype rapidly and test market: Slide dot com tests 4 to 6 ( NPI ) products A DAY!
  • Open mindedness of moving the shell of the business into a completely new direction based on fortuity, or rationalisation of the current route being unprofitable or low in ROI.
  • Dropping poor routes or products and carrying on with the good ones or looking for the direction presented above, to use the shell to do something new they have encountered potential within.
  • Capability to "breath", in expanding by using right-sizing and outsourcing such that projects can be delivered without committment to overhead. This could be even short term, in a matter of weeks in terms of hiring and firing costs.
  • Low Base Cash Burn: being able to return to a low base cash burn rate while still holding and building value in the proposition.
To give an example of the latter two : a company may require to prototype and implement an ISO system for a potential customer. This takes more resources and tooling thant the company has, so they can of course hire, but it would be better to outsource based on sound CDAs. After the prototyping, there may be a protracted evaluation period or changes to the specifications which slow time to payment up. These can be done at a lower head count and the company returns to a base cash burn while they wait the milestone payment or revenue.

Alternatively a company may reach a bottle neck: There may be no properly experienced Java programmers in "the valley" at that time to implement the product. Therefore it may be worth putting the company on ice while the labour market loosens.

Low burn rate in itself enables nimbleness, because companies can survive when trying out new directions. So trial and error can be a positive experience which the company comes out stronger from.

Nimbleness appears the exact opposite of strategy: it is more tactical and reationary than earlier concepts of "core business" and "mission statements". Indeed it seems that strategy is now both a wider direction, while also a focus on the abilities and will to win of the team. Overall you are likely to have a strategy in a market, and that may begin in one niche or direction. You therefore have a strategy to build the initial team of 5 to 8 people and take them in the market and client groups they aim their bow towards.

Right Sizing and Out Sourcing : Risks and Rewards for being Nimble
The problem with outsourcing or using contract labour is that your business concept and IPR are revealed to the personnel who have no definable loyalty to the company or knowledge of IPR laws even. In software for example, they may be able to implement a very similar solution in C++ you have done in Java, thus circumventing copyright and US software patents at least. However some suppliers specialise in secretive projects for start ups or companies needing "breathing " capacity.

Looking for suppliers in this way, addressing top management under CDA is also a very good means of approaching merger and acquistion potential partnerships. It may be a jobbing engineering workshop who do a lot of one-offs and prototypes would want to move into steadier high value production. Or a Java hot shop may be looking to partner to the next "facebook" entrepreneur.

Milestones Should Be Market and Quality Oriented

Businesses should indeed set NPI or earlier prototyping and proof of concept milestones: sometimes with no contact with the market which could expose the IPR too early. However soon the company should be setting alternative milestones and KPIs: breadth of client and partner contact, and depth of business development penetration.

  • Number of presentations at CEO Level,
  • number of call backs.
  • Ranking contacts and meeting outcomes for their quality, and the match between the customer.
  • Number of escalations
  • Conversion Rate in the Funnel

For a consumer oriented business: it should be rapid prototyping and feedback from significant numbers of potential users on as wide a geographic basis as practical.

The level of benefit over other competing products and services must in fact be assessed by customers and not internal staff. This is a key ground for that high tech companies fail: they are myopic to their own solution as being unique and better, while missing the concept of value-increment and adoption-risk pay off for customers.

So a key milestone must be that the NPI does deliver x% better, and that this is more than a just-noticeable-difference. It must be a real threshold which will take early adopters and not just innovator "geeks" at the front edge of introduction and word-of-mouth marketing.

What is Better? What do consumers like in NPI?

In fact the big brands on the internet do really very little that was not done by the mid to late nineties: Twitter, you can trace back to newsgroups and even before, perhaps even phone phreaks and early pre TCPIP notes.

The Facebook concept has had various guises since "personal home pages" became first popular in Universities in the mid ninetees: friends reunited amongst others. DIgg and My Space too, could all essentially be found in the 1990s, and "tweet / Blogg" Decks relate back to the 1990s "Jump Stations" .

Alta Vista and FAST had leading, performant search engines which rivaled Yahoo and Google in the mid ninetees. Why did they not win a normal market share ?

Why are the big brands then, all massive successes?

On the "Product" side, they provide a more cohesive user experience which is in fact, simple and often this means faster to both use and grasp the facets of the service. Also you could say the products were in the right place right time for the mobile device explosion into media rich web sites.

Perhaps many people had gained experiences with the earlier products and so were warmed up to adopt a tighter and neater execution, funded soley by discreet advertising. The last point cannot be stressed enough: intrusive advertising and pay-for-contact social web sites were doomed to reach their limits in utility for building a social network and communicating to them on line.

Certainly there is a lot to be said for the interplay of early adopters, good-simple-branding, a performant NPI concept, and a level of familiarity and readiness for the wider market to adopt these new social media platforms.

Branding is important: we trust FB not to itself, cause issues for us, to change radically, to stop being reliable, or pack up shop.... or just sell out to the latest Javascript pop-over banner ads. People trust twitter to be fast and accurate, and not glue them into spam but rather the extended social network and information constellation out there. We recognise FB's graphics, we are comfortable with the core offering, the layout is consistent, we accept "share to FB" from just about anywhere and we trust it to work and be a resource for us in future.

What Chance the Start Up?

One thing all the big internet brands have in common, is that they all have got funding and personnel in Silicon valley. At that time it was the only place where a critical mass could be achieved. Like Microsoft and IBM in Seattle before, and the big brands and Madison Avenue in the hey days of TV and Press advertising.

So place to seed in is going to be important, and for any given industry a labour supply is also key

People and ideas have become more important than numbers and projections. Perhaps you need the "excuse" of starting up to do X, Y and maybe not Z right now in a fast growing market.
But wait up here, before facebook came, off search engine advertising was a scarey business model? How did FB swing the balance?

Well the cost of gaining a critical mass to enable the huge social-word-of-mouth marketing snowball to roll, was actually relatively inexpensive and once it attained exponential growth, the advertisers wanted on, as an alternative to Google.

So for the start up there is hope but you have to get on the train perhaps at the right station.

Saturday, January 22, 2011

Scaling Web Servers for the Non IT Manager

(c) Authors 2011


This lecture will help non IT managers come to terms with the topic, and help them to improve their decision making and interactions with internal and external IT resources. When faced with expanding web traffic, management decisions should be informed, practical and economically justifiable.

The level is also pitched at MSc Marketing students at Strathclyde University. Some of them actually will have an IT background, and it may help expand their thinking into the management issues involved in scaling web server capacity.

An excutive summary and discussion workshop is reserved for the end of this lecture, such that you stand better informed before assertions are summarised and further discussion taken off line. Within the lecture, we will delve into the technical solutions, trying to explain them accurately in plain english where possible, or building upon your knowledge from my own and other learning materials.

Key Management Issues in Coping with Internet Server Traffic

Scalability in terms of a web site server capacity, means being able to handle more of the following :
  • Volume of incoming traffic, aka simultaenious user sessions; ie requests for web pages, information and seamless, connected flow through web shops ;
  • Quantity and intensiveness of internal data-handling of these incoming requests;
  • The Power and Appropriateness of the Back End Database ( eg MySQL, Oracle etc)
  • Outward serving of requested web pages and 'dynamic' information.
This is very much a topic for general and entrepreneurial management because it is the main investment in resources for a 'growing' web site in terms of: Hardware ; telecoms-connectivity; software installment for scaling ; and incremental manual maintenance.

Often the investment in scaling will be many times more the cost of initial web site development.

Management Perspective: Conflicts of Interest - It Pays to Be Informed

The conflict of interests between general or marketing management and both IT vendors, consultancies and internal IT fifedoms are very often centred on over-engineering solutions to meet "best practice and industry benchmark standards" rather than matching anticipated needs to a solution, which itself can go on to scale further if needed. Not to mince our words here, IT people often have a vested interest in over-delivery in terms of their top line as a supplier, or their departmental head count and budget as your IT department.

However, the opposite situation may also be true, where by a loyal IT department or web-host-supplier struggles to shoe horn more capacity into a fundamentally insubstantial architecture, and continues to patch-scale and fire-fight, tie-ing up resources in this rather than in the planning and implementation of a new server solution.


What is a Web Server?

A web server is simply a powerful computer with an ethernet card capable or handeling a larger number of users than a PC or local office, shared file server would. The other main differences to your ordinary PC or Mac' are that it will have a different operating system (OS) , and this is most often Linux these days, and it will run completely different software than we have: Apache (e.g.) for handling requests for web pages, and information and to-and-from e-commerce transactions from web pages; a dynamic language engine, like php, cold fusion or asp: this reads incoming PHP GET$ urls and POST$ed data and organises computing and replies to these; A SQL database interface software, like MySQL which allows languages like PHP to refer to a larger, well designed database through a more efficient interface than directly done with say PHP.

What do we mean by Scaling Web Server Capacity?

When we talk about scaling web servers, we mean increasing the capacity for number of simultaenious user sessions possible on the hardware and tackeling any increased complexity of those user sessions. We increase the computing processer power, connectivity (bandwidth, internal server LAN, on-board RAM and BUS speed), local and back end database memory.

We do this by utilising:

a) more powerful machines ( verticle scaling, see below)
b) more machines, "clones" most often ( horizontal scaling)
c) better architecture
d) intelligent load balancing
e) Software accelerators / short cuts


Goals in Scaling to Meet New Demand

In up-scaling to provide capacity what do we actually want and need to acheive ?

1) A fast and contiguous user experience: users expect rapid interaction on the internet, and it is vital to maintain continuity in user sessions.

2) Fidelity, Redundancy and Back Up of Data: we need to have data stored and retreived accurately, and to an acceptable level of currency, and periodic back up of "dynamic" data records . Data will need to be duplicated across "load balance clone servers" or stored in a central file server accessed by the clones. Some of this data will need to be prioritised in terms of populating all the cloned site data stores: eg user data for user log in, or important news items, product launches or deletions etc If there is a computer failure or down time, we would want the site to continue functioning: do we need to have the complexity to have continuity in user session even if a front end server goes down ?

3) A realistic level of investment and cost control in scaling to meet new and anticipated
demand: being able to meet a projected capacity and perhaps have a margin to exceed this, without over-engineering a solution to a definable capacity.

4) A known "road map" of potential upgrades to system architecture given projected growth or scenario setting for different potential user numbers and computing intensity.

Traffic Load Monitoring and Planning

Internet traffic for a newly launched web site is by in large chaotic: it is a function as much or even more of " word of mouth" ( tweets, bloggs, RSS feeds, top news sites) as it is of on line marketing or off line advertising.

Obviously though, a web site will see temporal traffic around the key user group's most active times on the internet around your offering. Also you may see peak traffic relating to promotions, ticket sales or fortuitous on/off line PR. It can be possible to hire in extra server banks or set up a queuing system for users, but really this is not quite in the scope of this lecture. Rather this lecture deals with a scaling demand over time.

Over time, ignoring any fortuitous "clicky" PR, you have the following to consider in the equation:

1) underlying growth rate: ( cumulative, moving-annual-total y2-y1 will help you assess this)

2) seasonality ( run rate ie ordinary graphing of y1; data from similar web sites; moving quarterly total)

3) daily internet time habits

4) planned marketing, promotions and rating on the key search engines for key words

As mentioned below, you can choose different strategies to meet forecasted demand or tactics to cope with sudden peaks in traffic. In out set for a new web venture, you will ideally have a handle on your marketing budget and success of previous web ventures either internally or from a consultancy.

In this way you can balance a desirable budget for hardware and set up of the server banks, with a set peak demand and consider this in light of break-even and target income per day ie hits converted to sales.

So practically to keep the accountants happy, you would want to be able to cope with the back end SQL requirements for processing your target sales volume from the projected time of break even to a time point when you want to achieve target operational profitability. This would be done in light of a phsyical measure relating to capacity planning, probably with reference to sales transactions per peak time hour. Meantime you may wish to perhaps "lighten" server load from people not in the shopping channel of your web site ie making casual visitors pay a queuing dividend or just sending them a server busy reply.

At this point you would want to engage the concepts of game theory and scenario building, whereby you consider different influential factors and outcomes and thus create plans or a framework to meet these scenarios. This is then beyond the scope of this lecture in producing a variety of possible peak and steady traffic numbers, but you still need to understand how to cope with each likelihood in traffic volume scalability.

Introduction to the Requirement Definition and
Technical Management in Scaling

Eventually as visits and itneraction to a web side grows, the level of Internet traffic essentially loads all the hardware on a server to a point of saturation: no more users can be coped with and the user either gets just a loading sig while their browser pane just "hangs" , or a blank page or if they are lucky the following:
Error 503 Service Unavailable
The server is currently unavailable (because it is overloaded or down for maintenance).[2] Generally, this is a temporary state.
What is going on in the server? Well firstly the Ethernet card has limited number of transactions per second- if indeed it is specified to match anywhere near the incoming bandwidth, which in itself may be insufficient.

Further into the "layers" of the server computer, RAM and near- CPU RAM gets filled, the ports get their bandwidth filled in and out, any back end database connections get busy or the allowed user intergation links gets filled and the CPU in the primary server gets overloaded. At this point various issues of queing arise or users requests/interactions with the web site just fail. A server busy error message is often generated. Sometimes the entire server will crash or even overheat.

The major choices in hardware are to either scale 'vertically' or 'horizontally':

Vertically : this means buying better quality machines (with for example more CPUs on the board and more layer 2 RAM ), More powerful ethernet ports and actual "front end" intelligent traffic routers- the latter is then integrated to the other route which is to scale horizontally: ie install more machines running in parallel to deliver the web interaction to many more users (or interconnected, sharing data repositories and user sessions)

The software needed to load balance is another area which although inexpensive, can have cost implications in personnel resources for implemenation, maintainance and life cycle management. Another more expensive software route, requiring more expert intervention and on going management, is software which optmises server performance, thus reducing CPU /RAM load: these work in different ways as we will learn, and are most relevant with "dynamic" web sites.

Cost Implications of Scaling

The relatively significant costs of scaling must be taken in balance with expected income or utility from the web site. A "best practice" technical solution recommended by a consultancy or vendor may be both over engineering a solution and more expensive than expanding the server and traffic channel "organically" that is to say, incremental expansion to meet demand eoncountered over time.

Vendors (suppliers) and consultants will of course give a very good outline of cost per thousand user sessions and what level of complexity can be handled by a given set up (system architecture and software)In implementing a scalable web site which anticipates high demand and high return on investment, utilising consultancies and installers like CISCO, Microsoft Partners, IBM or Oracle to name but a few can instill a high level of confidence through your own orginisation.

Internal corporate IT managers will also be able to indicate how much load the current system handles or how much load a "cloned" horizontal scaling ( see below) replicating the current server would cost.

Anticipating demand and therefore traffic load levels for a telecom connection can be difficult given that search-engine-optimisation, links on leading web sites, fortuitous search engine listings or successful on/off line marketing and PR campaigns can all deliver more traffic than in service-level-agreements, and trigger punative over-capacity charges from the bandwidth telecoms supplier.

Connective bandwidth is one area for cost control and planning, while the actual number and quality of investment in server-computers is another aspect and an area for hidden costs, or overly expensive investment.

Hardware will then contribute to the majority of your costs, but software can rapidly add costs to initial implementaion or in ongoing operational resources.

Management Perspective: Take a scenario- you implement a solid hardware based solution from a leading vendor (supplier) : the functionality of the implementation ( ie web site dynamic features) becomes larger than the initial specification for hardware capacity and it is desirable to implement different levels of load balancing , caching and acceleration: The original consultancy re-quote and this is outwith budget: You engage a small consultancy, who want to install a cutting edge solution. The solution works at first, but when you alter the web site structure, it stops working: the consultancy is bankrupt and no programmers will certify a fix on the software. You should have perhaps opted for a more tested software solution and reached a compromise with the consultancy to budget for cooperation with a cheaper, specialist supplier: you then have the issue of maintainance reduced, as common solutions are often taught in computer science or learned on the job.

In terms of working around a limited budget where utility of the web site may be reasonably high but income low, or it may be in fact a Social Enterprise (NPO) : lower value users may actually not mind being told they are queued or their data will be processed later, such that current hardware or that which can be budgeted for will determine the user experience and service level, rather than the reverse.

Database Resources and Hidden Costs

Most dynamic web sites which handle any large amount of data for many simultaenious users, utilise a "back end database" server. This will be discussed in more detail in a later lecture, but in terms of scaling this also has direct and hidden costs.

A simple web site with some 'non perishable interactivity', like say log-in or shopping cart, may well be developed without an actual database or with a "local" database which is just a file depository on the Apache server accessed by the web language being used with an SQL interface plug in so to speak. So for example in PHP you can install SQL light, which uses simplified commands to deposit and retreive data from a csv file on the server. However this will create bottle necks if you use one "table" ie csv file for all users, or start to use a lot of memory and CPU time if you have a lot of tables for each session and each user. With larger number of data fields, name-address- and so on, and with the need to store associated files like jpegs or docs, then it soon becomes highly desirable to employ a seperate data machine with a powerful database installed on it: the "Back end" database server.

There are a lot of benefits in moving over to this approach, but there are two main issues for scaling up to a powerful database engine :

1) Phsyical: web server to database ratio and redundancy of data.

2) Re-engineering web sites to new databases

For the former, if most users do not access the database then one back end server may suffice. However where the function of the web site is like a banking service or Facebook, then there nearly all users will want to log in and interact with data in the database. We will return to scaling back-end databases below.

Strategies and Technologies for Implementing Scale
in Web Server Capacity:

Vertical Scaling : Expanding Capacity by Increasing Quality of Machinery

The first means to scale the capacity of a server comes without any need to reprogramme the Apache/Php/Linux software environment:

Most simple, 'local' server machines intended for light loads of say 50 to 100 simultaineous user sessions, have the simple physcial capacity to add RAM, ports and a second CPU to the motherboard or a second CPU motherboard. When this is done, the system auto-detects and installs the new hardware, thus integrating it seamlessly, with all the higher level software running immediately. Everything runs faster, so more users can be accomadated and their page requests and form-response actions go quicker.

The real world limit motherboards 256GB, 2 x 3MgHZ cpu's, 64 bit busses: the cycles still top out though, with many users or complex scripting and back-end SQL interfacing.

There is always a premium for top end servers: so it is often cheaper to scale on cheaper systems with double the number of servers. Also if one breaks down, there is one which is up!

So instead of spending on fast but economically expensive machines while still having the risk of downtime, you scale horizontally:

Horizontal Scaling: Expanding Capacity by Number of Server Machines

This just means deploying more servers to handle the load, and the most common scenario at the primary server level, is that these are in effect clones of each other in terms of operating system, dynamic web language engine, related coding and data repositories. This makes scaling easy to implement, because reprogramming is basically a very rapid "copy and paste" of all programming code, systems and information , thus minimising down time during the up-scale to further multiples of servers, or minor bug fixing, or actually during any new web site implementations.

In my experience in the past however, it may not have been possible to acheive this because a reliable work horse of a single server became obselete over time, meaning that it was not powerful enough to run a new OS and newer dynamic language engine. It may be actually discontinued, so the next machine is only a "clone" to the outside world, and is actually running a new environment. In this case, there will be significatn duplciation of manual tasks in updates etc so it may be better to migrate the whole site to a new server which is 'vertically' scaled and can scale further on a horizontal basis. The scenario of rapid obselecence has now been largely overcome by the very common use of the efficient Linux OSs and PHP/Asp systems with their accelerators ( see below). allowing reasonable life span for some server machines.

The additional cost of this is in desinging and managing load-balancing between the two: for example you may have an XML mediated CSV database (comma separated values) which is in the millions of records, so this could be best split over the two servers, with a partitioning of users "A to L" to 1, and "M to Z" to 2. This requires intelligent routing, we will come back to.

Sesssion ID now becomes complex: if you load balance over to the other server during a session, their UID session and related temp data cache on the server ( eg shopping cart, log in, progress in a form) then the user could be lost if a subsequent http request goes to the other server.

Duplication of databases is often not practical if there is need for strong real-time continuity in this, so user behaviour can be looked at : perhaps they view many html static pages before they start to interegate the SQL server, or most traffic never even uses the pages containing these requests.

Even sessions can be kept contigous for individual users across different servers, by saving the session temp cache and duplicates the UID in a commonly accessible network folder: this needs to be very quick to access for this reason, and is an expensive investment on large server banks. Then there is the issue of redundancy for this important part of the architecture: it takes investment and uses time/space on the internal bandwith and CPUs.

Load Sharing:

Load sharing or load-balancing, means distributing the http traffic to different machines in the horizontally-scaled server bank. This can be acheived at several levels or "layers" and by using different levels of either simple or intelligent trafficing:

1) Multipe IP address per Dot Com on DNS servers can be the first point of load sharing: on original DNS request from a browser, the DNS can have several IP addresses for the one web site: most DNS's will have software which just goes automatically through the list of IP to that top level domain name sequentially. This means you get a low-cost, Drawbacks though: Some caching at ISP level , virtual DNS request management, can make this and other changes in IP address. Also the browser may cache IP for domain names. You could allow for days of overlap between servers, but you risk insynchronous database entries.It also takes time for a new IP point to populate the worlds DNSs, and then the ISPs virtual DNS servers. Also if one server IP goes down, then a disproportionate amount of users can be affected because of the vDNS: you could also perhaps handle all the down time traffic, if at night for example, on the other threee machines while #4 is down, then you lose 25% of visits no matter what you can serve from the one still up.
( IPCONFIG/Flush DNS will get rid of local DNS cache in command line)

2) "Layer 5 or 7" Balancing: Intelligent Router Load Balancing and IP sharing : a device owned by you, a router-server, which redirects requests for that web site to a a bank of servers : One way is eg, www.2..., so you can control more actual machines with IP's behind the scenes, and load balance without DNS effects: the use of multiple sub domain names can be annoying for users if the bookmark the address. Intelligen routers like this can act intelligently, parsing the packets quickly and sending to the most relevant server: eg database requests for certain folders / page requests, setting up cookie/sessions.

These can also be scaled horizontally such that one ethernet card : the other server can just be for out-times or when the primary goes down. The IP is just passed over in the ethernet card. Even session information can be balanced, ie duplicated between the two routers in-case of one going down: beacause they are 1) really just routing, transparently 2) but are intelligent enough to route users to the right web server they have been on already. These are called "sticky sessions", and can be cookie mediated at the intelligent router level.

Partitioning: the intelligent router-load balancer, log ins for user names A to L go to one server with their data there or in a back end SQL server.

So this has a cost-benefit for on line e-commerce shops." Fibre-Channel" is an expensive rapid access network file server & connection for example. NFS is at the other end, being very cheap and low maintainance database/source file-sharing on unix/linux.

RAID Arrays of managed hard-drives, were previously very popular because they could manage the information redundancy as a shared-file/data repository which overcame some of the issues of potential machine failure and recovery of data after crashes.

There are various software used to balance at layer 7 or primary web server level: LVS, Pirhana (Redhat Linux) , HAproxy (on ubuntu Linux) , A pair of Citrix's or CISCO's load balancers alone are approximately 100K USD, as an installed hardware solution with optimised linux/unix software.

Scaling Issues with "Back End Databases"

When a web site has either a large amount of complexity in the fields it will read-write to in a back end database, or when there are many more simultaenious users ( and when both are te case of course) then either web-server-directory database or a single back end database machine will become insufficient.

In order to cope with the complexity and demand, planning on both the hard ware, internal connectivity, and the software optimisation will be required in order to effect good load balancing which gives users a fast web site experience.

This means you have both the issue of horizontal/verticle back end scaling AND load balancing and data-sharing such that users have a contiguous experience. One way to ensure this latter point, is to link the 'log in' to a route to only one unique back-end server. So all users A to D go to that server. More likely, you date stamp the user and utilise an Apache server side "triage" 'log in' table such that the user name is recognised, a date is found and then the correct back end server is hooked up for that user irrespective of which web server they have come in on.

Vendors of branded database engines, and experts in the 'freeware' MySQL, will be able to present scenarios for traffic handeling :
  • what type of back end servers (CPUs, OS, RAM, hard drive, RAID systems)
  • and phsyical connectivity ( Ethernet cards, ports) will be needed
  • What level of redundancy and back up is needed etc
The above will be taken into account and costs presented to determine how many simult├Žnious user sessions can be handled economically, and how the load is balanced wieghed up against contiguity of data and user session. As mentioned before, it may be an economic case to just not allow more users on at peak times, or to re-iterate low security, frequently accessed data on the Apache server side in small tables dumped out the database. ( see SQL Caching)

Software and Reprogramming Costs in Changing Back End

However the other major issue when faced with suggestions by a vendor or internal developer to change back end database programme-ware, is that the database interaction coding in the web site front end will ALL need to be re-engineered. SQL commands vary, the database call-ups in PHP / ASP vary and the plug inn modules which add Apache side functionality vary in their commands and utility.

Once again there is a vendor-client conflict of interest scenario: the vendor has vested interests as being a "partner" firm to a branded database supplier, and these may change over time from MS to Oracle for example. It may look like say, extending your MS SQL-Server licensing from say your existing ERP system to replace previous share-ware MySQL back end on your web site, is a sensible integration at little cost. However, the programming costs in the conversion, may outwiegh the costs of horizontal scaling and maintiaining MySQL.

It is nearly always possible to create a flat table on a file-server which several different database engines can access. In the worse case scenario, you can just schedule a data dump of the flat files as "CSV or PSV etc from MySQL to a file location at scheduled intervals for the data to be populated over the main ERP system. This latter case was in fact was a very common integration method between earlier "green screen" ERP systems ( often refered to as "Legacy" systems ) and e-commerce front ends in the 1990s and early 2000s: Orders would be dumped down in a simple separator delimited flat file every hour or so, and there would be a manual check to see if they were accurate and not duplicates.

Management Perspective: Development Phase and Uncertainty over Choice of Database Engine and Back End Architecture

. When presented with the need to enter a development phase for a new dynamic web project, it may be unclear which back end system to opt for before the actual functionality and user demand is defined.
The costs and time penalties in re-engineering to a different back end system
which may better suit the specific functionality, and higher user demand,
may be unforseen by other managers.

So when prototyping a web site, it may be best to use a very simple,
single server set up, with for example just XML data depositories or SQL-light. From this perspective, the initial programming will be faster and cheaper and because there will be only a small user test base, server loads will not be an issue. Furthermore, the simple programming and data tables will be very self evident to the eventual developers. The code will therefore be easier to replace and supplement with higher database command-functionality when you upgrade to a more powerful back end database.

Performance Enhancers

There are diverse strategies for optimising performance of a system in terms of streamlining the amount of processor tasks and the route the data takes:

1) Compiled Code ( Java/C++ etc Applets) and UpCode Cache
One is to run a compiled language alternative to a higher level non compiled language on the server: PHP is compliled on the fly by the php server engine, thus there is an extra load on the server processor: requests to a php file location which are high traffic and require more logical data-processing, could have a Java server-applet which recieves the request in the URL or POST from PHP and just interprets it in a fixed parser to then run a very much quicker compiled java routine.

The downside of this route is that it requires higher programmer skill to update or rewrite code in comparison to a higher level language like Active Server Pages (asp) or Personal Home Pages (php)).

PHP and other dynamic web site language-engines, do have compilers as accelerators though: UPcode caching, means that you can have your cake and eat it so to speak: the server BOTH parses the PHP, keeping it there for reference or re-iteration, while it also COMPILES the actual programme to machine coded for execution during normal operation: so web pages and interactions are handled by machine code more directly and hence the on-the-fly compiling is eliminated. In PHP it is like eAccelerator, APC, XCache etc. The programme is only compiled when new PHP code is loaded onto the server file location.

2) Optimised Web Coding for Dynamic Web SItes

i) Think Traffic! Optimise the HTML

From the very, very top of the home page HTML document you start to incur web traffic loading as a multiplier: bytes x users. Coupled to the latest advances in super dynamic, updating web sites, server load can soon become a head ache and you a victim of your own success.

When you consider the many billion hits Google gets everyday, then even their URL expressions look to be optimised in terms of pure numbers of bytes by reducing the ASCII characters to a minimum.

Taking further with google: the pages have always been somewhat graphically sparse, with youttube neatly partitioned now. This is intentional! It is their brand and their saving on server load.

The latest web sites present data in page elements which are fully dynamic: they can update every 5 seconds or whatever periodicity is set, they request data as soon as the user moves the cursor and so on. This Ajax/Jason/XML mediated http traffic and server intelligence load if you like, can snow ball as users come to your site or start to be more interactive. Furthermore, they have to actually close the web page before say a five second share or ticket price refresher stops making those http hits on the server!!!

Best practice in this field is generally to be seen at, you guessed it, Google: in terms of the search engine and especially google maps API which serves up a graphic rich drag-scroll environment to millions. Also banks tend to be clever at cutting down the http traffic and perhaps protecting their sites by having succinct coding, minimal byte submission, low graphic content and efficient server side applications.

Software ( decongoderators ?) will rewrite your legnthy HTML/PHP for you as mentioned, to make it more compact, if a little less understandable and lacking any documentation. Simple things like parameter names or file paths can save on the bytes.

So in outset, the route
for a high volume consumer site in project managing, testing, debugging and beta versioning, laubching and eventually server scaling (ie the entire web build and life cycle), will be different from a low demand, high content business to business user service.

ii) Information and Query Caching

Flat HTML content files are very much quicker for a server to process and serve outwards because they are just files with little or no interactive requirement. They load the CPU very little. For web sites built in a dynamic web site language however, it can be inefficient to serve a home page which is only altered now and again by running php: the request for index.php? can therefore just be handled to send a current version of that page previously generated dynamically, in plain HTML.

In fact entire web sites are organised in this way, so in fact they are only dynamic upon scheduled processing rounds, when new submissions and versions coming from users ( like a notice board) or sources ( like an RSS or other XML news feed, or even a back end SQL database) are actually run in the dynamic language and output as htm/xhtml

Economic and quality issues for management mean that there has to be a utility to cost relationship: will lower value users interacting with files. This approach was used quite frequently with PERL based dynamic web sites in the late 90s, early 2000s, but it has drawbacks in the immediacy of data. So for example it is appropriate maybe for a catalogue of products with current prices, whereas totally innappropriate for a shopping cart type solution to buying those same products.

This is however an economic approach with interactive user-posting web sites, where there is actually little monetary revenue from the hosting: this can be done by just informing the user that their post / free advert etc

A further area which can be excluded from "low current /low perishable content" is the SQL requests emanating from PHP code going to back end databases: for example in cfm days, even a simple "welcome" message from a fully dynamic home page text area would incurr both cold fusion engine CPU useage (in and out to the :) , back end data connectivity delay and SQL Server CPU useage.

SQL Query - Result Caching

The response to a given query is cached: the infomation can be cached locally on the server once this is generated, thus being both "static" information and avoiding the connection and CPU use on the back end SQL server machine. This is actually very easy to implement on many webserver-back end server software set ups with simple command codes to store popular SQL queries and their results: hence the query is recognised as a reiteration and the local data served up, while new queries are sent to the php/SQL server route.

Further to this, and used most likely by FaceBook, is the "memcached" approach which builds a local array of frequently requested information which is then queried on the fly from just that server: so for example if you do not update your profile, your little user preferences and UID are actually local to the server ( amongst millions of others in this case perhaps! ): this is more efficient because the actual information may be stored in diverse tables in the SQL database, whereas the frequently paired information is really rapidly accessed in this small, simple and aggregated local sever array.

This can be used for emergency load balancing or back up for frequently accessed information or UID codes : the small arrays are populated accross servers, and so you may update on one, while actually a "back up" is on another, or if the back end goes down, then you have at least some redundancy for simple log ins or posting etc.

3) Back End Optimisations on SQL Data Servers
Master- Slave: here one master is the write to file, which then populates several slave SQL servers for delivering that data out, thus balancing queries over the slaves. Given failure of the master, the Slaves should be able to take over the Master role upon failure or upgrading on that original machine. We can also of course

Load balancers are then used to manage traffic to the slaves and only send write commands to the master; also they can help manage the cycles of populating the slaves.

RAID Arrays of managed hard-drives
, were previously very popular because they could manage the information redundancy as a shared-file/data repository which overcame some of the issues of potential machine failure and recovery of data after crashes.

Also you may find that the database table, navigation and actual hardware is not sufficient so you would want to vertically scale as load renders your current software/hardware combination inadequate for the traffic and processing encountered or expected.

4) Partitioning Users and Their Data
This can be first and foremost geographically acheived: hence facebook UK would store maybe all user profile information for that market: or on a regional basis. Also you could partition on user name, just alphabetically, or based upon some knowledge of surfing behaviour
If we can therefore know how many current and/or potential users for a partitioning possibility are, then we can actually invest resources specifically in relation to that population "n".
This is most applicable where "global" information to be served to users is somewhat limited in complexity, while local data is more closely relating to the partitioned group. Thus the updates from a "super master" database server are global and scheduled at night time for example, whereas for that partition, relevant information is virtually live, but not accessible outside that pertition.
You could then impose a level or partitioin screening or virtual partitioning, if for example a user logged in from abroad to their home country, all the servers in the world know the simple user ID and which partitioned server to redirect the user to.

Summary of Key Management Issues in
Scaling Web Server Capacity

The Key Management issues and decisions centre on the following:

  • Cost-Benefit: this can have it's own KPIs, like for example , cost per user, cost per thousand simult├Žnious transactions, and scaling related cost-outcome scenarios.
Do we want to offer a very high quality of web experience to a defined number of users? If so, then verticle scaling based upon known traffic is a good solution to speed up the user session, or integrate processing-hungry operations on the server side, or larger outward download of files and scripts to the local users computer via their browser.

Do we want to prepare for explosive growth in demand? Will we lease machinery to our location or from the web host to cope with demand from a marketing campaign? DO we expect to introduce far more dynamic code which is processor and internal bandwidth intensive? If so then we must consider both effective hardware scaling, horizontally, and optimisations in performance by software acceleration, caching data and intelligent routing.

Alternatively, do we expect predictable growth in demand or a slow rate of new, incremental traffic? If so we would want to have a cost-effective horizontal scaling which can grow organically and pay for itself within defined time and margin parameters. Can we roll out implementation to selective customer groups or geographical markets, thus being able to plan for a known or estimatable capacity? A horizontal solution with partitioning could be a very cost definable means to plan for a maximum forecast capacity from partioned user groups.'

Scale........ and Scalability: One issue is to acheive the scale you require for current or anticipated levels of web traffic, the other is to have an architecture which allows for further scaling: In terms of both traffic and complexity of your web site : will the web site start to search much larger data sources to offer the users a richer experience? WIll we roll out our full range of products, old stock, components and consumables onto the web shop? Will we force lower end users over to web information / shopping? All these can incurr extra CPU , RAM and internal BUS (bandwidth) load by volume of users of complexity of code and SQL back end computing.

As a manager faced with exponential growth in user traffic plus integration to ERP and legacy data, capacity planning must be an open discussion with IT departments, web hosts and ISPs with eventually the DNS ( Network associates etc) authorities being involved to help balance your load.

Risk: Do we accept a level of risk in either reducing costs or implementing a more innovative, higher performaning web experience for our target market? Or do we opt for safer technology with generally available personnel skills? Do we over-engineer to industry "best practice" or actually allow for a degree of over-capacity? Finally, do we risk adopting a stance that we cover a minimum projected demand with initial investment while making a road-map for scaling at a later date? How much do we expose ourselves to risk of having peculiar systems delivered by either suppliers or internal IT developers?

Location: Do you want to continue hosting at a virtual web host? WIll they be able to integrate a shop solution to your ERP order and inventory system would be a good place to start asking questions, and you may want to benchmark performance of other similar solutions they host which connect to ERP at a client: is this slow at peak times? Are their computers suitably
secure and redundant- do they offer enough back up and redundancy for your implementation? How much extra bandwidth cost is incurred? Will a rapid expansion in traffic to your site incurr punative surcharges which you are contractually obliged to cover? So it may be preferable to host on your own servers in terms of quality in e-commerce, security of sensitive data . However, do you actually command the bandwidth and speed in terms of being near to a fibre optic "backbone" or "ring" ?

The concepts of Cloud Computing and Edge Computing relate to a high level of interaction and redundancy made possible by super fast backbone communications and the sharing of computer capacity in terms of uptime, port traffic-capacity, RAM/Disc memory and processor power. These are a bit beyond the scope of this current lecture series, but are an emergent area which will have value for both large multinational corporates and web-brands, as well as for egalitarian, information based services, like perhaps Wikipedia.

Finally on location, will it be desirable to partition traffic from different geographical areas so as to load balance ? How do we avoid cross-domain-policy clashes? Do we want to infact implement new domain names for different geographical regions so we can serve nearer to market, or do we need to enter a Google/FaceBook style agreement with the DNS authorities, allowing for IP addresses to serve a dot com to nearest regional web server?

Crisis management : a new blog, how and when do we bump users during super nova demand?

(c) Authors 2011

Sources: 1) Wikipedia: links given and all authors recognised.
2) Harvard University Dept Computing Scienc, "E75"' Building Dynamic Web Sites ' on line streamed lectures ( academic earth dot com), Professor David Malan.

3) VM Ware Blog VM Ware World Headquarters Palo Alto, CA 94304 USA

Wednesday, January 19, 2011

Eating Cookies...not So Bad for Your PC Health

Cookies Explained

A cookie is a small internet-mediated capsule of information if you like, which is downloaded by your machine and read by the remote computer during your session on their web site, or upon your next visit.

A cookie is just a small text file really, with a unique user ID, a small amount of related information and reference to the host website.
Cookies can only be held on one or two file locations ( eg temporary internet files) and can only be short files conforming to a set syntax ( layout) , and this security step is managed by your browser.

The cookie session ID is a number usually a simple, long integer, which allows for continuity in a single session between the little http handshakes which go on between activities, or an ID which can be referred to between visits to the web site. The ID number relates your session or log-in to information stored on the server (remote computer).

Cookies can be used between visits, aka "permanent cookies": The cookie can include username for example as well as the UID and cookie host name. One security issue here we will come back to is that this can auto-populate fields with user names and if you specify, password, with these words being held on your computer in the cookie. A time "print" from the server clock and an expiry date can be set as well in the cookie. The default expiry is when you exit your browser, and in fact the usual browsers- IE, Firefox, Safari and Chrome- allow you to set "delete all cookies" when you close the programme.

Security Issues and Cookies

When you ask facebook or hotmail to "remember me on this machine" this is what it does, via the cookie, so it is important that you don't do this on public computers and best not to if asked while using a public WiFi with your laptop.

Some cookies will contain user name and password, verbatim, for autocompleting forms or for an auto log in, and this is really very bad security. The majority use a UID alone luckily and this will be part of a site getting SSL certification to use https.

Also with further due diligence to your own personal security you must not use your user names and passwords from more important accounts, like your online bank or Paypal/Amazon etc, or the e-mail account itself, on less important log in's. This is when you can save these in the tick box option so don't get the site to save these: they may be just left, completely readable on a cookie! Or site server may not be secure or you may be intercepted by a session hijack in a public Wifi hot spot or LAN at school/the office.

Many people consider this a breach of privacy or potential security: this would only be so if the server hosting say Amazon, was hacked and your ID number associated to your credit card for example. However some spyware ( an executable you inadvertedly download or get form an infected disk/stick) may look at cookies to examine where you have been, and determine what content to serve up to you dependent on your cookie host names.

Cookies can only be held on one file location and can only be short files, and this security step is managed by your browser. Also, due to the cross site policy installed in all good browsers, other web sites do not have permission to access the cookie, but as for example in google-ads /, a cookie can be requested from them and used to record your surfing behaviour while on those first party sites: this is called a third party cookie, and you should be able to block this happening: it just creates a lot more http requests which can be slow and then you will be served up "relevant" ad's when you go to G-mail: this is how they know, Google pay for access to other web sites so they can send these cookies.

PHP_session_IDs is the small identifier which could be picked up on a public network and a hijacking cookie written. This is why for any sensitive info, an https/SSL session should be used.

Secrutiy concerns are valid, but the use of cookies is not risky. Rather the opposite : the UID is so long and only temporary to the session, so it is very unlikely that a hacker would be able to guess or pre-empt it, or copy it. Even if they had some nasty spy ware or control ware, or as "session hijack" or a middle-man-attack over your machine in an internet WiFi cafe say, it would probably be obvious there was somethign wrong with your session and you should log out switch off your internet connection immediately.

In the most common language these days, PHP, cookies are sent from the server when the php file has a 'session_start' command at the top. The client's browser then replies to confirm it has saved the cookie.

'Session' Explained

A session is actually a description of capturing data for a user on the server side in a file. It is rightly called session because it helps continuity between http based requests and handshakes while the user is still using the web site. Good examples are in internet shops, when the session holds information on your progress through the buying process.

Using POST and GET for Submitting Data from Browser Side Web Pages
Post sends a hidden data capsule, which has a defined legnth of characters: so it is best for password/username submission in simple non https/SSL set ups or some transaction data in internet shops.

Data submitted from websites in html and xhtml are refered to as forms, like name and address lines, and this also applies to radiobuttons, check boxes and hidden letter log in fields.
GET sends the request in a URL, which has the benefit that you can have very long and very variable content for the "string" or form data. Browsers have an encoding routine which takes the string from the field in the form and converts it to a URL, and this incidentally eliminates conflicting characters like ampers And "&", "=" and "!" , and encodes them to their ASCII hexadecimal codes. This also has a security implication in other POST and URL constructs, whereby a hacker may get unauthorised access to a database which is not covered by the browser tidy on form submission/Get/Post operations.

Cookie data, like username, can be requested before opting to allow just a "post " ie field populates or user UID session is referred to.

Summary On Issues With Cookies

Cookies are actually a quite secure set of information files. Firstly in that the browser only allows them to live in one or two places, and they must be small and confrom to the right syntax. Also serious programmers will only ever store a user session ID number and not embedd user name or worse, the password information in the cookie.

Taking these two together, then the system is way more secure than the weak external link: homo sapien Mrk1. Also because cookies establish a long UID number, and to some extent manage the user session via this, the continuity in identity makes this a security feature both in and outside SSL (https) connections.

A third party would have to attain this number from the cookie in transit, to be able to hijack the session: if we relied on IP addresses for UID, then it would be possible for someone to mimmick your session and run a parallel attack. Also network administrators and ISPs would have to (c) 2011 Authors appoint proper IP addresses to all users, visible to the internet sites and nodes outside their routers.

The next layer of security is the cross-site-policy: once again in the browser, it will not allow other sites or owners to see your cookies from other web pages: the loophole in this is that if one web site incorporates a cookie which comes from another owner, like Google Analytics or, then that site can get information about your browsing across different domains and then serve you ad's in your Gmail or other sites they own or have a permitted prescence on.

Currently there has been little evidence I have seen on the web, that 3rd party cookies do anything else but intrude on your privacy a little, and lead to 'personalised' banner ad' serving based on your surfing habits. However, if you visit sites in countries with poor DNS policing, then a third party may be able to fake being some other site and run a middle man con based on knowing there was continuity in the user: ie you are still on the computer with the cookie and they know a little bit about you, even if it is just you have been on an unsavoury or inadvisable web site: that could be enough for a blackmail attempt for example.

Internet Explorer v 8 ( IE8) does allow for blocking of third party cookies, in the advanced panel in privacy window in internet settings. Earlier firefoxes had this included, now you will need cookie monster as a plug in to selectively stop this automatically.

So essentially, cookie use is more secure than your own human error, ignorance or gullability. If you must use a public machine or a machine on a public network ( include hot-desking at work in this!) , then it can be worth to do the folowing :

a) request warning and approval for cookies to be accepted in internet options, privacy settings ( advanced) ; there by will you know how many cookies are coming and what their sources actually are.

b) do not check the box "remember me on this computer" and especially do not check the box for "remember my password"

c) ensure your delete all cookies in internet options in the browser and then close the browser down when you walk away.

d) use a wholly different, unique user name and password for different web sites:

I recommend having a second "light security" use e-maill account in a non guessable pseudonym: freddampy@geewhizmail for example, because so many sites stipulate e-mail as login/user name. For this and your primary hot/gmal log in, use long and strong passwords. This has become more of an issue for escalation of a potential identity theft now that youtube, gmail and blogger have a common Gmail log in, and with Facebook Shares, other sites could just rip off your FB log in: thus if someone learns your log in they can grab more inforamtion on you, like address etc, and if you are worth stealing identify from. ie high credit rating.

Use an unrelated user name and password for banking and shopping, and if your bank or account version do not include a digital token pendant, then opt over to one that does.

(c) 2011 Authors

Sources: Prof D Malan, Harvard, E75 course on Dynamic Web Sites
Mozilla Corp. Inc User Help Web Site

Monday, January 17, 2011

Internet Security for Non IT Managers

Internet Security

In this lecture we will talk about the major issues in internet security for non IT managers involved in secure web projects or operations, and for us all as consumers using the internet for banking, shopping and social-media.

We will consider firstly the largest threats; look at some of the technical means of attacking and defending our security at the user level - also touching on server level diligence - ; visit a complete alternative to open internet traffic for encrypted communication ; look in more detail at some of the potential technical loopholes or hacker opportunities ; and then summarise what steps should be included in web site development; At the very end ,as not just an appendix, we will discuss the highest level of practical internet security used for consumer banking today, the digital token system.

Acknowledgements and further reading: In compiling this I am very much indebted to the Harvard University E75 course, as streamed on Academic and conducted by Professor David Malan. Please regard this as the main reference material for building a less technical while well informed background to this lecture. The course has been excellent in extending my own knowledge as a largely non-technical project leader earlier within the industry. One thing which inspired me to write this is actually the verbose and often esoteric nature of many wikipedia entries relating to internet technologies, but for all acronyms I would refer the user there as I have at least followed them up and used some as information sources, all authors recognised and copyright not infringed.

The Major Threats in Internet Security

The biggest security threats to your use of web sites and related e-mails on the internet today are all from what we could summarise as euphemistically " human error " namely ** :

  • Phishing scams and resulting identity theft, or temporary seizing of control of bank/shopping accounts- the human error in opening the links!
  • Sloppy programming allowing for loopholes like SQL/script injection and session ID capture, as well as link mediated phishing scams or copy-cat web sites.
  • Using your own laptop in an open network or public Wifi spot... and resulting identity theft or temporary seizing of control of bank/shopping accounts
** This would also include people wandering off onto certain countries domains where criminals and hackers may enjoy greater freedoms than they do in the developed west ,where DNS servers, routers and web hosts are really quite well policed. Also this includes poor internet "Hygeine", see the blogg on cookies: in having common log ins across domains and not deleting cookies you expose yourself to one level of attack, and the whole on and off - facebook invasion of privacy issue, where people can scan you for potential worth in stealing your identity. Internet "Hygeine" may form a forthcoming FredRant.

Threat #1: Identity Theft and Open Public Networks

Having someone hijack your facebook session ID and log into your facebook may seem trivial and just irritating, but in doing this they can get enough information to commence a very good identity theft. Also your session may be with a web shop who have lax security "around the edges" meaning that someone could control your account details, deny you access and use your credit card or balance on the site to make purchases or bogus payments.

Why are public WiFi networks not safe?

Public WiFi is one of the main threats to internet security, when you engage with a public network or do not suitably protect your home or office WiFi. Passwords sent over a public network , Starbucks, etc, are completely open to copying if there is no SSL or Javascript VPN encrypting communications.

On any non https (SSL) web site, or web page for that matter, your data communication is completely open on a public network: so you are exposed to potential identity theft and more sophisticated phishing scams which use other communications like SMS, e-mail and telephone calls to coerce you into giving out more information over a web URL address, leading to theft.

For example, Wireshark - formerly Etheral - sniffs wifi or other routers and can intercept any packet of http on an open network router like in a wifi hotspot.

Even with SSL you could be victim to an attack from a local computer on the Wifi /LAN running a copy-cat site which intercepts your internet traffic and makes you believe you are entering user name / password to the real world version of the site. ( A.k.a. "middle-man-attack")

The same is actually true for FTP: the password can be intercepted on a network which is not secure, and standard http where the http packets can be intercepted by third parties. SFTP makes FTP passwords encrypted.

A "Perfect" Secure Solution is Out There: VPN - But at What Cost?

The supposedly perfect means of connecting in a secure way over the "Internet", is actually to go against all the usual principles of free movement and negotiate a VPN: a virtual private network.

These are similar to WANs og GANs (wide or global area networks) but use a higher level security by encrypting everything using a private key, not communicated over the internet/WAN. Sometimes VPNs dispense with usual TCP-IP and have alternative connection protocol and technologies centred on encyrption.

The VPN relies on a defined route over the internet, sometimes between just two internet nodes / routers, and crucially it relies that both sides have the same private network key for encryption, or synchronise this key in a way which relies on the closed loop of the key never being transmitted. The VPN is yet more secure by being relatively private or closed off from the rest of the internet traffic at the router level, where interception of data by hackers could be possible.

However VPNs are largely impractical and overly expensive for most e-commerce applications where users, sorry consumers, are not readily able or willing to set up a secure encyrption key. It would mean that for simple shopping on the internet, people would just move away from the shop asking for VPN set up, going to a competitor. Consumers are used to simple steps using https, a password and for lower value transactions with the big brands or trusted sources, they are comfortable with the level of risk.

Also we would undoubtedly have to pay a lot more for our connection to the "internet" in using this kind of ring-fenced data route (a.k.a a tunnel).

On the other hand, with the use of token mediated security, the cost and time of establishing VPN between high value consumers and their banks and secure trading areas may outweigh the risks of using http and https at some point, or at least the percieved risk.

However, this could lead to a false sense of security, and hence if someone does take control of the remote computer or hack the server, then any delay in detecting this could lead to a higher relative financial or IPR loss because the stakes were probably high enough to pay for a VPN in outset.

Threat # 2: Sloppy Programming

There are several areas which are subject to security breaches caused by sloppy programming for want of a better word for lazyness or poor project resource- invesment.

  • Server Side issues: File Management, Security, Configuration and so on.
  • Web site areas which go in and out of secure pages while the "session" is still continuing
  • Making code easily copyable such that someone can fake your web site in a phishing or middle-man scam.
  • Not programming out loopholes in how information from web sites enters the web server- allowing for SQL/PHP or JS scripts to control the server log in area or crash the server
Usually this is by neglecting to use "escaping" such that data control commands, like ampers& , " / gt/lt ><> which could instruct a server computer to return personal data to be intercepted at the WiFI cafe, like log in & password, or perhaps cause a server crash and permit a hacker to gain complete control of the server.

  • Also not covering for a simple security breach in those auto-form fill requests, allowing an external site to phish out the log in name and even password,
    • Not managing the session ID ( cookie) in a secure way which means making them expire and replacing the ID number next time with something which cannot be guessed from the previous one.
    • Neglecting to use POST data for some important stages when a visible GET URL is shown instead, exposing a security risk.

    It should be pointed out that the small JS programming routines (scripts) operating in the browser, which do things like allow for only 10 characters in a user name, or ensure the operators " <>&%?" etc in a field are invalidated or neutralised (escaping the data), also help increase security from unforseen attacks with new command lines.

    We will cover most of these types of attacks in a little more detail below. Realistically they can all be managed-out with due diligence routines in the security section of a sweb project, long before launch.

    Don't Take Your Internet Browser For Granted

    What we actually rely on for much of our internet security at home, is not our ISP or national DNS authorities, but our humble internet browser.

    Sometimes new techniques evolve within the possibilities offered by the internet, which mean that hackers can gain access to information being passed to-and-from, or held in our PCs. Or worse, they can take control of our own PCs over the internet, or actual entire web servers.

    One example of this was when Javascripting allowed for pop ups and there was possibility for cross domain JS scripts running automatically before you could stop a cascade of new windows opening.

    I myself lost control of a laptop once by wandering my cursor over a gambling banner-ad' or icon, clicking by accident: it opened many casino sites and some hard porn sites in Russia, dozens of them within a couple of seconds, which in turn opened a download dialogue and landed a trojan horse executable in the background which was active just before the antivirus found it and was therefore able to run several exe's essentially taking administration control of my machine and not allowing me to even use CTRL-ALT-DEL to find out what was running!

    The pop up issue with JS (javascript) above, lead to a rapid introduction of new security measures in the browsers. This included the control of all pop ups which are cross domain, or those which automatically ran the scirpsts leading to this action, and involuntary start up of scripts running file management actions ie auto down load of files out of your control.

    The latter versions of all main browsers, have closed down this type of activity so you have to opt in to opening such pop ups, third party scripts, file downloads and external links with a warning controlled by the browser, so clicking "cancel" does not take your vote for the Republicans. This has of course gone further with more restrictions within Hot- and G- mail interfaces for example, which block all html link and script content at source code level in e-mails.

    Browser security is an arms race which has, by-in-large for now, been completely won by the leading makers of browsers. Hackers spend more of their energies trying to capture data in transit, steal identities from snail-mail, or catch out our sloppy programmers as mentioned, rather than programming work-arounds for browser security.

    Internet Explorer, FireFox, Safari, Opera and Chrome hold the majority of the world's internet access in their hands, and do really an admirable job for what is freeware! These programmes contain the https (SSL) encryption and certification system, the controls of javascript and the security rules on what web sites can and cannot access: for example web sites should not be able to view other web sites cookies (as info references, stored user names or session IDs): and one web site may not use scripting from another source by the policing of this by the browser in the standard cross domain policy.

    However, despite having just stated that the battle against browser mediated hacking is "won", microsoft/Apple/ Mozilla/Google all keep ahead of potential or emerging threats by updating their software, which is more or less automatic for you and me. Therefore you really should not impeed the progress of updates to IE, Firefox or Safari because despite this being free, it is of high value to you!

    Threat # 3 : Server Hijacking

    One of the nirvannas of hacking used to be gaining control of supposedly secure servers. Now as e-commerce and web serving perhaps becomes more widespread or under cost pressure to offer cheaper hosting, non IT managers should be aware that investment in resources and best practice in this area is essential.

    Apache and other server secutiry issue: super user status: on default programmes running on linux Apache, the super user is "Root", and this means that some programmes running are actually at the level of super user, so if they crash or are manipulated then an external hacker can run the whole server by assuming that super user during and after the crash. This is avoided by having a different default super user on an internal machine, so that programmes crashing just crash or the server needs to be rebooted by the administrator with those permissions.

    When web pages on a unix/linux system are served via Apache, then the permission to access is fully public, but the properly configured apache server will send your dot com or IP requests at top level domain just to index.php or -.html and send the request in dynamic web URLs to the file to be processed to give a result ( expects the result to be the reply to the URL request, say a GET$) and NOT the actual source code. Other users on the APACHE served network could see your source code though , but there are other security programmes which empose a layer of administration permission onto your own file area: eg suPHP, substitute user PHP.

    Some web hosts will however, configure their linux environment so that the Apache has super user access, and allows for a public route to the served content and operations, while not allowing file admin to anyone but the owner of those web sitess, and the apache /linux super user administrator. If there is a php environement running anyway, then suPHP does this rather well according to Prof. David Malan at least, so the extra work is probably not worth the time. The permission can be applied to directory AND file, such that if abberant files or hacks can be stopped from working.

    Threat # 4: Cookie and Session Hijack Issues

    Cookies : storing user name and password in a cookie is a bad idea, because it is both open over any open network when you connect to the owning site and communicate these details over: and if your machine is compromised or you are on a shared machine, it is there as a visible text on the hard drive. Https only protects in transit, and quite a few web sites are sloppy, starting a user session and setting up initial user name cookies outside the https encryption, meaning the cookies can be read by interceptors.

    Sesssion hijacking means the open account, after password is passed, has the cookie and session ID running and can just be copied and while you are logged in. The other hacking user can have a duplcate session running: this is why re-set password should include re-iterate current password so that they cannot just exclude you from your own account from such an interception and temporary hijack.

    Sessions are useful because http is not a continuous connection, and also now you have quite a lot of JSON/XML/Ajax/API functionality in the background giving you useful data while you are logged into an area or otherwise in a "session".

    Google analytic cookies: google gets around the cross domain policy because they are so often opted in on so many sites, becase they offer those sites free web stats in return. This means that they can link the information on where you have been because they are the same domain accessing and laying the cookie upon request. As we will see a phishing URL could mimmick google while actually sending cookie data to an third party site in an XSS conflict attack.
    Browsers though can disable such third party cookies ( 'external site' in firefox), but at the moment is is not a default and in fact in FireFox's latest versions it is actually only specifically in add ons which must be installed seperately.

    Session IDs can be completely random, because they are huge numbers in PHPs session ID engine for example. The chances of a hacker second guessing one is low, in relation to a user name, and the chances of two users on the same server and web site getting issued the same session ID are null if the random issuing is not reiterative through a pseudo random listing.
    A cookie's sphere of influence or reading can also be limtied to certain file directories in the server URL / directory set up, in the little packet of info in the cookie which helps restrict GET URL theft of Session ID or use of it further.

    The Major Cornerstone in Today's Internet Security: SSL and https

    When you see the sign for https then you know you are utilising a line which encrypts form data and information coming back from the site, using a certified system integrated on servers and browsers called SSL ( secure socket layers).

    Https is what you see at your end, as well as some dialogue pop out window warnings, usually about entering or leaving a secured area. Also you may be warned that the Security Certificate for a web site is expiring, which means its accreditation to most often microsoft, has run out.

    These little warnings about entering and leaving a secure area may be annoying, but if you are using a public network it is very well worth noting that you may still be "logged in" and using some facilities private to you while the "http" has lost the little "s". Thus your communications are open again. It is worth deleting all cookies when you leave a public computer or making sure you log off from the web site you were on before you shut down the machine or window.

    Using https helps the problem of potential router-interception or public Wifi network session hijacking, but it is still possible that any info or pages not in https reveal the cookie session ID on a public WIfi network for example or an insecure router. Such a hijacked session could then revert to the https area and abuse your account or credit card.

    SSL/https is not supported by all serves ( if not all routers ?), and most sites which have https requires that bit extra time and processing means that after log in , facebook amongst others kicks you over to standard http and uses a session ID. So https is a performacne and server capacity issue, but under the UK dataprotection act at least, personal details linked to web registration, would be reasonably expected to be protected by SSL as a consideration for due diligence.

    SSL: secure socket layers; SSL Certificates : What is all this About?

    SSL certificates usually cover one domain name on one IP address, and they are issued/certified by your trusted domain name partner, like network solutions, Verisign, Go Daddy etc when you buy an https ready web domain.

    The browser types themselves, like IE 6, have approved SSL partner lists from Microsoft, giving a certain level of confidence and meanign that you should have a host which is on the MS list for SSL certified source. Having one enables you to receive and send https on the network or internet routers. The >SSL certificate contains a code which denotes the vendor.

    You can get a *.com 'wildcard' SSL for a single IP address running as a virtual web host, but this is expensive and the key will possibly be the same across all the web sites you have running there. However given good apache / linux file admin, this can be attractive for TLD names and sub domain level name security.

    SSL's Public Key Encrytpion Practice and Theory

    SSL's system is sometimes called assymetric encryption, because the users have a their own respective public key and this public encrytping key is different to the decrytpting key, the private key.

    As a new customer / visitor we do not have a shared secret key in advance : the shared secret would be sent "In the clear" so it could be intercepted.

    Public key and privat key are generated at browswer installment time, and at the SSL certified server side when say Apache is installed or upgraded to SSL at that unique IP address. There is actually a mathematical relationship between the public and private keys.

    What happens is that you communicate the public encryption key to each other and send the data encrypted to this key, while it is actually decrypted by a local private key. Your SSL public key is open for any https connection to read, and that web site will then encrypt using that key, but only YOU can decode that encryption. The reverse is true for sending data, you get the servers public key code when you request an https transaction and only the server can then decrpyt that - in theory, using their private key.


    The two numbers as mentioned are related to each other, and the public is generated from the private : to overly simplyfy: the public is the private to the power n for example with the private being a prime positive integer for example. The public numbers are shared to perform a one way encryption of ASCII and other characters over to numbers. The public key can be accessed, but the computational power needed to decrypt any messages would be so huge that it renders it unpractical to do so.

    Issue: is middle in the man attack or of course hijacking in either an internet cafe scenario: someone on that network intercepts your https initial request to say amazon, and presents themselves as amazon, sending a key they of course can decrypt. The same could be true if a mis-spell or an expired domain name is hijacked, and maybe you return there with your pass word and user name, and they then ask you to reissue your credit card details.

    Other Security Risks Relating to Sloppy Programming :

    SQL Injection Attacks

    This means that a SQL coding is written into a form field/s and then submitted then becoming a query ie a valid SQL command. This is worked around by 'escaping' the syntax, like apostrophes which enclose a string to be submitted: the content of the field string then becomes like CDATA, and this can be done also at the javascript level to help users whose names contain "illegal" characters.

    Same Origin Policy Breaches
    JS code does not allow for direct republishing or integration of data in JS from cross-origin web sites. RSS and XML do allow for this, usually done server side through importing via PHP , but that is safer than allowing JS to roam the internet outside the origin who wrote it. So FB can update to itself in JSON/Ajax but not use this API

    Third party javascript cannot look into your browser behaviour : so ad's cannot directly do this. Frames are the same, so you cannot manipulate a frame requested from another URL.
    The data has to first be "imported" to your own or the common domain so that when served to you it is : the same is true of APIs: you have to copy google maps JS and then it is just looking for permitted flat data in their google maps repository.

    CSRF ( Phishing email and bad web site links)
    A commonly used web site which involves financial / buying can be embedded in a link in an email or a phishing site, which you click on and then it uses the session ID to then open a direct account: these can be hidden at the top of jpeg http requests, so you don't even click on them.
    This can be worked around by good cookie management and requesting password for any buy or send sensitive data. Also this is related to URL GET$ so it is wise to use POST data for some key transactions.

    Cross Site Scripting - XSS - Vulnerabiliy
    This is similar to SQL injection: an HTML GET line or form submission is mimmicked in a link, which also may contain a script activation reference, overcoming cross domain policy. The first half of the URL is mimmicking the real web site and its form fill /submit page or GET $URL, but then also requesting a bit of javascript linking to the attacking site.

    Once again, these URL links can be automatically run in http headers for jpegs etc, and hence hotmail and g-mail protect against content with links in HTML-mail.

    The phishing URL will request a JS held on the Baddie URL which is embedded / referenced to the fake host URL. js..document.location= filewhich will for example look at the cookie for amazon dot com and then refer it over to badguy such that they coud steel the user session ID.

    Increasing Security

    Top Ten Tips When Considering Implementation or LifeCycle Updates of a web site which should have a diligent level of security:

    1. Block subversive commands in data submitted to your dynamic web sites ( "/&%" and so on) All GET$ and POST $ data being submitted should be "escaped" that is the string sent is like CDATA- the computer knows to ignore the content until it comes to the end of the escape
    2. You limit the legnth of the form $ to a known or reasonable number of characters in the POST/GET on server side, and you can do this with JS during form control.
    3. You also limit the characters possible to submit from the web form.
    4. You use POST data instead of GET$ in open URL, and call this POST command set up in a hidden file area from general web page code,
    5. Make sessions ID only temporary with an expiry inversely proportional to the potential personal financial loss from session hijack or reuse of a public computer.
    6. Request password ( or token key number, see below) at vulnerable points where payment is required, value is to be transfered or personal details and credit cards can be viewed.
    7. Re-set password should also include re-iterate current password so that a hacker cannot just exclude you from your own account from a session hijack. Ideally this should be accompanied by an e-mail or SMS action as in 8:
    8. To activate accounts or change e-mail addresses to accounts, you force an e-mail to be sent to the original email address on the original registration: you keep this field as a locked UID for the user and copy the e-mail address over to a field modifiable by the user. Consider using an SMS mediated activation code or temporary password from an original mobile phone number in this vain.
    9. Log suspicsious IP addresses which attempt the submission of SQL/Bolean or HTML coding in form fields. Refresh session ID underway to those which are suspicious and leave a permanent cookie with a special user ID on that side to see if they repeat an attack. If the IP address is common to a small ISP, a university or the like, contact their system administration or security manager. Exclude IP addresses from countries with which you have no business with or are noteworthy sources of criminal internet attacks.
    10. When the transactions, personal information or IPR exposed is of significant value and the traffic volume is acceptable in cost-benefit, then consider either using an SMS key or password system, a digital token system or setting up a VPN connection to users ( See below) . The same could be true if you are aware that there have been many attempts or several successful breaches of security from user identity theft or Session Hijacking.
    Technical Approaches in the Highest Level of Consumer Internet Security In Practice Today

    VPN is one solution: virtual private network. This relies on a defined route, sometimes between just two internet nodes / routers, and also that both sides have the same private network key. This is discussed in some detail already, and we have mentioned the use of tokens, which are in effect symmetrical or highly encrypted assymetrical encryption keys.

    Another means of internet security for the highest levels of banking, software exchange (inc high value web services), film streaming and e-commerce itself is the use of "token" based password/ verification / encryption code. This adds an additional layer of security which is difficult to intercept and virtually impossible to guess. The token itself plus the log in and password would all need to be stolen from the user and used before they could alert the bank.

    The most popular type of token is the stand alone, thumb sized medallion, which have a preprogrammed number generator which generate a password or encryption key number either synchronisously/sequentially done with the server or interpreted like an SSL key.

    There are also dongle and cordless types which send the data when you allow connection, and latterly out-of-band tokens can be used which send an SMS or other datapacket through another type of telephony than the internet TCPIP.

    In many banking web systems, the token key is a one time password and is requested both at log in and when movements or payments are to be made. Banks always use https and SSL certification, thus there is a very good extra layer of security requiring this token be to hand. If you had your token AND your identity stolen you would notice it, or be the victim of a "extortion with menaces" ie an off line mediated crime.

    This further reduces the chances of Session Hijacking getting anywhere, but you could in theory still be open to middle-man scams if someone was really able to mimmick the bank site, and then pass the current synchronised token key on further in actioning fraud in your real web account.

    Back End Security

    The key issues of back end security arise beyond the SSL decryption on the server side. For example, log in passwords and credit card numbers are then fed back to the database server for verification, or initial data entry. This has in the past lead to employees stealing databases, or developers loosing laptops which included copies for WIP testing. Now even MySQL, the shareware, includes modules (eg AES) which allow for industry standard encryption of password: some use the ASCII coding in the password as the actual key, and this has some native appeal given the initial connection was SSL and the user keeps their password to themselves. The down side is that no one can know the original password, or even the legnth of it on better encryption, so if you forget it, then you have to register new or go through some other security checks and get it sent to you by e-mail or SMS.