Friday, April 08, 2011
Data is data- Using an Excel Trick
Okay well I did say I would blog more on the nuts-and-bolts of data, but for now we can look at data its simple form we can relate to: in tables- rows and columns. In earlier days tables were called "arrays" in computers which simply meant that there was a column and row "cell" address structure and values / text could reside in those cells and therefore be referred to or manipulated systematically. (These simple arrays are still used today, for instance, in XML often "on the fly" data-shutteling in JSON.) Of course for the last twenty years we have all been accustomed to the king of the table: Microsoft Excel. Now because data in larger databases, or on for example tables or lists on web sites, can be exported to excel ( often via ".csv' files- comma separated values) or directly pasted in we have the opportunity to "slice and dice" or grab information and use it quickly. You can think of database data as being lots of simple excel tables, but all being connected by common denominators like unique user ID, street address or so on. So here we have the first two tips- check there are no commas in the data itself ( choose tab separated if so! ) and use paste special > values to paste in from any HTML or word etc formatted source. You find this in right click and now there is a short cut to plain text values. Why would we want to export to excel or paste in tables? For example, we may want to compare listings between an old and new database to ensure completeness and non duplication. Alternatively we may want to be able to summate, search, sort and work up a table we find on a web site. This is very common in say "migrating" customer listings from an older system to a new database. On a more complex case: we have two databases running on different computers: we have a list of flights from a travel company exportable to csv, we have a list of cost codes and budgets in our own ERP and we then find a web page with listings over room prices at near to airport locations. We want to calculate a cost. A case where we recieve two excels would be in a tender process, and these may be far from immediately comparable: in effect they are two different "data bases". Just as in the cases above, we need to combine the data to make for comparability and cross refernce. Steps okay so we have cut and pasted in clean, unformatted data or opened a simple table file in excel. Cowabunga: our old database we have in excel output has "Mr John Dow" in one field whereas the new has Mr - John - Dow in title/firstname/surname fields! Okay so we need to do a text to columns, but we may want to do the reverse : merge the three into one: ideally we should have the three but retain the single field for common sense checks when an operator in the company is faced with a name they cannot interpret. Text to columns is in DATA. Choose either fixed, which gives a line between characters when all are the same width ( useless for peoples names!!) or tab or comma delimination: you get a preview to scroll down to see if this will cut well, or disect up your names into garbage. If this fails you can say replace all Spaces "" pipe ( CTRL H is find-replace shortcut) and then export to unlimited columns: then you can clean the data later easily by running a sort on the outer columns to merge them back in. Here then comes in another fundamental: using SORT you can find empty fields, wild outlier values and plain old garbage so as to clean data or get an idea on how dirty it actually is!! Of course this can show you that your preparation work has had a bum-steer and you need to rethink your approach. Now another very common thing to do is to have two different tables with one nice clean column of cells in common. This is preferably unique: ie values are not repeated/duplicated. Now we can apply a very simple database command without having to go over to setting up the straight jacket of Excel databases and learning all that jazz. We have the VLOOKUP function: this does what in SQL is: GOTO-tableX-ColumnZ-Find Value Y: report back Value in Column Z+1 value. So you can lookup data from another table preferably in the same workbook on a different worksheet so it is not lost ( you can link to other workbooks) or pasted over in the working area. What VLOOKUP depends on is that the data is the two most leftward columns of your selection and that the first column is preferably ascending numerical values. You can therefore cut and paste the column you know you will find the unique value in over to the LHS and then insert another in column A and draw down 1,2,3 ....until you reach the bottom of the record. The first part of VLOOKUP then is, using the "wizard" thing not hard programming per se: Cell Value to Lookup in the other table: this is best to have as a fixed column with '$' if you want to draw out several other column values from the search: eg date, monetary value, total, etc based on looking up "name" ( use F4 for inserting dollar sign $ to cell references_ it toggles both-columnonly-row # only btw) The next is the table array to go and look for this value in: here YOU MUST make it absolute and this should really be default because as you draw-paste down the column, you $ both columns and row numbers. Otherwise the table area you look in will move with each cell you draw the formula down into. Next there is which value to return by colum : this is tedious, it is the number of columns in the selection so in this case "A" is one, B two and so on: it may be you use column F to make for a clean data search avoiding some noise to the left, so this then is your column one. Usually you would want to return the value of interest, but you can also use VLOOKUP to look for exahustiveness and fidelity between lists on the same supposedly identical value. Finally the logical operator as they call it is always "FALSE" for some reason which then send back the data. True does not for some reason?? Okay, so now you should see returned data, and maybe some "#N/A" valuse: n ot avialble. If everything is NA then you have to check the formular: or if NA starts at one point, you probably have forgotten $ ( F4) Errors are usually caused by: not having a left hand column in numerical having hidden spaces or tab characters in the lookup column having confusing data there being duplicates ( the first one is the only one reported back on)
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment