Organizing Data for Analysis.
You’ve gathered your data and put it all in the computer. You open your database or spreadsheet anxious to see what secrets it reveals. Then frustration — no matter what you do you can’t produce the analysis you had expected. Why, everything that’s needed seems to be there?
The Problem. The raw data may well be there, but they may not be organized in an a way that can be used to make the required calculations. For example, client data can’t be analyzed by state unless each client’s state is recorded separately from the rest of the address.
Problems like this typically occur when project planning focuses mainly on identifying data to be collected, rather than results to be produced. That approach is about as logical as a pastry chef mixing ingredients and putting then in the oven before finding out whether the evening’s menu calls for apple pie or chocolate cake.
Getting Started. Obviously the chef’s first step should be to find out what people want to eat. For data analysis a good way to start is by creating a sample final report (you can use make-believe numbers). Then identify what information is needed to compute each item on the report and where it will come from. If anything is missing, you need to figure out how to get it.
Organizing Your Data. Next you need to make sure each input item can be accessed as needed, each client’s state for example. In programs like Access, Excel and FileMaker, that means deciding how your information will be organized into fields and records. Those are the elements the software uses to make its calculations.
Field and record are abstract concepts, so it can be helpful to visualize them in terms of the cutting-edge data storage technology of the 1870s, the index card. Library catalogues once were filled with cards listing their holdings by author, subject and title.
If a single card lists several books by the same author, you can’t catalogue each book alphabetically by title since there is only one card to file. Each book needs its own card, or record in database terminology. The collection of all the records is called a table. In Excel and database tables the records appear as rows.
Information on each record is subdivided into categories called fields. That is how a library patron can tell whether a catalogue entry for “Shakespeare” refers to a holding written by Shakespeare (author), one about Shakespeare (subject) or one with Shakespeare’s name in the title (e.g., Shakespeare in Love). Computers search and sort records by field, so if you want to list people alphabetically by first and last name, there must be separate fields for first and last name. Deciding what fields are needed is a critical part of designing any data analysis system.
Fixing Problems. Unfortunately you are sometimes handed a file whose structure is not appropriate for your project. For example, states were not recorded in a separate field or multiple items were put on a single record. Don’t give up, you may be able to fix it.
For example, if city, state and ZIP code were entered in a single field, your software may have text functions or a utility like Excel’s Text to Column feature that will separate them. Be careful, though, none of the tools is foolproof — depending on your data the software may not be able to fix everything. Run a test with a small data set to see what the program can do.
Another common problem is that items that should have been put on separate records instead were all put on one. For example, survey responses where a participant's answers to several questions were all entered on a single record rather than separate records for each. The sample to the right shows how both structures would look in Excel and illustrates why they are sometimes referred to as horizontal and vertical data.
There is no universal solution for transposing one set up to the other. For small data sets it may be easiest just to re-enter or copy and paste the data. In some cases your software may have a utility that can make the conversion or you may be able to do it with an Excel Pivot Table.1
* * * * *
The truth is you need to know where you are going if you want to get there. For the chef that means finding out what’s on the menu before starting to cook, for the data analyst it means identifying information you want to produce before assembling data. You may still encounter detours, but staying focused on the goal likely will get you there long before the person who wanders aimlessly hoping to stumble upon the right path.
- Video by Doug H posted on YouTube April 30, 2011. The video uses an older version of Excel, but the approach still works.
This article originally appeared in our free semi-monthly newsletter. To receive future issues, please add your name to the subscription list.