Introduction\n\nEvery business runs on at least some kind of data, including employee, financial and customer information, and seemingly more often than not, spreadsheets are used to store and visualise this 📈📊. I've spoken before about the super powers that spreadsheets can give you wrapped up into an easy-to-use package, and have also shown how this can lead to very real data integrity issues when workbooks are contributed to in a "collaborative" manner. These well worn issues can lead to inefficiency, extra work and a great deal of frustration for those who have to rationalise any data sets. Not a job to be underplayed and taken lightly 😂.\n\nSi's HeyJournal: Give Me the Truth, the Whole Truth and Nothing But the Truth\n\nSi's HeyJournal: Data Quality and the Strange Case of Spreadsheet Privilege\n\nAnother problem I see in most examples of spreadsheets is data that is difficult (or nigh on impossible) to keep consistent without following an instruction sheet to make an update because of the many steps required to change a single value! This then NEEDS to be followed by many consistency checks afterwards to ensure data integrity hasn't been lost, which *News Flash* isn't commonplace at all 😱! This problem is exacerbated when having to manage those super-workbooks with LOADS of columns, some hidden, some with meanings which are unknown / cryptic / known by one person who's on leave etc. Ultimately, it becomes an absolute NIGHTMARE to manage the data integrity of such sheets, and the problems that arise are usually noticed much later down the line when it's too late.\n\nIf you've used any spreadsheet tool in anger for important information, I'm sure you know what I'm talking about and have been frustrated by the mistakes noticed and added manual work required to keep your data in an accurate state. If you've ever asked yourself if there's a better way to work, you'll be pleased to know that there is 😁! We just need to borrow concepts from elsewhere to make managing our data integrity easier and vastly improve our efficiency.\n\nLady with her head in her hands, stressed out from all the data she needs to fix in the company Spreadsheet\n\nWhy Are We Even Here?\n\nBefore we embark on this journey of discovery together, let's ask the all important question required to steer us correctly...\n\nWhat is the purpose of using spreadsheets for anything?\n\nFrom what I've seen, people like to use spreadsheet tools for many purposes including (but definitely not limited to) financial reporting, checklists, questionnaires, forms, signature front pages, stocktaking, leaflets, project management etc. This is but a small collection of uses, as a comprehensive list would stretch on to an eternity! Any vendor of a popular spreadsheet tool that needs to a paid-for license will be RAKING IN MONEY 💰 (take a bow Microsoft). From what I have seen, I would say the main usage can be condensed down to the following elements:\n\nStore and Retrieve Useful Data\n\nLayout and Display Data to be Read\n\nAnalyse Data for Insights\n\nWhen it comes to the storing and displaying of information, in a spreadsheet the concepts can get easily conflated together and thought about as the same thing. This can make data manipulation difficult as the two become tied together. Also, when it comes to storing data, I've argued in the past that spreadsheets definitely aren't the best (and sometimes even a good) solution and have included a link to this below. In my opinion, when spreadsheets are arranged correctly (i.e. designed instead of evolving over time), they are fantastic for laying out information and analysing it effectively.\n\nSi's HeyJournal: Putting Data First: Real-time Information as a First Class Citizen\n\nUnfortunately, the lack of spreadsheet structural design and the tying of data storage with layout is a major cause of a lot of nightmares and headaches 🤯!\n\nA lack of spreadsheet design can lead to a TOTAL mess of our information\n\nThe Same Old Problems\n\nIn most of the longstanding workbooks that I've seen, the layout and design has evolved over time rather than be designed with data integrity in mind. What this means is that these started from humble beginnings, with a manageable number of attribute columns presenting only integral information. Then as time has passed and more ideas of data importance had been considered, column after column was tacked on at the end with no thought about data integrity and how one piece of information affected other elements. "Helper" columns (raw data and formulas) were then added to assist (adding further management and administration overheads), which can do more harm than good. Colour schemes were added to make it easier to read and "guide the eyes" to the most pertinent information (shouldn't it all be important?). Columns were subsequently hidden because there was too much information but people were too scared to delete just in case the information was useful (how do you judge this). It is then that you realise that you've allowed your sheet to become a sprawling uncontrollable monster 👹 that NO ONE wants to touch 😬!\n\nHard to manage spreadsheets are a breeding ground for lapses in data integrity, and this problem is exacerbated further when you are mixing data storage with layout and all the fancy colours, rules, graphs, formulas etc. on the same sheet. When all of this is factored together, production spreadsheets which provide important insights for mission-critical decisions can start to report misleading and/or incorrect trends which are very difficult to spot, let alone swiftly fix! This is where the suite of checks come in to ensure some kind of consistency, but as anyone who has managed a huge workbook will tell you, it's impossible to catch everything 😣.\n\nSo what are the common mistakes this previously mentioned "breeding ground" lead to? I have compiled a list of popular ones which is by no means exhaustive (which in itself is quite alarming)!\n\nAttributes depending on others, meaning that a SINGLE update requires MULTIPLE changes to the spreadsheet, which can become difficult to predict.\n\nAttribute column after attribute column appended to the spreadsheet without scrutiny to capture every little element, leading to a information that scrolls horizontally for an eternity.\n\nAttribute columns with ambiguous meanings and usage that people use inconsistently and thus don't represent an attribute clearly.\n\nThe emergence of redundant information which needs to be managed as accurately as significant data.\n\nPoorly structured tables which inhibit basic operations, including inappropriate merging, headings in silly places and summary data existing where it shouldn't.\n\nCell/text colouring being used as data in its own right, which can become a nightmare when data is manipulated and/or analysed and/or exported.\n\nA reliance on custom buttons, macros and formulas, many of which haven't been rigorously tested to do "clever" things behind the scenes that few people actually understand but everyone relies on. Any errors could be catastrophic, but who would be able to determine this?\n\nFormula after formula added to calculate and check information which causes the spreadsheet to run INCREDIBLY slowly, especially as the amount of data contained increases.\n\nPeople updating formulas with raw values which can be hard to spot leading to stale data which misrepresents reality.\n\nThe many more that you are now currently thinking of!\n\nMan sad taking some time out to reflect on all the issues he has to deal with in his companies spreadsheets\n\nIt is here that we need to take a step back and remind ourselves again what we are actually trying to achieve when we use spreadsheets to layout our information. Ultimately, such tools are supposed to make our lives easier, not create lots of extra manual work and stress 😂! So instead of accepting the status quo and battling on, how about we try and address these issues and buck this problematic trend?!\n\nGoing through an example spreadsheet of orders for an online bookshop and identifying how the structure could be improved.\n\nThe New Normal(isation)\n\nFortunately, the aforementioned data problems have already been solved in the database world using the concept of "Database Normalisation" which is what we are going to borrow today. This is the process of structuring your database into tables (think spreadsheet workbook and sheets) to prevent redundant data and preserve integrity using standard norms. This involves following the rules of First Normal Form (1NF), Second Normal Form (2NF) and Third Normal Form (3NF). To be clear, by designing to 2NF, the rules of 1NF must be met and by designing to 3NF, the rules of both 2NF and 1NF will be satisfied.\n\nTo design your spreadsheet workbook to First Normal Form, all data tables contained should satisfy ALL of the following rules:\n\nThe order of which your rows are presented DOESN'T matter (where reordering your data using the popular sorting function won't break your information you are trying to present)\n\nThe order of your horizontal attribute columns doesn't matter (keep your tables clear, simple and avoid attribute columns in the same table depending on each other)\n\nNo rows are duplicated (meaning that each piece of data can be uniquely identified easily, which is facilitated by an identifier which can be made up of one/more attributes known as a Primary Key in the database world)\n\nEvery Row:Column reference (e.g. A:1, B:2, C:3 etc.) contains ONE piece of information ONLY and NOTHING else (i.e. you can't put multiple values in a single cell separated by a comma/semicolon etc. and you can't have BLANK values)\n\nAll attribute columns are specific to the information captured (so no hidden elements, special row/object IDs, helper columns etc.)\n\nAdherence to these rules will fix the majority of the previously mentioned issues right away 🤓! Taking the concept of the Primary Key (uniquely identifies each piece of data) further, databases STRICTLY enforce this, meaning that they will EXPLICITLY block duplicate Keys. It's trivial to get a database to handle this for you, however in the spreadsheet world, this will likely be manual (and time consuming trying to keep up with all the keys) so my advice is to use a UUID generator to trivially create a unique key for every record (saves you having to manually keep track of an incrementing number 1, 2, 3 ... etc).\n\nOnline UUID Generator\n\nThe rules to designing to Second Normal Form are:\n\nMust be designed to First Normal Form\n\nAll non Primary Key attributes must be dependent on the primary key (i.e. no partial dependency).\n\nDesigning to Second Normal Form will be a given if you use a single attribute to define your Primary Key 👌. The partial dependency element only becomes relevant when your Primary Key is a composite of two/more attributes. In this case, all non-key attributes must depend on all Key elements and if not, should be moved to another table/worksheet where they can depend ONLY on the relevant Key element(s).\n\nFinally, to ensure that we reach the baseline standard Third Normal Form, we need to satisfy the following rules:\n\nMust be designed to Second Normal Form\n\nEvery non Primary Key attribute column must depend SOLELY on the Primary Key and NOT have a transitive relationship with ANY non Primary Key attribute (i.e. changing any non Primary Key attribute should have NO EFFECT on any other non Primary Key attribute)\n\nFor me, this is the BIG ONE and fixes some of the more serious issues mentioned earlier. We should be able to change any value in ONE PLACE and not worry about all hell breaking loose! Doing this will KILL any issues you have with integrity and completely remove those required consistency checks. A few examples of non Primary Key attributes having a transitively depending on each other could be:\n\nA date of birth attribute alongside an age attribute.\n\nA height in cm attribute alongside a height in feet and inches attribute.\n\nA full name attribute alongside a forename(s) attribute and a surname attribute.\n\nA single change to any of these linked attributes would require other attributes to also be updated. If someone neglects this (pressed for time, didn't know, column was hidden etc.), then data will become inconsistent FAST and could potentially lead to incorrect conclusions and decisions being made. In each of these cases, the redundant attributes could easily be removed and their information could be derived from the attributes which remain.\n\nBy making these changes and removing repeated and redundant data, we can become far more efficient and productive. Remember, our sheets are supposed to help us and perform all the heavy lifting for us. They aren't supposed to require constant TLC and copious amounts of our attention just to occasionally hold high integrity data (after hours of our manual work 😂).\n\nDemonstrating how a spreadsheet is improved when normalised.\n\nOther Benefits\n\nOnce you have taken the time to normalise your workbook, sheets and data aside from saving yourself a lot of time and frustration, you'll also have the following benefits to look forward to:\n\nEasily manipulate your raw data into other forms to be used in a multitude of different ways.\n\nEasily stitch together your raw data with data from other sources using your Primary Keys (concept called JOINS in the database world).\n\nEasily export your information into other tools for further processing and analysis.\n\nI've attached a speadsheet example of bookstore orders, both denormalised and normalised so you can see how the data changes. Also, if you're interested in databases and have experience, I've attached various example files (exported from my database) if you want to mess around with my bookstore example within your own Microsoft SQL Server database instance. You'll be able to import the data, and then create the views using the attached SQL script file. Have fun!\n\nSpreadsheet Implementation Advice\n\nShould you choose to put together normalised spreadsheets in the future, or even normalise existing sheets, there are definitely certain elements for you to consider. By getting things right ahead of time, you'll save yourself a lot of extra work in the future 😎.\n\nUse a UUID Generator to create a unique identifier for each piece of your data.\n\nUse the full worksheet for any data storage, where Row 1 contains all the attribute headings and Column A contains all of your UUID Primary Keys.\n\nDon't mix your formulas in with your data on the same sheet, instead keep your data fixed and any formula derivations based on references to this data.\n\nSeparate your data storage sheets from your layout/visualisation/analytics sheets. Prefix the titles of your worksheets with (DATA), (VISUAL) and (ANALYTICS) to make it easy for users. Use a separate workbook if it helps.\n\nTake the time up front to design ALL data worksheets to Third Normal Form.\n\nIf the structure is getting complicated, consider using a multi-user database.\n\nIf you need strict relationship enforcement across different sets of data, consider using a multi-user database.\n\nIf user access control is required, consider using a multi-user database.\n\nIf concurrent/collaborative working is required, consider using a multi-user database.\n\nSpreadsheets Without the Fuss\n\nSpreadsheet applications are very powerful tools that provide a lot of functionally to everyone, with no real rules around how to structure data effectively. Because of this, many mission-critical workbooks evolve over time and are put together in ways that facilitate the destruction of data integrity and can cause immense amount of stress. Fortunately, such rules in the realm of databases do actually exist that specifically address the same problems, and can be easily ported over to the world of spreadsheets to help us manage our data better. By normalising our data, we can become more efficient and productive as well as remain sane when dealing with our data!\n\nTake care and all the best, Si.