Taming Bad Data
A great concept for a book. In this day and age as we seem to be increasingly engaging with things we call datasets, engaging in challenges to make sense of big data and engaging with one another around stuff we call data — here are a series of lessons to deal with data … Taking a very case-oriented approach, the collection of articles in this edited volume look at the problems we run into — either overtly or unawarely when working with data. How many have run into the character encoding challenge, received data in a semi-structured form and needed to transform it quickly and efficiently into something more usable, or had to determine a means to identify the potential bias or results from collection errors? Well, that’s what the Bad Data Handbook is all about.
Editor, Q. Ethan McCallum has assembled an impressive array of contributors who present articles on determining data quality and detecting potential flaws, fixing data errors to make it usable for your specific usage, and using the most up to date techniques and methods available today to tame data and effectively interrogate it for analytical purposes. The precept of this book is data not fit for purpose … or at least the purpose you might have in mind for it and in that respect, we will call it bad data. The various chapters look at doing ‘sniff’ tests’ on the data to see whether it is sound for the purposes you might consider putting it to. How do we find outliers? Can we spot gaps? through the use of some handy automated routines. The second chapter looks to techniques useful for transforming data that was formatted for human consumption and provides means to transform it to useful for machine readability. Subsequently the authors explore ways to consider the data models that have been used to define the collection and processing procedures that may or may not render data unfair for purpose.
The collection of articles in this book are deadly valuable and the solutions proposed are code based. The routines for dealing with the data ultimately involve application of routines to make data suit your needs. The routines are python-based so about as approachable as possible by users who may be less familiar or accustomed to using code to deal with data problems.
I was particularly impressed by the inclusion of a section on working with various text encoding formats and apply techniques to remedy situations which render the data ‘bad’. The inclusion of a series of quick exercises in this section are particularly apt.
The general presentation of the book is to identify a specific problem, explain its significance and then to provide hands-on examples of how a user can approach a solution.
The transition to applied techniques to look at data from a more broad basis, such as using sentiment analysis and Natural Language Processing to sniff out whether online reviews are genuine or not addresses real world problems with online information — more than data itself.
This is an intriguing book. It looks at the down and dirty manipulation and munging of data, then takes higher level looks at how we might mistake information for solid data. In all cases it applies good techniques, suggests how one can use sound statistical reasoning, interrogate the data model or delve into code based manipulation in the pursuit of more truthful data. Due to the broad coverage of this book it is harder to determine who it is directly aimed towards. I believe that selective reading of it could inform general practitioners in the digital humanities and in emerging areas of study increasingly engaging with data in new ways. It brings to light many lessons of experience that are simply invaluable and would normally be developed only through hands-on tinkering and discovery often well into larger projects.It has broader appeal to data scientists more broadly who benefit for similar reasons, but also for the wealth of hands-on techniques provided that refine and empower standard practice.
In any case I do feel that as a collection of it articles it can a very helpful reference source and individual sections consulted as needed — by no means does is this a linear designed volume. It is however, a very valuable contribution to a field that is gaining mass popular engagement.