Taming Bad Data

The Bad Data HandbookA great concept for a book. In this day and age as we seem to be increas­ingly enga­ging with things we call data­sets, enga­ging in chal­lenges to make sense of big data and enga­ging with one another around stuff we call data — here are a series of les­sons to deal with data … Tak­ing a very case-oriented approach, the col­lec­tion of art­icles in this edited volume look at the prob­lems we run into — either overtly or unawarely when work­ing with data. How many have run into the char­ac­ter encod­ing chal­lenge, received data in a semi-structured form and needed to trans­form it quickly and effi­ciently into some­thing more usable, or had to determ­ine a means to identify the poten­tial bias or res­ults from col­lec­tion errors? Well, that’s what the Bad Data Hand­book is all about.

Editor, Q. Ethan McCal­lum has assembled an impress­ive array of con­trib­ut­ors who present art­icles on determ­in­ing data qual­ity and detect­ing poten­tial flaws, fix­ing data errors to make it usable for your spe­cific usage, and using the most up to date tech­niques and meth­ods avail­able today to tame data and effect­ively inter­rog­ate it for ana­lyt­ical pur­poses. The pre­cept of this book is data not fit for pur­pose … or at least the pur­pose you might have in mind for it and in that respect, we will call it bad data. The vari­ous chapters look at doing ‘sniff’ tests’ on the data to see whether it is sound for the pur­poses you might con­sider put­ting it to. How do we find out­liers? Can we spot gaps? through the use of some handy auto­mated routines. The second chapter looks to tech­niques use­ful for trans­form­ing data that was format­ted for human con­sump­tion and provides means to trans­form it to use­ful for machine read­ab­il­ity. Sub­sequently the authors explore ways to con­sider the data mod­els that have been used to define the col­lec­tion and pro­cessing pro­ced­ures that may or may not render data unfair for purpose.

The col­lec­tion of art­icles in this book are deadly valu­able and the solu­tions pro­posed are code based. The routines for deal­ing with the data ulti­mately involve applic­a­tion of routines to make data suit your needs. The routines are python-based so about as approach­able as pos­sible by users who may be less famil­iar or accus­tomed to using code to deal with data problems.

I was par­tic­u­larly impressed by the inclu­sion of a sec­tion on work­ing with vari­ous text encod­ing formats and apply tech­niques to rem­edy situ­ations which render the data ‘bad’. The inclu­sion of a series of quick exer­cises in this sec­tion are par­tic­u­larly apt.

The gen­eral present­a­tion of the book is to identify a spe­cific prob­lem, explain its sig­ni­fic­ance and then to provide hands-on examples of how a user can approach a solution.

The trans­ition to applied tech­niques to look at data from a more broad basis, such as using sen­ti­ment ana­lysis and Nat­ural Lan­guage Pro­cessing to sniff out whether online reviews are genu­ine or not addresses real world prob­lems with online inform­a­tion — more than data itself.

This is an intriguing book. It looks at the down and dirty manip­u­la­tion and munging of data, then takes higher level looks at how we might mis­take inform­a­tion for solid data. In all cases it applies good tech­niques, sug­gests how one can use sound stat­ist­ical reas­on­ing, inter­rog­ate the data model or delve into code based manip­u­la­tion in the pur­suit of more truth­ful data. Due to the broad cov­er­age of this book it is harder to determ­ine who it is dir­ectly aimed towards. I believe that select­ive read­ing of it could inform gen­eral prac­ti­tion­ers in the digital human­it­ies and in emer­ging areas of study increas­ingly enga­ging with data in new ways. It brings to light many les­sons of exper­i­ence that are simply invalu­able and would nor­mally be developed only through hands-on tinker­ing and dis­cov­ery often well into lar­ger projects.It has broader appeal to data sci­ent­ists more broadly who bene­fit for sim­ilar reas­ons, but also for the wealth of hands-on tech­niques provided that refine and empower stand­ard practice.

In any case I do feel that as a col­lec­tion of it art­icles it can a very help­ful ref­er­ence source and indi­vidual sec­tions con­sul­ted as needed — by no means does is this a lin­ear designed volume. It is how­ever, a very valu­able con­tri­bu­tion to a field that is gain­ing mass pop­u­lar engagement.


Leave a Reply