randomosity

strikingly random thoughts and 'maximum data existentialisation'

  • Research
    • Conference Papers
    • Datasets
      • 1871 Populations of Ontario
      • Breweries and Distilleries in Ontario, 1914–15
      • Canadian Federal Railway Charters
      • 1871 Tavernkeepers in Huron County
    • Maps
      • 1891 Ontario Census Divisions
      • Admissions from Gaols to Hamilton Asylum
      • Asylums in New Zealand, 1900
      • Asylums in Scotland, 1797–1897
      • Asylums in the Australian Colonies, 1860
      • Asylums in Western Canada, 1911
      • Asylums of England and Wales, 1765–1845
      • Asylums of England and Wales, 1845–1860
      • Asylums of Ireland, 1814–1869
      • Discharge Rate from Hamilton Asylum
      • Duration of Stay for First Admissions to Hamilton Asylum
      • First Admissions to Hamilton Asylum by County
      • Rate of Readmission to Hamilton Asylum
      • Study Context
      • 1841 Settlers Map of Ontario
      • 1851 Essex County by Religion Stated in Census
      • 1848 Circulation Map of Paris
      • Modern Circulation Map of Paris
      • Irish and Indian-Trained Psychiatrists in Canada
      • Asylums in the United States, 1850
    • Other Research Stuff
      • Sir Frank Smith
    • Visual Support Materials
      • 1851 — 1911 Essex County Census District Evolution
      • Guelph Historical GIS
      • Occupational Comparison 1867–2007
      • Pajek Apple Taxonomy
      • Napoleonic Timeline
      • 1878 Guelph Mass Model
  • Gallery
  • Archives
  • About
    • Contact Me
    • Contact Me
    • Curriculum Vitae
    • Ligit Results
    • Movies
    • Stuff
    • Stats
    • Collophon
    • Delicious Tags

Taming Bad Data

Posted by shawnday on 29 November 2012
Posted in: Info Architecture, Review. Leave a Comment

The Bad Data HandbookA great concept for a book. In this day and age as we seem to be increas­ingly enga­ging with things we call data­sets, enga­ging in chal­lenges to make sense of big data and enga­ging with one another around stuff we call data — here are a series of les­sons to deal with data … Tak­ing a very case-oriented approach, the col­lec­tion of art­icles in this edited volume look at the prob­lems we run into — either overtly or unawarely when work­ing with data. How many have run into the char­ac­ter encod­ing chal­lenge, received data in a semi-structured form and needed to trans­form it quickly and effi­ciently into some­thing more usable, or had to determ­ine a means to identify the poten­tial bias or res­ults from col­lec­tion errors? Well, that’s what the Bad Data Hand­book is all about.

Editor, Q. Ethan McCal­lum has assembled an impress­ive array of con­trib­ut­ors who present art­icles on determ­in­ing data qual­ity and detect­ing poten­tial flaws, fix­ing data errors to make it usable for your spe­cific usage, and using the most up to date tech­niques and meth­ods avail­able today to tame data and effect­ively inter­rog­ate it for ana­lyt­ical pur­poses. The pre­cept of this book is data not fit for pur­pose … or at least the pur­pose you might have in mind for it and in that respect, we will call it bad data. The vari­ous chapters look at doing ‘sniff’ tests’ on the data to see whether it is sound for the pur­poses you might con­sider put­ting it to. How do we find out­liers? Can we spot gaps? through the use of some handy auto­mated routines. The second chapter looks to tech­niques use­ful for trans­form­ing data that was format­ted for human con­sump­tion and provides means to trans­form it to use­ful for machine read­ab­il­ity. Sub­sequently the authors explore ways to con­sider the data mod­els that have been used to define the col­lec­tion and pro­cessing pro­ced­ures that may or may not render data unfair for purpose.

The col­lec­tion of art­icles in this book are deadly valu­able and the solu­tions pro­posed are code based. The routines for deal­ing with the data ulti­mately involve applic­a­tion of routines to make data suit your needs. The routines are python-based so about as approach­able as pos­sible by users who may be less famil­iar or accus­tomed to using code to deal with data problems.

I was par­tic­u­larly impressed by the inclu­sion of a sec­tion on work­ing with vari­ous text encod­ing formats and apply tech­niques to rem­edy situ­ations which render the data ‘bad’. The inclu­sion of a series of quick exer­cises in this sec­tion are par­tic­u­larly apt.

The gen­eral present­a­tion of the book is to identify a spe­cific prob­lem, explain its sig­ni­fic­ance and then to provide hands-on examples of how a user can approach a solution.

The trans­ition to applied tech­niques to look at data from a more broad basis, such as using sen­ti­ment ana­lysis and Nat­ural Lan­guage Pro­cessing to sniff out whether online reviews are genu­ine or not addresses real world prob­lems with online inform­a­tion — more than data itself.

This is an intriguing book. It looks at the down and dirty manip­u­la­tion and munging of data, then takes higher level looks at how we might mis­take inform­a­tion for solid data. In all cases it applies good tech­niques, sug­gests how one can use sound stat­ist­ical reas­on­ing, inter­rog­ate the data model or delve into code based manip­u­la­tion in the pur­suit of more truth­ful data. Due to the broad cov­er­age of this book it is harder to determ­ine who it is dir­ectly aimed towards. I believe that select­ive read­ing of it could inform gen­eral prac­ti­tion­ers in the digital human­it­ies and in emer­ging areas of study increas­ingly enga­ging with data in new ways. It brings to light many les­sons of exper­i­ence that are simply invalu­able and would nor­mally be developed only through hands-on tinker­ing and dis­cov­ery often well into lar­ger projects.It has broader appeal to data sci­ent­ists more broadly who bene­fit for sim­ilar reas­ons, but also for the wealth of hands-on tech­niques provided that refine and empower stand­ard practice.

In any case I do feel that as a col­lec­tion of it art­icles it can a very help­ful ref­er­ence source and indi­vidual sec­tions con­sul­ted as needed — by no means does is this a lin­ear designed volume. It is how­ever, a very valu­able con­tri­bu­tion to a field that is gain­ing mass pop­u­lar engagement.

 

Share this:

  • Print
  • LinkedIn
  • Twit­ter
  • Google +1
  • Tumblr

Posts navigation

← The Science of Fun
Raven and Cherith: A Must Visit →
Logging In...
Cancel Reply
  • about.me

    Shawn Day

    Shawn Day

    Shawn Day is an entrepreneur, digital historian, economist and blender of the aesthetic and the informative. Raised in Canada, Shawn now works with the Digital Humanities Observatory, a project of the Royal Irish Academy, to leverage Ireland's participation in the emerging practise of digital humanities scholarship. He lectures in Social Computing and the Philosophy of Technology.

    His own research explores the social and economic circumstances of the nineteenth century retail liquor trade and it's impact on family. He applies digital, spatial and social network analysis to the study of the relationships between credit, respectability, and order in the Victorian community. Recent articles have examined the social dimensions of the Victorian public mental hospital using GIS and statistical modeling tools. Shawn has been involved in a number of successful and innovative digital humanities projects throughout Canada. Most recently he has worked with large manuscript census databases in the 1871/1891 census project (University of Guelph). He is a team member of the national TAPoR text analysis portal project, the Canadian Network for Economic History and the Network for Canadian History and the Environment (NiCHE - UWO).

    Shawn has blended his background in management economics with an entrepreneurial ethos to found a number of successful software development ventures in Canada and find a means to leverage this in the academic arena.

  • Twitter Updates

    • RT @DiggingIntoData: And, we're back! Round 3 of the int'l Digging into Data Challenge launches today w/ TEN research sponsors http://t. ... 1 hour ago
    • stallman reminds - Amazon recalls (and embodies) Orwell's '1984' news.cnet.com/8301-13860_3-1… via @CNET 1 day ago
    • Well spotted - thoughtful: “@kcor1964: Why innovation is so hard to achieve management.fortune.cnn.com/2013/01/16/why…” 1 day ago
    • RT @adriansalmon: "There once was a curate from Kew Who kept a small cat in a pew. He taught it to speak Alphabetical Greek, But it neve ... 1 day ago
    • RT @rcahms: Telling Scotland's Story: download the new @ScARFHub booklet & uncover stories from the past bit.ly/VqaCWh 1 day ago
  • Flickr

    			shawnday posted a photo:				shawnday posted a photo:				shawnday posted a photo:				shawnday posted a photo:				shawnday posted a photo:
    Used tag: concordia
  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  • Pages

    • About
      • Collophon
      • Contact Me
      • Contact Me
      • Curriculum Vitae
      • Delicious Tags
      • Ligit Results
      • Movies
      • Stats
      • Stuff
    • Archives
    • Gallery
    • Research
      • Conference Papers
      • Datasets
        • 1871 Populations of Ontario
        • 1871 Tavernkeepers in Huron County
        • Breweries and Distilleries in Ontario, 1914–15
        • Canadian Federal Railway Charters
      • Maps
        • 1841 Settlers Map of Ontario
        • 1848 Circulation Map of Paris
        • 1851 Essex County by Religion Stated in Census
        • 1891 Ontario Census Divisions
        • Admissions from Gaols to Hamilton Asylum
        • Asylums in New Zealand, 1900
        • Asylums in Scotland, 1797–1897
        • Asylums in the Australian Colonies, 1860
        • Asylums in the United States, 1850
        • Asylums in Western Canada, 1911
        • Asylums of England and Wales, 1765–1845
        • Asylums of England and Wales, 1845–1860
        • Asylums of Ireland, 1814–1869
        • Discharge Rate from Hamilton Asylum
        • Duration of Stay for First Admissions to Hamilton Asylum
        • First Admissions to Hamilton Asylum by County
        • Irish and Indian-Trained Psychiatrists in Canada
        • Modern Circulation Map of Paris
        • Rate of Readmission to Hamilton Asylum
        • Study Context
      • Other Research Stuff
        • Sir Frank Smith
      • Visual Support Materials
        • 1851 — 1911 Essex County Census District Evolution
        • 1878 Guelph Mass Model
        • Guelph Historical GIS
        • Napoleonic Timeline
        • Occupational Comparison 1867–2007
        • Pajek Apple Taxonomy
Proudly powered by WordPress Theme: Parament by Automattic.