Collaborative Record Matching

I have been of late explores vari­ous means for the auto­mated lon­git­ud­inal match­ing of census manu­script records. Its a huge chal­lenge and I seem to have spent as much time identi­fy­ing poten­tial prob­lems as opposed to identi­fy­ing poten­tial solu­tions. This is not say I haven’t pondered a couple solu­tions, but the list of chal­lenges remains much longer and seems to be grow­ing much faster — but, all this means is a more chal­len­ging research prob­lem, demand­ing some innov­a­tion in meth­od­o­logy. Fun!

googleimage.gifBut there is a paradigm shift hap­pen­ing. One that I have been par­ti­cip­at­ing in, and cer­tainly embrace, but am sel­dom always cog­niz­ant of. The idea of online col­lab­or­a­tion con­tin­ues to per­meate more and more of our every­day tasks. Emer­ging from spe­cial­ized research object­ives such as the SETI@Home ini­ti­at­ive, which sought to use excess per­sonal com­put­ing capa­city dis­trib­uted around the world, to other efforts today that take advnt­age not only of excess pro­cessor cycles to the idea of car­ry­ing out manual tasks through engage­ment of the masses in spe­cific tasks.

I star­ted play­ing with the Google Image iden­ti­fic­a­tion pro­gramme a few months back. If you haven’t tried it, it basic­ally involves match­ing you with a ran­dom online user and you spend 90 seconds typ­ing in words to describe a pic­ture dis­played to both users. You quickly type words that come to mind until both users type in the same word, at which point the engine accepts that that word is likely to be a rel­ev­ant descriptor. The key to par­ti­cip­a­tion is that the exer­cise if fun, fast and you can hop on at any­time and given the global scope, you will quickly be paired with an online user. Moreover, you have the small sat­is­fac­tion of being part of a big­ger exer­cise of improv­ing the descriptors attached to Google’s image search repos­it­ory. This little ‘game’ also clearly illus­trates one of the down­sides of Google’s repos­it­ory, as these descriptors are determ­ined through a pro­cess which renders them simple rather than more spe­cial­ized. as I ‘play’ I real­ize that I may recog­nize the image as a par­tic­u­lar movie poster, but also think that my online part­ner may not catch the sub­tleties, so I may resort to simply choos­ing a pre­dom­in­ant col­our as a sug­ges­ted word, rather than the name of the movie or say an actor in the movie. As a res­ult I choose the more obvi­ous descriptor word to encour­age faster match. The object­ive in the Google match is to match words for the highest num­ber of images dur­ing the 90 second period, which may not achieve the best descrip­tions. How­ever, the pro­cess does deliver some basic descrip­tions terms that an auto­mated pro­cess would miss. The key is mak­ing it fun for the participants.

Down this same vein, Kris Inwood poin­ted me at a census ini­ti­at­ive, Auto­mated Gene­a­logy. Work­ing down this same premise of try­ing to funify a pro­cess requir­ing mass user inter­ven­tion, at Auto­mated Gene­a­logy, the site is a meet­ing point for gene­a­lo­gists to signup for and manu­ally enter into a data­base manu­script census records. The hope here is to engage that vast army of gene­a­lo­gists out there to con­trib­ute time to help their fel­low gene­a­lo­gists and have access to records which bene­fit their own research efforts. Col­lab­or­a­tion at its best. Addi­tion­ally they have begun a sim­ilar pro­cess to match Cana­dian manu­script census records between the 1901 and 1911 censuses. This is the same task that I have been rumin­at­ing over devel­op­ing an auto­mated pro­cess for. At AG they are using auto­mated means to do simple match­ing and then allow­ing users to refine the match where human dis­cre­tion is required. This is a clever approach to a real world research prob­lem. As to pro­gress, the pub­lished res­ults indic­ate that they have tran­scribed 93.15% of the entire Cana­dian census for 1911 and 99.99% of the 1901 census with 55.15% of the proof­ing car­ried out on this one.

This is a great example of this emer­ging trend to mobil­ize indi­vidual efforts en masse to assist with pro­cesses that in the past would have been car­ried out by a small group of spe­cial­ized research­ers. Both pro­cesses recog­nize that tasks can be divided and appro­pri­ate and dif­fer­ent resources applied to vary­ing stages. Mass col­lab­or­a­tion on simple tasks made fun!

One Response

  1. shawnday says:

    There are a num­ber of art­icles cir­cu­lat­ing in the last few days relat­ing to the Xgrid soft­ware that Apple included in Tiger. I hadn’t been aware of it and it is rel­ev­ant to this dis­cus­sion as it allows for you to ‘plug’ your Mac into a dis­trib­uted grid of pro­cessors. Pretty cool. It is sim­ilar to the seti@home plug-in that we used to install. This time, its right as the OS level and as simple as choos­ing a pro­ject to be part of through your Sys­tem Pref­er­ences and lo and behold, spare cycles are donated to worthy causes.

Leave a Reply

*