Collaborative Record Matching

I have been of late explores various means for the automated longitudinal matching of census manuscript records. Its a huge challenge and I seem to have spent as much time identifying potential problems as opposed to identifying potential solutions. This is not say I haven’t pondered a couple solutions, but the list of challenges remains much longer and seems to be growing much faster – but, all this means is a more challenging research problem, demanding some innovation in methodology. Fun!

But there is a paradigm shift happening. One that I have been participating in, and certainly embrace, but am seldom always cognizant of. The idea of online collaboration continues to permeate more and more of our everyday tasks. Emerging from specialized research objectives such as the SETI@Home initiative, which sought to use excess personal computing capacity distributed around the world, to other efforts today that take advntage not only of excess processor cycles to the idea of carrying out manual tasks through engagement of the masses in specific tasks.

I started playing with the Google Image identification programme a few months back. If you haven’t tried it, it basically involves matching you with a random online user and you spend 90 seconds typing in words to describe a picture displayed to both users. You quickly type words that come to mind until both users type in the same word, at which point the engine accepts that that word is likely to be a relevant descriptor. The key to participation is that the exercise if fun, fast and you can hop on at anytime and given the global scope, you will quickly be paired with an online user. Moreover, you have the small satisfaction of being part of a bigger exercise of improving the descriptors attached to Google’s image search repository. This little ‘game’ also clearly illustrates one of the downsides of Google’s repository, as these descriptors are determined through a process which renders them simple rather than more specialized. as I ‘play’ I realize that I may recognize the image as a particular movie poster, but also think that my online partner may not catch the subtleties, so I may resort to simply choosing a predominant colour as a suggested word, rather than the name of the movie or say an actor in the movie. As a result I choose the more obvious descriptor word to encourage faster match. The objective in the Google match is to match words for the highest number of images during the 90 second period, which may not achieve the best descriptions. However, the process does deliver some basic descriptions terms that an automated process would miss. The key is making it fun for the participants.

Down this same vein, Kris Inwood pointed me at a census initiative, Automated Genealogy. Working down this same premise of trying to funify a process requiring mass user intervention, at Automated Genealogy, the site is a meeting point for genealogists to signup for and manually enter into a database manuscript census records. The hope here is to engage that vast army of genealogists out there to contribute time to help their fellow genealogists and have access to records which benefit their own research efforts. Collaboration at its best. Additionally they have begun a similar process to match Canadian manuscript census records between the 1901 and 1911 censuses. This is the same task that I have been ruminating over developing an automated process for. At AG they are using automated means to do simple matching and then allowing users to refine the match where human discretion is required. This is a clever approach to a real world research problem. As to progress, the published results indicate that they have transcribed 93.15% of the entire Canadian census for 1911 and 99.99% of the 1901 census with 55.15% of the proofing carried out on this one.

This is a great example of this emerging trend to mobilize individual efforts en masse to assist with processes that in the past would have been carried out by a small group of specialized researchers. Both processes recognize that tasks can be divided and appropriate and different resources applied to varying stages. Mass collaboration on simple tasks made fun!

One comment

shawnday

February 14, 2007 / 11:18 Reply

There are a number of articles circulating in the last few days relating to the Xgrid software that Apple included in Tiger. I hadn’t been aware of it and it is relevant to this discussion as it allows for you to ‘plug’ your Mac into a distributed grid of processors. Pretty cool. It is similar to the seti@home plug-in that we used to install. This time, its right as the OS level and as simple as choosing a project to be part of through your System Preferences and lo and behold, spare cycles are donated to worthy causes.

One comment

Leave a ReplyCancel Reply