View this PageEdit this Page (locked)Attachments to this PageHistory of this PageHomeRecent ChangesSearch the SwikiHelp Guide
Hotspots: Admin Pages | Turn-in Site |
Current Links: Cases Final Project Summer 2007

Spring2004 M6 - Matching and merging

Overview

One of the really tough problems in genealogical work is figuring out where two trees overlap. Fundamental to this problem is the matching function included in the last milestone, though the matching approach you have implemented so far must be extended to deal with the notion of matching portions of family trees rather than simply matching individuals.

Your system should already have the capability to deal with multiple Genealogies. The essence of this milestone is to provide a user with a way to compare two Genealogies in order to identify Persons from each that may represent the same individual. Identifying such matches will be key to the process of merging two Genealogies.

Matching Revisited

The matching you have done so far has relied entirely on the information describing an individual. Matching can be improved by also using family information. However, the different family relationships impact matching in very different ways. Finding parents that can't possibility be the same people is a good way to eliminate a match. Parents that do match are a strong positive clue (but not a guarantee, since siblings obviously have the same parents and may have similar names). Matching spouses also are a strong positive indicator, while non-matching spouses typically don't help, since multiple marriages are a possibility. Similarly, matching children are a positive indicator, but lack of matching children may only be a sign of incomplete information in one or both trees.

The comments above address the question of using family relationships to look at the probability of a match between two Persons. A more sophisticated approach would be to use this same information to attempt to identify matching family structures (graphs) spanning any number of generations. Taking this approach might be particularly useful in attempting to combine two GEDCOM files that include a small number of separate changes to a common original file.

Merging (and unmerging)

Merging can be thought of as happening one of two ways, either as starting from a merger of two Person objects identified as representing the same individual or as operating on two whole Genealogies, attempting to merge all of the matching Person pairs possible.

Beginning with the simpler approach, merging has implications not only for the Person objects being merged, but even more so for surrounding family structures. When two Persons are merged, the resulting representation should contain a combination of the information in the original two representations. If there are conflicts, the user should be asked to guide the merger by making choices (or perhaps by designating one Person as overriding the other in all cases).

Doing a merger has significant implications on the family structures. Once a merger between two Persons is indicated, their parents and any other ancestors represented in the two files must necessarily be the same. Merging ancestors may potentially generate a string of requests to the user to guide the choice of information about ancestors. Spouses of two Persons being merged may also be merged, but not necessarily (again, multiple marriages are a possibility). If they are merged, then a family match has been identified. Children in matched families may be merged, if identified as matching; otherwise, the lists of children in the two families should be combined.

If the merge all matching pairs approach is being used, the above is repeated as necessary. The process of matching ancestors described above should result in significantly shortening the list of pairs to merge. Alternatively, if family structure graph matching has been done, the merger process can be driven by starting from the matching family structures so discovered. Here is an article to read if you are interested in thinking about an advanced merging algorithm: Remerge.pdf

Requirements

  1. Add a Compare and Merge capability to your user interface. It must first support the notion of designating two Genealogies under consideration. Since your interface currently operates on a Genealogy, it probably makes the most sense for you to provide a way to select a second Genealogy to operate on in conjunction with the “current” Geneaolgy.
  2. Using your existing Genealogy browsing capability, give the user a way to designate Persons from the two Genealogies as the same individual, thus beginning or continuing the process of merging the two structures.
  3. Provide the ability to automatically compare two maps, looking for Persons from each map that may represent the same individual and displaying the likely matches in a useful way.
  4. Provide an interface that allows the user to approve matches before you incorporate the information from two matched Person objects in a merged Genealogy.
  5. Finally (and most importantly) generate some kind of interface or visualization that illustrates the merged structure. The user has to be able to move around representations of Persons, see the relationships, and somehow discern which nodes are merged, and which nodes are from which Genealogies and are NOT merged. This interface must also support the user unmerging nodes and merging nodes that your algorithm didn't merge. Note that this new interface or visualization must continue to work and be usable in parallel with views of the original structures. (Hint!: What was that MVC thing again...?)

Given all of the complexity of merging, unmerging is hard! It is best implemented as an undo feature, meaning you need to have a way to put things back the way they were before a merge step. (Hint: this means mergers should be accomplished by creating new objects, not modifying the original.)

Scenarios and testing

There are a lot of possibilities covered in the matching and merging discussions above. You should carefully generate scenarios and test cases that cover the various situations. Start with simple ones and then add additional complexity. Of course, add new CRC cards and UML structures as you find it necessary

Here are two example GEDCOM files that you may want to use to construct your test cases:
Example.ged
Kennedy.ged
The simplest way to create a test case is to use some subset of a GEDCOM file (perhaps a very small subset) to create a GEDCOM and use it as the second file to be matched or merged with the first. You can create a sequence of progressively more complex tests by including more person and family records in the subset. Of course, you can delete or edit information to demonstrate inexact matching.

Here is a new GEDCOM file for testing merging:
Rome-Dugas.ged
It is a relatively simple file that includes some shared ancestors as you follow the two lines back from the most recently born person included. In addition, the Individual and Family numbers are not consecutive, which is a good test of the generality of your mechanisms for importing data from GEDCOM files.

Grading Criteria


Note: Please include a README.txt with an explanation of how to file-in your code, how to run your code, and things the TA should know when grading your project.

The percentages above go well past 100%. Doing it all is a big challenge, not the expected outcome for most groups. As should be clear from the notes, test planning is an important aspect of this milestone. Good test suite design will allow you to check out your implementation incrementally and to demonstrate clearly just what your system can do. Be sure you realize that implementtion of the merging capabilities is independent of the matching algorithm being used. These two parts can be developed in parallel, since merging can work quite adequately simply using the matching capabilities you implemented in Milestone 5.


Links to this Page