Where truly unique personal identification numbers are not available across all information sources, probabilisticRelating to probability. In linkage, this is the measurement of the likelihood that certain pieces of information relate to the same entity. linkage allows connections (or linkages) to be created by comparing the personal information available and calculating the likelihood that records belong to the same person, place or event.
Linkage is a complex process that uses many 'passes' through datasets using different arrangements of the data items at each pass. Weights are assigned based on the likelihood of a "true match" and thresholds are set to separate "probable" and "improbable links". Linkage strategies are designed so that these thresholds are as close together as possible to minimise the number of matches that need manual review and a decision by a linkage officer.
The linkage process can be split into the following steps:
1. Obtain demographic data
Raw data is provided for linkage. All or some of the following demographic fields are included:
- Name (first name, second name, family name, aliases)
- Date of Birth
- Address (house number, street name, suburb, postcode)
- Sex
- Record date
- Other unique identifiers (e.g. Hospital Unique Medical Record Number)
2. Clean and standardise data
The data fields are cleaned and put into a standard format that can be used for linkage. Customised identifiers are assigned. For example:
- MC DONALD > MCDONALD
- O'CONNOR > OCONNOR
- 12th August 1982 > 19820812
3. Load demographic tables
The demographic details are loaded into tables in an Oracle database. There are different tables for different datasets since not all datasets have the same variablesThe specific data items that are collected for health records, e.g. name, address, date of birth, sex etc. Researchers can apply to obtain certain variables from DOHWA data collections. The release of these variables is decided by the individual data custodians of those collections..
4. Extract linkage variables
Customised scripts are used to extract only those records and fields required for a given linkage into "flat data files".
5. Run linkage engine
The linkage program is used to run comparisons between two flat data files. Linkage officers can customise their linkage strategies according to the individual characteristics of each dataset. Some links pass as automatic matches, some are automatic rejections, and some fall into a "grey area" in between where links are manually checked for validity.
6. Load links
The IDs of linked records are assigned an identical "chainA group of links that have been determined to relate to one individual. number", which is stored in a separate database.
7. Update links as required
Linkages are regularly revisited to ensure that the system of links is continually refined and improved.
8. Extract linkage keys
Customised project specific linkage keys are extracted by encrypting the "master ID" for each chain of records. These are the keys that have service data attached by the various data collections.
