Where truly unique personal identification numbers are not available across all information sources, probabilisticProbabilistic linkage is a method of linking records using non-unique identifiers (e.g. name, date of birth) to establish weights which represent the likelihood that two records belong to the same person. These weight are used to inform matches and non-matches, and can include clerical review for a selected 'grey area' in between. linkageData Linkage: a complex technique for connecting data records within and between datasets using demographic data (e.g. name, date of birth, address, sex, medical record number). Also called ‘Record Linkage’ or ‘Linkage’. allows connections (or linkages) to be created by comparing the personal information available and calculating the likelihood that records belong to the same person, place or event.

Linkage is a complex process that uses many 'passes' through datasets using different arrangements of the dataCan refer to: (1) the demographic data used in the Data Linkage process; or (2) information pertaining to services provided to people or their clinical information (available only from Data Custodians, including via CARES). items at each pass. Weights are assigned based on the likelihood of a "true match" and thresholds are set to separate "probable" and "improbable links".

Linkage strategies are designed so that these thresholds are as close together as possible to minimise the number of matches that need manual review and a decision by a Linkage Officer, while still minimising the number of missed and incorrect links.

Data ExtractionCan refer to: (1) the extraction of linkage keys (by Linkage Officers); or (2) the extraction of service data to which these keys will be appended (by CARES or the relevant Data Custodians). is then performed to meet the requirements of approved projects.

 

The Linkage Process

The linkage process can be split into the following steps:

File 502

 

1. Obtain Demographic Data, Clean and Standardise

Raw data is provided for linkage. All or some of the following demographic fields are included:

  • Name (first name, second name, family name, aliases)
  • Date of Birth
  • Address (house number, street name, suburb, postcode)
  • Other unique identifiers (e.g. Hospital Unique Medical Record Number)

The data fields are cleaned and put into a standard format that can be used for linkage. Customised identifiers are assigned. For example:

  • MC DONALD > MCDONALD
  • 12th August 1982 > 19820812

 

2.    Load Demographic Tables

The demographic details are loaded into tables in a relational database. There are different tables for different datasets because not all datasets have the same variablesThe specific data items that are collected for health records, e.g. name, address, date of birth, sex etc. Researchers can apply to obtain certain variables from DOHWA data collections. The release of these variables is decided by the individual data custodians of those collections..

 

3.    Run Linkage Engine and Load Links

The linkage program runs comparisons between two datasets. Linkage strategies are customised according to the individual characteristics of each dataset.

Some links pass as automatic matches, some are automatic rejections, and some fall into a "grey area" in between, where links are manually checked by Linkage Officers for validity.

With more than 1.2 million records, on average, being linked every week by the WA Data Linkage BranchThe specialist team at the Department of Health who are responsible for developing and maintaining the WA Data Linkage System, performing data linkage, and the facilitation of access to linked data. (DLBData Linkage Branch: the specialist team at the Department of Health who are responsible for developing and maintaining the WA Data Linkage System, performing data linkage, and the facilitation of access to linked data.), and a dynamic and constantly changing system, it is important to ensure that the links we make between records and chains are of the highest quality.  There are many ways to assess the quality of both existing and proposed links and DLB employs a variety of strategies and tools to ensure that our linkage system contains the highest quality links. These are detailed in DLB’s Linkage Quality Paper.

Linkages are also regularly revisited to ensure that the system of links is continually refined and improved.

 

4.    Extract Linkage Keys

Customised project specific linkage keys are extracted by encrypting the "master ID" for each chain of records. These are the keys that have service data attached by the various data collections.

 

The Extraction Process

The diagram below shows how linked data is extracted:

File 117

1. Identify Study Population

First, the study population is selected, either via linkage, where the researcher already has the study population chosen, or via selection from one or more of the health data collections. For example:

  • All people who went to hospital for a colonoscopy (from HMDC)
  • All people with colorectal cancer (from WA Cancer Registry)
  • People in both these groups (from both HMDC and WA Cancer Registry).

Control populations are also identified (e.g. random sample of people from the electoral roll who are the same age and gender as the cases).

 

2. Extract linkage keys

Once the study population is defined, the linkage team extracts the linkage keys for each requested dataset. The Project Manager then distributes these lists of keys to the relevant data collections for the service data to be attached.

 

3. Attach service data

The Data Custodians arrange for the requested service data from their collection to be attached to the linkage keys.

For some Data Collections, DLB can perform this process using the Custodian Administered Research Extract ServerA DLB initiative that streamlines linked data extraction, quality control and delivery services. (CARESCustodian Administered Research Extract Server: a DLB initiative that streamlines linked data extraction, quality control and delivery services.). There is more information about CARES in the DLB’s technical paper, "The custodian administered research extract server: "improving the pipeline" in linked data delivery systems" (2014, PDF). 

For core data collections, the files are sent back to the DLB Project Manager. For some external datasets, the service data is released directly to the researcher.

 

4. Checking

The service data files come back in various formats. The DLB Project Manager arranges for a DLB analyst to check the data matches the request and convert all the data to fixed width text files. Supporting documentation is also written to describe the data requested.

 

5. Data release

The DLB Project Manager prepares the data for release by encrypting it and burning it to a disc, then arranges for secure delivery to the researcher.

 

6. Linked Data Preview

Linked data files are usually provided as delimited, text files. They will be encrypted/password protected using WinZip, 7zip or the DLB's own encryptionA process where information is transformed so that it is unrecognisable, and where this transformation can only be reversed (decrypted) by a person with the same secret key used to encrypt the original data. program. On receipt of the data, researchers will also be given an information form and a WA Data LinkageA complex technique for connecting data records within and between datasets using demographic data (e.g. name, date of birth, address, sex, medical record number). Also called ‘Record Linkage’ or ‘Linkage’. Branch Project Report form.

The following are examples of files a researcher might receive. Please see the data dictionaries on the Downloads page for detailed information on each dataset.

Data file

File 12

Layout file

File 13

Mapping File

File 14