HDR (Hamming Distance Ratio) 2


A comprehensive program for genome scanning by the HDR method (version 2)

Akihiro Nakaya, Atsuko Imai-Okazaki, and Jurg Ott

Last updated: Feb 18, 2018

The HDR2 program is a comprehensive tool based on HDR (Hamming distance ratio) for prioritizing variants/regions in exome sequencing for being pathogenic. In our HDR approach, we calculate the "difference" of genotypes between an affected individual and control individuals using HDR over target regions and rank those regions according to a suitable test statistic. Currently the HDR2 program includes three different modes; (1): AR mode (HDR program): Prioritize candidate homozygous variants in autosomal recessive families where HDR is defined as difference in homozygous status between a patient and control individuals (Imai et al., 2015 and Imai et al., 2016). (2): DD mode: Prioritize chromosomal deletion regions where HDR is defined as difference in heterozygous status between a patient and control individuals (Imai-Okazaki et al., 2017). (3) AD mode: where HDR is defined as difference in all different genotypes between a patient and control individuals (in preparation). In all three different modes, by computing the HDR scores between all the pairs of subjects that includes both the case subject and the control subjects, we can obtain a test statistic between two types of HDR, one is HDR between case and control individuals and the other is HDR between control individuals. The candidate variants/regions are then ranked according to the test statistic by using the maxstatRS program by Prof. Jurg Ott (a GUI version of the maxstatRS program is available here) in AR mode and by using the Statdel program by Prof. Jurg Ott (a GUI version of the Statdel program is in preparation) in DD mode.

Download and invoke the HDR2 executable

The JAR file in the download page is an implementation of HDR written in the Scala language and works on a VM of Java 8 to utilize JavaFX 8 for its GUI. Complicated installation is not necessary, only a single JAR file is required. Note that, however, the HDR program requires Java 8 (or later) but works in multiple environments including Windows, macOS(Mac OS X), and Linux by double clicking the JAR file or by being invoked from a command line interface as follows:

% java -Xmx8G -cp hdr2fx.jar hdr2.fx

Note: the "-Xmx" option sets the maximum heap size for the Java VM. "-Xmx8G" means the size is increased up to approximately 8 GB.

Load the data files for computation

When the HDR2 program is invoked an initial window for computation will appear. To carry out the computation by the HDR program, a user must specify several input files from the "Data" menu of the HDR initial window.

  1. The VCF file(s) for the case subject(s),
  2. The VCF file(s) for the control subject(s),
  3. The TSV file(s) for the target variant positions (AR mode and AD mode), or
  4. The TSV file(s) for the target regions (DD mode).

VCF: variant call format, TSV: tab-separated values.

Other than vcf files in full size, vcf files after data slicing can also be imported.

In DD mode, to obtain the TSV file(s) for the target regions, users can use the FindRun program, which generates a text file containing candidate chromosomal deletion region files with different region size of interest. To obtain a potentially significant p-value obtained by permutation analysis, our HDR method requires at least 20 control individuals sequenced by the same platform as the patient.

Set the parameters for computation

See the "Set the parameters for computation" in a HDR program for AR mode. For DD mode, once users select case file and region file, there is no need for selecting search window sizes. Instead, after selecting DD mode, users select whether including or excluding indels for HDR calculation.

Execute and save the results

See the "Execute and save the results" in the HDR program.

Data interpretations

After getting the result, output files can be imported into the maxstatRS program (AR mode) or the Statdel program (DD mode), where users can see all the variants/regions ranked according to the test statistic. Variants/regions with higher rank means greater difference in genotypes between a patient and control individuals, that means, those variants/regions are more likely to be specific to a patient.

Sample datasets

Sample datasets which we created from 1000 Genome datasets with our modification are available here.

  1. The VCF file(s) for the case subject(s): NA18939_v2.vcf
  2. The VCF file(s) for the control subject(s): NA18940_v2-NA118961_v2.vcf (20 vcf files)
  3. The TSV file(s) for the target variant positions (AR mode and AD mode): CHR_POS_NA18939_v2.txt
  4. The TSV file(s) for the target regions (DD mode): Region_NA18939_v2.txt