HDR (Hamming Distance Ratio)


A program for genome scanning by the HDR method

Akihiro Nakaya, Atsuko Imai-Okazaki, and Jurg Ott

Last updated: Feb 18, 2018

The HDR program is an implementation of the HDR (hamming distance ratio) method that evaluates the significance of an SNV (single nucleotide variant) in the genome DNA sequence of a case subject of interest (Imai et al., 2015). The HDR method assesses the impact of an SNV based on the variant pattern (haplotype) in the chromosomal region that the SNV belongs to instead of the SNV alone, and generates a value called an HDR score that indicates the degree of the difference of the variant pattern against that in the comparison counterpart based on the hamming distance. By computing the HDR scores between all the pairs of subjects in a population that includes both the case subject and the control subjects, we can obtain an empirical distribution of HDR scores for evaluation of the statistical significance of the variant pattern in the region flanking the SNV. We can define the significance of the SNV by using that of the variant patterns in the flanking region with taking the local LD (linkage disequilibrium) structure around the SNV into account. The HDR program carries out an all-against-all computation of the HDR scores among the case and control subjects, and generates a HDR score matrix in order to validate the statistical significance of the SNV found in the case subject by using the maxstatRS program by Prof. Jurg Ott (a GUI version of the maxstatRS program is also available from here). The HDR program can automatically generate a HDR score matrix for each SNV in the given list with varying the size of the flanking region.

Download and invoke the HDR executable

The JAR file in the download page is an implementation of HDR written in the Scala language and works on a VM of Java 8 to utilize JavaFX 8 for its GUI. Complicated installation is not necessary, only a single JAR file is required. Note that, however, the HDR program requires Java 8 (or later) but works in multiple environments including Windows, macOS(Mac OS X), and Linux by double clicking the JAR file or by being invoked from a command line interface as follows:

% java -Xmx8G -cp hdrfx.jar hdr.fx

Note: "-Xmx" option sets the maximum heap size for the Java VM. "-Xmx8G" means the size is increased up to approximately 8 GB.

Load the data files for computation

When the HDR program is invoked an initial window for computation will appear. To carry out the computation by HDR program, a user must specify several input files from the "Data" menu of the HDR initial window.

  1. The VCF file(s) for the case subject(s),
  2. The VCF file(s) for the control subject(s),
  3. The TSV file(s) for the target variant positions, or
  4. The TSV file(s) for the region boundaries.

VCF: variant call format, TSV: tab-separated values.

By selecting a file type in the "Data" menu (Fig. 1), a file chooser dialog will open.

Figure 1: "Data" menu.

VCF files for variants

When we select "Load Case VCF File(s)" in the pull-down menu, the VCF files for the case subjects are loaded and indicated in the red tabs showing the header of the VCF files (Fig. 2). Using the file chooser dialog, user can select multiple VCF files all at once and also add VCF files one by one, but only one of them is used as the case subject. The selection of the case subject will be made in the succeeding steps.

In the same way, by selecting "Load Ctrl VCF File(s)", the VCF files for the control subjects are loaded and indicated in the blue tabs (Fig. 3). All the VCF files in the blue tabs are used as the control subjects (selection is not required or allowed).



Figure 2: Load the VCF file(s) for case subjects. Figure 3: Load the VCF file(s) for control subjects.

TSV files for positions/regions

By selecting "Load Position TSV file(s)", the TSV (tab-separated values) files for the target variant positions are loaded and indicated in the green tabs (Fig. 4). Each line of those TSV files specifies the chromosome name and the chromosomal position of a target variant. Empty lines and lines starting with '#' will be skipped.

    chr1     1690888
    chr1     17025251
    chr1     17429082
    chr1     18025251
    :
Figure 4: Load the TSV file(s) for target variant positions.

Here, let p denote the chromosomal position of a target variant in bp (base pair) and w denote the window size in bp (base pair) given in the succeeding steps. The boundaries of the region is [p-w/2, p+w/2], where we assume that w%2 = 0 and both boundaries themselves (p-w/2 and p+w/2) are included in the region (therefore, the region size is w + 1, to be exact).

In a similar way, by selecting "Load Region TSV file(s)", the TSV (tab-separated values) files for the region boundaries are loaded and indicated in the purple tabs. By the TSV files, the user can directly specify the boundaries of the regions. Each line of these TSV files specifies the chromosome name and the both boundaries in bp (base pair) of a region. Empty lines and lines starting with '#' will be skipped.

    chr1     1690888     1700888
    chr1     17025251     17035251
    chr1     17429082     17439082
    chr1     18025251     18035251
    :

TSV files can contain a header line. If the first line contains non-integer values in the columns except for the left-most one (e.g., "Chr Pos1 Pos2"), it will be regarded as a header line. At least one of these TSV files is required for computation. The selection of the TSV file for positions or regions will be made in the succeeding steps.

Set the parameters for computation

A VCF file for the case subject

A user must specify a VCF file for case subject from the red pull-down menu (Fig. 5).

Figure 5: Select a VCF file for the case subject.

A TSV file for postions/regions

In a similar way, a TSV file for positions must be selected from the green pull-down menu. The TSV file for region boundaries can be selected from the purple pull-down menu. Making a selection of the two kinds of the TSV file is alternative to each other. The selection determines the way of specifying the region boundaries, positions with a region size or boundary positions.

Iteration parameters for varying region sizes

The start, stop, and interval of an execution loop of the HDR method in relation to the window size must be set in kbp (kilobase) respectively by the light green pull-down menus designated as "Start", "Stop", and "Interval".

Figure 6: Set the iteration parameters.

For example, if 10, 100, and 10 are respectively set for the parameters (Fig. 7), the HDR computation is performed for each target variant position with window sizes, 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100 kbp.

Figure 7: An example of parameters for computation.

We are now ready to execute the HDR method.

Execute and save the results

By clicking on the yellow button designated as "Exec & Save", a file chooser dialog for an output file to store the HDR score matrices. A user can arbitrarily set the file name, while a file name generated based on the current machine time ("hdrar_scores_YYYYMMDD-HHMM.txt") is indicated by default (e.g., "hdr_scores_20141224-2401.txt"). On closing the dialog, the body of the computation will start. The program cannot accept any operations during the computation. The termination of the computation will be noticed in the "LOG" tab (Fig. 8).

Figure 8: Execution of computation with the given parameters.

Note that a name of a chromosome that is not in the format "chrN" (N is an integer number) will be automatically replaced by one that is in the format (e.g., chrX and chrY will be respectively replaced by chr23 and chr24, for example). Replacement of chromosome names will be indicated in the first line of the file. Execution of the program will generate also a file that contains the subjects used ("hdrar_subjects_YYYYMMDD-HHMM.txt" is the default name).

Limitations