Score calculation in HaploGrep

In this blog post we show how Haplogrep’s default measure (so called Kulczynski measure) works on a concrete example. (Note: this blog post has been updated after receiving a correction from Chris Simpson, Thanks!)

So let’s say this is your input sample in hsd format:

test 16024-16569;1-576; ? 73G 263G 285T 315.1C 455.1T 523G 524T 16051G 16129A 16188.1C 16249C 16264G

The Kulczynski measure is defined as follows:
(HaplogroupWeight + SampleWeight) * 0.5

HaploGrep applies this formula to all haplogroups in Phylotree and finally returns the overall best hit. In this example, I’ll calculate the measure only for the best hit, this is in our case U1a2.

1) First, we have to calculate the HaplogroupWeight:

HaplogroupWeight = FoundPolymorphismsWeight/ExpectedPolymorphismsWeight

Found Polymorphisms: Polymorphisms from the input sample that are detected (or found) in the currently tested haplogroup (i.e. in our case U1a2)

Expected Polymorphisms: Polymorphisms that are included (or expected) in the currently tested haplogroup (i.e. in our case U1a2).

Found polymorphisms + weights:
455.1T (6.7), 263G (8.8), 285T (10.0), 16249C (4.5), 16129A->2.6, 73G (5.6)

Expected polymorphisms + weights:
455.1T (6.7), 263G (8.8), 285T (10.0), 16189C (2.0), 16249C (4.5), 16129A (2.6), 73G (5.6)

FoundPolymorphismsWeight:
6.7 + 8.8 + 10 + 4.5 + 2.6 + 5.6 = 38.2

ExpectedPolymorphismsWeight:
6.7 + 8.8 + 10 + 2 + 4.5 + 2.6 + 5.6 = 40.2

HaplogroupWeight: 41.5 / 43.5 = 0.9540229885

As you can see only 16189C is not found but expected by the haplogroup.

2) Second, we have to calculate the SampleWeight:

SampleWeight = FoundPolymorphismsWeight/SamplePolymorphismsWeight

Found Polymorphisms: see above

Sample Polymorphisms: Polymorphisms that are included in the sample and are falling into the specified range (2nd column of hsd, in our case 16024-16569;1-576;)

Sample polymorphisms (weights):
16264G (0.0), 16188.1C (0.0), 263G (8.8), 311C (0.0), 16129A (2.6), 16051G (4.5), 455.1T (6.7), 523G (0.0), 285T (10.0), 524T (0.0), 16249C (4.5), 73G (5.6)

SamplePolymorphismsWeight:
8.8 + 2.6 + 4.5 + 6.7 + 10 + 4.5 + 5.6 = 42.7

SampleWeight: 41.5 / 46 = 0.90217391304

As you can see 16051G is included in the sample but not required by the haplogroup.

3) Third use the calculate weights and calculate the final measure:
(HaplogroupWeight + SampleWeight) * 0.5
(0.90217391304 + 0.9540229885) / 2 = 0.92809845077

The best hit (U1a2) has a quality of 0.928. We also integrated that as an automatic test case here.

Keep in mind that in a real life scenario the quality of all haplogroups are calculated, sorted and the best 50 hits are returned.

Hope this helps to understand the measure.

 

Errors and Warnings in HaploGrep2

What triggers an Error and what is considered a Warning in HaploGrep2? Here’s a short overview of the events:

Errors:

  • The detected haplogroup quality is low. Sample is marked red. Quality <=80%
  • The expected haplogroup is not a super group of the detected haplogroup
  • Common rCRS polymorphism not found! The sample seems not properly aligned to rCRS.
  • The sample seems to be aligned to RSRS. Haplogrep only supports rCRS aligned samples.
  • The sample misses >2 expected polymorphisms

Warnings:

  • Fasta Alignment check:  positions to recheck
  • The detected haplogroup does not match the expected haplogroup but represents a valid sub haplogroup
  • The sample shows ambigous best results
  • The sample contains heteroplasmic positions
  • The detected haplogroup quality is moderate. Sample is marked yellow. Quality <= 90% and > 80%.
  • The sample contains  polymorphimsms that are equal to the reference / The sample contains variants according the rCRS
  • The sample contains >2 global private mutation(s) that are not known by Phylotree
  • The sample contains >=2 local private mutation(s) associated with other Haplogroups
  • Different haplogroup with 2 local private remaining mutation(s) found
  • The sample contains undetermined variants N
 

HaploGrep2 Update 2.1.0

We just updated HaploGrep2, with the following minor points:

  • the Export of the Extended Report (Export / Haplogroup Extendet (txt)) got extended with the “Found_Polys” – see Describtion below:
  • The Report for the potential Phantom Mutations got corrected, so that positions with bases according the rCRS reference are not listed anymore.
  • Report of the possible Recombinations based now on Hamming Distance instead of Kulczynksi-Distance.
  • 6 haplogoups were labeled wrong with H2 instead of H:
previously until 2.0.3
correct in 2.1.0
H2+195 H+195
H2+195+146 H+195+146
H2+152 H+152
H2+16129 H+16129
H2+16291 H+16291
H2+13708 H+13708

Currently, the following columns are included:

SampleID – the identifier of the sample

Range – Sequenced / Genotyped positions on the mitochondrial genome

Haplogroup – resulting Haplogroup

Cluster – if first hit is ambiguous, the result of the cluster is listed in this column

Overall_Rank – the haplogrouping score (from 0.5 to 1) where 0.5 is indicates no SNPs found, and 1 is a perfect match. Now always with “.” as decimal separator

Not_found_Polysfalse negatives – mutations expected in this haplogroup but not found

Found_Polystrue positives – mutations found for the resulting haplogroup. Backmutations are considered as well, indicated by ! (see 182T! or 195C! in Sample Africa01)

Remaining_Polys – Variants not being used for this haplogrouping classification – indicates: a) possibly new haplogroup, or b) possible sample admixture, or phantom mutation (false positives). Listed here are hotspot mutations as well as local private mutations (found in at least one different haplogroup) or global private mutation (unknown in the current phylogeny), as well as heteroplasmic mutations or reference identical positions (the latter is often the case for MicroArray based data).

AAC_Remaining – the remaining Variants in the previous column are checked – and marked as such if involved in an Amino Acid Change.

Input_Sample – the profile used for the classification

 

HaploGrep 2 Stand-alone Version

With some delay, we finally updated the stand-alone Version of HaploGrep to the latest version 2. This includes the latest Phylotree 17 (Forensic Science International: Genetics Supplement Series, from December 2015), finescaling the human phylogeny even further. This version has no file-upload limit as  currently applied on the web service (file size of 5MB and max. 3,000 samples – you can use compressed files in zip format tough). We also provide a command-line version of haplogrep2, which makes it straight-forward to integrate it in your workflows or pipelines directly. Or you can use the Rest-API for doing so.

Here’s the direct link to the Download Page – enjoy – and don’t hesitate to contact us in case of questions, suggestions, or any kind of problems.

And here’s the evolution of HaploGrep’s sessions per month from Google analytics, with the release of the Phylotree Versions:

haplogrep_phylotree