Score calculation in HaploGrep

In this blog post we show how Haplogrep’s default measure (so called Kulczynski measure) works on a concrete example. (Note: this blog post has been updated after receiving a correction from Chris Simpson, Thanks!)

So let’s say this is your input sample in hsd format:

test 16024-16569;1-576; ? 73G 263G 285T 315.1C 455.1T 523G 524T 16051G 16129A 16188.1C 16249C 16264G

The Kulczynski measure is defined as follows:
(HaplogroupWeight + SampleWeight) * 0.5

HaploGrep applies this formula to all haplogroups in Phylotree and finally returns the overall best hit. In this example, I’ll calculate the measure only for the best hit, this is in our case U1a2.

1) First, we have to calculate the HaplogroupWeight:

HaplogroupWeight = FoundPolymorphismsWeight/ExpectedPolymorphismsWeight

Found Polymorphisms: Polymorphisms from the input sample that are detected (or found) in the currently tested haplogroup (i.e. in our case U1a2)

Expected Polymorphisms: Polymorphisms that are included (or expected) in the currently tested haplogroup (i.e. in our case U1a2).

Found polymorphisms + weights:
455.1T (6.7), 263G (8.8), 285T (10.0), 16249C (4.5), 16129A->2.6, 73G (5.6)

Expected polymorphisms + weights:
455.1T (6.7), 263G (8.8), 285T (10.0), 16189C (2.0), 16249C (4.5), 16129A (2.6), 73G (5.6)

6.7 + 8.8 + 10 + 4.5 + 2.6 + 5.6 = 38.2

6.7 + 8.8 + 10 + 2 + 4.5 + 2.6 + 5.6 = 40.2

HaplogroupWeight: 41.5 / 43.5 = 0.9540229885

As you can see only 16189C is not found but expected by the haplogroup.

2) Second, we have to calculate the SampleWeight:

SampleWeight = FoundPolymorphismsWeight/SamplePolymorphismsWeight

Found Polymorphisms: see above

Sample Polymorphisms: Polymorphisms that are included in the sample and are falling into the specified range (2nd column of hsd, in our case 16024-16569;1-576;)

Sample polymorphisms (weights):
16264G (0.0), 16188.1C (0.0), 263G (8.8), 311C (0.0), 16129A (2.6), 16051G (4.5), 455.1T (6.7), 523G (0.0), 285T (10.0), 524T (0.0), 16249C (4.5), 73G (5.6)

8.8 + 2.6 + 4.5 + 6.7 + 10 + 4.5 + 5.6 = 42.7

SampleWeight: 41.5 / 46 = 0.90217391304

As you can see 16051G is included in the sample but not required by the haplogroup.

3) Third use the calculate weights and calculate the final measure:
(HaplogroupWeight + SampleWeight) * 0.5
(0.90217391304 + 0.9540229885) / 2 = 0.92809845077

The best hit (U1a2) has a quality of 0.928. We also integrated that as an automatic test case here.

Keep in mind that in a real life scenario the quality of all haplogroups are calculated, sorted and the best 50 hits are returned.

Hope this helps to understand the measure.


HaploGrep 2 Stand-alone Version

With some delay, we finally updated the stand-alone Version of HaploGrep to the latest version 2. This includes the latest Phylotree 17 (Forensic Science International: Genetics Supplement Series, from December 2015), finescaling the human phylogeny even further. This version has no file-upload limit as  currently applied on the web service (file size of 5MB and max. 3,000 samples – you can use compressed files in zip format tough). We also provide a command-line version of haplogrep2, which makes it straight-forward to integrate it in your workflows or pipelines directly. Or you can use the Rest-API for doing so.

Here’s the direct link to the Download Page – enjoy – and don’t hesitate to contact us in case of questions, suggestions, or any kind of problems.

And here’s the evolution of HaploGrep’s sessions per month from Google analytics, with the release of the Phylotree Versions:




New HaploGrep Exports in Detail

Here’s the overview of the new export options, HaploGrep 2 offers, by clicking on the small arrow next to Export:


Missing a specific export format? Feel free to contact us!


HaploGrep’s RestAPI now available!

Hi all,
due to many requests we are happy to announce that we provide Haplogrep’s REST API to the public! This will allow everyone to determine haplogroups in a very convenient way. In the following snippets we show how a simple call works (a) from the UNIX command line using curl and (b) from Java. If you have an example running for other languages, please let us know and we will add it here.

Happy haplogrouping & happy new year!

1) Unix Command Line: This call uploads the file myfile.vcf and returns a JSON String including the tags “id” and “haplogroup”:

curl -i -X POST -H "Content-Type: multipart/form-data" -F "importfile=@myfile.vcf"

2) Here you can see the exactly same call with Java:

import org.restlet.resource.*;
import org.restlet.ext.html.*;
import org.restlet.representation.*;
import org.json.*;


public class SampleClient {
public static void main(String[] args) throws IOException {
// change location here
File file = new File("/home/seb/samplefile.hsd");
//POST file
ClientResource cr = new ClientResource("");
final FormDataSet fds = new FormDataSet();
final FormData fileRep = new FormData("importfile", new FileRepresentation(file, MediaType.APPLICATION_ALL));

JSONArray jsonArray = new JSONArray(cr.getResponse().getEntityAsText());
for (int i = 0; i < jsonArray.length(); i++) {
JSONObject object = jsonArray.getJSONObject(i);
String id = (String) object.get("id");
String hg = (String) object.get("haplogroup");
Status status = cr.getResponse().getStatus();


HaploGrep 2.0 is ready!

Over the last years, HaploGrep became the de facto standard for automatic haplogroup classification (~ 18.000 users, cited over 140 times, about 120 local installations) and is also used in several commercial systems and research pipelines. There was quite some work done underneath the surface of HaploGrep, especially to improve our haplogroup classification performance and to keep up with the latest requirements. After almost a year in beta (see entry from Sept 2014), we think it’s finally time to replace the initial version of HaploGrep with the new and improved version Haplogrep2. We hope you like the new version and would appreciate any kind of feedback!

These are the major improvements:

  • Improved classification algorithm resulting in a speed up of 20x!
  • HaploGrep includes now a rule-based engine. The two new columns “warnings” (W) and “errors” (E) are showing abnormalities in the input file detected with the new engine. We very much appreciate the input and suggestions from Hans Bandelt and Antonio Salas!
  • New Import Formats (VCF + FASTA) supported
  • Updated to the latest security standards on server side. So we are finally back on Firefox!
  • Apply different ranking algorithms (e.g. Jaccard, Hamming Distance) besides our default ranking algorithm, the Kulczynski distance. These new ranking algorithms will be introduced one by one and are therefore currently disabled.
  • Provide HaploGrep also as a command line version (included in mtDNA-Server)
  • Direct support of VCF files through the Htsjdk library.

New export formats supported:

Points we (currently) removed from the beta:

  • Removed direct support of heteroplasmic sites (Y,R)
  • How to use the REST-API.
  • Fasta Import is available (open a *.fasta file!) but still in beta.

Here’s the updated version: HaploGrep2


HaploGrep Export Formats for Phylogenetic trees

HaploGrep 2.0 Beta allows the export of a multiple alignment fasta format. Working with the new version, the generation of phylogenetic trees becomes therefore straight forward. Beside its own Phylogenetic tree directly based on Phylotree (see previous blog entry), we present here the basic steps to generate phylogenetic trees based on multiple alignment fasta files by using MrBayes, Neighbor Joining or Maximum Likelihood. For this purpose we recommend Ugene which is a very powerful toolset not only for Next-Gen sequencing projects. The following steps show how simple this process can be: … 


NGS, HaploGrep and Hadoop MapReduce

Many of our HaploGrep users are interested in analysing data within automated pipelines. To support users with a scalable and standardized service, we are happy to present you our newest project, called mtDNA-Server.
mtDNA-Server provides a free service for the complete workflow of NGS mtDNA projects. It includes the alignment of FASTQ SE/PE data based on BWA MEM (using JBWA) , sorting of data, creation of BAM files, heteroplasmy detection, contamination identification, haplogroup assignment using of course HaploGrep and graphical report creation based on R.
All steps are parallelized using Hadoop MapReduce. Therefore, we are able to analyse 800 1000G Phase 1 samples (27 GB) on our 30 core test cluster within 30 minutes. To simplify the execution of MapReduce jobs and provide users an intelligent workflow system, we use our MapReduce framework Cloudgene. Cloudgene controls the complete workflow and provides an intelligent queuing system in the background. mtDNA-Server will be available as a download in near future.

For now, we are still in beta and would very much appreciate any kind of feedback!!
Here’s the link:


New Feature: Tree-based visualisation of haplogroups

Within the new HaploGrep 2.0 Beta version some new export formats are now supported. One of the most powerful is the export of a phylogenetic tree, representing the current profiles loaded into HaploGrep. This feature generates almost publication-ready  phylogenetic trees. We used them with almost no modification (some color-highlighting in Inkscape) – please see [1].

Click here to read more –> …