NGS, HaploGrep and Hadoop MapReduce

Many of our HaploGrep users are interested in analysing data within automated pipelines. To support users with a scalable and standardized service, we are happy to present you our newest project, called mtDNA-Server.
mtDNA-Server provides a free service for the complete workflow of NGS mtDNA projects. It includes the alignment of FASTQ SE/PE data based on BWA MEM (using JBWA) , sorting of data, creation of BAM files, heteroplasmy detection, contamination identification, haplogroup assignment using of course HaploGrep and graphical report creation based on R.
All steps are parallelized using Hadoop MapReduce. Therefore, we are able to analyse 800 1000G Phase 1 samples (27 GB) on our 30 core test cluster within 30 minutes. To simplify the execution of MapReduce jobs and provide users an intelligent workflow system, we use our MapReduce framework Cloudgene. Cloudgene controls the complete workflow and provides an intelligent queuing system in the background. mtDNA-Server will be available as a download in near future.

For now, we are still in beta and would very much appreciate any kind of feedback!!
Here’s the link:

New Feature: Tree-based visualisation of haplogroups


Within the new HaploGrep 2.0 Beta version some new export formats are now supported. One of the most powerful is the export of a phylogenetic tree, representing the current profiles loaded into HaploGrep. This feature generates almost publication-ready  phylogenetic trees. We used them with almost no modification (some color-highlighting in Inkscape) – please see [1].

Click here to read more –> (more…)

HaploGrep 2.0 – Beta

We are happy to announce HaploGrep 2.0 – Beta Version. We want YOU to give it a try, and help us to improve the software. We also want to say thanks to all the fruitful discussions, especially with our collaborators Antonio Salas and Hans Bandelt.

All new features will be introduced one by one in the next couple of weeks.

What you will notice in this first release:

  • GUI didn’t change a lot :-)
  • The “Find HaploGroups” Button is not needed anymore
  • A new tab (“Errors and Warnings”) shows problematic sequences based on a novel developed internal rule-system
  • Annotation of Amino Acid Changes for remaining mtSNPs
  • Import of VCF files (introduced below)
  • Fasta export
  • New server architecture based on REST

Feature-Highlights that will be released in the next weeks:

  • Support of heteroplasmic sites (Y,R)
  • New distance metrics for haplogroup classification
  • Phylogenetic representation based on the rCRS tree for all samples
  • Fasta Import
  • Additional Quality Checks
  • How to use the Rest API

We are looking forward to your feedback and your replies,

Hansi, Lukas, Sebastian



Importing VCF file to HaploGrep 2.0


With the establishment of NGS-Devices and the resulting data flood, new file formats such as FASTQ, SAM, BAM or VCF became de-facto standards in the bioinformatics data world.
Especially the VCF-file containing the variants became of special interest in the user requests lately. There are some python scripts available, that convert a VCF file to a HaploGrep hsd file, as well as a publication with a tool that dedicates itself to this topic. To simplify your life, we decided to implement the VCF file import directly into HaploGrep2. You can give it a try with the 1000 Genome mtDNA VCF file from Phase 1:

1000G Phase 1 VCF File


rCRS vs. RSRS vs. HG19 (Yoruba)

It’s not always straight forward when working with the human mitochondrial DNA, even if it comes to the question: “what reference sequence do you use?” There should be just one reference and the questions therefore obsolete – what however is not the case.

Since the first sequencing of the human mitochondrial genome by Anderson et al. in 1981 the length was defined to be 16,569 base pairs naming it Cambridge Reference Sequence. Even if some years later errors in this first sequencing were corrected by Andrews et al. in 1999, (Genbank NC_012920.1) the new revised Cambridge Reference Sequence (rCRS) was kept the same length – although a deletion on 3107 was found, it was kept by introducing an N. So the 3107N is basically a deletion, kept so that the positions on CRS and rCRS are still comparable.
From many aspects, the choice of an european Haplogroup (H2a2a1) being the reference sequence is not the best one. Therefore Behar et al. proclaimed a new reference sequence in 2012 – a hypotetical one – the so called Reconstructed Sapiens Reference Sequence preserving the historical genome annotation numbering, but not starting with an leaf-sequence in the phylogenetic tree as is the case with the rCRS, but with a “mitochondrial Eve” as root. The two base insertion on 523-524 are represented as NN instead of AC, therefore the RSRS has 3 N positions (523N, 524N, 3107N). Mannis, the father of Phylotree made this table showing the differences between rCRS and RSRS.

We didn’t however change to the RSRS, since we agree with Bandelt et al., that an additional reference sequence causes confusion. But there’s yet another mtDNA reference sequence around, you should be aware of – present in GRCh37/ UCSC Hg19 or the older GRCh36/UCSC Hg18


HaploGrep presents new mtDNA-KnowledgeBase

It’s been a while now since the last update of HaploGrep. The last few updates were mostly due to the new Phylotree releases by Mannis. Some updates where due to user feedback – which we hope to improve with this blog, besides presenting new tools for human mitochondrial DNA data management and analysis.

In the next time we want to present new features, help with questions other users came across and ask you what features you’d like to have in HaploGrep.