Over the last years, HaploGrep became the de facto standard for automatic haplogroup classification (~ 18.000 users, cited over 140 times, about 120 local installations) and is also used in several commercial systems and research pipelines. There was quite some work done underneath the surface of HaploGrep, especially to improve our haplogroup classification performance and to keep up with the latest requirements. After almost a year in beta (see entry from Sept 2014), we think it’s finally time to replace the initial version of HaploGrep with the new and improved version Haplogrep2. We hope you like the new version and would appreciate any kind of feedback!
These are the major improvements:
- Improved classification algorithm resulting in a speed up of 20x!
- HaploGrep includes now a rule-based engine. The two new columns “warnings” (W) and “errors” (E) are showing abnormalities in the input file detected with the new engine. We very much appreciate the input and suggestions from Hans Bandelt and Antonio Salas!
- New Import Formats (VCF + FASTA) supported
- Updated to the latest security standards on server side. So we are finally back on Firefox!
- Apply different ranking algorithms (e.g. Jaccard, Hamming Distance) besides our default ranking algorithm, the Kulczynski distance. These new ranking algorithms will be introduced one by one and are therefore currently disabled.
- Provide HaploGrep also as a command line version (included in mtDNA-Server)
- Direct support of VCF files through the Htsjdk library.
New export formats supported:
Points we (currently) removed from the beta:
- Removed direct support of heteroplasmic sites (Y,R)
- How to use the REST-API.
- Fasta Import is available (open a *.fasta file!) but still in beta.
Here’s the updated version: HaploGrep2
HaploGrep 2.0 Beta allows the export of a multiple alignment fasta format. Working with the new version, the generation of phylogenetic trees becomes therefore straight forward. Beside its own Phylogenetic tree directly based on Phylotree (see previous blog entry), we present here the basic steps to generate phylogenetic trees based on multiple alignment fasta files by using MrBayes, Neighbor Joining or Maximum Likelihood. For this purpose we recommend Ugene which is a very powerful toolset not only for Next-Gen sequencing projects. The following steps show how simple this process can be: …
Many of our HaploGrep users are interested in analysing data within automated pipelines. To support users with a scalable and standardized service, we are happy to present you our newest project, called mtDNA-Server.
mtDNA-Server provides a free service for the complete workflow of NGS mtDNA projects. It includes the alignment of FASTQ SE/PE data based on BWA MEM (using JBWA) , sorting of data, creation of BAM files, heteroplasmy detection, contamination identification, haplogroup assignment using of course HaploGrep and graphical report creation based on R.
All steps are parallelized using Hadoop MapReduce. Therefore, we are able to analyse 800 1000G Phase 1 samples (27 GB) on our 30 core test cluster within 30 minutes. To simplify the execution of MapReduce jobs and provide users an intelligent workflow system, we use our MapReduce framework Cloudgene. Cloudgene controls the complete workflow and provides an intelligent queuing system in the background. mtDNA-Server will be available as a download in near future.
For now, we are still in beta and would very much appreciate any kind of feedback!!
Here’s the link:
Within the new HaploGrep 2.0 Beta version some new export formats are now supported. One of the most powerful is the export of a phylogenetic tree, representing the current profiles loaded into HaploGrep. This feature generates almost publication-ready phylogenetic trees. We used them with almost no modification (some color-highlighting in Inkscape) – please see .
Click here to read more –> …
We are happy to announce HaploGrep 2.0 –
Beta Version. We want YOU to give it a try, and help us to improve the software. We also want to say thanks to all the fruitful discussions, especially with our collaborators Antonio Salas and Hans Bandelt.
All new features will be introduced one by one in the next couple of weeks.
What you will notice in this first release:
- GUI didn’t change a lot 🙂
- The “Find HaploGroups” Button is not needed anymore
- A new tab (“Errors and Warnings”) shows problematic sequences based on a novel developed internal rule-system
- Annotation of Amino Acid Changes for remaining mtSNPs
- Import of VCF files (introduced below)
- Fasta export
- New server architecture based on REST
Feature-Highlights that will be released in the next weeks:
- Support of heteroplasmic sites (Y,R)
- New distance metrics for haplogroup classification
- Phylogenetic representation based on the rCRS tree for all samples
- Fasta Import
- Additional Quality Checks
- How to use the Rest API
We are looking forward to your feedback and your replies,
Hansi, Lukas, Sebastian
With the establishment of NGS-Devices and the resulting data flood, new file formats such as FASTQ, SAM, BAM or VCF became de-facto standards in the bioinformatics data world.
Especially the VCF-file containing the variants became of special interest in the user requests lately. There are some python scripts available, that convert a VCF file to a HaploGrep hsd file, as well as a publication with a tool that dedicates itself to this topic. To simplify your life, we decided to implement the VCF file import directly into HaploGrep2. You can give it a try with the 1000 Genome mtDNA VCF file from Phase 1:
1000G Phase 1 VCF File
It’s not always straight forward when working with the human mitochondrial DNA, even if it comes to the question: “what reference sequence do you use?” There should be just one reference and the questions therefore obsolete – what however is not the case.
Since the first sequencing of the human mitochondrial genome by Anderson et al. in 1981 the length was defined to be 16,569 base pairs naming it Cambridge Reference Sequence. Even if some years later errors in this first sequencing were corrected by Andrews et al. in 1999, (Genbank NC_012920.1) the new revised Cambridge Reference Sequence (rCRS) was kept the same length – although a deletion on 3107 was found, it was kept by introducing an N. So the 3107N is basically a deletion, kept so that the positions on CRS and rCRS are still comparable.
From many aspects, the choice of an european Haplogroup (H2a2a1) being the reference sequence is not the best one. Therefore Behar et al. proclaimed a new reference sequence in 2012 – a hypotetical one – the so called Reconstructed Sapiens Reference Sequence preserving the historical genome annotation numbering, but not starting with an leaf-sequence in the phylogenetic tree as is the case with the rCRS, but with a “mitochondrial Eve” as root. The two base insertion on 523-524 are represented as NN instead of AC, therefore the RSRS has 3 N positions (523N, 524N, 3107N). Mannis, the father of Phylotree made this table showing the differences between rCRS and RSRS.
We didn’t however change to the RSRS, since we agree with Bandelt et al., that an additional reference sequence causes confusion. But there’s yet another mtDNA reference sequence around, you should be aware of – present in GRCh37/ UCSC Hg19 or the older GRCh36/UCSC Hg18
It’s been a while now since the last update of HaploGrep. The last few updates were mostly due to the new Phylotree releases by Mannis. Some updates where due to user feedback – which we hope to improve with this blog, besides presenting new tools for human mitochondrial DNA data management and analysis.
In the next time we want to present new features, help with questions other users came across and ask you what features you’d like to have in HaploGrep.