With HaploGrep version 2.1.21 the handling of human mitochondrial FASTA sequences gets a new improvement. As we use BWA 07.17 for the alignment of FASTA sequences, the mtDNA nomenclature is not always met by default. Therefore we provide a new parameter which will fix the nomenclature to the correct one used in Phylotree, finally yielding a better HaploGrep score. The parameter is:
Currently we apply 66 rules, which will fix issues around indels (insertions and deletions) – e.g. the deletion on 8281-8289d for haplogroup B, but also around 315 and 524 (which are however not relevant for the mitochondrial phylogeny)
Happy to announce the publication of the HaploGrep2 paper in this years Web Server Issue 2016 in Nucleic Acids Research. For data generated with massive parallel sequencing devices in form of fastq or bam files, the mtDNA-Server paper also published in this years issue, gives some details there.
After exactly 2 years, Phylotree, the “database” behind HaploGrep got updated by Mannis van Oven. Here’s the accompanying publication on Phylotree 17 . The mtDNA tree has now 5,437 haplogroups, which is a growth of over 13% to the previous version. Find out how Phylotree 17 differs for your dataset, by using the updated HaploGrep 2 Version, with the latest mtDNA tree build 17.
Over the last years, HaploGrep became the de facto standard for automatic haplogroup classification (~ 18.000 users, cited over 140 times, about 120 local installations) and is also used in several commercial systems and research pipelines. There was quite some work done underneath the surface of HaploGrep, especially to improve our haplogroup classification performance and to keep up with the latest requirements. After almost a year in beta (see entry from Sept 2014), we think it’s finally time to replace the initial version of HaploGrep with the new and improved version Haplogrep2. We hope you like the new version and would appreciate any kind of feedback!
These are the major improvements:
- Improved classification algorithm resulting in a speed up of 20x!
- HaploGrep includes now a rule-based engine. The two new columns “warnings” (W) and “errors” (E) are showing abnormalities in the input file detected with the new engine. We very much appreciate the input and suggestions from Hans Bandelt and Antonio Salas!
- New Import Formats (VCF + FASTA) supported
- Updated to the latest security standards on server side. So we are finally back on Firefox!
- Apply different ranking algorithms (e.g. Jaccard, Hamming Distance) besides our default ranking algorithm, the Kulczynski distance. These new ranking algorithms will be introduced one by one and are therefore currently disabled.
- Provide HaploGrep also as a command line version (included in mtDNA-Server)
- Direct support of VCF files through the Htsjdk library.
New export formats supported:
Points we (currently) removed from the beta:
- Removed direct support of heteroplasmic sites (Y,R)
- How to use the REST-API.
- Fasta Import is available (open a *.fasta file!) but still in beta.
Here’s the updated version: HaploGrep2
It’s not always straight forward when working with the human mitochondrial DNA, even if it comes to the question: “what reference sequence do you use?” There should be just one reference and the questions therefore obsolete – what however is not the case.
Since the first sequencing of the human mitochondrial genome by Anderson et al. in 1981 the length was defined to be 16,569 base pairs naming it Cambridge Reference Sequence. Even if some years later errors in this first sequencing were corrected by Andrews et al. in 1999, (Genbank NC_012920.1) the new revised Cambridge Reference Sequence (rCRS) was kept the same length – although a deletion on 3107 was found, it was kept by introducing an N. So the 3107N is basically a deletion, kept so that the positions on CRS and rCRS are still comparable.
From many aspects, the choice of an european Haplogroup (H2a2a1) being the reference sequence is not the best one. Therefore Behar et al. proclaimed a new reference sequence in 2012 – a hypotetical one – the so called Reconstructed Sapiens Reference Sequence preserving the historical genome annotation numbering, but not starting with an leaf-sequence in the phylogenetic tree as is the case with the rCRS, but with a “mitochondrial Eve” as root. The two base insertion on 523-524 are represented as NN instead of AC, therefore the RSRS has 3 N positions (523N, 524N, 3107N). Mannis, the father of Phylotree made this table showing the differences between rCRS and RSRS.
We didn’t however change to the RSRS, since we agree with Bandelt et al., that an additional reference sequence causes confusion. But there’s yet another mtDNA reference sequence around, you should be aware of – present in GRCh37/ UCSC Hg19 or the older GRCh36/UCSC Hg18
It’s been a while now since the last update of HaploGrep. The last few updates were mostly due to the new Phylotree releases by Mannis. Some updates where due to user feedback – which we hope to improve with this blog, besides presenting new tools for human mitochondrial DNA data management and analysis.
In the next time we want to present new features, help with questions other users came across and ask you what features you’d like to have in HaploGrep.