We are happy to announce HaploGrep 2.0 –
Beta Version. We want YOU to give it a try, and help us to improve the software. We also want to say thanks to all the fruitful discussions, especially with our collaborators Antonio Salas and Hans Bandelt.
All new features will be introduced one by one in the next couple of weeks.
What you will notice in this first release:
- GUI didn’t change a lot 🙂
- The “Find HaploGroups” Button is not needed anymore
- A new tab (“Errors and Warnings”) shows problematic sequences based on a novel developed internal rule-system
- Annotation of Amino Acid Changes for remaining mtSNPs
- Import of VCF files (introduced below)
- Fasta export
- New server architecture based on REST
Feature-Highlights that will be released in the next weeks:
- Support of heteroplasmic sites (Y,R)
- New distance metrics for haplogroup classification
- Phylogenetic representation based on the rCRS tree for all samples
- Fasta Import
- Additional Quality Checks
- How to use the Rest API
We are looking forward to your feedback and your replies,
Hansi, Lukas, Sebastian
With the establishment of NGS-Devices and the resulting data flood, new file formats such as FASTQ, SAM, BAM or VCF became de-facto standards in the bioinformatics data world.
Especially the VCF-file containing the variants became of special interest in the user requests lately. There are some python scripts available, that convert a VCF file to a HaploGrep hsd file, as well as a publication with a tool that dedicates itself to this topic. To simplify your life, we decided to implement the VCF file import directly into HaploGrep2. You can give it a try with the 1000 Genome mtDNA VCF file from Phase 1:
1000G Phase 1 VCF File
It’s not always straight forward when working with the human mitochondrial DNA, even if it comes to the question: “what reference sequence do you use?” There should be just one reference and the questions therefore obsolete – what however is not the case.
Since the first sequencing of the human mitochondrial genome by Anderson et al. in 1981 the length was defined to be 16,569 base pairs naming it Cambridge Reference Sequence. Even if some years later errors in this first sequencing were corrected by Andrews et al. in 1999, (Genbank NC_012920.1) the new revised Cambridge Reference Sequence (rCRS) was kept the same length – although a deletion on 3107 was found, it was kept by introducing an N. So the 3107N is basically a deletion, kept so that the positions on CRS and rCRS are still comparable.
From many aspects, the choice of an european Haplogroup (H2a2a1) being the reference sequence is not the best one. Therefore Behar et al. proclaimed a new reference sequence in 2012 – a hypotetical one – the so called Reconstructed Sapiens Reference Sequence preserving the historical genome annotation numbering, but not starting with an leaf-sequence in the phylogenetic tree as is the case with the rCRS, but with a “mitochondrial Eve” as root. The two base insertion on 523-524 are represented as NN instead of AC, therefore the RSRS has 3 N positions (523N, 524N, 3107N). Mannis, the father of Phylotree made this table showing the differences between rCRS and RSRS.
We didn’t however change to the RSRS, since we agree with Bandelt et al., that an additional reference sequence causes confusion. But there’s yet another mtDNA reference sequence around, you should be aware of – present in GRCh37/ UCSC Hg19 or the older GRCh36/UCSC Hg18
It’s been a while now since the last update of HaploGrep. The last few updates were mostly due to the new Phylotree releases by Mannis. Some updates where due to user feedback – which we hope to improve with this blog, besides presenting new tools for human mitochondrial DNA data management and analysis.
In the next time we want to present new features, help with questions other users came across and ask you what features you’d like to have in HaploGrep.