DNA Compression

Description

Apply compression algorithms to reduce the storage of DNA sequences. mtDNA has been used as a model for compressing full genome sequences. Chromosome 1 of HapMap data has also been tried. The downloadable software allows one to use their own data set.


Data

Software

  • Should work on most any unix-based operating system (written on a Mac).
  • Requires Perl and certain Perl modules listed in the documentation.
  • Download Software (gunzip, untar, and see the README)
  • SQLite is required only if working with HapMap data
  • SQLite chromosome 1 database (1.2 GB) is required only if working with HapMap data - gunzip and untar this file in the same directory that you used for the compression software
  • Contact Marty Brandon if you have difficulty.

Results

mtDNA

Encodings

HapMap

Encodings

  • Position-variant
    • analysis
    • consensus - A list of the consensus haplotypes at each variable position in chromosome 1.
    • Assumes one position-variant pair per line followed by a one-character newline.
    • For each position, a consensus haplotype was computed and data matching the consensus haplotype was removed.
    • File identifiers are not included in the size calculation.
    • Positions are the absolute DNA position as reported in the HapMap data file.
    • Use the average number of variants per sequence reported here together with the expected costs per runlength reported in the encoding schemes below to get the expected cost per sequence.
  • Huffman
  • Golomb
  • Elias-Gamma
  • Unary
    • analysis - Not a practical encoding. Just did this one as an exercise.

Topic attachments
I Attachment Action Size Date Who Comment
ASHG08_HapMapTutorial.pptppt ASHG08_HapMapTutorial.ppt manage 6 MB 04 Feb 2009 - 03:55 UnknownUser HapMap powerpoint tutorial downloaded from the HapMap site.
accession_list.txttxt accession_list.txt manage 35 K 02 Dec 2008 - 00:22 UnknownUser List of Genbank accession numbers for the sequences used.
chr1.tar.gzgz chr1.tar.gz manage 1 GB 14 Apr 2009 - 00:04 UnknownUser sqlite database file for chromosome 1
chr1_hapmap_consensus.txttxt chr1_hapmap_consensus.txt manage 3 MB 08 Mar 2009 - 01:53 UnknownUser HapMap chromosome 1 consensus variants.
chr1_runlength_Elias-Gamma.txttxt chr1_runlength_Elias-Gamma.txt manage 2 MB 07 Mar 2009 - 23:36 UnknownUser Chromosome 1 runlength encoding using Elias-Gamma
chr1_runlength_Golomb.txttxt chr1_runlength_Golomb.txt manage 1 MB 07 Mar 2009 - 23:37 UnknownUser Chromosome 1 runlength encoding using Golomb
chr1_runlength_Huffman.txttxt chr1_runlength_Huffman.txt manage 2 MB 07 Mar 2009 - 23:38 UnknownUser Chromosome 1 runlength encoding using Huffman
chr1_runlength_counts.txttxt chr1_runlength_counts.txt manage 1 MB 07 Mar 2009 - 23:22 UnknownUser HapMap chromosome 1 runlength counts
chr1_runlength_freqs.txttxt chr1_runlength_freqs.txt manage 1 MB 07 Mar 2009 - 23:28 UnknownUser HapMap chromosome 1 runlength frequencies
chr1_variant_Elias-Gamma.txttxt chr1_variant_Elias-Gamma.txt manage 110 bytes 07 Mar 2009 - 23:39 UnknownUser Chromosome 1 variant encoding using Elias-Gamma
chr1_variant_Golomb.txttxt chr1_variant_Golomb.txt manage 99 bytes 07 Mar 2009 - 23:40 UnknownUser Chromosome 1 varian encoding using Golomb
chr1_variant_Huffman.txttxt chr1_variant_Huffman.txt manage 96 bytes 07 Mar 2009 - 23:40 UnknownUser Chromosome 1 variant encoding using Huffman
chr1_variant_Unary.txttxt chr1_variant_Unary.txt manage 121 bytes 08 Mar 2009 - 00:06 UnknownUser Chromosome 1 variant encoding using Unary
chr1_variant_counts.txttxt chr1_variant_counts.txt manage 187 bytes 07 Mar 2009 - 23:23 UnknownUser HapMap chromosome 1 variant counts
chr1_variant_freqs.txttxt chr1_variant_freqs.txt manage 165 bytes 07 Mar 2009 - 23:28 UnknownUser HapMap chromosome 1 variant freqs
compression_results.txttxt compression_results.txt manage 821 bytes 24 Nov 2008 - 22:07 UnknownUser Final compression results for the full collection of mtDNA sequences.
dna_compression_software.tar.gzgz dna_compression_software.tar.gz manage 5 MB 14 Apr 2009 - 01:07 UnknownUser DNA compression software
hapmap_chr1_Elias-Gamma.txttxt hapmap_chr1_Elias-Gamma.txt manage 355 bytes 07 Mar 2009 - 23:50 UnknownUser Analysis of the HapMap chromosome 1 Elias-Gamma encoding
hapmap_chr1_Golomb.txttxt hapmap_chr1_Golomb.txt manage 345 bytes 07 Mar 2009 - 23:50 UnknownUser Analysis of the HapMap chromosome 1 Golomb encoding
hapmap_chr1_Huffman.txttxt hapmap_chr1_Huffman.txt manage 347 bytes 07 Mar 2009 - 23:51 UnknownUser Analysis of the HapMap chromosome 1 Huffman encoding
hapmap_chr1_Unary.txttxt hapmap_chr1_Unary.txt manage 345 bytes 07 Mar 2009 - 23:51 UnknownUser Analysis of the HapMap chromosome 1 Unary encoding
hapmap_chr1_compression_factors.txttxt hapmap_chr1_compression_factors.txt manage 905 bytes 17 Mar 2009 - 17:19 UnknownUser Compression factors computed for each of the encodings.
hapmap_chr1_data_size.txttxt hapmap_chr1_data_size.txt manage 632 bytes 09 Mar 2009 - 18:15 UnknownUser File sizes of the HapMap data for chromosome 1
hapmap_chr1_pv.txttxt hapmap_chr1_pv.txt manage 871 bytes 10 Mar 2009 - 23:25 UnknownUser Analysis of the HapMap chromosome 1 position-variant notation
mtdna_consensus_Elias-Gamma.txttxt mtdna_consensus_Elias-Gamma.txt manage 441 bytes 13 Mar 2009 - 23:49 UnknownUser Analysis of mtDNA consensus encoding using Elias-Gamma
mtdna_consensus_Golomb.txttxt mtdna_consensus_Golomb.txt manage 431 bytes 13 Mar 2009 - 23:42 UnknownUser Analysis of mtDNA consensus encoding using Golomb
mtdna_consensus_Huffman.txttxt mtdna_consensus_Huffman.txt manage 432 bytes 13 Mar 2009 - 23:42 UnknownUser Analysis of mtDNA consensus encoding using Huffman
mtdna_consensus_compression_factors.txttxt mtdna_consensus_compression_factors.txt manage 878 bytes 15 Mar 2009 - 23:54 UnknownUser Compression factors computed for mtdna_consensus
mtdna_consensus_pv.txttxt mtdna_consensus_pv.txt manage 961 bytes 13 Mar 2009 - 23:43 UnknownUser Analysis of the mtDNA_consensus position-variant notation
mtdna_consensus_runlength_counts.txttxt mtdna_consensus_runlength_counts.txt manage 61 K 12 Mar 2009 - 23:17 UnknownUser mtDNA consensus runlength counts
mtdna_consensus_runlength_freqs.txttxt mtdna_consensus_runlength_freqs.txt manage 75 K 12 Mar 2009 - 23:13 UnknownUser mtDNA consensus runlength freqs
mtdna_consensus_variant_counts.txttxt mtdna_consensus_variant_counts.txt manage 992 bytes 12 Mar 2009 - 23:17 UnknownUser mtDNA consensus variant counts
mtdna_consensus_variant_freqs.txttxt mtdna_consensus_variant_freqs.txt manage 960 bytes 12 Mar 2009 - 23:14 UnknownUser mtDNA consensus variant freqs
mtdna_file_sizes.txttxt mtdna_file_sizes.txt manage 385 bytes 13 Mar 2009 - 23:43 UnknownUser Predicted file sizes for mtDNA data
mtdna_rcrs_Elias-Gamma.txttxt mtdna_rcrs_Elias-Gamma.txt manage 431 bytes 13 Mar 2009 - 23:44 UnknownUser Analysis of mtDNA rCRS encoding using Elias-Gamma
mtdna_rcrs_Golomb.txttxt mtdna_rcrs_Golomb.txt manage 421 bytes 13 Mar 2009 - 23:45 UnknownUser Analysis of mtDNA rCRS encoding using Golomb
mtdna_rcrs_Huffman.txttxt mtdna_rcrs_Huffman.txt manage 422 bytes 13 Mar 2009 - 23:45 UnknownUser Analysis of mtDNA rCRS encoding using Huffman
mtdna_rcrs_compression_factors.txttxt mtdna_rcrs_compression_factors.txt manage 873 bytes 15 Mar 2009 - 23:53 UnknownUser Compression factors computed for mtdna_rcrs
mtdna_rcrs_pv.txttxt mtdna_rcrs_pv.txt manage 956 bytes 13 Mar 2009 - 23:45 UnknownUser Analysis of the mtDNA_rCRS position-variant notation
mtdna_rcrs_runlength_counts.txttxt mtdna_rcrs_runlength_counts.txt manage 45 K 13 Mar 2009 - 22:39 UnknownUser mtDNA rCRS runlength counts
mtdna_rcrs_runlength_freqs.txttxt mtdna_rcrs_runlength_freqs.txt manage 55 K 12 Mar 2009 - 23:10 UnknownUser mtDNA rCRS runlength freqs
mtdna_rcrs_variant_counts.txttxt mtdna_rcrs_variant_counts.txt manage 992 bytes 13 Mar 2009 - 22:39 UnknownUser mtDNA rCRS variant counts
mtdna_rcrs_variant_freqs.txttxt mtdna_rcrs_variant_freqs.txt manage 960 bytes 12 Mar 2009 - 23:08 UnknownUser mtDNA rCRS variant freqs
mtdna_sequence_stats.txttxt mtdna_sequence_stats.txt manage 523 bytes 13 Mar 2009 - 16:24 UnknownUser Sequence statistics for mtDNA sequences.
Topic revision: r1 - 12 Feb 2016, UnknownUser

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback