From The MarthLab

Jump to: navigation, search

Contents

[edit] Overview

Mosaik is a suite comprising of three modular programs: MosaikBuild, MosaikAligner, and MosaikAssembler.

  • MosaikBuild converts various sequence formats into Mosaik’s native read format.

  • MosaikAligner pairwise aligns each read to a specified series of reference (anchor) sequences.

  • MosaikAssembler parses the aligned sequence archive and produces a multiple sequence alignment which is then saved into an assembly file format.

Mosaik is written in C++ with multiple platforms in mind. Compiled versions are currently available for both Microsoft Windows and Linux operating systems, but can be made available on different platforms upon request. Cluster-aware (MPI) versions of the MosaikAligner exist that have been tested on up to 160 processors.

At the time of the beta release, the workflow consists of supplying sequences in a FASTA format consistent with the output of phd2fasta (i.e. separate FASTA files for bases, base qualities, and base positions) and obtaining assembly files in the phrap ace format which can be viewed with utilities such as consed, Sequencher, or EagleView.

[edit] Features

  • aligns a large range of read lengths
    from short Illumina reads to medium 454 reads to long legacy Sanger reads

  • co-assembly
    can create an assembly with multiple sequencing technologies (Illumina, 454, and Sanger)

  • cluster-aware
    can be run on any number of processors (tested up to 160 processors)

  • anchored aligner
    use an entire genome as a reference when aligning reads

  • gapped alignment
    especially useful for insertion / deletion (indel) detection

  • fast
    aligns 100 million Illumina reads in 90 minutes on one processor

[edit] Publication & Release Schedule

Summer 2007

[edit] Beta Release

Registered beta testers can download the documentation and Mosaik distribution here:

[edit] Found a bug?

Found a bug in Mosaik? Please report it on our bug tracking website for quicker bug resolution.

[edit] Masked Human Genome 36.2

Derek Barnett has masked out all of the repeats in the human genome build 36.2 using repeat masker for documented repeats and BLAT for micro-repeats.

[edit] Illumina Support

If you currently don't have a utility to convert Illumina data sets to FASTA, you can use the following Mosaik utilities to help convert the data sets to both FASTA and the Mosaik internal read format (bypassing MosaikBuild).

If GERALD base quality calibration was used, use the ConvertGerald utility. Otherwise use the ConvertBustard utility. For a list of options, just run each program with no parameters.

[edit] Developers

Michael Strömberg