Remove Empty Variants From VCF

Tyler Marrs

2016-09-25

I ran into an issue where I tried to use known SNPs from dbSNP that consisted of some empty alleles. If you are using GATK's base quality score recalibrator (BQSR), then it will complain about the VCF not being valid. There are a handful of Perl one-liners that are posted within forums, but they did not appear to work. Here is my solution.

The script below takes a VCF as input (gzipped or not) and writes a new VCF. It simply checks for empty REF and ALT fields. When that criteria is met, these variant lines are skipped while writing out the new VCF.

#! /usr/bin/env python

# This script removes empty alleles from a VCF. Some data sources such as
# dbSNP provide improperly formatted VCFs. This causes headache when trying
# to use strict tools such as the GATK.

# It works by checking for empty string in the REF and ALT columns. When
# one of these lines are crossed it skips writing it to the new VCF output.

import argparse
import gzip

def parse_args():
    """
    Parses the command line arguments.
    """
    parser = argparse.ArgumentParser(description='Strip empty alleles from VCF')
    parser.add_argument('input', type=str, help='Input VCF - gzip accepted')
    parser.add_argument('output', type=str, help='Output VCF')

    return parser.parse_args()

def vcf_open(path):
    """
    Obtain file handle if gzipped or normal text file.
    """
    if path.endswith('.gz'):
        return gzip.open(path, 'rt')

    return open(path, 'r')

def read_and_write_vcfs(input, output):
    """
    Reads in VCF and write out corrected VCF.
    """
    with vcf_open(input) as f:
        with open(output, 'w') as w:
            for line in f:
                if not line.startswith('#'):
                    records = line.strip().split("\t")
                    ref = records[3]
                    alt = records[4]

                    if is_empty(ref) or is_empty(alt):
                        print("skipping.... " + line)
                        continue

                w.write(line)

def main():
    args = parse_args()
    read_and_write_vcfs(args.input, args.output)

if __name__ == '__main__':
    main()

Fix FASTq Headers

Tyler Marrs

2016-09-25

I ran into a case in the lab where I received some odd FASTq headers. They were not standard by any means. The data consisted of paired end reads, but the headers did not contain the appropriate format for specifying which end it came from. Here is a script that I wrote to fix the issue. This may be useful to you as well, but be wary that I hard coded a hardware identifier of the sequencer within the header identification logic. This script also assumes that "read1" and "read2" is within the file name. It uses the filename to determine which end the FASTA came from.

#! /usr/bin/env python

# Some of the FASTQ files have some odd format. So this script fixes the header
# to where it matches paired reads properly. BWA complains that they are 
# different.
#
# The read counts are also the same in each FASTQ.
# -------
# Example
# -------
# Read 1
# @D3LH75P1:3:1101:1237:1997:0:1:1
#
# Becomes
# @D3LH75P1:3:1101:1237:1997:0/1
#
# Read 2
# @D3LH75P1:3:1101:1237:1997:0:2:1
#
# Becomes
# @D3LH75P1:3:1101:1237:1997:0/2

import argparse
import gzip

def parse_args():
    """
    Parses the command line arguments.
    """
    parser = argparse.ArgumentParser(description='Fix fastq headers.')
    parser.add_argument('input', type=str, help='Input FASTQ - gzip accepted')
    parser.add_argument('output', type=str, help='Output FASTQ')

    return parser.parse_args()

def open_file(path):
    """
    Obtain file handle if gzipped or normal text file.
    """
    if path.endswith('.gz'):
        return gzip.open(path, 'rt')

    return open(path, 'r')

def is_read1(file_path):
    """
    Infer read from file name. Our files have read1 or read2 explicitly.
    """
    return "read1" in file_path

def main():
    args = parse_args()
    read1 = is_read1(args.input)

    with open_file(args.input) as f:
        with open(args.output, "w") as w:
            for line in f:
                new_line = line

                if line.startswith("@D3LH75P1"):
                    data = line.strip().split(":")
                    del data[-2:]                   
                    new_line = ":".join(data)

                    if read1:
                        new_line = new_line + "/1"
                    else:
                        new_line = new_line + "/2"

                    new_line = new_line + "\n"

                w.write(new_line)

if __name__ == "__main__":
    main()

Paper - Dimension reduction techniques for the integrative analysis of multi-omics data

Tyler Marrs

2016-07-28

This is a great paper that introduces some techniques to analyze multiple "omics" data sets. With my interest in integrative "omics" data analysis, I thought it would be important to store this for myself. The paper discusses current techniques and challenges in reducing the number of data dimensions.

Dimension reduction techniques for the integrative analysis of multi-omics data

Abstract

State-of-the-art next-generation sequencing, transcriptomics, proteomics and other high-throughput ‘omics' technologies enable the efficient generation of large experimental data sets. These data may yield unprecedented knowledge about molecular pathways in cells and their role in disease. Dimension reduction approaches have been widely used in exploratory analysis of single omics data sets. This review will focus on dimension reduction approaches for simultaneous exploratory analyses of multiple data sets. These methods extract the linear relationships that best explain the correlated structure across data sets, the variability both within and between variables (or observations) and may highlight data issues such as batch effects or outliers. We explore dimension reduction techniques as one of the emerging approaches for data integration, and how these can be applied to increase our understanding of biological systems in normal physiological function and disease.