Browsed by
Category: Science

New Bovine Genome Comparison

New Bovine Genome Comparison

Since 2014 the standard for genome studies in bovine was the UMD 3.1 genome, e.g. for download here:

 

However, a few days ago a new assembly was release, called ARS_UCD1.2.This assembly can be downloaded here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/263/795/GCA_002263795.2_ARS-UCD1.2/

Just to get a quick impression, I compared both assemblies and checked for their similarities using LAST http://last.cbrc.jp/ .

First, I created a LAST database like this

lastdb -P0 -uNEAR -R01 $FOLDER/UMD31/UMD31-NEAR $FOLDER/UMD31/UMD3.1_chromosomes.fa

Then, I determined the substitution and gap frequencies

last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 $FOLDER/UMD31/UMD31-NEAR $FOLDER/ARS/ARS_UCD12.fna > $FOLDER/UMD-ARS.mat

After the training, the blasting was performed (here is the parallel part of the slurm script)

FOLDER="/wrk/daniel/References/";
chr=($(ls $FOLDER/ARS/chr*));

lastal -m50 -E0.05 -C2 -p $FOLDER/UMD-ARS.mat $FOLDER/UMD31/UMD31-NEAR ${chr[$SLURM_ARRAY_TASK_ID]} | last-split -m1 > UMD-ARS-$SLURM_ARRAY_TASK_ID.maf

As I ran the blasting parallel for each chromosome, the header of the files needed to be removed

cat *.maf > all.maf
sed '/^#/ d' < all.maf > temp.maf
head -n 22 all.maf > headerLines
cat headerLines temp.maf > alignments.maf

Finally, the merged simple-sequence alignments were discarded, the alignments were converted to tabular format, and alignments with error probability > 10^-5 were discarded:

last-postmask alignments.maf |
maf-convert -n tab |
awk -F'=' '$2 <= 1e-5' > alignments.tab

And for that tab file was then the dotplot created

last-dotplot -x 4000 -y 4000 alignment.tab alignment.png

This is how the dotplot looks like, it seems pretty much the same genome, but has in some areas clearly changed it! (Open it and zoom to the diagonal to see the differences)

 

For the steps, I followed the tutorial here: https://github.com/mcfrith/last-genome-alignments

 

Another update of bitools

Another update of bitools

During the last two weeks, I updated the bitools container twice. Two new tools were added to it:

1. velvet

An easy to apply de-novo assembler that I use for metagenome studies

2. Bandage

A tool to visualize the graphs that are provided from velvet

Great book on statistical inference

Great book on statistical inference

During the last days I noticed a really nice book that gives an updated respective a refreshment on statistical inference. It is ‘Computer Age Statistical Inference: Algorithms, Evidenve and Data Science’ by Bradley Efron and Trevor Hastie

On the webpage of Trevor Hastie is even a download link to the pdf of the book, I can highly recommend it!

Screenshot of web.stanford.edu
New version of the bitools docker container

New version of the bitools docker container

I updated the docker container that I maintain (bitools) to keep all the bioinformatics tools together that I recently use. Yesterday I added the tool FEELnc to it, a tool to detect lncRNAs from RNA-seq data.

UPDATE: Apparently there was an issue with the Forkmanages perl module in the docker container, I fixed that on 7.12.2017 and udated the docker image v0.1.6 on Docker Hub.

New publication

New publication

The November issue of Computer Methods and Programs in Biomedicine
contains an article about my R-package ‘GenomicTools’:

Screenshot of www.sciencedirect.com

The R-package GenomicTools for multifactor dimensionality reduction and the analysis of (exploratory) Quantitative Trait Loci

Background and objectives

We introduce the R-package GenomicTools to perform, among others, a Multifactor Dimensionality Reduction (MDR) for the identification of SNP-SNP interactions. The package further provides a new class of tests for an (exploratory) Quantitative Trait Loci analysis that overcomes some of the limitations of other popular (e)QTL approaches. Popular (e)QTL approaches that use linear models or ANOVA are often based on over-simplified models that have weak statistical properties and which are not robust against outlying observations.

Method

The algorithm to calculate the MDR is well established. To speed up its calculation in R, we implemented it in C++. Further, our implementation also supports the combination of several MDR results to an MDR ensemble classifier. The (e)QTL test procedure is based on a generalized Mann-Whitney test that is tailored for directional alternatives, as they are present in an (e)QTL analysis.

Results

Our package GenomicTools provides functions to determine SNP combinations that have the highest accuracy for a MDR classification problem. It also provides functions to combine the best MDR results to a joined ensemble classifier for improved classification results. Further, the (e)QTL analysis is based on a solid statistical theory. In addition, informative visualizations of the results are provided.

Conclusion

The here presented new class of tests and methods have an easy to apply syntax, so that also researchers inexperienced in R are able to apply our proposed methods and implementations. The package creates publication ready Figures and hence could be a valuable tool for genomic data analysis.