Browsed by
Category: Bioinformatics

Introducing the Fluidigm R- Package

Introducing the Fluidigm R- Package

Our Fluidigm R-package was just released on Cran. The package is designed to streamline the process of analyzing genotyping data from Fluidigm machines. It offers a suite of tools for data handling and analysis, making it easier for researchers to work with their data. Here are the key functions provided by the package:

  1. fluidigm2PLINK(...): Converts Fluidigm data to the format used by PLINK, creating a ped/map-file pair from the CSV output received from the Fluidigm machine.
  2. estimateErrors(...): Estimates errors in the genotyping data.
  3. calculatePairwiseSimilarities(...): Calculates pairwise similarities between samples.
  4. getPairwiseSimilarityLoci(...): Determines pairwise similarity loci.
  5. similarityMatrix(...): Generates a similarity matrix.

Users can choose to run these functions individually or execute them all at once using the convenient fluidigmAnalysisWrapper(...) wrapper function.

Finding the Closest Variants to Specific Genomic Locations

Finding the Closest Variants to Specific Genomic Locations

In the field of genomics, we often need to find the closest variants (e.g., SNPs, indels) to a set of genomic locations of interest. This task can be accomplished using various bioinformatics tools such as bedtools. In this blog post, we will walk through a step-by-step guide on how to achieve this.

Prerequisites

Before we start, make sure you have the following files:

  1. A BED file with your locations of interest. In this example, we’ll use locations_of_interest.bed
  2. A VCF file with your variants. In this example, we’ll use FinalSetVariants_referenceGenome.vcf

Step 1: Sorting the VCF File

The first issue we encountered was that the VCF file was not sorted lexicographically. bedtools requires the input files to be sorted in this manner. We can sort the VCF file using the following command:

(grep '^#' FinalSetVariants_referenceGenome.vcf; grep -v '^#' FinalSetVariants_referenceGenome.vcf | sort -k1,1 -k2,2n) > sorted_FinalSetVariants_referenceGenome.vcf

This command separates the header lines (those starting with #) from the data lines, sorts the data lines, and then concatenates the header and sorted data into a new file sorted_FinalSetVariants_referenceGenome.vcf.

Step 2: Converting VCF to BED and Finding the Closest Variants

The next step is to find the closest variants to our locations of interest. However, by default, bedtools closest outputs the entire VCF entry, which might be more information than we need. To limit the output, we can convert the VCF file to a BED format on-the-fly and assign an additional feature, the marker name, as chr_bpLocation (which is the convention we use for naming our markers). We can also add the -d option to get the distance between the location of interest and the closest variant. Here is the command:

awk 'BEGIN {OFS="\t"} {if (!/^#/) {print $1,$2-1,$2,$4"/"$5,"+",$1"_"$2}}' sorted_FinalSetVariants_referenceGenome.vcf | bedtools closest -a locations_of_interest.bed -b stdin -d

This command uses awk to read the VCF data, convert it to BED format, and write the result to the standard output. The pipe (|) then feeds this output directly into bedtools closest as the -b file. The keyword stdin is used to tell bedtools to read from the standard input.

Conclusion

With these two steps, we can efficiently find the closest variants to a set of genomic locations of interest. This approach is flexible and can be adapted to different datasets and requirements.

Couldn’t delete a failed R package installation

Couldn’t delete a failed R package installation

Today I faced a strange thing, while I tried to install an update to one of my R-packages. As usually, I installed the latest version from it like this

library("devtools")
install_github("fischuu/GenomicTools")

But the installation failed, and when I tried to reinstall it, I got this error message:

Installing package into ‘/homeappl/home/<user>/R/x86_64-redhat-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)
ERROR: failed to lock directory ‘/homeappl/home/<user>/R/x86_64-redhat-linux-gnu-library/4.1’ for modifying
Try removing ‘/homeappl/home/<user>/R/x86_64-redhat-linux-gnu-library/4.1/00LOCK-GenomicTools’
Warning message:
In i.p(...) :
  installation of package ‘/tmp/Rtmp<...>Lv/file3c36<...>34/GenomicTools_0.2.11.tar.gz’ had non-zero exit status

So, I tried to go in said directory and delete the folder manually and there I received another error:

rm: cannot remove '00LOCK-GenomicTools/GenomicTools/libs/.nfs00000001002e2<...>d': Device or resource busy

I tried this and this, but nothing helped to delete that folder, it kept mocking my that the device is busy. Eventually, it helped just to rename the folder list this

mv 00LOCK-GenomicTools/ 00LOCK-GenomicTools-deleteThis/

It is now still hanging there in the folder, but I was able to reinstall the R-package and now I need to revisit the folder in a few days and check, if the device is still busy or if I can delete it then…

Take a random sample of size k from paired-end FASTQ

Take a random sample of size k from paired-end FASTQ

Today I wrote a bash script that creates a random subset of a paired-end FASTQ file pair. It requires the names of the two FASTQ-files as input and also the amount of reads that the sample should have.

The script is mainly based on this Blog post. This is a rather rough code and it could be more user-friendly and allow for more options, but in its current form, it does what I need it to do.

#!/bin/bash

round() {
    printf "%.2f" "$1"
}

file1=$1
file2=$2
sample=$3
# Input test
 if ! [[ $sample =~ ^-?[0-9]+([.][0-9]+)?$ ]]; then 
>&2 echo "$sample is not a number"; exit 1; 
fi
  
extension1="${file1##*.}"
extension2="${file2##*.}"
filename1="${file1%.*}"
filename2="${file2%.*}"

fn1=$filename1"_"$sample".fastq"
fn2=$filename2"_"$sample".fastq"

if [ $extension1 == "gz" ]; then
  gunzip $file1;
  file1=$filename1;
  filename1="${file1%.*}"
  fn1=$filename1"_"$sample".fastq"
fi
if [ $extension2 == "gz" ]; then
  gunzip $file2;
  file2=$filename2;
  filename2="${file2%.*}"
  fn2=$filename2"_"$sample".fastq"
fi

lines=$(wc -l < $file1)
echo $lines
echo $sample

if (( $(awk 'BEGIN {print ("'$sample'" <= 1)}') )); then
  sample=$(awk 'BEGIN {printf("%.0f", "'$sample'" * "'$lines'")}')
fi

echo $sample

paste $file1 $file1 | \
awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | \
awk -v k=$sample 'BEGIN{srand(systime() + PROCINFO["pid"]); }{ s=x++<k?x- 1:int(rand()*x);
                  if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | \
awk -F"\t" -v file1=$fn1 -v file2=$fn2 '{print $1"\n"$3"\n"$5"\n"$7 > file1;\
                                         print $2"\n"$4"\n"$6"\n"$8 > file2}'
                                         
if [ $extension1 == "gz" ]; then
  gzip $fn1;
  gzip $file1;
fi
if [ $extension2 == "gz" ]; then
  gzip $fn2;
  gzip $file2;
fi
New Bovine Genome Comparison

New Bovine Genome Comparison

Since 2014 the standard for genome studies in bovine was the UMD 3.1 genome, e.g. for download here:

 

However, a few days ago a new assembly was release, called ARS_UCD1.2.This assembly can be downloaded here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/263/795/GCA_002263795.2_ARS-UCD1.2/

Just to get a quick impression, I compared both assemblies and checked for their similarities using LAST http://last.cbrc.jp/ .

First, I created a LAST database like this

lastdb -P0 -uNEAR -R01 $FOLDER/UMD31/UMD31-NEAR $FOLDER/UMD31/UMD3.1_chromosomes.fa

Then, I determined the substitution and gap frequencies

last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 $FOLDER/UMD31/UMD31-NEAR $FOLDER/ARS/ARS_UCD12.fna > $FOLDER/UMD-ARS.mat

After the training, the blasting was performed (here is the parallel part of the slurm script)

FOLDER="/wrk/daniel/References/";
chr=($(ls $FOLDER/ARS/chr*));

lastal -m50 -E0.05 -C2 -p $FOLDER/UMD-ARS.mat $FOLDER/UMD31/UMD31-NEAR ${chr[$SLURM_ARRAY_TASK_ID]} | last-split -m1 > UMD-ARS-$SLURM_ARRAY_TASK_ID.maf

As I ran the blasting parallel for each chromosome, the header of the files needed to be removed

cat *.maf > all.maf
sed '/^#/ d' < all.maf > temp.maf
head -n 22 all.maf > headerLines
cat headerLines temp.maf > alignments.maf

Finally, the merged simple-sequence alignments were discarded, the alignments were converted to tabular format, and alignments with error probability > 10^-5 were discarded:

last-postmask alignments.maf |
maf-convert -n tab |
awk -F'=' '$2 <= 1e-5' > alignments.tab

And for that tab file was then the dotplot created

last-dotplot -x 4000 -y 4000 alignment.tab alignment.png

This is how the dotplot looks like, it seems pretty much the same genome, but has in some areas clearly changed it! (Open it and zoom to the diagonal to see the differences)

 

For the steps, I followed the tutorial here: https://github.com/mcfrith/last-genome-alignments