Browsed by
Author: Daniel

Workshop: “Foundation for the Future Agenda”

Workshop: “Foundation for the Future Agenda”

I just noticed I haven’t written anything yet about my attendance at the Shared Workshop: “Foundation for the Future Agenda” in Hinxton, UK, during February. This was a workshop organized by EBI for the three H2020 projects GeneSwitch, BovReg and AquaFAANG.

As always, it was an excellent meeting in Hinxton, with lots of good interaction and talks. Also, a visit to the Red Lion is always a nice experience.

Git submodules

Git submodules

Today I noticed for the first time the concept of submodules in git. While cloning a repository from GitHub I noticed that one folder in it remained empty. After having a closer look, I noticed a reference to another repository tree like this:

Here, the folder htslib is actually from a tree in a different repository. After I cloned the repository like this (I forked it before):

git clone https://github.com/fischuu/SE-MEI.git

the folder htslib remained empty. That is because files from submodules are not fetched by default. This needs to be done separately by first initializing the submodules (first, cd into the cloned repository)

git submodule init

and then update the files from it

git submodule update

After these steps, the repository should be complete. However, instead of initializing the submodule separately, there is also a shortcut to fetch them all in one step by adding an additional parameter to the cloning like this:

git clone --recurse-submodules https://github.com/fischuu/SE-MEI.git

More details to git submodules can be found here.

Take a random sample of size k from paired-end FASTQ

Take a random sample of size k from paired-end FASTQ

Today I wrote a bash script that creates a random subset of a paired-end FASTQ file pair. It requires the names of the two FASTQ-files as input and also the amount of reads that the sample should have.

The script is mainly based on this Blog post. This is a rather rough code and it could be more user-friendly and allow for more options, but in its current form, it does what I need it to do.

#!/bin/bash

round() {
    printf "%.2f" "$1"
}

file1=$1
file2=$2
sample=$3
# Input test
 if ! [[ $sample =~ ^-?[0-9]+([.][0-9]+)?$ ]]; then 
>&2 echo "$sample is not a number"; exit 1; 
fi
  
extension1="${file1##*.}"
extension2="${file2##*.}"
filename1="${file1%.*}"
filename2="${file2%.*}"

fn1=$filename1"_"$sample".fastq"
fn2=$filename2"_"$sample".fastq"

if [ $extension1 == "gz" ]; then
  gunzip $file1;
  file1=$filename1;
  filename1="${file1%.*}"
  fn1=$filename1"_"$sample".fastq"
fi
if [ $extension2 == "gz" ]; then
  gunzip $file2;
  file2=$filename2;
  filename2="${file2%.*}"
  fn2=$filename2"_"$sample".fastq"
fi

lines=$(wc -l < $file1)
echo $lines
echo $sample

if (( $(awk 'BEGIN {print ("'$sample'" <= 1)}') )); then
  sample=$(awk 'BEGIN {printf("%.0f", "'$sample'" * "'$lines'")}')
fi

echo $sample

paste $file1 $file1 | \
awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | \
awk -v k=$sample 'BEGIN{srand(systime() + PROCINFO["pid"]); }{ s=x++<k?x- 1:int(rand()*x);
                  if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | \
awk -F"\t" -v file1=$fn1 -v file2=$fn2 '{print $1"\n"$3"\n"$5"\n"$7 > file1;\
                                         print $2"\n"$4"\n"$6"\n"$8 > file2}'
                                         
if [ $extension1 == "gz" ]; then
  gzip $fn1;
  gzip $file1;
fi
if [ $extension2 == "gz" ]; then
  gzip $fn2;
  gzip $file2;
fi