Take a random sample of size k from paired-end FASTQ
Today I wrote a bash script that creates a random subset of a paired-end FASTQ file pair. It requires the names of the two FASTQ-files as input and also the amount of reads that the sample should have.
The script is mainly based on this Blog post. This is a rather rough code and it could be more user-friendly and allow for more options, but in its current form, it does what I need it to do.
#!/bin/bash round() { printf "%.2f" "$1" } file1=$1 file2=$2 sample=$3 # Input test if ! [[ $sample =~ ^-?[0-9]+([.][0-9]+)?$ ]]; then >&2 echo "$sample is not a number"; exit 1; fi extension1="${file1##*.}" extension2="${file2##*.}" filename1="${file1%.*}" filename2="${file2%.*}" fn1=$filename1"_"$sample".fastq" fn2=$filename2"_"$sample".fastq" if [ $extension1 == "gz" ]; then gunzip $file1; file1=$filename1; filename1="${file1%.*}" fn1=$filename1"_"$sample".fastq" fi if [ $extension2 == "gz" ]; then gunzip $file2; file2=$filename2; filename2="${file2%.*}" fn2=$filename2"_"$sample".fastq" fi lines=$(wc -l < $file1) echo $lines echo $sample if (( $(awk 'BEGIN {print ("'$sample'" <= 1)}') )); then sample=$(awk 'BEGIN {printf("%.0f", "'$sample'" * "'$lines'")}') fi echo $sample paste $file1 $file1 | \ awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | \ awk -v k=$sample 'BEGIN{srand(systime() + PROCINFO["pid"]); }{ s=x++<k?x- 1:int(rand()*x); if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | \ awk -F"\t" -v file1=$fn1 -v file2=$fn2 '{print $1"\n"$3"\n"$5"\n"$7 > file1;\ print $2"\n"$4"\n"$6"\n"$8 > file2}' if [ $extension1 == "gz" ]; then gzip $fn1; gzip $file1; fi if [ $extension2 == "gz" ]; then gzip $fn2; gzip $file2; fi