Take a random sample of size k from paired-end FASTQ

Take a random sample of size k from paired-end FASTQ

Today I wrote a bash script that creates a random subset of a paired-end FASTQ file pair. It requires the names of the two FASTQ-files as input and also the amount of reads that the sample should have.

The script is mainly based on this Blog post. This is a rather rough code and it could be more user-friendly and allow for more options, but in its current form, it does what I need it to do.

#!/bin/bash

round() {
    printf "%.2f" "$1"
}

file1=$1
file2=$2
sample=$3
# Input test
 if ! [[ $sample =~ ^-?[0-9]+([.][0-9]+)?$ ]]; then 
>&2 echo "$sample is not a number"; exit 1; 
fi
  
extension1="${file1##*.}"
extension2="${file2##*.}"
filename1="${file1%.*}"
filename2="${file2%.*}"

fn1=$filename1"_"$sample".fastq"
fn2=$filename2"_"$sample".fastq"

if [ $extension1 == "gz" ]; then
  gunzip $file1;
  file1=$filename1;
  filename1="${file1%.*}"
  fn1=$filename1"_"$sample".fastq"
fi
if [ $extension2 == "gz" ]; then
  gunzip $file2;
  file2=$filename2;
  filename2="${file2%.*}"
  fn2=$filename2"_"$sample".fastq"
fi

lines=$(wc -l < $file1)
echo $lines
echo $sample

if (( $(awk 'BEGIN {print ("'$sample'" <= 1)}') )); then
  sample=$(awk 'BEGIN {printf("%.0f", "'$sample'" * "'$lines'")}')
fi

echo $sample

paste $file1 $file1 | \
awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | \
awk -v k=$sample 'BEGIN{srand(systime() + PROCINFO["pid"]); }{ s=x++<k?x- 1:int(rand()*x);
                  if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | \
awk -F"\t" -v file1=$fn1 -v file2=$fn2 '{print $1"\n"$3"\n"$5"\n"$7 > file1;\
                                         print $2"\n"$4"\n"$6"\n"$8 > file2}'
                                         
if [ $extension1 == "gz" ]; then
  gzip $fn1;
  gzip $file1;
fi
if [ $extension2 == "gz" ]; then
  gzip $fn2;
  gzip $file2;
fi
Adding an external HDD to fstab

Adding an external HDD to fstab

In order to permanently add an external HDD, the best way is to first identfy the UUID and the corresponding file system by typing

sudo blkid

with that information one can edit the fstab

sudo vim /etc/fstab

and enter then there a line with the format

UUID=<UUID> \tab <mountPoint> \tab <filesystem> \tab <options> \tab 0 \tab 1

Here, the values for <UUID> and <filesytem> we get from the blkid command, the mount point is ‘free choice’ and as option, I choose e.g. errors=remount-ro

Once the fstab is populated like this, just try to mount the disc by typing

sudo mount -a
Cancel all slurm jobs larger job ID X

Cancel all slurm jobs larger job ID X

Sometimes it happens that we have running a whole bunch of slurm jobs from different projects, some of them are running already for days, while others are just fired – and then we noticed, damn, the 100 jobs that I just fired are wrong and they need to be canceled. Unfortunately, there is no slurm command that can do that, it requires some kind of scripting to do that.

The following script takes as an input a slurm job ID and cancels all jobs larger than that (that belong to the logged in user…).

#!/bin/bash

declare -a jobs=()

if [ -z "$1" ] ; then
    echo "Minimum Job Number argument is required.  Run as '$0 jobnum'"
    exit 1
fi

minjobnum="$1"

myself="$(id -u -n)"

for j in $(squeue --user="$myself" --noheader --format='%i') ; do
  if [ "$j" -gt "$minjobnum" ] ; then
    jobs+=($j)
  fi
done

scancel "${jobs[@]}"

If you store this e.g as killLarger.sh in your PATH somewhere, you can just use it from anywhere and cancel slurm jobs that are larger than this ID.

RumenPredict

RumenPredict

Predicting appropriate GHG mitigation strategies based on modelling variables that contribute to ruminant environmental impact.

Objective:


Ruminant production is responsible for ~ 9% of anthropogenic CO2 emission and 37% of CH4 emissions. Release of methane results in 6-12% less energy being available to the animal. Ruminants also contribute towards NO2 within the environment, a persistent gas in the atmosphere which has 296 times more warming potential than CO2. RumenPredict brings together members of the international Rumen Microbial Genomics network (led by IBERS, AU), of which the Hungate 1000 (focussed on sequencing 1000 rumen microbes) and the Rumen Census (focussed on evaluating effects of diet, host genetics and geographical location on the rumen microbiome) are projects within.

RumenPredict brings together key members of the RMG network to generate the necessary data to link rumen microbiome information to host genetics and phenotype and develop feed based mitigation strategies. This will enhance innovative capacity and allow integration of new knowledge with that previously generated to devise geographic and animal-specific solutions to reduce the environmental impact of livestock ruminants. The project members have access to recent data/tools resulting from an array of projects, and RumenPredict will build upon and enhance the integration of knowledge generated from these projects whilst providing innovation through further testing and validation of key hypotheses resulting from the previously obtained data. RumenPredict will provide a platform for predicting how host genetics, feed additives or microbiome may affect emission phenotypes and develop genetic/diet/prediction technologies further for implementation to improve nitrogen use efficiency whilst decreasing environmental impact of ruminants.

Link: https://eragas.eu/research-projects/rumenpredict

NanoBioMass

NanoBioMass

Natural Secreted Nano Vesicles as a Source ofNovel Biomass Products for Circular Economy

Objective:

This BioFuture2025 project targets the nano-and micro vesiclesthat are called collectively here as the exosomes. The exosomes represent a new humoral, systemic layer thatcontrolshomeostasis. Since the exosomes are around the size of viruses and that they are also present in saliva, the exosomes may function as a novel bio aerosolclass. The exosomes transmit various types of relevant cellular biomolecules such as proteins, RNA/DNA and the metabolites. Due to these reasons the exosomes may offer openings to target (biological) drugs, image tissues and organs in vivoand ways to develop even noninvasivesurgery therapies at the end. The exosomes can be expected to offer fundamental opportunitiesfor disease diagnostics. Individual exosomes maythemselves serve as biological drugs when produced in mass quantitiesfor medical practise. In summary the exosomes offer important opportunities to develop significant bio economicallyvaluableproducts. In the project we will enrich exosomes from the air, milk and certain other biological fluids. We will define the composition of the exosomes, their nucleic acids and proteins. We will develop better ways to purify the exosomes and to methods to define theirmolecular signatures. With the identified molecular tools we aim to enrich specific types of exosomes. We will then use the enriched exosomes in assays to learn more about their cellular functions and mechanisms of action. We will use nano levelfilters to analyseair and to study if the exosomes may serve as a novel way to characterize qualityof air. We will develop technologiesto enrich and characterize exosomes from milk. We will go on to target theroles of the milk-derivedexosomes in wealth in defined model assay systems. The aim is to reveal the mode of their cellular entry and roles in metabolic control. Moreoverwe will study hownutrition may reflect to the composition of the exosomes and quality of milk and if the milk offers ways to obtain large amounts of exosomes and to generate custom made exosomes for the different sectors of bio economy. Form the obtained data sets we will generate a data bank.

Link: https://www.aka.fi/en/research-and-science-policy/academy-programmes/current-programmes/biofuture2025/