Bioinformatics – Daniel Fischer

New R-package started

28. February 2025 Daniel Comments 0 Comment

I just started a new R package called ‘SnakebiteTools’. I would like to collect there small helper functions to better analyse and monitor Snakemake runs. The output can later be used for resource optimization, checking the status of an ongoing Snakemake run (which might be messy for runs with plenty of jobs) etc.

Creating Drop-Down Menus in Excel

27. February 2025 Daniel Comments 0 Comment

I often do not remember how to create simple drop down menus in Excel and so I decided to write a short note here. The thing I want to have:

In one tab, I want to have a column with possible values for my drop down menu, e.g my project names
In another tab, I want to have in each button of a column a drop down button that allows me to chose from these values.

This is in principle rather easy to achieve:

Step 1: Prepare the List on Another Sheet

Open your Excel file and go to the sheet where you want to store the drop-down values (e.g., Projects).
Enter the list of values in a column (e.g., A1:A10 in Projects).

Step 2: Name the List (Optional but Recommended)

Select the range of values in Projects (e.g., A1:A10 or the whole column).
Click on the Formula tab → Define Name.
Enter a name (e.g., MyProjects) and click OK.

Step 3: Create the Drop-Down List

Go to the sheet where you want the drop-down (e.g., Tasks).
Select the cell(s) where you want the drop-down.
Click on the Data tab → Data Validation.
In the Allow box, choose List.
In the Source box:
- If you named the range: enter =MyProjects
- If not: enter e.g. =Projects!A1:A10
Click OK.

In case you have in the first line a header (e.g. with column names) you want to remove this line from the drop down options. You can do that like this:

Method 1: Remove Data Validation from One Cell (Header Only)

Click on the first cell of the column (e.g., A1).
Go to the Data tab → Click Data Validation.
In the pop-up, click Clear All → Click OK.

Finding Files in My Folders

10. February 2025 Daniel Comments 0 Comment

Managing disk space efficiently is essential, especially when working on systems with strict file quotas. Recently, I encountered a situation where I had exceeded my file limit and needed a quick way to determine which folders contained the most files. To analyze my storage usage, I used the following command:

for d in .* *; do [ -d "$d" ] && echo "$d: $(find "$d" -type f | wc -l)"; done | sort -nr -k2

Breaking Down the Command

This one-liner efficiently counts files across all directories in the current location, including hidden ones. Here’s how it works:

for d in .* * – Loops through all files and directories, including hidden ones.
[ -d "$d" ] – Ensures that only directories are processed.
find "$d" -type f | wc -l – Counts all files (not directories) inside each folder, including subdirectories.
sort -nr -k2 – Sorts the results in descending order based on the number of files.

Why This is Useful

With this command, I quickly identified the directories consuming the most inodes and was able to take action, such as cleaning up unnecessary files. It’s an efficient method for understanding file distribution and managing storage limits effectively.

Alternative Approaches

If you only want to count files directly inside each folder (without subdirectories), you can modify the command like this:

for d in .* *; do [ -d "$d" ] && echo "$d: $(find "$d" -maxdepth 1 -type f | wc -l)"; done | sort -nr -k2

This variation is useful when you need a more localized view of file distribution.

Introducing the Fluidigm R- Package

15. March 2024 Daniel Comments 0 Comment

Our Fluidigm R-package was just released on Cran. The package is designed to streamline the process of analyzing genotyping data from Fluidigm machines. It offers a suite of tools for data handling and analysis, making it easier for researchers to work with their data. Here are the key functions provided by the package:

fluidigm2PLINK(...): Converts Fluidigm data to the format used by PLINK, creating a ped/map-file pair from the CSV output received from the Fluidigm machine.
estimateErrors(...): Estimates errors in the genotyping data.
calculatePairwiseSimilarities(...): Calculates pairwise similarities between samples.
getPairwiseSimilarityLoci(...): Determines pairwise similarity loci.
similarityMatrix(...): Generates a similarity matrix.

Users can choose to run these functions individually or execute them all at once using the convenient fluidigmAnalysisWrapper(...) wrapper function.

Finding the Closest Variants to Specific Genomic Locations

26. February 2024 Daniel Comments 0 Comment

In the field of genomics, we often need to find the closest variants (e.g., SNPs, indels) to a set of genomic locations of interest. This task can be accomplished using various bioinformatics tools such as bedtools. In this blog post, we will walk through a step-by-step guide on how to achieve this.

Prerequisites

Before we start, make sure you have the following files:

A BED file with your locations of interest. In this example, we’ll use locations_of_interest.bed
A VCF file with your variants. In this example, we’ll use FinalSetVariants_referenceGenome.vcf

Step 1: Sorting the VCF File

The first issue we encountered was that the VCF file was not sorted lexicographically. bedtools requires the input files to be sorted in this manner. We can sort the VCF file using the following command:

(grep '^#' FinalSetVariants_referenceGenome.vcf; grep -v '^#' FinalSetVariants_referenceGenome.vcf | sort -k1,1 -k2,2n) > sorted_FinalSetVariants_referenceGenome.vcf

This command separates the header lines (those starting with #) from the data lines, sorts the data lines, and then concatenates the header and sorted data into a new file sorted_FinalSetVariants_referenceGenome.vcf.

Step 2: Converting VCF to BED and Finding the Closest Variants

The next step is to find the closest variants to our locations of interest. However, by default, bedtools closest outputs the entire VCF entry, which might be more information than we need. To limit the output, we can convert the VCF file to a BED format on-the-fly and assign an additional feature, the marker name, as chr_bpLocation (which is the convention we use for naming our markers). We can also add the -d option to get the distance between the location of interest and the closest variant. Here is the command:

awk 'BEGIN {OFS="\t"} {if (!/^#/) {print $1,$2-1,$2,$4"/"$5,"+",$1"_"$2}}' sorted_FinalSetVariants_referenceGenome.vcf | bedtools closest -a locations_of_interest.bed -b stdin -d

This command uses awk to read the VCF data, convert it to BED format, and write the result to the standard output. The pipe (|) then feeds this output directly into bedtools closest as the -b file. The keyword stdin is used to tell bedtools to read from the standard input.

Conclusion

With these two steps, we can efficiently find the closest variants to a set of genomic locations of interest. This approach is flexible and can be adapted to different datasets and requirements.

Daniel Fischer

Science, Cooking, Life

Browsed by
Category: Bioinformatics