Daniel Fischer

Limiting Parallel Jobs in Snakemake Using Resources

1. April 2025 Daniel Comments 0 Comment

Introduction

When running computationally intensive workflows with Snakemake, you might encounter issues where too many jobs are running in parallel, causing excessive I/O load, memory pressure, or high latency on your hard drive. This can lead to failed jobs or degraded performance.

Snakemake provides a way to limit parallel execution per rule using the resources directive, but this only works if you also specify a global resource limit when executing the workflow.

In this blog post, we will demonstrate how to properly limit the number of parallel jobs for a specific rule using Snakemake’s resource management system.

The Problem: Too Many Jobs Running at Once

Consider the following Snakemake rule:

rule process_data:
    input:
        "{sample}.raw"
    output:
        "{sample}.processed"
    resources:
        process_data_jobs=1  # Assign a resource unit for limiting the number of jobs
    shell:
        """
        some_tool --input {input} --output {output}
        """

Why Doesn’t `resources` Alone Limit Job Execution?

You might expect that setting resources: processing_jobs=1 would automatically limit Snakemake to running only 1 job at a time. However, Snakemake does not enforce resource-based scheduling unless you specify a global limit when launching the workflow.

Without a global limit, Snakemake may still launch too many jobs in parallel, overloading your system.

The Solution: Enforce Resource Limits

To actually restrict the number of parallel jobs, run Snakemake with:

snakemake --resources process_data_jobs=10

How Does This Work?

Each job of process_data requests 1 unit of process_data_jobs.
The global limit processing_jobs=10 ensures that at most 10 jobs (10 / 1 = 10) run in parallel. You can also set different units, if you like

Before setting this limit, too many jobs could be running at once! After applying it, only 10 jobs were allowed to run simultaneously.

Conclusion

If you are facing high disk latency, I/O pressure, or excessive job execution in Snakemake, the best way to control it is by:

Using resources to define per-job resource requirements.
Setting a global resource limit (--resources processing_jobs=10) when executing Snakemake.

This approach ensures your workflow runs efficiently and reliably without overloading your system!

Genomic Prediction for Timothy Grass in Finland

19. March 2025 Daniel Comments 0 Comment

Timothy (Phleum pratense L.) is a key forage grass for Finnish agriculture, and improving its yield, winter hardiness, and digestibility is crucial for sustainable production. Our recent study explored the potential of genomic prediction to accelerate breeding progress by leveraging genotyping-by-sequencing and advanced statistical models.

Key findings:
* Heritability estimates ranged from 0.13 (yield at first cut) to 0.86 (digestibility at second cut).
* Genetic correlations suggest trade-offs between yield and winter survival but positive links between digestibility traits.
* Genomic breeding values were estimated using advanced statistical approaches, including a novel scaling of the genomic relationship matrix.
* Predictive ability reached up to 0.62 for digestibility, and validation confirmed moderate accuracy with little dispersion.

Despite concerns that genotype quality might impact predictions, our results show that genomic prediction remains a powerful tool for Timothy breeding in Finland. This research highlights the potential for data-driven breeding strategies to enhance forage crop resilience and quality.

https://link.springer.com/article/10.1007/s00122-025-04860-9

New R-package started

28. February 2025 Daniel Comments 0 Comment

I just started a new R package called ‘SnakebiteTools’. I would like to collect there small helper functions to better analyse and monitor Snakemake runs. The output can later be used for resource optimization, checking the status of an ongoing Snakemake run (which might be messy for runs with plenty of jobs) etc.

Creating Drop-Down Menus in Excel

27. February 2025 Daniel Comments 0 Comment

I often do not remember how to create simple drop down menus in Excel and so I decided to write a short note here. The thing I want to have:

In one tab, I want to have a column with possible values for my drop down menu, e.g my project names
In another tab, I want to have in each button of a column a drop down button that allows me to chose from these values.

This is in principle rather easy to achieve:

Step 1: Prepare the List on Another Sheet

Open your Excel file and go to the sheet where you want to store the drop-down values (e.g., Projects).
Enter the list of values in a column (e.g., A1:A10 in Projects).

Step 2: Name the List (Optional but Recommended)

Select the range of values in Projects (e.g., A1:A10 or the whole column).
Click on the Formula tab → Define Name.
Enter a name (e.g., MyProjects) and click OK.

Step 3: Create the Drop-Down List

Go to the sheet where you want the drop-down (e.g., Tasks).
Select the cell(s) where you want the drop-down.
Click on the Data tab → Data Validation.
In the Allow box, choose List.
In the Source box:
- If you named the range: enter =MyProjects
- If not: enter e.g. =Projects!A1:A10
Click OK.

In case you have in the first line a header (e.g. with column names) you want to remove this line from the drop down options. You can do that like this:

Method 1: Remove Data Validation from One Cell (Header Only)

Click on the first cell of the column (e.g., A1).
Go to the Data tab → Click Data Validation.
In the pop-up, click Clear All → Click OK.

Finding Files in My Folders

10. February 2025 Daniel Comments 0 Comment

Managing disk space efficiently is essential, especially when working on systems with strict file quotas. Recently, I encountered a situation where I had exceeded my file limit and needed a quick way to determine which folders contained the most files. To analyze my storage usage, I used the following command:

for d in .* *; do [ -d "$d" ] && echo "$d: $(find "$d" -type f | wc -l)"; done | sort -nr -k2

Breaking Down the Command

This one-liner efficiently counts files across all directories in the current location, including hidden ones. Here’s how it works:

for d in .* * – Loops through all files and directories, including hidden ones.
[ -d "$d" ] – Ensures that only directories are processed.
find "$d" -type f | wc -l – Counts all files (not directories) inside each folder, including subdirectories.
sort -nr -k2 – Sorts the results in descending order based on the number of files.

Why This is Useful

With this command, I quickly identified the directories consuming the most inodes and was able to take action, such as cleaning up unnecessary files. It’s an efficient method for understanding file distribution and managing storage limits effectively.

Alternative Approaches

If you only want to count files directly inside each folder (without subdirectories), you can modify the command like this:

for d in .* *; do [ -d "$d" ] && echo "$d: $(find "$d" -maxdepth 1 -type f | wc -l)"; done | sort -nr -k2

This variation is useful when you need a more localized view of file distribution.