In the field of genomics, we often need to find the closest variants (e.g., SNPs, indels) to a set of genomic locations of interest. This task can be accomplished using various bioinformatics tools such as bedtools
. In this blog post, we will walk through a step-by-step guide on how to achieve this.
Prerequisites
Before we start, make sure you have the following files:
- A BED file with your locations of interest. In this example, we’ll use
locations_of_interest.bed
- A VCF file with your variants. In this example, we’ll use
FinalSetVariants_referenceGenome.vcf
Step 1: Sorting the VCF File
The first issue we encountered was that the VCF file was not sorted lexicographically. bedtools
requires the input files to be sorted in this manner. We can sort the VCF file using the following command:
(grep '^#' FinalSetVariants_referenceGenome.vcf; grep -v '^#' FinalSetVariants_referenceGenome.vcf | sort -k1,1 -k2,2n) > sorted_FinalSetVariants_referenceGenome.vcf
This command separates the header lines (those starting with #
) from the data lines, sorts the data lines, and then concatenates the header and sorted data into a new file sorted_FinalSetVariants_referenceGenome.vcf
.
Step 2: Converting VCF to BED and Finding the Closest Variants
The next step is to find the closest variants to our locations of interest. However, by default, bedtools closest
outputs the entire VCF entry, which might be more information than we need. To limit the output, we can convert the VCF file to a BED format on-the-fly and assign an additional feature, the marker name, as chr_bpLocation
(which is the convention we use for naming our markers). We can also add the -d
option to get the distance between the location of interest and the closest variant. Here is the command:
awk 'BEGIN {OFS="\t"} {if (!/^#/) {print $1,$2-1,$2,$4"/"$5,"+",$1"_"$2}}' sorted_FinalSetVariants_referenceGenome.vcf | bedtools closest -a locations_of_interest.bed -b stdin -d
This command uses awk
to read the VCF data, convert it to BED format, and write the result to the standard output. The pipe (|
) then feeds this output directly into bedtools closest
as the -b
file. The keyword stdin
is used to tell bedtools
to read from the standard input.
Conclusion
With these two steps, we can efficiently find the closest variants to a set of genomic locations of interest. This approach is flexible and can be adapted to different datasets and requirements.