Test datasets

Test datasets

Occasionally one needs some test data, here are some small sets available that I use to test my R-packages


Download: test.vcf.gz

From the 1000 Genomes I downloaded the pilot data for one population and subsampled the file to get a whole genome coverage, but not as dense, as it is in the 1k Genomes projects.
The sample file contains 100.000 SNPs and XXX samples.

The steps that I performed to create the file were

# Download the file
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/paper_data_sets/a_map_of_human_variation/low_coverage/snps/CEU.low_coverage.2010_09.genotypes.vcf.gz

# Unzip the file
gunzip CEU.low_coverage.2010_09.genotypes.vcf.gz

# Remove the header
sed '/^#/d' CEU.low_coverage.2010_09.genotypes.vcf > CEU.low_coverage.2010_09.genotypes_noHeader.vcf

# Extract the header only
sed -n '/^#/p' CEU.low_coverage.2010_09.genotypes.vcf > CEU.low_coverage.2010_09.genotypes_onlyHeader.vcf

# Sample the headerless vcf
shuf -n 100000 CEU.low_coverage.2010_09.genotypes_noHeader.vcf > tmp.vcf

# Sort the output
sort -n -k1,1 tmp.vcf > tmp.sorted.vcf

# Merge the header back to the subsampled file
cat CEU.low_coverage.2010_09.genotypes_onlyHeader.vcf tmp.sorted.vcf > test.vcf

# Clean up
rm CEU.low_coverage.2010_09.genotypes.vcf
rm CEU.low_coverage.2010_09.genotypes_noHeader.vcf
rm CEU.low_coverage.2010_09.genotypes_onlyHeader.vcf
rm tmp.vcf
rm tmp.sorted.vcf