Test datasets
Occasionally one needs some test data, here are some small sets available that I use to test my R-packages
VCF
Download: test.vcf.gz
From the 1000 Genomes I downloaded the pilot data for one population and subsampled the file to get a whole genome coverage, but not as dense, as it is in the 1k Genomes projects.
The sample file contains 100.000 SNPs and XXX samples.
The steps that I performed to create the file were
# Download the file wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/paper_data_sets/a_map_of_human_variation/low_coverage/snps/CEU.low_coverage.2010_09.genotypes.vcf.gz # Unzip the file gunzip CEU.low_coverage.2010_09.genotypes.vcf.gz # Remove the header sed '/^#/d' CEU.low_coverage.2010_09.genotypes.vcf > CEU.low_coverage.2010_09.genotypes_noHeader.vcf # Extract the header only sed -n '/^#/p' CEU.low_coverage.2010_09.genotypes.vcf > CEU.low_coverage.2010_09.genotypes_onlyHeader.vcf # Sample the headerless vcf shuf -n 100000 CEU.low_coverage.2010_09.genotypes_noHeader.vcf > tmp.vcf # Sort the output sort -n -k1,1 tmp.vcf > tmp.sorted.vcf # Merge the header back to the subsampled file cat CEU.low_coverage.2010_09.genotypes_onlyHeader.vcf tmp.sorted.vcf > test.vcf # Clean up rm CEU.low_coverage.2010_09.genotypes.vcf rm CEU.low_coverage.2010_09.genotypes_noHeader.vcf rm CEU.low_coverage.2010_09.genotypes_onlyHeader.vcf rm tmp.vcf rm tmp.sorted.vcf