6.3.3.11. vcf2bed

The vcf2bed script converts 1-based, closed [start, end] Variant Call Format v4.2 (VCF) to sorted, 0-based, half-open [start-1, start) extended BED data.

Note

Note that this script converts from [start, end] to [start-1, start). Unless the --snvs, --insertions or --deletions options are added, we perform the equivalent of a single-base insertion to make BED output that is guaranteed to work with BEDOPS, regardless of what the actual variant may be, to allow operations to be performed. The converted output contains additional columns which allow reconstruction of the original VCF data and associated variant parameters.

For convenience, we also offer vcf2starch, which performs the extra step of creating a Starch-formatted archive.

6.3.3.11.1. Dependencies

The vcf2bed script requires convert2bed. The vcf2starch script requires starch. Both dependencies are part of a typical BEDOPS installation.

This script is also dependent on input that follows the VCF v4.2 specification.

Tip

Conversion of data which are VCF-like, but which do not follow the specification can cause parsing issues. If you run into problems, please check that your input follows the VCF specification using validation tools, such as those packaged with VCFTools.

6.3.3.11.2. Source

The vcf2bed and vcf2starch conversion scripts are part of the binary and source downloads of BEDOPS. See the Installation documentation for more details.

6.3.3.11.3. Usage

The vcf2bed script parses VCF from standard input and prints sorted BED to standard output. The vcf2starch script uses an extra step to parse VCF to a compressed BEDOPS Starch-formatted archive, which is also directed to standard output.

The header data of a VCF file is usually discarded, unless you add the --keep-header option. In this case, BED elements are created from these data, using the chromosome name _header to denote content. Line numbers are specified in the start and stop coordinates, and unmodified header data are placed in the fourth column (ID field).

Note

By default, multiple BED annotations are printed if there are multiple alternate alleles in a variant call. Use the --do-not-split-alt-alleles option to preserve the alternate allele string and print only one BED element for the variant call.

Tip

By default, all conversion scripts now output sorted BED data ready for use with BEDOPS utilities. If the converted BED output looks truncated or incomplete, and you are converting a VCF file that is larger than the capacity of your /tmp folder, then use the --sort-tmpdir option to specify an alternative directory to store intermediate sort results. Or, if you do not want to sort the converted output, use the --do-not-sort option. Run the script with the --help option for more details.

Tip

If you are sorting data larger than system memory, use the --max-mem option to limit sort memory usage to a reasonable fraction of available memory, e.g., --max-mem 2G or similar. See --help for more details.

6.3.3.11.4. Customized variant handling

By default, the vcf2bed script translates all variants to single-base positions in the resulting BED output. Depending on the category of variant you are interested in, however, you may want more specific categories handled differently.

Based on the VCF v4.2 specification, we also provide three custom options for filtering input for each of the three types of variants listed: --snvs, --insertions and --deletions. In each case, we use the length of the reference and alternate alleles to determine which type of variant is being handled.

In addition, using any of these three custom options automatically results in processing of mixed variant records for a microsatellite, where present. For instance, the following record contains a mixture of a deletion and insertion variant (GTC -> G and GTC -> GTCT, respectively):

#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     12.4.417 microsat1 GTC    G,GTCT  50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3

When using --snvs, --insertions or --deletions, this record is split into two distinct BED records and filtered depending on which custom option was chosen. The --insertions option would only export the single-base position of the insertion in this mixed variant, while --deletions would show the deletion.

In this way, you can control what kinds of variants are translated into BED outputs—most importantly, there is also no confusion about what the length of the BED element signifies.

6.3.3.11.5. Example

To demonstrate these scripts, we use a sample VCF input called foo.vcf (see the Downloads section to grab this file).

Note

This data is also publicly available from the Broad Institute.

##fileformat=VCFv4.0
##FILTER=<ID=LowQual,Description="QUAL < 50.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model">
##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]"
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
chr1    873762  .       T       G       5231.78 PASS    AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473      GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047       A       G       3931.66 PASS    AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD=0.1185      GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431      C       T       71.77   PASS    AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391       T       C       29.84   LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255

We can convert VCF to sorted BED data in the following manner:

$ vcf2bed < foo.vcf
chr1    873761  873762  .       5231.78 T       G       PASS    AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473      GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
chr1    877663  877664  rs3828047       3931.66 A       G       PASS    AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD=0.1185      GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899281  899282  rs28548431      71.77   C       T       PASS    AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974164  974165  rs9442391       29.84   T       C       LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255

As you see here, the header data of the VCF file is discarded, unless you add the --keep-header option. In this case, BED elements are created from these data, using the chromosome name _header to denote content. Line numbers are specified in the start and stop coordinates, and unmodified header data are placed in the fourth column (ID field).

Here we use --keep-header with our example dataset:

$ vcf2bed --keep-header < foo.vcf
_header 0       1       ##fileformat=VCFv4.0
_header 1       2       ##FILTER=<ID=LowQual,Description="QUAL < 50.0">
_header 2       3       ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
_header 3       4       ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
_header 4       5       ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
_header 5       6       ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
_header 6       7       ##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
_header 7       8       ##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
_header 8       9       ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
_header 9       10      ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
_header 10      11      ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
_header 11      12      ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
_header 12      13      ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
_header 13      14      ##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
_header 14      15      ##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
_header 15      16      ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
_header 16      17      ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
_header 17      18      ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
_header 18      19      ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
_header 19      20      ##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
_header 20      21      ##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model">
_header 21      22      ##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]"
_header 22      23      #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
chr1    873761  873762  .       5231.78 T       G       PASS    AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473      GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
chr1    877663  877664  rs3828047       3931.66 A       G       PASS    AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD=0.1185      GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899281  899282  rs28548431      71.77   C       T       PASS    AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974164  974165  rs9442391       29.84   T       C       LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255

With this option, the vcf2* scripts are completely “non-lossy”. Use of awk or other scripting tools can munge these data back into a VCF-formatted file.

Note

Note the conversion from 1- to 0-based coordinate indexing, in the transition from VCF to BED. While BEDOPS supports 0- and 1-based coordinate indexing, the coordinate change made here is believed to be convenient for most end users.

6.3.3.11.6. Column mapping

In this section, we describe how VCF v4.2 columns are mapped to BED columns. We start with the first five UCSC BED columns as follows:

VCF v4.2 field BED column index BED field
#CHROM 1 chromosome
POS - 1 2 start
POS (*) 3 stop
ID 4 id
QUAL 5 score

The remaining columns are mapped as follows:

VCF v4.2 field BED column index BED field
REF 6  
ALT 7  
FILTER 8  
INFO 9  

If present in the VCF v4.2 input, the following columns are also mapped:

VCF v4.2 field BED column index BED field
FORMAT 10  
Sample ID 1 11  
Sample ID 2 12  
13, 14, etc.  

When using --deletions, the stop value of the BED output is determined by the length difference between ALT and REF alleles. Use of --insertions or --snvs yields a one-base BED element.

If the ALT field contains more than one allele, multiple BED records will be printed. Use the --do-not-split option if you only want one BED record per variant call.

The “meta-information” (starting with ##) and “header” lines (starting with #) are discarded, unless the --keep-headers options is specified.

6.3.3.11.7. Downloads