m6ASNP – Annotation of genetic variants by m6A function

m6ASNP is an online service that predicts and annotates N6-methyladenosine alterations from genetic variants data such as germline SNPs or cancer somatic mutations. Based on the recently published miCLIP data, we trained two accurate prediction models for human and mouse using Random Forest algorithm. Combining genetic variants in the prediction model, we separately predicted the m6A status in reference sample and mutant sample. For those m6As that occurred in reference sample and lost in mutant sample, we defined them as m6A-loss alterations. Whilst in the opposite case, we defined them as m6A-gain alterations. Taking advantages of other annotation tools, we developed an analysis pipeline to interpret the landscape distribution of predicted m6A alterations at a whole genomic level. Multiple statistical diagrams and genome browser are also embed in the web server to provide visualization for all the analysis results.

How do I use it?

1. Firstly, to use m6ASNP you will need to upload a data file or paste a formatted text in VCF format or simple tabular format.

An example for VCF format:

##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
20  14370   rs6054257   G   A   29  PASS    NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ:AD  0|0:48:1:51,51:0,1  1|0:48:8:51,51:4,4  1/1:43:5:.,.:1,4
20  17330   .   T   A   3   q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50  0|1:3:5:65,3    0/0:41:3
20  1110696 rs6040355   A   G,T 67  PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB   GT:GQ:DP:HQ 1|2:21:6:23,27  2|1:2:0:18,2    2/2:35:4
20  1230237 .   T   .   47  PASS    NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60  0|0:48:4:51,51  0/0:61:2
20  1234567 microsat1   GTC G,GTCT  50  PASS    NS=3;DP=9;AA=G  GT:GQ:DP:AD 0/1:35:4:2,1,1  0/2:17:5:1,1,3  1/1:40:3:0,2,1
22  30421786    TR1 A   T   29  PASS    NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ:AD  0|0:48:1:51,51:1,0  1|0:48:8:51,51:4,4  1/1:43:5:.,.:0,5
17  41244435    VBRCA1  T   C   30  PASS    NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ:AD  0|0:48:3:51,51:3,0  1|0:48:8:51,51:3,5  1/1:43:6:.,.:0,6
22  29446079    .   A   G   67  PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB   GT:GQ:DP:HQ 1|0:21:6:23,27  0|0:2:0:18,2    1/1:35:4
22  40814500    TR3 T   C   3   q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50  0|1:3:5:65,3    0/0:41:3
22  40815256    .   C   T   47  PASS    NS=3;DP=13;AA=T GT:GQ:DP:HQ 1|0:54:7:56,60  0|1:48:4:51,51  0/0:61:2
12  123466292   .   GGAAGAAGAA  G,GGAA,GGAAGAA  50  PASS    NS=3;DP=13;AA=T GT:GQ:DP:HQ:AD  0|1:48:4:51,51:2,2,0,0  1|2:21:6:23,27:0,2,4,0  3|1:48:8:51,51:0,3,0,5

Note:We support the standard variant call format (VCF) version 4.2 specification. A lower VCF version are also well support by m6ASNP. A genotype field is required for prediction and multiple samples can be submitted in a single VCF file.

An example for tabular format:

#Chromosome   Position    ID  Reference   Alteration  Sample
20  14370   rs6054257   G   A   NA00003
20  17330   .   T   A   NA00002
20  1110696 rs6040355   A   G,T NA00001
20  1110696 rs6040355   A   G,T NA00002
20  1110696 rs6040355   A   G,T NA00003
20  1230237 .   T   .   NA00001
20  1234567 microsat1   GTC GTCT    NA00003
22  30421786    TR1 A   T   NA00002
17  41244435    VBRCA1  T   C   NA00003
22  29446079    .   A   G   NA00003
22  40814500    TR3 T   C   NA00002
22  40815256    .   C   T   NA00001
22  40815256    .   C   T   NA00002
12  123466292   .   GGAAGAAGAA  G   NA00003

Note:The simple tabular format requires six fixed fields at each line. Fields must be tab-separated. Lines prefixed with octothorpe (#) will be regarded as comments and ignored by the prediction tool. Also, all the fields in each feature line must contain a value; 'empty' columns should be denoted with a '.'

Chromosome- name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix.
Position– startposition of the genetic variants, with sequence numbering starting at 1.
ID– identifier of the genetic variants, i.e., rsID in dbSNP.
Reference- reference base(s). Each base must be one of A,C,G,T,N. Insertion denote as '-'.
Alteration- alternate base(s). Deletion denote as '-'.
Sample– Sample ID or sample name regarding the genetic variants.

2. Next, select the corresponding species and assembly version. Currently, due to data limitation, only human and mouse are supported. For human genome, two assembly versions, GRCh37 (hg19) and GRCh38 (hg38), are supported. For mouse genome, NCBIM37 (mm9) and GRCm38 (mm10) are available.

3. Finally, select a prediction threshold to start the calculation. The current status can be view in the console panel.

What do the score mean?

The scores quantitatively represent the degree of m6A alterations between reference and mutant samples. Scores larger than 0 represent m6A-gain alterations, while scores lower than 0 represent m6A-loss alterations. In some m6A-loss mutations, scores are assigned to NA, which mean that the core AC motif is destroy by genetic variants and leading to complete loss of m6A at those sites.

What is the prediction threshold?

The prediction models for m6As were trained using Random Forest algorithm. After training the prediction models, we carried out 10-fold cross-validation to validate the prediction performance. To balance the prediction accuracy, we selected three thresholds with high, medium and low stringencies based on the evaluation results. The detailed performance under these three thresholds was presented as follow:

Usage of the stand-alone program

Once you have downloaded the java package, you can run it directly from the command line.

java -jar m6ASNP.jar –h

Note:The java runtime environment must be installed and the version should be greater than 7.0.

After a short delay, you will see the following usage instructions:

Usage: java -jar m6ASNP.jar [Flags] [Options]

Flags:
   -predict    Predict m6A alteration based on VCF file

Options:
   Predict m6A-association from vcf or tab file:
     -i        Input file
     -it       Input file format, default: vcf (vcf, tab)
     -sp       The species for prediction, default: Human (Human, Mouse)
     -a        Known gene annotation file from UCSC
     -g        The genome sequence file (2bit format)
     -t        The prediction threshold, default: Medium (High, Medium, Low)
     -o        The output file

Example:
java -jar m6AVarAnno.jar -predict -i test.vcf -a Knowngene_hg19.txt -g hg19.2bit -o out.txt

Command-line options

-predict
Perform a prediction on the inputted variants. The potential m6A-association variants will be identified by our random forest classifier.
-i
The inputted file of candidate genetic variants.
-it
Specify the format of the input file. Can be either a standard VCF format or a simplified tab-delimited format.
-sp
Specify the species for prediction. We have trained two separate models for both human and mouse.
-a
A knownGene annotation file downloaded from the UCSC Table Browser. The annotation file for human (hg19 and hg38) and mouse (mm9 and mm10) can also be downloaded from our website.
-g
The genomic sequences in 2bit format. The 2bit files of human and mouse can be downloaded from our website or generated by UCSC faToTwoBit program.
-t
The prediction threshold.
-o
Specify the full location of the output file.

I'm having trouble with m6ASNP that isn't addressed on this page. What should I do?

If you are having trouble with m6ASNP please contact the two major authors: Dr. Jian Ren and Dr. Zhixiang Zuo. We will try to resolve it.