m6ASNP is an online service that predicts and annotates N6-methyladenosine alterations from genetic variants data such as germline SNPs or cancer somatic mutations. Based on the recently published miCLIP data, we trained two accurate prediction models for human and mouse using Random Forest algorithm. Combining genetic variants in the prediction model, we separately predicted the m6A status in reference sample and mutant sample. For those m6As that occurred in reference sample and lost in mutant sample, we defined them as m6A-loss alterations. Whilst in the opposite case, we defined them as m6A-gain alterations. Taking advantages of other annotation tools, we developed an analysis pipeline to interpret the landscape distribution of predicted m6A alterations at a whole genomic level. Multiple statistical diagrams and genome browser are also embed in the web server to provide visualization for all the analysis results.
1. Firstly, to use m6ASNP you will need to upload a data file or paste a formatted text in VCF format or simple tabular format.
An example for VCF format:
##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ:AD 0|0:48:1:51,51:0,1 1|0:48:8:51,51:4,4 1/1:43:5:.,.:1,4 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP:AD 0/1:35:4:2,1,1 0/2:17:5:1,1,3 1/1:40:3:0,2,1 22 30421786 TR1 A T 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ:AD 0|0:48:1:51,51:1,0 1|0:48:8:51,51:4,4 1/1:43:5:.,.:0,5 17 41244435 VBRCA1 T C 30 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ:AD 0|0:48:3:51,51:3,0 1|0:48:8:51,51:3,5 1/1:43:6:.,.:0,6 22 29446079 . A G 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|0:21:6:23,27 0|0:2:0:18,2 1/1:35:4 22 40814500 TR3 T C 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 22 40815256 . C T 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 1|0:54:7:56,60 0|1:48:4:51,51 0/0:61:2 12 123466292 . GGAAGAAGAA G,GGAA,GGAAGAA 50 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ:AD 0|1:48:4:51,51:2,2,0,0 1|2:21:6:23,27:0,2,4,0 3|1:48:8:51,51:0,3,0,5
Note:We support the standard variant call format (VCF) version 4.2 specification. A lower VCF version are also well support by m6ASNP. A genotype field is required for prediction and multiple samples can be submitted in a single VCF file.
An example for tabular format:
#Chromosome Position ID Reference Alteration Sample 20 14370 rs6054257 G A NA00003 20 17330 . T A NA00002 20 1110696 rs6040355 A G,T NA00001 20 1110696 rs6040355 A G,T NA00002 20 1110696 rs6040355 A G,T NA00003 20 1230237 . T . NA00001 20 1234567 microsat1 GTC GTCT NA00003 22 30421786 TR1 A T NA00002 17 41244435 VBRCA1 T C NA00003 22 29446079 . A G NA00003 22 40814500 TR3 T C NA00002 22 40815256 . C T NA00001 22 40815256 . C T NA00002 12 123466292 . GGAAGAAGAA G NA00003
Note:The simple tabular format requires six fixed fields at each line. Fields must be tab-separated. Lines prefixed with octothorpe (#) will be regarded as comments and ignored by the prediction tool. Also, all the fields in each feature line must contain a value; 'empty' columns should be denoted with a '.'
Chromosome- name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix.
Position– startposition of the genetic variants, with sequence numbering starting at 1.
ID– identifier of the genetic variants, i.e., rsID in dbSNP.
Reference- reference base(s). Each base must be one of A,C,G,T,N. Insertion denote as '-'.
Alteration- alternate base(s). Deletion denote as '-'.
Sample– Sample ID or sample name regarding the genetic variants.
2. Next, select the corresponding species and assembly version. Currently, due to data limitation, only human and mouse are supported. For human genome, two assembly versions, GRCh37 (hg19) and GRCh38 (hg38), are supported. For mouse genome, NCBIM37 (mm9) and GRCm38 (mm10) are available.
3. Finally, select a prediction threshold to start the calculation. The current status can be view in the console panel.
The scores quantitatively represent the degree of m6A alterations between reference and mutant samples. Scores larger than 0 represent m6A-gain alterations, while scores lower than 0 represent m6A-loss alterations. In some m6A-loss mutations, scores are assigned to NA, which mean that the core AC motif is destroy by genetic variants and leading to complete loss of m6A at those sites.
The prediction models for m6As were trained using Random Forest algorithm. After training the prediction models, we carried out 10-fold cross-validation to validate the prediction performance. To balance the prediction accuracy, we selected three thresholds with high, medium and low stringencies based on the evaluation results. The detailed performance under these three thresholds was presented as follow:
Once you have downloaded the java package, you can run it directly from the command line.
java -jar m6ASNP.jar –h
Note:The java runtime environment must be installed and the version should be greater than 7.0.
After a short delay, you will see the following usage instructions:
Usage: java -jar m6ASNP.jar [Flags] [Options]
Flags:
-predict Predict m6A alteration based on VCF file
Options:
Predict m6A-association from vcf or tab file:
-i Input file
-it Input file format, default: vcf (vcf, tab)
-sp The species for prediction, default: Human (Human, Mouse)
-a Known gene annotation file from UCSC
-g The genome sequence file (2bit format)
-t The prediction threshold, default: Medium (High, Medium, Low)
-o The output file
Example:
java -jar m6AVarAnno.jar -predict -i test.vcf -a Knowngene_hg19.txt -g hg19.2bit -o out.txt
-predict
Perform a prediction on the inputted variants. The potential m6A-association variants will be identified by our random forest classifier.
-i
The inputted file of candidate genetic variants.
-it
Specify the format of the input file. Can be either a standard VCF format or a simplified tab-delimited format.
-sp
Specify the species for prediction. We have trained two separate models for both human and mouse.
-a
A knownGene annotation file downloaded from the UCSC Table Browser. The annotation file for human (hg19 and hg38) and mouse (mm9 and mm10) can also be downloaded from our website.
-g
The genomic sequences in 2bit format. The 2bit files of human and mouse can be downloaded from our website or generated by UCSC faToTwoBit program.
-t
The prediction threshold.
-o
Specify the full location of the output file.
If you are having trouble with m6ASNP please contact the two major authors: Dr. Jian Ren and Dr. Zhixiang Zuo. We will try to resolve it.