PausePred

PausePred Help Guide


Introduction

"PausePred" is designed to predict translational halts using ribosome footprint density information obtained from ribosomal profiling data. The high resolution of the ribosomal profiling technique allows the location and strength of translational pauses to be detected with high precision. PausePred is designed to worked efficiently on transcriptome as well as genome alignments. A window-based approach is used to detect pauses.


Inputs:



1. Alignment file:

A sorted BAM file can be generated by using the SAMtools sort option(Li et al., 2009). For users not familiar with command line tools, a BAM alignment file can be generated and sorted in RiboGalaxy (http://ribogalaxy.ucc.ie). The PausePred tool accepts single-end read BAM alignment files only. The file name should have a .bam extension. Also, please avoid spaces or special characters like '(' in the file name. The file can be easily browsed from your computer by clicking on 'Choose file' option.
For example:
example1_genome.bam is correct.
example1_genome(1).bam is incorrect.








2. Reference FASTA file:

The reference sequence (genomic or transcriptomic) in FASTA format which was used to generate the BAM file should be uploaded in the second input tab. Please avoid spaces or special characters like '(' in the file name.
For example:
example_genome.fa is correct.
example_genome(1).fa is incorrect.


3. Annotation file (optional):

Providing a annotation file as input will give the codon level information related to the predicted pause for transcriptome alignments only. This file should be uploaded only if you are using transcriptome alignments for the pause prediction. The coordinate positions in this file should be one-based, i.e. first coordinate starts from 1 not 0 position. Please avoid spaces or special characters like '(' in the file name. The required file format is given in the example_annotation.txt and also mentioned below:






The annotation file requires the position of the start and stop codon of a gene/transcript within a gene. The GFF (General Feature Format) or GTF (General Transfer Format) files can be helpful in generating the required annotation file.

4. Fold change for Pause:

The fold change for a pause is also referred to as the pause score(Si). Pauses are predicted using the following procedure.
The number of reads r mapped to each position i within a sliding window of length n (step = n/2) is normalized over the average density within the window to score the pause. The average of these values across overlapping windows is used as the pause score, Si:








5. Window Size:

The default size of the window(n) for calculating the background density is 1000 nucleotides (nt). An overlapping window approach is used in the pause prediction such that each window overlaps the previous window by 50% (i.e. the step size is n/2). For pauses which are predicted in both overlapping windows, the pause score and window coverage are averaged for the final output.


6. Read length range:

You can specify the footprint read lengths to be considered for the pause prediction(s). The default read length range is 25-35 nt.

7. Coverage (%):

The coverage(%) represents the percentage of positions within a window size that have a minimum of one read mapped. This parameter helps to filter out the genes/genomic regions that have a low density coverage across the window

8.Up and downstream nucleotide sequence:

The flanking nucleotide sequences of a predicted pause location are provided in the output. You can specify the length(nt) of up and downstream sequences to be provided in the output file (the default value is 50 nt).
The sequence information can be used as input to secondary structure prediction tools which may provide insight into the pause occurrence.

9. Offset value:

An offset value can be used to infer the location of the A-site. Use a positive value to add an offset to the 5' read end, whereas a 3' offset should be specified with a negative number. The offset values should be entered using a comma separated list and the number of offset values should be equal to the number of read lengths entered. The position of the A-site is inferred using the following calculations:
##for 5'offset
Infered A-site= Read position + offset_value
##for 3'offset
Infered A-site=(Read position+read length-1)-offset value


10. E-mail address:

The link to the result files will be emailed to you. Please input a valid email id such as abc@domain.com.


Output:



The PausePred calculations are carried out on our server and once processed, an email will be sent to you with a hyperlink to the output file. The output is in a CSV (comma separated) file format. The snapshot below shows the output of PausePred for our example dataset with an 8 column output format.





1. Column 1:

The name of the gene or chromosome where a pause is predicted.

2. Column 2:

The particular coordinate position where a pause is detected. In this example the coordinate corresponds to the 5' end of the mapped reads.

3. Column 3:

The total number of reads (5' end) mapped to the predicted pause location.

4. Column 4:

This column provides the pause score values for each predicted pause. The pause score is calculated by dividing number of reads mapped on a pause position by the average number of reads mapped within the corresponding window. As an overlapping window approach is used for the calculations (i.e. each window overlaps the previous window by 50%), if a pause location is predicted in overlapping windows, the average of the pause scores will be provided in the output file.

5. Column 5:

This column contains the window coverage which represents, the percentage of positions within a window size that have at least one read mapped. The coverage parameter can be used to filter out regions with low footprint density.

6. Column 6 and 7:

These are the flanking nucleotide sequences of a predicted pause. The downstream nucleotide sequence includes the nucleotide of the pause location. The sequence information can be useful for the prediction of secondary structures related to the pause.

7. Column 8:

This column contains the z-score values for each predicted pause which can help to determine the statistical significance of the predicted pauses. PausePred carries out a z-score transformation of the pause scores grouped into bins of size 300 based on the window coverage(Andreev et al., 2015).

Output when annotation file is supplied:







8. Column 9:

This column provides the codon related to the predicted pause. It is predicted by using the coding frame information provided in the annotation file.


How long will it take to process my job?



The total time taken for processing your data depends on various things, such as:

1. Size of input file:

Large input files will take longer to upload and process.

2. Internet speed on your system:

Your internet connection speed will affect the time taken to upload your files.

3. Number of users running PausePred simultaneously:

The number of people using PausePred simultaneously may affect the processing time.

*We have tested the tool with different file sizes and approximate time PausePred took to process the data is given below:
~1GB->10-15 mins
~2GB->30-45 mins
~4GB->2-2.5 hours



Important links:



Stand-alone version


The stand-alone version of PausePred can be downloaded from the github link provided on the pausepred website (highlighted point 11 in the PausePred screenshot image).

Pause visualization with Rfeet


The pauses can be visually explored using the Rfeet tool that we have developed to visualise Ribo-seq and RNA-seq read density profiles. Please click the link provided on the PausePred webpage to go to Rfeet (highlighted point 11 in the PausePred screenshot).





Rfeet ribosome density profile of the E.coli gene araC showing a ribosomal stall detected by PausePred at position 214 with a pause score 96.16 using RiboSeq data generated in E.coli mutants lacking elongation factor EFP (red profile).The background grey coverage plot represents RiboSeq data generated for the wild-type strain. Both RiboSeq datasets are from (Woolstenhulme et al., 2015).



Contact email address:

pausepred@gmail.com