Small RNA on Pipeline Pilot
This pipeline was designed by Dena Leshkowitz.
Pipeline Execution Instructions
Pipeline Parameters
You can see the description of each parameter in the help tab.
There are several parameters worth mentioning:
Parameter Name | Description |
---|---|
Adapter | The adapter sequence of all samples. If there are samples using different adapters you can manually set them through samples.csv as mentioned below |
Do Partial Protocol | If True - it will only analyze each sample. If False - after analyzing each sample, it will unite the results from all the samples and blast them |
Species | Choose the species of the samples |
Top Sequences Size | How many sequences will undergo blast from the united reads (from all samples) |
Min Count | filters from each samples all the sequences with less count than this |
Over Length | filters from each samples all the sequences shorter than this |
Max Length | filters from each samples all the sequences longer than this |
Blast parameters | Mir Blast Parameters - the blast parameters for the miRNA databases (hairpin and mature) Nt Blast Parameters - the blast parameters for the nt database DBs Path - If you want to add some custom databases to blast against you can list here their FASTA files DBs Blast Parameters - all the parameters here are lists of the blast parameters matching the order of the FASTA files mentioned in the previous parameter (DBs Path) |
Input data
Samples File
This file will be generated automatically from the parameter Samples Directory - For every folder in that directory a row will be added to the samples file with the name of the sample(without the prefix Sample_) and the path of the folder.
If for some reason (for example, multiple adapters) you want to manually generate this file - Just build this CSV file in the format declared below, name it samples.csv and place it under the Output Directory.
The current format is a comma delimited file with the following information per sample:
sample name | path to sample fastq directory | the adapter in this sample |
sample | path | adapter (Optional) |
---|
Notes:
- adapter is an optional column, if you want to note the adapter you're using in each sample you can add this column. If all the samples have the same adapter, you can write it in the pipeline Adapter parameter.
The adapter column in the sample file has priority over the pipeline parameter.
Logging
For each project there is a logs folder under the Output Directory and for each sample there is a logs folder under the sample folder.
Each logs folder includes the following:
- Logs for each step.
- A recovery file which is internally used by the pipeline and not meant to be tampered with.
Pipeline Steps
General
This protocol analyzes every sample separately and then combine all of them together for a final report.
Input
Samples Directory - directory of the samples folders (only the samples folders).
Output Directory - where the result will be.
Adapter - the 3’ end adapter.
1 - miRNA QC
This sub-protocol is executed for every sample in the samples folder.
Input
Directory containing FASTQ files. Zipped/unzipped. Single read.
Adapter - the 3’ end adapter.
Output files prefix - that will be the sample name.
Output directory - where this sample's results are held.
The various steps which every sample undergoes are described below
Merge FASTQ files
Unzip (if needed) and merge FASTQ files from the given directory to a single file named “<prefix>.merged.fastq”.
Input
Zipped FASTQ Input Directory Path - the input directory.
Merged FASTQ File - the output file.
Trim by quality
Trims the reads from the input file according to quality and write the output to file and discards reads shorter than <Min Length>.
Input
FASTQ Input File - name of input file
FASTQ Output File - name of output file
Quality Cutoff - the bound for the quality (0 - 93) of the reads
Min Length - after trimming discards reads shorter than <Min Length>
Trim adapter
Trims the adapter from the reads and discard the reads that did not contain this adapter.
Input
FASTQ Input File - name of input file
FASTQ Output File - name of output file
Adapter - the 3’ end adapter to trim
Trim by quality (second time)
Trims the reads again after the adapter was removed.
Filter PhiX reads
Discards the PhiX reads from the input file and outputting a filtered file without those reads.
Input
FASTQ Input File - name of input file
FASTQ Output File - name of output file
Filter reads by length to FINAL FASTQ
Discards the reads shorter than Over Length and longer than Max Length base pairs and save a final FASTQ file
Input
FASTQ Input File - name of input file
FASTQ Output File - name of output file
Over Length - the length low bound
Max Length - the length high bound
Filter reads and unique to CSV file
Discards the reads shorter than Over Length and longer than Max Length base pairs and count Min Count, keeps only unique sequences (while adding ‘count’ field to data record) and sorting the sequences.
Input
FASTQ Input File - name of input file
CSV Output File - name of output file
Over Length - the length low bound
Max Length - the length high bound
Min Count - the count lower bound
This step is almost identical to the previous one, the only difference being the additional filter by count.
QC
Run FastQC
Runs FastQC on the filtered sequences file.
Input
FASTQ Input File - name of input file.
FASTQ Output Directory - name of output directory.
miRNA QC part 2
This part outputs a length distribution CSV file along with an image of the length distribution.
It also outputs the over-represented sequences that are over 1% and blasts them.
Input
FastQC Data Output Directory - the FastQC directory for the data file
BLAST Output Files Prefix - name of the sample
Analysis Folder Name - folder for the analysis output
miRNA QC part 3
Runs a python script to decide what is the best blast match for the over-represented over 1% sequences out of the blast databases (mat, nt, hair and the custom databases).
The script decides whats is the best match according to this algorithm:
- by default it takes the mature match unless when hairpin has alignment bigger by at least 2 from mature then it takes hairpin.
- if mature and hairpin found NO HIT, if looks for a HIT in the custom databases by their order and takes the first database that gets a HIT
- when all other fail, take nt's match
Input
BLAST Output Directory - the analysis folder of the previous part
BLAST Output Files Prefix - name of the sample
2 - Blast merged samples
This sub-protocol collects all the filters reads that the previous sub-protocol generated per sample and combines them to one file.
Then it converts the combined file to FASTA format, runs blast on all samples with the 3 blast databases (or more if added custom databases), decides which one was the best match and outputs the final file.
Input
Output Directory - results directory
Input Samples CSV - a CSV file containing the samples name and path (created automatically)
Output Files Prefix - ‘ALL’ (combined samples)
3 - Creating Tags Annotation Report
This sub-protocol collects all the filtered unique sequences and creates a matrix of sequences vs. samples.
The matrix contains how many times each sequence appeared in each sample, the total count of the sequences in all the samples and the blast result of that sequence from the previous sub-protocol.
Input
Output Directory - results directory
Input Samples CSV - a CSV file containing the samples name and path (created automatically).
Output Files Prefix - ‘ALL’ (combined samples)
4 - Creating Mir Annotation Report
This sub-protocol takes the matrix from the previous sub-protocol, filters out all the sequences that are not mature miRNA, merges all the sequences that match the same miRNA id and generates a new matrix with all the unique miRNA's containing their miRNA database, the number of sequences that mapped to this miRNA, their count in each sample, their count in all the samples and their description.
Input
Output Directory - results directory
5 - Build report
This sub-protocol collects all the information from previous steps and builds a report containing details on every sample and all of them combined.
Input
Output Directory - results directory
Input Samples CSV - a CSV file containing the samples name and path (created automatically)
Show Viewer - if True it shows the final report, else it just saves it in the output directory
Utilities
Compress Project
This Protocol gets the original project folder and a new compressed folder.
It copies the original project files to the new folder keeping just the main files and not the whole project