This pipeline was designed by Dena Leshkowitz.

Pipeline Execution Instructions

Pipeline Parameters

You can see the description of each parameter in the help tab.

There are several parameters worth mentioning:

Parameter Name	Description
Adapter	The adapter sequence of all samples. If there are samples using different adapters you can manually set them through samples.csv as mentioned below
Do Partial Protocol	If True - it will only analyze each sample. If False - after analyzing each sample, it will unite the results from all the samples and blast them
Species	Choose the species of the samples
Top Sequences Size	How many sequences will undergo blast from the united reads (from all samples)
Min Count	filters from each samples all the sequences with less count than this
Over Length	filters from each samples all the sequences shorter than this
Max Length	filters from each samples all the sequences longer than this
Blast parameters	Mir Blast Parameters - the blast parameters for the miRNA databases (hairpin and mature) Nt Blast Parameters - the blast parameters for the nt database DBs Path - If you want to add some custom databases to blast against you can list here their FASTA files DBs Blast Parameters - all the parameters here are lists of the blast parameters matching the order of the FASTA files mentioned in the previous parameter (DBs Path)

Input data

Samples File

This file will be generated automatically from the parameter Samples Directory - For every folder in that directory a row will be added to the samples file with the name of the sample(without the prefix Sample_) and the path of the folder.

If for some reason (for example, multiple adapters) you want to manually generate this file - Just build this CSV file in the format declared below, name it samples.csv and place it under the Output Directory.

The current format is a comma delimited file with the following information per sample:

sample	path	adapter (Optional)
sample name	path to sample fastq directory	the adapter in this sample

Notes:

adapter is an optional column, if you want to note the adapter you're using in each sample you can add this column. If all the samples have the same adapter, you can write it in the pipeline Adapter parameter.
The adapter column in the sample file has priority over the pipeline parameter.

Logging

For each project there is a logs folder under the Output Directory and for each sample there is a logs folder under the sample folder.

Each logs folder includes the following:

Logs for each step.
A recovery file which is internally used by the pipeline and not meant to be tampered with.

Pipeline Steps

General

This protocol analyzes every sample separately and then combine all of them together for a final report.

Input

Samples Directory - directory of the samples folders (only the samples folders).
Output Directory - where the result will be.
Adapter - the 3’ end adapter.

1 - miRNA QC

This sub-protocol is executed for every sample in the samples folder.

Input

Directory containing FASTQ files. Zipped/unzipped. Single read.
Adapter - the 3’ end adapter.
Output files prefix - that will be the sample name.
Output directory - where this sample's results are held.

The various steps which every sample undergoes are described below

Merge FASTQ files

Unzip (if needed) and merge FASTQ files from the given directory to a single file named “<prefix>.merged.fastq”.

Input

Zipped FASTQ Input Directory Path - the input directory.
Merged FASTQ File - the output file.

Trim by quality

Trims the reads from the input file according to quality and write the output to file and discards reads shorter than <Min Length>.

Input

FASTQ Input File - name of input file
FASTQ Output File - name of output file
Quality Cutoff - the bound for the quality (0 - 93) of the reads
Min Length - after trimming discards reads shorter than <Min Length>

Trim adapter

Trims the adapter from the reads and discard the reads that did not contain this adapter.

Input

FASTQ Input File - name of input file
FASTQ Output File - name of output file
Adapter - the 3’ end adapter to trim

Trim by quality (second time)

Trims the reads again after the adapter was removed.

Filter PhiX reads

Discards the PhiX reads from the input file and outputting a filtered file without those reads.

Input

FASTQ Input File - name of input file
FASTQ Output File - name of output file

Filter reads by length to FINAL FASTQ

Discards the reads shorter than Over Length and longer than Max Length base pairs and save a final FASTQ file

Input

FASTQ Input File - name of input file
FASTQ Output File - name of output file
Over Length - the length low bound
Max Length - the length high bound

Filter reads and unique to CSV file

Discards the reads shorter than Over Length and longer than Max Length base pairs and count Min Count, keeps only unique sequences (while adding ‘count’ field to data record) and sorting the sequences.

Input

FASTQ Input File - name of input file
CSV Output File - name of output file
Over Length - the length low bound
Max Length - the length high bound
Min Count - the count lower bound

This step is almost identical to the previous one, the only difference being the additional filter by count.

QC

Run FastQC

Runs FastQC on the filtered sequences file.

Input

FASTQ Input File - name of input file.
FASTQ Output Directory - name of output directory.

miRNA QC part 2

This part outputs a length distribution CSV file along with an image of the length distribution.

It also outputs the over-represented sequences that are over 1% and blasts them.

Input

FastQC Data Output Directory - the FastQC directory for the data file
BLAST Output Files Prefix - name of the sample
Analysis Folder Name - folder for the analysis output

miRNA QC part 3

Runs a python script to decide what is the best blast match for the over-represented over 1% sequences out of the blast databases (mat, nt, hair and the custom databases).

The script decides whats is the best match according to this algorithm:

by default it takes the mature match unless when hairpin has alignment bigger by at least 2 from mature then it takes hairpin.
if mature and hairpin found NO HIT, if looks for a HIT in the custom databases by their order and takes the first database that gets a HIT
when all other fail, take nt's match

Input

BLAST Output Directory - the analysis folder of the previous part
BLAST Output Files Prefix - name of the sample

2 - Blast merged samples

This sub-protocol collects all the filters reads that the previous sub-protocol generated per sample and combines them to one file.

Then it converts the combined file to FASTA format, runs blast on all samples with the 3 blast databases (or more if added custom databases), decides which one was the best match and outputs the final file.

Input

Output Directory - results directory
Input Samples CSV - a CSV file containing the samples name and path (created automatically)
Output Files Prefix - ‘ALL’ (combined samples)

3 - Creating Tags Annotation Report

This sub-protocol collects all the filtered unique sequences and creates a matrix of sequences vs. samples.

The matrix contains how many times each sequence appeared in each sample, the total count of the sequences in all the samples and the blast result of that sequence from the previous sub-protocol.

Input

Output Directory - results directory
Input Samples CSV - a CSV file containing the samples name and path (created automatically).
Output Files Prefix - ‘ALL’ (combined samples)

4 - Creating Mir Annotation Report

This sub-protocol takes the matrix from the previous sub-protocol, filters out all the sequences that are not mature miRNA, merges all the sequences that match the same miRNA id and generates a new matrix with all the unique miRNA's containing their miRNA database, the number of sequences that mapped to this miRNA, their count in each sample, their count in all the samples and their description.

Input

Output Directory - results directory

5 - Build report

This sub-protocol collects all the information from previous steps and builds a report containing details on every sample and all of them combined.

Input

Output Directory - results directory
Input Samples CSV - a CSV file containing the samples name and path (created automatically)
Show Viewer - if True it shows the final report, else it just saves it in the output directory

Utilities

Compress Project

This Protocol gets the original project folder and a new compressed folder.

It copies the original project files to the new folder keeping just the main files and not the whole project

Small RNA on Pipeline Pilot

Pipeline Execution Instructions

Pipeline Parameters

Input data

Samples File

Logging

Pipeline Steps

General

Input

1 - miRNA QC

Input

Merge FASTQ files

Input

Trim by quality

Input

Trim adapter

Input

Trim by quality (second time)

Filter PhiX reads

Input

Filter reads by length to FINAL FASTQ

Input

Filter reads and unique to CSV file

Input

QC

Run FastQC

Input

miRNA QC part 2

Input

miRNA QC part 3

Input

2 - Blast merged samples

Input

3 - Creating Tags Annotation Report

Input

4 - Creating Mir Annotation Report

Input

5 - Build report

Input

Utilities

Compress Project