Small RNA on Pipeline Pilot

This pipeline was designed by Dena Leshkowitz.

Pipeline Execution Instructions

Pipeline Parameters

You can see the description of each parameter in the help tab.

There are several parameters worth mentioning:

Parameter NameDescription
Adapter

The adapter sequence of all samples.

If there are samples using different adapters you can manually set them through samples.csv as mentioned below

Do Partial Protocol

If True - it will only analyze each sample.

If False - after analyzing each sample, it will unite the results from all the samples and blast them

SpeciesChoose the species of the samples
Top Sequences SizeHow many sequences will undergo blast from the united reads (from all samples)
Min Countfilters from each samples all the sequences with less count than this
Over Lengthfilters from each samples all the sequences shorter than this
Max Lengthfilters from each samples all the sequences longer than this
Blast parameters

Mir Blast Parameters - the blast parameters for the miRNA databases (hairpin and mature)

Nt Blast Parameters - the blast parameters for the nt database

DBs Path - If you want to add some custom databases to blast against you can list here their FASTA files

DBs Blast Parameters - all the parameters here are lists of the blast parameters matching the order of the FASTA files mentioned in the previous parameter (DBs Path)

Input data 

Samples File

This file will be generated automatically from the parameter Samples Directory - For every folder in that directory a row will be added to the samples file with the name of the sample(without the prefix Sample_) and the path of the folder.

If for some reason (for example, multiple adapters) you want to manually generate this file - Just build this CSV file in the format declared below, name it samples.csv and place it under the Output Directory.

The current format is a comma delimited file with the following information per sample:

sample namepath to sample fastq directorythe adapter in this sample
sample
path
adapter (Optional)

Notes:

  1. adapter is an optional column, if you want to note the adapter you're using in each sample you can add this column. If all the samples have the same adapter, you can write it in the pipeline Adapter parameter.
    The adapter column in the sample file has priority over the pipeline parameter.

Logging

For each project there is a logs folder under the Output Directory and for each sample there is a logs folder under the sample folder. 

Each logs folder includes the following:

  1. Logs for each step.
  2. A recovery file which is internally used by the pipeline and not meant to be tampered with.

Pipeline Steps

General

This protocol analyzes every sample separately and then combine all of them together for a final report.

Input

  • Samples Directory - directory of the samples folders (only the samples folders).

  • Output Directory - where the result will be.

  • Adapter - the 3’ end adapter.

1 - miRNA QC

This sub-protocol is executed for every sample in the samples folder.

Input

  • Directory containing FASTQ files. Zipped/unzipped. Single read.

  • Adapter - the 3’ end adapter.

  • Output files prefix - that will be the sample name.

  • Output directory - where this sample's results are held.

The various steps which every sample undergoes are described below

Merge FASTQ files

Unzip (if needed) and merge FASTQ files from the given directory to a single file named “<prefix>.merged.fastq”.

Input

  • Zipped FASTQ Input Directory Path - the input directory.

  • Merged FASTQ File - the output file.

Trim by quality

Trims the reads from the input file according to quality and write the output to file and discards reads shorter than <Min Length>.

Input

  • FASTQ Input File - name of input file

  • FASTQ Output File - name of output file

  • Quality Cutoff - the bound for the quality (0 - 93) of the reads

  • Min Length - after trimming discards reads shorter than <Min Length>

Trim adapter

Trims the adapter from the reads and discard the reads that did not contain this adapter.

Input

  • FASTQ Input File - name of input file

  • FASTQ Output File - name of output file

  • Adapter - the 3’ end adapter to trim

Trim by quality (second time)

Trims the reads again after the adapter was removed.

Filter PhiX reads

Discards the PhiX reads from the input file and outputting a filtered file without those reads.

Input

  • FASTQ Input File - name of input file

  • FASTQ Output File - name of output file

Filter reads by length to FINAL FASTQ

Discards the reads shorter than Over Length and longer than Max Length base pairs and save a final FASTQ file

Input

  • FASTQ Input File - name of input file

  • FASTQ Output File - name of output file

  • Over Length - the length low bound

  • Max Length - the length high bound

Filter reads and unique to CSV file

Discards the reads shorter than Over Length and longer than Max Length base pairs and count Min Count, keeps only unique sequences (while adding ‘count’ field to data record) and sorting the sequences.

Input

  • FASTQ Input File - name of input file

  • CSV Output File - name of output file

  • Over Length - the length low bound

  • Max Length - the length high bound

  • Min Count - the count lower bound

This step is almost identical to the previous one, the only difference being the additional filter by count.

QC

Run FastQC

Runs FastQC on the filtered sequences file.

Input
  • FASTQ Input File - name of input file.

  • FASTQ Output Directory - name of output directory.

miRNA QC part 2

This part outputs a length distribution CSV file along with an image of the length distribution.

It also outputs the over-represented sequences that are over 1% and blasts them.

Input
  • FastQC Data Output Directory - the FastQC directory for the data file

  • BLAST Output Files Prefix - name of the sample

  • Analysis Folder Name - folder for the analysis output

miRNA QC part 3

Runs a python script to decide what is the best blast match for the over-represented over 1% sequences out of the blast databases (mat, nt, hair and the custom databases).

The script decides whats is the best match according to this algorithm:

  • by default it takes the mature match unless when hairpin has alignment bigger by at least 2 from mature then it takes hairpin.
  • if mature and hairpin found NO HIT, if looks for a HIT in the custom databases by their order and takes the first database that gets a HIT
  • when all other fail, take nt's match
Input
  • BLAST Output Directory - the analysis folder of the previous part

  • BLAST Output Files Prefix - name of the sample

2 - Blast merged samples

This sub-protocol collects all the filters reads that the previous sub-protocol generated per sample and combines them to one file.

Then it converts the combined file to FASTA format, runs blast on all samples with the 3 blast databases (or more if added custom databases), decides which one was the best match and outputs the final file.

Input

  • Output Directory - results directory

  • Input Samples CSV - a CSV file containing the samples name and path (created automatically)

  • Output Files Prefix - ‘ALL’ (combined samples)

3 - Creating Tags Annotation Report

This sub-protocol collects all the filtered unique sequences and creates a matrix of sequences vs. samples.

The matrix contains how many times each sequence appeared in each sample, the total count of the sequences in all the samples and the blast result of that sequence from the previous sub-protocol.

Input

  • Output Directory - results directory

  • Input Samples CSV - a CSV file containing the samples name and path (created automatically).

  • Output Files Prefix - ‘ALL’ (combined samples)

4 - Creating Mir Annotation Report

This sub-protocol takes the matrix from the previous sub-protocol, filters out all the sequences that are not mature miRNA, merges all the sequences that match the same miRNA id and generates a new matrix with all the unique miRNA's containing their miRNA database, the number of sequences that mapped to this miRNA, their count in each sample, their count in all the samples and their description.

Input

  • Output Directory - results directory

5 - Build report

This sub-protocol collects all the information from previous steps and builds a report containing details on every sample and all of them combined.

Input

  • Output Directory - results directory

  • Input Samples CSV - a CSV file containing the samples name and path (created automatically)

  • Show Viewer - if True it shows the final report, else it just saves it in the output directory

Utilities

Compress Project

This Protocol gets the original project folder and a new compressed folder.

It copies the original project files to the new folder keeping just the main files and not the whole project