深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
HOMtnbaS(BGISEQ-500
Small RNA)
2021/4/27
@2021 BGI All Rights Reserved
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
HOMtnbaS(BGISEQ-500
Small RNA)
2021/4/27
@2021 BGI All Rights Reserved
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
4
4
4
4
5
7
9
9
11
12
13
14
15
15
16
18
18
18
19
20
20
20
21
21
21
22
22
23
23
23
24
25
25
26
26
26
27
27
Table of Contents
Results
1 Disclaimer
2 Checklist: Getting Started
3 Data Statistics
4 sRNA Annotation
5 sRNA Prediction
6 sRNA Expression
7 miRNA Target gene Prediction
8 Screening Differentially Expressed piRNA
9 Screening Differentially Expressed siRNA
10 Screening Differentially Expressed miRNA
11 DESs Hierarchical Clustering
12 DESs Target Prediction
13 Gene Ontology Enrichment Analysis Of DESs Target
14 Pathway Enrichment Analysis Of DESs Target
Methods
1 Experiment Pipeline Steps for Small RNA Sequencing
2 Bioinformatics Pipeline
3 Data Filtering
4 Reads Mapping
5 sRNA Classification
6 sRNA Prediction
7 sRNA Expression
8 Target Gene Prediction
9 Screening DESs (Possion Distribution)
10 Hierarchical Clustering Analysis
11 Gene Ontology Enrichment Analysis
12 Pathway Enrichment Analysis
Help
1 Report Introduction
2 FASTQ Format
3 BAM Format
4 File Format Of Genome Statistics
5 File Format Of sRNA Catalog
6 File Format Of Unknown sRNA
7 File Format Of Novel miRNA
8 File Format Of Novel piRNA
9 File Format Of Novel siRNA
2/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
27
27
28
28
29
29
30
31
10 File Format Of sRNA Expression
11 File Format Of Target Prediction
12 DESs screening Format (Possion Distribution)
13 Cluster Format
14 GO Enrichment Analysis Result
15 Pathway Enrichment Analysis Result
FAQs
References
3/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Results
1 Disclaimer
This is a BGI-tech Standard Small RNA Project report. It may contain privileged and/or
confidential information and is intended solely for the specific customer. If you have
received this report in error, please notify the sender and delete this report immediately.
You are hereby notified that any use, retention, disclosure, dissemination, copying, or
takingany otheractioninrelianceoncontents ofthis reportis prohibited.
2 Checklist: Getting Started
1. When you receive the package and hard drive, please check it is intact and not
damagedintransit.
2. Assuming there isn't an issue with the packaging or hard drive; proceed to check your
data.
3.Ensureyouback upyourdatabeforeprocessingit.
3 Data Statistics
In the project F21FTSCCKF1658_HOMtnbaS, we sequenced animal using BGISEQ500 technology
[1][2].The number of detected small non-coding rna (sncRNA) for each
samplewereshowninTable1.
Table 1 Summary of detected sncrna for each sample (Download)
Sample
name
Known miRNA
count
Novel miRNA
count
Known piRNA
count
Novel piRNA
count
Known siRNA
count
Novel siRNA
count
K10mDPCs 754 60 161 630 0 0
NCmDPCs 871 108 115 509 0 0
Before data analysis can be carried out, low-quality tags need to be removed (for more
information on this see Data Filtering in the Method page.)Table2 summarizes the
sequencing data for each sample and the distribution of base quality on clean tags is
presented inFigure1. Length distribution analysis allows us to see the composition of a
small RNA (sRNA) sample(seeFigure2).In general, the length of small RNA is found to
bebetween18ntand30nt.
Table 2 Summary of sequencing data for each sample (Download)
Sample
name
Sequence
type
Raw tag
count
Low quallity
tag count
Invalid
adapter tag
count
PolyA
tag
count
Short valid
length tag
Clean tag
count
Q20 of
clean tag
(%)
Percentage
of clean
tag(%)
K10mDPCs SE50 36,452,956 245,719 1,690,523 2,358 1,306,395 33,207,961 98.60 91.10
NCmDPCs SE50 37,253,264 240,828 588,178 430 1,323,304 35,100,524 98.80 94.22
Percentage(%)=Clean tag count/Raw tag count
4/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Figure 1 Distribution of base quality on clean tag.
In Figure 1, the X axis is the base position along a tag. The Y axis is the base quality value. Each dot in
Figure 1 represents the total number of bases with a specific quality value of the corresponding base along
a tag. A darker dot color indicates a higher base number. If the percentage of the bases with low quality (<
20) is very high, then the sequencing quality of this tag is low. If one sample name appears several times,
then that means it consists of more than one sequencing tag.
Figure 2 Length distribution of sRNA.
The X axis shows the length of sRNA and the Y axis shows the percentage of the number of sRNA with a
specific length.
Table 3 K10mDPCs.Length_distribution.txt (Download)
Table 4 NCmDPCs.Length_distribution.txt (Download)
Table 5 Allsamples.Length_distribution.txt (Download)
4 sRNA Annotation
After filtering, clean tags were mapped to sRNA database such as miRBase
[3], Rfam[4],
siRNA,piRNA, snoRNAandetc.Table6lists seperate mapping rate foreach sample and
Figure3 shows the distribution of tags.The proportion of all kinds of sRNA is shown in
Confirm Show All
K10mDPCs
NCmDPCs
Confirm Show All
K10mDPCs
NCmDPCs
5/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Figure4
The files listed in the tables (see links below) which are suffixed with catalog.xls are
results of sRNA annotation for each sample (see File Format Of sRNA Catalog in help
page).
Table 6 Alignment statistics of tags align to reference genome (Download)
Sample name Total tag Mapped tag Percentage(%)
K10mDPCs 33,207,961 28,929,564 87.12
NCmDPCs 35,100,524 33,134,337 94.40
Figure 3 Genome distribution of tags.
The X axis shows the relative position in the chromosome, and the Y axis shows the number of tags. The
color red indicates tag count, whereas green indicates tag catalogs.
Summarize all alignments and annotation before, the proportion of all kinds of small rna
inFigure4.
Figure 4 Catalog of sRNA.
Figure 4 shows the proportion of different types of sRNA. To make sure each sRNA corresponds to only
Confirm Show All
K10mDPCs
NCmDPCs
Confirm Show All
K10mDPCs
NCmDPCs
6/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
one annotation, we follow this priority rule: miRNA > piRNA > snoRNA > Rfam > other sRNA.
The following listed file suffixed withcatalog.xlsare results of small rna annotation for
eachsample(seeFileFormat Of sRNACataloginhelppage).
Table 7 K10mDPCs.catalog.xls (Download)
Table 8 NCmDPCs.catalog.xls (Download)
Summarize all kinds of konwn or concerved miRNA. The mature sequences and hairpin
sequences of miRNAs of the species listed in file known_miRNA.xls when miRbase
database have known miRNAs of this species, the mature sequences of other species
and predicted hairpin sequences of this species if miRbase database do not have known
miRNAs ofthisespecies.
Table 9 known_miRNA.xls (Download)
Figure 5 The first base distribution. <
X axis is the length of miRNA.
5 sRNA Prediction
After sRNA annotation, those unknown tags (seeFile Format Of Unknown sRNAin help
page) will be used to predict novel sRNA based on their architectural feature, the
corresponding tools seesRNAPredictionin method page. Novel miRNA was predicted,
the result format seeFile Format Of Novel miRNA in help page. Novel piRNA was
predicted, the result format seeFile Format Of Novel piRNAin help page. Novel siRNA
was predicted, the result format seeFile Format Of Novel siRNAin help page. The details
ofpredictedsRNAs as belows:
Table 10 unknown.xls (Download)
Table 11 predicted_miRNA.xls (Download)
Table 12 predicted_piRNA.xls (Download)
Table 13 predicted_siRNA.xls (Download)
An example for the stem loop structure of precursors of predicted miRNA was shown in
Figure6,thestructureofeachcandidatemiRNA was storedinprecursor.tar.gz. 7/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Figure 6 The stem loop structure of precursors of predicted miRNA. <
Figure 5 shows the secondary structure of candidate miRNA based on maximum free energy (MFE)
structure.
Figure 7 The first base distribution of predicted miRNAs. <
Statistics the first base of all kinds of predicted miRNAs. X axis means the length of miRNA, the numbers in
figure means the kinds number of predicted miRNAs
Figure 8 The first base distribution of predicted piRNAs. <
Statistics the first base of all kinds of predicted piRNAs. X axis means the length of piRNA, the numbers in
figure means the kinds number of predicted piRNAs
8/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Figure 9 The first base distribution of predicted siRNAs. <
Statistics the first base of all kinds of predicted siRNAs. X axis means the length of siRNA, the numbers in
figure means the kinds number of predicted siRNAs
6 sRNA Expression
BaseontheUMI species numbers of sequencingreads, thesmallrnaexpressionlevel is
calculated by using UMI species numbers (seesRNA Expressionin method page), and
the result format seeFile Format Of sRNA Expressionin help page. The expression of
eachsamplewerelistedas below.
Table 14 Allsamples.exp.xls (Download)
Table 15 K10mDPCs.expr.xls (Download)
Table 16 NCmDPCs.expr.xls (Download)
Figure 10 Correlation analysis of all samples. <
Both X and Y axis represent each sample. Coloring indicate Pearson correlation (high: blue; low: white).
7 miRNA Target gene Prediction
9/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
BGI uses multiple types of software to find the target gene of miRNA. You can find out
moreinformationonTarget GenePredictionintheMethodpage.
BGI chooses the intersection or union of the target gene as the final result. The results
areshownbelow inTable17and Figure11.
To see the full results, downloadTable18from the web link below.For more information
on the results format please seeFile Format Of Target Prediction in the Additional
Informationpage.
Table 17 Target statistics (Download)
Type miRNA count Target count RNAhybrid target TargetScan target miRanda
Filter 2,039 107,049 107,049 14,002 107,049
Union 2,040 107,186 107,182 14,086 107,056
Figure 11 Venn statistics of different target gene prediction software. <
Venn statistics of different target gene prediction software.
Table 18 Target union result (Download)
The next step involves setting up appropriate filter conditions such as MFE and then
furtheranalyzingtheintersectiontargets.Thefilteredresult canbeseeninFigure12.
To see the full results, downloadTable19from the web link below. For more information
on the results format please seeFile Format Of Target Prediction on the Additional
Informationpage.
10/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Figure 12 Venn statistics of different target gene prediction software based on appropriate filter
conditions. <
Venn statistics of different target gene prediction software.
Table 19 Target filter result (Download)
8 Screening Differentially Expressed piRNA
Differentially Expressed SRNAs(DESs) screening is aimed to find differentially
expressed small rna between samples and do the further analysis. We use ExpDiff (see
Screening DESs \\(Possion Distribution\\)inmethodpage)methodtodothis analysis.The
DESs ineachpairwiseas showninFigure13.
Figure 13 Differentially expressed piRNA. <
The X axis represents the DESs pair, and the Y axis represents the number of screened DESs, with green
indicating down-regulation and red indicating up-regulation. 11/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
All expressed piRNA of each pairwise were stored in*.diffexp.xlslisted as below. The
result format please see ExpDiff (seeDESs screening Format \\(Possion Distribution\\)in
helppage)
Table 20 NCmDPCs-vs-K10mDPCs_ExpDiff.diffexp.xls (Download)
Screened DESs were stored*.diffexpfilter.xlslisted as below. The result format please
seeExpDiff(seeDESs screeningFormat \\(Possion Distribution\\)inhelppage)
Table 21 NCmDPCs-vs-K10mDPCs_ExpDiff.diffexpfilter.xls (Download)
Figure 14 Distribution of differentially expressed piRNA. <
In Figure, the X axis is the log2(Fold change) of DES, the Y axis is the -log10(FDR) of DES, with green
points indicating down-regulation (log2(Fold change)<=-1 and FDR <=0.001) and red points indicating upregulation (log2(Fold change)>=1 and FDR <=0.001).
9 Screening Differentially Expressed siRNA
Differentially Expressed SRNAs(DESs) screening is aimed to find differentially
expressed small rna between samples and do the further analysis. We use ExpDiff (see
Screening DESs \\(Possion Distribution\\)inmethodpage)methodtodothis analysis.The
DESs ineachpairwiseas showninFigure16.
12/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Figure 16 Differentially expressed siRNA. <
The X axis represents the DESs pair, and the Y axis represents the number of screened DESs, with green
indicating down-regulation and red indicating up-regulation.
All expressed siRNA of each pairwise were stored in*.diffexp.xlslisted as below. The
result format please see ExpDiff (seeDESs screening Format \\(Possion Distribution\\)in
helppage)
Table 22 NCmDPCs-vs-K10mDPCs_ExpDiff.diffexp.xls (Download)
Screened DESs were stored*.diffexpfilter.xlslisted as below. The result format please
seeExpDiff(seeDESs screeningFormat \\(Possion Distribution\\)inhelppage)
Table 23 NCmDPCs-vs-K10mDPCs_ExpDiff.diffexpfilter.xls (Download)
10 Screening Differentially Expressed miRNA
Differentially Expressed SRNAs(DESs) screening is aimed to find differentially
expressed small rna between samples and do the further analysis. We use ExpDiff (see
Screening DESs \\(Possion Distribution\\)inmethodpage)methodtodothis analysis.The
DESs ineachpairwiseas showninFigure19.
Figure 19 Differentially expressed miRNA. < 13/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
The X axis represents the DESs pair, and the Y axis represents the number of screened DESs, with green
indicating down-regulation and red indicating up-regulation.
All expressed miRNA of each pairwise were stored in*.diffexp.xlslisted as below. The
result format please see ExpDiff (seeDESs screening Format \\(Possion Distribution\\)in
helppage)
Table 24 NCmDPCs-vs-K10mDPCs_ExpDiff.diffexp.xls (Download)
Screened DESs were stored*.diffexpfilter.xlslisted as below. The result format please
seeExpDiff(seeDESs screeningFormat \\(Possion Distribution\\)inhelppage)
Table 25 NCmDPCs-vs-K10mDPCs_ExpDiff.diffexpfilter.xls (Download)
Figure 20 Distribution of differentially expressed miRNA. <
In Figure, the X axis is the log2(Fold change) of DES, the Y axis is the -log10(FDR) of DES, with green
points indicating down-regulation (log2(Fold change)<=-1 and FDR <=0.001) and red points indicating upregulation (log2(Fold change)>=1 and FDR <=0.001).
11 DESs Hierarchical Clustering
BGI performs hierarchical clustering for differentially expressed miRNA using function
pheatmap. For more than two groups, hierarchical clustering of the intersection is
performed, followed by union DESs.Figure22 shows the results of DESs Hierarchical
Clustering.
14/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Figure 22 Hierarchical clustering. <
The X axis represents each pair of differences, and the Y axis represents Differently Expressed Genes
(DEGs). The colors indicate the fold change, with red showing up-regulation, and blue showing downregulation.
To see the full results, download tables from the web link below. For more information on
theresults formatpleaseseeCluster Formatinhelppage.
Table 26 Clustering list of NCmDPCs-vs-K10mDPCs_ExpDiff.union (Download)
12 DESs Target Prediction
BGI uses multiple types of software to find the target gene of DESs.You can find out
moreinformationaboutthis ontheTarget GenePredictioninmethodpage
Toseethefullresults,downloadthetables fromtheweblink below.
Table 27 NCmDPCs-vs-K10mDPCs_ExpDiff.target.xls (Download)
Formoreinformationontheresults format pleaseseeFile Format Of Target Predictionin
thehelppage.
13 Gene Ontology Enrichment Analysis Of DESs Target
GeneOntology (GO) enrichment analysis is performed for screened DESs target
genes (for more information seeGene Ontology Enrichment Analysis on the Method
page).You can see the fullresults by visiting theGO Enrichment Analysis Resultpage in
Help.
We use WEGO software
[5]to perform GO function classification to show the distribution
of gene function (seeFigure23). The GO enrichment analysis result is shown as
Figure24. You can find out more details about GO enrichment analysis by visiting
GOView
15/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Figure 23 GO functional classification. <
The X axis shows the number of DEGs (their square root value), and the Y axis shows GO terms. All GO
terms are grouped in to three ontologies: blue indicates biological process, brown indicates cellular
components and red indicates molecular function.
Figure 24 GO functional enrichment.
BGI uses the Directed Acyclic Graph (DAG) to visualize GO enrichment results. Each node shows the
name of the GO term and P-value. The darker the red color the lower the P-value, which indicates more
significant enrichment.
14 Pathway Enrichment Analysis Of DESs Target
Apathway enrichment analysis ofDESs target can help us to identify how genes interact
andperformroles inbiological functions.
BGI performs pathway enrichment analysis of DESs target genes using the KEGG
database
[6]. We generate a report for target genes in each pair of DESs. For more
Confirm Show All
NCmDPCs-vsK10mDPCs_ExpDiff.C
NCmDPCs-vsK10mDPCs_ExpDiff.F
NCmDPCs-vsK10mDPCs_ExpDiff.P
16/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
information on this, see the sectionPathway Enrichment Analysisin Method page, and
thesectionPathwayEnrichment Analysis ResultinHelppage.
In addition,BGI generates a scatter plot of theKEGG enrichmentresults (seeFigure25),
andabarplot of theKEGGterms (seeFigure26).You can download the fullresults forthe
pathways usingthis link:
NCmDPCs-vs-K10mDPCs_ExpDiff
Figure 25 Statistics of pathway enrichment in each pair. <
The top 20 enriched pathway terms are displayed as a scatterplot in Figure25. The Rich Factor is the ratio of
DESs target gene numbers annotated in this pathway term to all gene numbers annotated in this pathway
term. The greater the Rich Factor the greater the degree of enrichment. The Q-value is the corrected Pvalue, and ranges from 0~1; the lower the Q-value the greater the level of enrichment.
Figure 26 KEGG classification for each pair. <
The X axis shows the number of DESs target genes, and the Y axis shows the second KEGG pathway
terms. The first pathway terms are indicated using different colors. And second pathway terms are sub- 17/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
groups of the first pathway terms and are grouped together on the X axis on the right side.
Methods
1 Experiment Pipeline Steps for Small RNA Sequencing
1)Small RNAenrichmentandpurification.
2) 3' end adaptor ligation: Ligate the 5-adenylated and 3-blocked adaptor to the 3' end of
thesmall RNAfragment.
3)AddUniquemolecularidentifiers (UMI)labeledPrimer.
4)5'endadaptorligation.
5)First strandsynthesiswithUniquemolecularidentifiers (UMI)labeledPrimer.
6)Secondstrandsynthesis.
7)Fragment selection.
8)Library quantitativeandpoolingcyclization.
9)LibraryQC.
10)SequencingonDNBSEQ.
Theexperimentpipelineis describedinFigure1.
Figure 1 Experimental workflow of standard DNBSEQ UMI small RNA library
2 Bioinformatics Pipeline
18/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Figure 2 Bioinformatics analysis pipeline for Small-RNA project
BioinformaticsAnalysisPipelineforSmallRNA Sequencing
1.Eliminatethelow-quality reads,adaptors andother contaminants toget cleanreads
2.Summarize the length distribution of the clean tags, common and specific sequences
betweensamples
3.Annotatethecleantags intodifferent categories
4.Predictthenovel miRNA
5.Functionannotationof knownmiRNAs
3 Data Filtering
The impurities of raw data are: 5' primer contaminants, no-insert tags, oversized
insertion tags, low quality tags, poly A tags and small tags, tags without 3' primer.
Generally, an adaptor contaminant is caused by the sample quality, the adaptor or
sample concentration. The higher the adapter proportion is, the more the contaminant
there is. Low quality tags also need to be filtered; low quality tags are those which have
more than four bases and whose quality is less than ten, or those which have more than
six bases andhaveaquality less thanthirteen.19/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
All these contaminant tags are removed and the length distribution of clean tags is then
summarized. The length distribution analysis is helpful to find the compositions of the
sample. Normally, the length distribution of small RNA is between 18nt and 30nt. For
example,miRNAis normally 21ntor22nt, siRNAis 24nt,andpiRNAis 30nt.
Thedataisprocessedusingthefollowingsteps:
1)Removelow quality tags
2)Removetagswith5'primer contaminants
3)Removetagswithout3'primer
4)Removetagswithoutinsertion
5)RemovetagswithpolyA
6)Removetagsshorterthan18nt
7)Summarizethelengthdistributionofthecleantags
After filtering, the remaining tags are called 'clean tags' and stored in FASTQ format
[7]
(more information aboutFASTQ Format can be found on the Additional Information
page).
4 Reads Mapping
In general, the higherthe alignmentratio, the closerthe genetic relationship between the
sample and the reference species. The lower rate may be due to low similarity with the
referencegenomeorduetocontaminants.
Bowtie2
[8] is used to map clean reads to the reference genome and to other sRNA
databases. Please note that for Rfam we use cmsearch
[9].Their default alignment
parameters areas follows:
Bowtie2 : -q -L 16 --phred64 -p 6
cmsearch: --cpu 6 --noali
To learn more about Bowtie2 parameters, please refer to the Options section on
http://computing.bio.cam.ac.uk/local/doc/bowtie2.html#.
5 sRNA Classification
Intheannotationinformationof differentRNAs, somesmall RNAtags may bemappedto
more than one category. To make sure every unique small RNA is mapped to only one
category,wefollow this priority rule:
MiRbase> pirnabank> snoRNA(human/plant)> Rfam> other sRNA
6 sRNA Prediction
We use miRDeep2
[10](for animals) and miRA[11](for plants) to predict novel miRNA by
exploringthecharacteristic hairpinstructureofmiRNAprecursor. 20/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Piano
[12] is used to predict piRNAs. This algorithm is based on the Support Vector
Machine (SVM) algorithm and transposon interaction information.The SVM classifier
canbeusedinawiderangeof species includinghuman,mouse,rat,fruitfly andinsects.
Small interfering RNA (siRNA[13])is a 22-24nt double-strand RNA, each strand of which
is 2nt longer than the other. Due to this structural feature, we align tags to each other to
find sRNAs meeting these criteria. These resultant tags are most likely to be siRNA
candidates.
7 sRNA Expression
Unique molecular identifier (UMI)
[14], 8-10nt seqences. It was connected to cDNA
molecules to marking each molecule in the original sample at the early stage of library
construction. It was used to reduce the quantitative bias introduced by PCR
amplification, and is conducive to obtaining enough readings for testing. We calculate
thespecies numbers toaccuratequantitativeof sRNAs.
8 Target Gene Prediction
In order to find more accurate targets multiple types of software are used. Generally, we
use RNAhybrid
[15],miRanda
[16] or TargetScan
[17] to predict animal targets, and use
psRobot
[18],TAPIR[20]orTargetFinder
[19]topredict plant targets.Thedefault parameters
areas follows:
miRanda: -en -20 -strict
RNAhybrid: -b 100 -c -f 2,8 -m 100000 -v 3 -u 3 -e -20 -p 1 -s 3utr_human
psRobot: -gl 17 -p 8 -gn 1
TargetFinder: -c 4
TAPIR: --score 5 --mfe_ratio 0.6
9 Screening DESs (Possion Distribution)
Using the findings of the Genome Res paper entitled the significance of digital gene
expression profiles
[21], BGI have developed an algorithm to identify differentially
expressedgenes betweentwosamples.
If x is defined as the number of reads from small RNA A, x yields to the Poisson
distribution:
In orderto calculate the expression level of the genes we use a different equation. In this
equation,thenumberoftotal cleanreads mappedtothereferencegenomeof sample1is
N1
, and the number of total clean reads mapped to the reference genome of sample 2 is
N2
; small RNA A holds x tags in sample 1 and y tags in sample 2. The probability of small
RNAAexpressedequally betweentwosamples canbecalculatedas follows:
21/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
In the equation above, the P-value of the differential gene expression test is corrected
using the Bonferroni method
[22]. DES analysis is then performed on the sample;
however, DES analysis creates a problem as it generates thousands of hypothesis
simultaneously (only if gene x is differentially expressed between the two groups),
correction for false positive (type I errors) and false negative (type II) errors are
performedusingtheFalseDiscoveryRate(FDR )method
[23].
Inthenext step, it is assumedthatwehavepickedoutR differentially expressedgenes in
whichSgenes show differential expression,andtheotherVgenes arefalsepositives.
The error ratio Q = V / R. The customer sets a cutoff value for Q (e.g. BGI sets a default
cutoff of 5%), FDR is preset to a value less than 0.05. To judge the significance of gene
expression difference, FDR ≤ 0.001 and the absolute value of Log2Ratio ≥ 1' are set as
thedefaultthreshold.
More stringent criteria, with smaller FDR and bigger fold-change value, can be used to
identifyDEGs.
Next, we make multiple hypothesis tests for the P-value of the differential gene
expression test
[22] and determine the P-value field by controlling the FDR result.
Conditions aresetinadvancesothatthe FDR result cannotexceed0.05.
At the same time, we calculated the gene expression level (FPKM value) in order to
calculatethegeneexpressiondifferencebetweendifferent samples.
The smallerthe FDR value, the greaterthe difference multiple, which indicate there are
significantdifferences inexpression.
In BGI’s experience, differentially expressed genes were defaulted as genes with FDR
≤ 0.001andmultiples ofmorethan2-fold.
10 Hierarchical Clustering Analysis
BGI performs hierarchical clustering for differentially expressed miRNA using function
pheatmap. For more than two groups, hierarchical clustering of the intersection is
performed,followedby unionDESs.
11 Gene Ontology Enrichment Analysis
GeneOntology (GO) is an international standard gene functional classification
system. It offers a dynamically updated and controlled vocabulary, as well as a defined
concept to comprehensively describe properties of genes and their products. GO has
three ontologies: molecular function, cellular component and biological process. The
22/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
basic unitofGOisGO-term,andeveryGO-terminGObelongs toatypeofontology.
GO enrichment analysis finds all GO-terms that are significantly enriched in a list of
DESs target genes, it also finds the genes that correspond to specific biological
functions.To perform this analysis,BGI first maps all genes toGO-terms in the database
(http://www.geneontology.org/), which calculates the gene numbers for every term. The
hypergeometric test is then used to find significantly enriched GO-terms in the input
gene list. This test is based on 'GO::TermFinder'
(http://www.yeastgenome.org/help/analyze/go-term-finder).
BGI has developed an algorithm to perform this analysis, and the method used is
describedas follows:
Intheequationabove,N is thenumberofall geneswithGOannotation;nis thenumberof
DESs target genes in N; M is the number of all genes that are annotated to a specific GO
term;mis thenumberofDESs targetgenes inM.
The P-value is corrected by using the Bonferroni method
[22], a corrected P-value ≤ 0.05
is taken as a threshold. GO terms fulfilling this condition are defined as significantly
enrichedGOterms.
12 Pathway Enrichment Analysis
Pathway-based analysis helps to further understand DESs target genes biological
functions. KEGG[24](the major public pathway-related database) is used to perform
pathway enrichment analysis. This analysis identifies significantly enriched metabolic
pathways or signal transduction pathways in DESs target genes when compared with
thewholegenomebackground.
The calculating formula is the same as that in GO analysis. Here N is the number of all
genes with KEGG annotation; n is the number of DESs target genes in N; M is the
number of all genes that are annotated to a specific pathway; m is the number of DESs
target genes in M. The P-value is corrected by using the Bonferroni method
[22], a
corrected P-value <= 0.05 is taken as a threshold. KEGG terms fulfilling this condition
aredefinedas significantly enrichedKEGGterms.
Help
1 Report Introduction
This concludingreport contains eight sections :
Results:Includes project results in a variety of display forms including text, tables and
pictures.
Methods:Included the project workflow, experimental processes, bioinformatics
analysis. 23/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Tables:Includes statistics tables displayingresults.
Figures:Includes statistics figures oftheresults.
Files:This where projectresults are stored in downloadable formats.Please referto`The
Guide`tolearnhow todownloadfiles.
Reference:This section show small RNA project analysis methods and references
articles.
Additional Information:Includes aguidetohelpyouininterprettheresults.
FAQ:This section answers some common questions, such as those relating to the data
interpretationprocess andsampleselectionanddelivery.
Thevideoforresultinterpretionas follows:
report: https://pan.genomics.cn/ucdisk/s/73UZjq
data: https://pan.genomics.cn/ucdisk/s/2QjYBb
2 FASTQ Format
The original image data is transferred into sequence data via basecalling , which is
defined as raw data or raw reads and saved as an FASTQ file. The FASTQ files are
theoriginal data provided by BGI to the users, and include detailed read sequences and
read quality information. In each FASTQ file, each read is described by four lines, listed
as follows:
@A80GVTABXX:4:1:2587:1979#ACAGTGAT/1
NTTTGATATGTGTGAGGACGTCTGCAGCGTCACCTTTATCGGCCATGGT
+
BMMTKZXUUUdddddddddddddddddddddddddddadddddd^WYYU
The first and third lines are sequence names generated by the sequence analyzer; the
second line is the sequence; the fourth line is the sequencingquality value, in
whicheach letter corresponds to the bases in line 2; the base quality is equal to theASCII
value of the characterin line 4 minus 64, e.g. theASCII value of c is 99, therefore its base
quality valueis 35.
Table1 demonstrates the relationship between sequencingerror and
sequencingquality . Specifically, if the sequencingerror rate is denoted as E and
base quality value is denoted as SQ, the relationship is described using the following
formulas:
T a b l e 1 Relationship between sequencing error rate and sequencing quality
24/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
value (Download)
Sequencing Error Rate(%) Sequencing Quality Value Character
1.00 20 T
0.10 30 ^
0.01 40 h
3 BAM Format
Mapping results are stored in a BAM file, which is the binary equivalent of a SAM file.
SAMtools software can perform format conversion between BAM and SAM, and also
support more complex tasks like variant calling and alignment viewing as well as sorting,
indexinganddataextraction.Youcangetmoredetails aboutSAMtools here
http://samtools.sourceforge.net/samtools.shtml#5.
If required by clients, BAM files of the genome mapping result can be provided. BGI also
recommends using IGV tool (Integrative Genomics Viewer) to visualize BAM files. IGV
supports the loading of multiple samples in order to compare in the same scale, and can
view the distribution of reads on the Exon, Intron, UTR, and Intergenic regions;which
makes it very convenientandintuitive.
You can read an introduction to IGV that in the project result, and more detailed
information about IGV is available on this website
http://www.broadinstitute.org/software/igv/UserGuide.Figure2 shows an example using
Windows operatingsystem.
Figure 2 A screenshot of IGV interface. <
This example loads two samples into IGV tool in window operation system.
4 File Format Of Genome Statistics
Thegenomestatistics result of eachsamplearestoredinatab-separatedtext filenamed
`samplename`followedby .genome_stat.xls.
Table2describes thefileformatforgenomestatistics. 25/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Table 2 Format description of genome statistics (Download)
Field Description
Flag Chomosome name
Count Mapped tags number
Percentage Percentage of mapped tags
Alias Alias for chomosome, which was used in figure (*.genome_stat.png)
5 File Format Of sRNA Catalog
sRNA catalog result of each sample are stored in a tab-separated text file named
`samplename`followedby .catalog.xls.
Table3describes thefileformatfor sRNAcatalog.
Table 3 Format description of sRNA catalog (Download)
Field Description
sRNA id sRNA name
Count Mapped tags number
Type sRNA type
Description Description for sRNA
6 File Format Of Unknown sRNA
UnknownsRNAresultis storedintab-seperatedtextfilenamedunknown.xls.
Table4describes thefileformatforunknownsRNA.
Table 4 Format description of Unknown sRNA (Download)
Field Description
Tag Tag name
Chr Chromosome name
Start Start position
End End position
Strand Strand
* Count for each sample
Sequence Tag sequence
7 File Format Of Novel miRNA
Novel miRNA result of each sample is stored in a tab-separated text file named
predicted_miRNA.xls.
Table5describes thefileformatforNovel miRNA.
Table 5 Format description of Novel miRNA (Download)
Field Description
miRNA id miRNA name
Chromosome Chomosome name
Strand Strand of miRNA
Sequence(mature) Mature sequence of miRNA
Sequence(star) Star sequence of miRNA
26/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
Start Start position
End End position
Sequence(precursor) Precursor sequence of miRNA
Tag info Mapped tag infomation
8 File Format Of Novel piRNA
Novel siRNA result of each sample is stored in a tab-separated text file named
predicted_piRNA.xls.
Table6describes thefileformatforNovel piRNA.
Table 6 Format description of Novel piRNA (Download)
Field Description
piRNA id miRNA name
Chromosome Chomosome name
Repeat family Repeat family
Strand Strand of miRNA
Start Start position
End End position
Sequence piRNA sequence
Tag info Mapped tag infomation
9 File Format Of Novel siRNA
Novel siRNA result of each sample is stored in a tab-separated text file named
predicted_siRNA.xls.
Table7describes thefileformatforNovel siRNA.
Table 7 Format description of Novel siRNA (Download)
Field Description
siRNA id siRNA name
Guide sequence Candidate siRNA sequence
Passenger sequence Passenger sequence
Tag info Mapped tag infomation
10 File Format Of sRNA Expression
sRNA Expression result is stored in a tab-separated text file named`sample name`
followedby .expr.xls.
Table8describes thefileformatfor sRNAexpression.
Table 8 Format description of sRNA expression (Download)
Field Description
siRNA id miRNA name
Count Mapped tag number, the value in parentheses is the total number
Type Small rna type
Description Description for small rna
UMI The real expression
11 File Format Of Target Prediction
27/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
The Target Prediction result for each sample is stored in a tab-separated text file named
target.filter.xls .
Table9describes thefileformatforTargetPrediction.
Table 9 Format description of Target Prediction (Download)
Field Description
miRNA id miRNA name
Target id Target name
TargetScan site type Type of TargetScan
TargetScan NA meaningless
RNAhybrid MFE MFE
RNAhybrid pvalue Predicted p-value
miRanda MFE MFE of miRanda
miRanda score Score value of miRanda
psRobot score Score value of psRobot
psRobot NA meaningless
TargetFinder score Score value of TargetFinder
TargetFinder NA meaningless
12 DESs screening Format (Possion Distribution)
The differentially expressed sRNA result, which was screened using the Poisson
Distribution method in each control-treatment pair, is stored in atab-separated text file
named`pair`followedby _ExpDiff.diffexpfilter.xls.
Table10describes thefileformatforDESs screeningformat(PoissonDistribution).
Table 10 Format description of DESs screening result file (Download)
Field Description
miRNA id miRNA name
Count(A) Tag number of sample A
Count(B) Tag number of sample B
TPM(A) TPM value of sample A
TPM(B) TPM value of sample B
log2 Ratio(B/A) Log2 ratio
Up-Down-Regulation(B/A) Up(down) regulated
P-value P value
FDR FDR
13 Cluster Format
The Cluster result is stored in a tab-separated text file named `pair` followed by
inter/union.xls
Table11describes thefileformatforCluster.
Table 11 Format description of cluster result file (Download)
Field Description
miRNA miRNA name
compare Log2 ratio28/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
14 GO Enrichment Analysis Result
Go Enrichment analysis results are supplied as GOView.html files. You can open these
files usingyourweb-browser.
On yourresults file, the left navigation displays three types of GO terms for each controltreatmentpair(C: cellular component,P:biological process&F:molecularfunction).
OneexampleofthethreeenrichedGOterms results is showninFigure3.
Figure 3 Significantly enriched GO terms. <
Column 1 shows the GO term name, Column 2 the ratio of DESs target genes enriched to this GO term,
Column 3 the ratio of genes enriched to this GO term and Column4 is the Corrected P-value which
indicates the degree of enrichment. The smaller the Corrected P-value, the more significantly enriched the
DESs target gene.
Theresults listhas beensortedbyCorrectedP-value.
Click the term name `intermediate filament` inFigure3to find out more about genes that
enriched to it.More information about intermediate filament is available on
http://amigo.geneontology.org/amigo.
The full list of genes that enriched to this GO term can be seen by clicking `view genes`in
Figure3.Anexampleofwhat youwill seeis showninFigure4.
Figure 4 Gene list related to GO terms. <
As shown in Figure4, 48 genes are enriched to the GO term ` intermediate filament `.
15 Pathway Enrichment Analysis Result
The pathway enrichment result and the enriched KEGG pathways will be listed as
showninFigure5.
Figure 5 Pathway enrichment analysis of DESs target. <
In Figure 5, Column 3 is the ratio of DEGs enriched to the pathway (Column 2), Column 4 is the ratio of
genes enriched to this pathway.P-value and Q-value are both values that indicate the degree of
enrichment; Q-value is created by correcting the P-value by use of the FDR method. The smaller the Pvalue or Q-value, the more significantly DEGs is enriched to this pathway.The results list has been sorted
by Q-value.The last column is the pathway ID which corresponds to the pathway name. 29/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
For example, to view the gene ID list of `Glycosaminoglycan biosynthesis`, click the
pathway name `Glycosaminoglycan biosynthesis` shown inFigure5.This will give you
thegeneIDs thatenrichedtoit(as showninFigure6).
Figure 6 Gene list related to pathway. <
There are 23 DEGs enriched to the pathway ` Glycosaminoglycan biosynthesis `.
Also, the pathway links inFigure6allows you to see detailed pathway information in the
KEGGdatabase.Forexample, clickingthehyperlink on
` Glycosaminoglycan biosynthesis ` inFigure6will show you detailed information on the
KEGGpathway forGlycosaminoglycanbiosynthesis (as showninFigure7).
Figure 7 An example of KEGG pathway of `Glycosaminoglycan biosynthesis`. <
Up-regulated genes are marked with red borders and down-regulated genes with green borders. Nonchange genes are marked with black borders. Related DEGs appear ifyou hover your mouse where the
border with the red or green meet. Clicking on a gene name will redirect you to the KEGGwebsite.
FAQs
Can the research be carried out on species without a reference genome?
It is better to provide the genome reference, or another closely related species genome, or EST sequence. The exon/intron,
repeat information and gene encode sequence are also necessary.
What is the identified method of miRNA?
Any of these methods can be used: PCR-Stem-loop quantitative real-time PCR, Quantitative PCR (qPCR), qRT-PCR, as well
as other methods.
Is it feasible to perform a joint analysis between small RNA analysis and other RNA product analysis?
Yes, not only can it be combined with the analysis of mRNA results, but it also can be combined with circular RNA analysis.
How do I read '.fq' files ?
if you are using Linux or Unix environment, '.fq' files are opened by typing the command 'less'. For Windows environment, you
need to unpack the files first, then use software UltraEdit to view the '.fq' files.
In the Gene Expression Difference Analysis, for example A-vs-B, what is meant by up-regulated and down-regulated?
30/31
深圳华⼤基因股份有限公司 400-706-6615 ©2017BGIAllRightsReserved.
A-vs-B means sample A is the control and sample B is the case. In the corresponding files, A-vs-B.diffexp.xls and A-vsB.diffexpfilter.xls, if a gene is up-regulated it means the expression of this gene is up-regulated in sample B compared to
sample A. Likewise the same applies for down-regulation.
In pathway enrichment analysis (Figure 13), why are the number of genes not equal to the coloured borders?
In Figure 13, each border represents one kind of enzyme, and the function of an enzyme involves the participation of several
genes, so one border may relate to multiple genes.
References
[1] Wang Z., et al. (2009). RNA-Seq: a revolutionary tool for Transcriptomics. Nature Reviews Genetics, 10(1): 57-63.
[2] Mortazavi, A., et al. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5(7):621-8.
[3] Kozomara A, Griffiths-Jones S. miRBase: annotating high confidence microRNAs using deep sequencing data. NAR 2014 42:D68-D73
[4] Rfam 12.0: updates to the RNA families database. Eric P. Nawrocki, Sarah W. Burge, et al. Nucleic Acids Research (2014)
[5] Ye, J., et al. (2006). WEGO: a web tool for plotting GO annotations. Nucleic Acids Res, 34(Web Server issue): W293-7.
[6] Kanehisa, M., et al. (2008). KEGG for linking genomes to life and the environment. Nucleic Acids Res, 36 (Database issue): D480-4.
[7] Cock P., et al.(2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids
Research, 38(6): 1767-1771.
[8] Langmead B, et al. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3): 25-
34.
[9] E. P. Nawrocki and S. R. Eddy, Infernal 1.1: 100-fold faster RNA homology searches , Bioinformatics 29:2933-2935 (2013).
[10] Friedlander, M.R., Chen, W., et al. 'Discovering microRNAs from deep sequencing data using miRDeep', Nature Biotechnology, 26, 407-415
(2008)
[11] Evers M, Huttner M, et al. miRA: adaptable novel miRNA identification in plants using small RNA sequencing data. BMC Bioinformatics. 2015
Nov 5; 16:370
[12] Kai Wang, Chun Liang, et al. Prediction of piRNAs using transposon interaction and a support vector machine. BMC Bioinformatics201415:419
[13] Jagla B, Aulner N, Kelly PD, Song D, Volchuk A, et al. (2005) Sequence characteristics of functional siRNAs. RNA 11: 864-872
[14] Kivioja T, V盲h盲rautio A, Karlsson K, et al. Counting absolute numbers of molecules using unique molecular identifiers[J]. Nature Methods,
2012, 9(1):72-74.
[15] Rehmsmeier, Marc and Krueger, Jan RNAhybrid: microRNA target prediction easy, fast and flexible, Nucleic Acids Res., 2006
[16] John B, Enright AJ, Aravin A, Tuschl T, Sander C, Marks DS., PLoS Biol. 2005 Jul;3(7):e264.
[17] Agarwal V, Bell GW, Nam J, Bartel DP. Predicting effective microRNA target sites in mammalian mRNAs. eLife, 4:e05005, (2015)
[18] Wu HJ, Ma YK, Chen T, Wang M, Wang XJ. (2012) PsRobot: a web-based plant small RNA meta-analysis toolbox. Nucleic Acids Res.
DOI:10.1093/nar/gks554.
[19] Fahlgren N, Carrington JC (2010) miRNA Target Prediction in Plants. Methods in molecular biology (Clifton, NJ) 592.
[20] Bonnet, E., He, Y., Billiau, K., Van de Peer, Y. (2010) TAPIR, a web server for the prediction of plant microRNA targets, including target
mimics. Bioinformatics 26(12):1566-8.
[21] Audic, S. and J. M. Claverie. (1997). The significance of digital gene expression profiles. Genome Res, 10: 986-95.
[22] Abdi, H.The bonferroni and Sidak corrections for multiple comparisons. In N.J. Salkind ( ed.). Encyclopedia of Measurement and Statistics.
[23] Benjamini, Y. and D. Yekutieli. (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics,
29: 1165-1188.
[24] Kanehisa, M., et al. (2008). KEGG for linking genomes to life and the environment. Nucleic Acids Res, 36 (Database issue): D480-4.
2021 Copyright BGI All Rights Reserved 粤ICP备 12059600
Technical Support E-mail:info@bgitechsolutions.com
Website: www.bgitechsolutions.com
31/31