Methodology & Instructions
Details on how calculations are performed and references to cite.
Overview
This page documents how sncLex computes and aggregates metrics for RNA-derived short products, including definitions, formulas, and references. Use this section to provide a high-level description of the data sources, processing pipeline, and quality control steps.
Only sncRNA products that map to the positive genomic strand are retained in the database. Any
negative-strand calls (identified by the trailing - marker in the source tables) are
filtered before loading, ensuring downstream analyses focus solely on positive-strand products.
Calculation Methods
sncLex calculates each metric directly from aligned read piles around a called product peak. The windows below refer to the configurable small and big windows that flank the peak coordinate.
- Expression: Starts with the read count at the peak itself and then sums all reads found within the ±small_window expression window. Expression is tracked per input file so the totals used in multi-sample analyses remain separable by library.
- Background: Begins with the same peak read count and surveys the broader ±big_window region, skipping the peak coordinate when scanning so it is only counted once. All remaining reads inside this background window are accumulated to provide the total background signal for that candidate.
- Expression fraction: Defined as Expression / Background. It captures how much of the local background support is concentrated in the expression window and is stored alongside the raw counts.
- Prominence (Enrichment): Reported as (Expression / Background) × 100. This percentage is used as an enrichment score, and products falling below the configured threshold are filtered before being written to the database.
Data Processing Pipeline
- Download raw
fastq.gzfiles from the ENCODE portal for each experiment. - Run FastQC to assess overall quality and adapter content before any trimming.
- Trim adapters with CutAdapt, retaining only reads where an adapter was confidently detected.
- Filter trimmed reads by average quality (≥Q20) using
fastq-filter. - Collapse identical sequences with
fastx_collapserfrom the FASTX-Toolkit. - Align the collapsed reads with STAR against the GRCh38 Homo sapiens reference from GENCODE v47, using the comprehensive gene annotation plus predicted tRNA genes.
- Apply a custom mispriming removal script to samples prepared with Poly-A selection.
- Process all libraries jointly with missRNA 2.0 in multi-sample mode to call products and metrics.
Publications and References
List key publications that describe the methods and data sources. Include DOIs and links where possible.
- Author et al., Year. Title. Journal. DOI/URL.
- Author et al., Year. Title. Journal. DOI/URL.
How to Cite
Provide a recommended citation for sncLex and any underlying methods paper.
LastName, FirstName; ... sncLex: Atlas of RNA-derived Irregular Short End-products, Year, Version, URL
Versioning
Note the current data/model version and any major changes affecting metrics or interpretation.