Methodology & Instructions

Details on how calculations are performed and references to cite.


Overview

This page documents how sncLex computes and aggregates metrics for RNA-derived short products, including definitions, formulas, and references. Use this section to provide a high-level description of the data sources, processing pipeline, and quality control steps.

Only sncRNA products that map to the positive genomic strand are retained in the database. Any negative-strand calls (identified by the trailing - marker in the source tables) are filtered before loading, ensuring downstream analyses focus solely on positive-strand products.

Calculation Methods

sncLex calculates each metric directly from aligned read piles around a called product peak. The windows below refer to the configurable small and big windows that flank the peak coordinate.

  • Expression: Starts with the read count at the peak itself and then sums all reads found within the ±small_window expression window. Expression is tracked per input file so the totals used in multi-sample analyses remain separable by library.
  • Background: Begins with the same peak read count and surveys the broader ±big_window region, skipping the peak coordinate when scanning so it is only counted once. All remaining reads inside this background window are accumulated to provide the total background signal for that candidate.
  • Expression fraction: Defined as Expression / Background. It captures how much of the local background support is concentrated in the expression window and is stored alongside the raw counts.
  • Prominence (Enrichment): Reported as (Expression / Background) × 100. This percentage is used as an enrichment score, and products falling below the configured threshold are filtered before being written to the database.

Data Processing Pipeline

  1. Download raw fastq.gz files from the ENCODE portal for each experiment.
  2. Run FastQC to assess overall quality and adapter content before any trimming.
  3. Trim adapters with CutAdapt, retaining only reads where an adapter was confidently detected.
  4. Filter trimmed reads by average quality (≥Q20) using fastq-filter.
  5. Collapse identical sequences with fastx_collapser from the FASTX-Toolkit.
  6. Align the collapsed reads with STAR against the GRCh38 Homo sapiens reference from GENCODE v47, using the comprehensive gene annotation plus predicted tRNA genes.
  7. Apply a custom mispriming removal script to samples prepared with Poly-A selection.
  8. Process all libraries jointly with missRNA 2.0 in multi-sample mode to call products and metrics.

Publications and References

List key publications that describe the methods and data sources. Include DOIs and links where possible.

  • Author et al., Year. Title. Journal. DOI/URL.
  • Author et al., Year. Title. Journal. DOI/URL.

How to Cite

Provide a recommended citation for sncLex and any underlying methods paper.

LastName, FirstName; ... sncLex: Atlas of RNA-derived Irregular Short End-products, Year, Version, URL

Versioning

Note the current data/model version and any major changes affecting metrics or interpretation.