教程來自ANALYSIS OF SINGLE CELL RNA-SEQ DATAANALYSIS OF SINGLE CELL RNA-SEQ DATA
https://broadinstitute.github.io/KrumlovSingleCellWorkshop2020/index.html
本文主要針對Seurat官網教程上一些難懂的部分進行分享,
scRNA-Seq和bulk RNA-Seq兩個關鍵的不同是drop-out和potential for quality control (QC) metrics,前者是指單細胞中有限的RNA探測導致很多的0值,也正因如此,使用sparse matrix而不是dense matrix進行存盤,這樣節省存盤空間,因為sparse matrix默認大多數為0(用 . 表示),只存盤非零值,
識別分析細胞亞群之前,質量控制非常重要
counts <- Read10X(data.dir = counts_matrix_filename) # Seurat function to read in 10x count data
將檔案讀入后得到sparse Matrix of class "dgCMatrix"
rownames是不同的gene名(feature),colnames是不同的16 nt 長barcode序列(細胞),可以通過dim(counts)等函式查看matrix的行列數,即總細胞數和基因數,
過濾低質量的細胞
通過 Matrix::colSums(counts) 計算每一列的count和,即counts per cell
通過 Matrix::rowSums(counts) 計算每一行的count和,即counts per gene
而 Matrix::colSums(counts > 0)則是分別計算每一列中counts大于0的有多少行,即細胞中有多少個基因被表達了,colSums 和 rowSums函式回傳的是向量vector形式,
通過探測的基因數量(文庫復雜性)對細胞排序
通過plot圖可以觀察到failed libraries (lower end outliers) 或cell doublets (higher end outliers).
plot(sort(genes_per_cell), xlab='cell', log='y', main='genes per cell (ordered)')
使用Seurat beginning
- 創建Seurat物件
seurat <- CreateSeuratObject(counts = counts, min.cells = 3, min.features = 350, project = "10X_NSCLC")留下在3個或更多細胞中表達的基因,表達350個或更多基因的細胞,
seurat物件包括多個slots ,不僅存盤原始的count資料,也存盤計算后的結果,會隨著分析不斷更新,如meta.data,是一個資料框,原始變數包括orig.ident, nCount_RNA, nFeature_RNA,可以用$或[參考,
ps:orig.ident指cell identity,細胞身份,此時它的identity是指10X_NSCLC,但在細胞聚類后,則指細胞所屬的cluster
- 獲取物件資訊
GetAssayData(object = seurat, slot = 'data')[1:10, 1:10]
- 獲取seurat包含的method 列出seurat的所有函式
utils::methods(class = 'Seurat') ls("package:Seurat")預處理1:過濾掉低質量細胞
- 除去被破壞的受傷細胞,當細胞應急凋亡,會導致線粒體泄露和RNA降解,因此會有線粒體來源基因的富集,計算每個細胞中線粒體轉錄本的比例percent.mt,作小提琴圖,
# The [[ operator can add columns to object metadata. This is a great place to stash QC
seurat[["percent.mt"]] <- PercentageFeatureSet(object = seurat, pattern = "^MT-")
VlnPlot(object = seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
- 管家基因:細胞中管家基因的表達數量可以反映細胞常規程序的普遍活躍水平,風度高,穩定表達,對dropout不敏感,計算每個細胞中有多少管家基因表達,
# Load the the list of house keeping genes
hkgenes <- read.table(paste0(dirname, "housekeepers.txt"), skip = 2)
hkgenes <- as.vector(hkgenes$V1)
# remove hkgenes that were not found
hkgenes.found <- which(toupper(rownames(seurat@assays$RNA@data)) %in% hkgenes)
n.expressed.hkgenes <- Matrix::colSums(seurat@assays$RNA@data[hkgenes.found, ] > 0)
seurat <- AddMetaData(object = seurat, metadata = n.expressed.hkgenes, col.name = "n.exp.hkgenes")
VlnPlot(object = seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt", "n.exp.hkgenes"), ncol = 2)
- 過濾
seurat <- SubsetData(object = seurat, subset.names = c("nFeature_RNA", "percent.mito","n.exp.hkgenes"), low.thresholds = c(350, -Inf,55), high.thresholds = c(5000, 0.1, Inf)) seurat <- subset(seurat, nFeature_RNA < 5000) seurat <- subset(seurat, nFeature_RNA > 350) seuart <- subset(seurat, n.exp.hkgenes > 55) seuart <- subset(seurat, percent.mt < 10)預處理2:表達標準化
- LogNormalize:表達量乘scale因子,默認10的4次方,再進行對數轉換,可以穩定方差
seurat <- NormalizeData(object = seurat, normalization.method = "LogNormalize", scale.factor = 1e4)差異基因的檢測
- 計算每個基因在細胞中的平均表達量和細胞間離散程度z-score
seurat <- FindVariableFeatures(object = seurat, selection.method = "vst", nfeatures = 2000) # Identify the 10 most highly variable genes top10 <- head(VariableFeatures(seurat), 10) seurat <- FindVariableFeatures(object = seurat, selection.method = "vst", nfeatures = 2000, mean.function = ExpMean, dispersion.function = LogVMR, num.bin = 40, mean.cutoff = c(0.0125, 1), dispersion.cutoff = c(0, 0.5)) ## Check number of variable genes length(seurat@assays$RNA@var.features)selection.method
How to choose top variable features. Choose one of :vst: First, fits a line to the relationship of log(variance) and log(mean) using local polynomial regression (loess). Then standardizes the feature values using the observed mean and expected variance (given by the fitted line). Feature variance is then calculated on the standardized values after clipping to a maximum (see clip.max parameter).
mean.var.plot (mvp): First, uses a function to calculate average expression (mean.function) and dispersion (dispersion.function) for each feature. Next, divides features into num.bin (deafult 20) bins based on their average expression, and calculates z-scores for dispersion within each bin. The purpose of this is to identify variable features while controlling for the strong relationship between variability and average expression.
dispersion (disp): selects the genes with the highest dispersion values
關于selection的3種方法,下期再敘!
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/395014.html
標籤:AI
