[Bioinfo] RNA-seq project 1 - (2) data cleaning, indexing

- data download

- data transformation

- data cleaning

- data mapping

- mapped read counting

- DEGs(Statistical analysis)

- Gene expression pattern

- data visualization

Read cleaning

데이터를 다운받은 후에 raw read cleaning을 진행한다. read cleaning을 하는 이유는 에러가 있는 nucleotide를 제거하고, 중복 read를 제거하기 위함이다.

Duplicates를 다루는 방법?

이론적으로 기존 라이브러리의 각각의 template에 대해 하나의 read를 사용해야하는데, NGS의 과정 중 PCR amplification이 있기 때문에 여러개의 read를 얻게 된다. 만약 중복된 read를 제거하지 않는다면 SNP calling 중 skewed allele frequencies를 얻게 된다. 하지만 중복된 read를 지우는 중 발생할 수 있는 문제가 있는데, 다른 molecule에서 온 같은 region을 제거할 가능성이 있다. (염기서열이 동일한 경우) 이론적으로 같은 서열을 가진 read를 찾으면 되지만, 시퀀싱 에러를 고려하면 완벽히 같은 서열이 존재할 가능성이 낮다. 따라서 아주 비슷한 서열을 골라야 한다.

read cleaning 단계는 PRINSEQ라는 소프트웨어를 사용한다. PRINSEQ는 퀄리티 컨트롤과 filtering에 이용되는 툴로, read length, GC content, sequence complexity, quality score distributions, count of duplicate reads, Ns, and poly-A tails, assembly quality measures, tag sequences 같은 summary statistics 정보를 제공한다.

prinseq-lite -fastq ${fq} -out_format 3 -out_good ${fq%.fastq}good -out_bad ${fq%.fastq}bad -log ${fq%.fastq}.log -min_len 50 -min_qual_score 5 -min_qual_mean 15 -derep 14 -trim_qual_left 15 -trim_qual_right 15

우리 데이터는 single-end fastq data이다. 따라서 인풋이 한 개의 fastq 파일이다.

-fastq <file>: Input

-out_format <int>: output format

1: FASTA only

2: FASTA and QUAL

3: FASTQ

4: FASTQ and FASTQ

5: FASTQ, FASTA and QUAL

-out_good <file>: filtering이 끝난 파일

-out_bad <file>: 기준에 미달하는 read들

-min_len <int>: filtering이 끝났는데 남아있는 길이가 50보다 작으면 지우기. 원래 read의 절반 정도로 설정.

-min_qual_score <int>, -min_qual_mean <int>: PHRED score

-derep <int>: 중복된 read를 제거. 순서 상관 없이 1, 2, 3, 4, 5 값을 사용한다. 1은 2, 3에 포함되고, 4는 5에 포함된다.

중복된 read가 많다는 것은 RNA 상태가 좋지 못하다는 뜻이고, 시퀀싱이 잘 되면 10% 내외이다.

1: exact duplicate

2: 5' duplicate

3: 3' duplicate

4: reverse complement exact duplicate

5: reverse complement 5'/3' duplicate

-trim_qual_left <int>, -trim_qual_right <int>: 끝에서부터 자르기.

Indexing

mapping을 진행하기 전에 mapping에 필요한 reference를 만들어줘야 한다. indexing은 책에 index를 만드는 것에 비유할 수 있다고 한다. 나중에 어떤 챕터를 열고 싶을 때 그 인덱스를 사용하면 빠르고 편하게 찾아볼 수 있듯이, mapping의 성능을 향상시키기 위해 진행한다. genome indexing은 bowtie2라는 소프트웨어를 사용한다. bowtie와 alignment program을 엮어서 만든 trinity라는 소프트웨어가 있다. reference가 없는 경우에는 fragment를 이어서 transcript를 만드는 De novo assembly도 가능하다.

align_and_estimate_abundance.pl --transcripts Arab_mRNA.fa --est_method RSEM --aln_method bowtie2 --prep_reference

참고자료: 충청 ict AI 바이오인포매틱스 과정

https://bioinf.comav.upv.es/courses/sequence_analysis/read_cleaning.html

저작자표시

'Computer Science > [21-하] Projects' 카테고리의 다른 글

[Bioinfo] RNA-seq project 1 - (1) data download (0)	2022.07.25
[캐글] Tabular Playground Series - April 회고 (0)	2022.05.01
[캐글] Tabular Playground Series - November 회고 (0)	2021.12.01
[캐글] Tabular Playground Series - October 회고 (0)	2021.11.05
[캐글] 첫번째 노트북 발행 - Fish Market (0)	2021.09.05