专栏名称: VG生信软件

国内首家从事Windows平台、可视化生物信息学桌面软件开发的公司。致力于提供领先的生物信息学软件产品和系统服务。产品和业务包括：微生物多样性分析软件、转录组分析软件、重测序分析软件、细菌基因组分析系统。

高水平生物信息学文献速递

VG生信软件 · 公众号 · · 2018-04-02 17:30

正文

1.基因表达数据模块检测方法的综合评估

A comprehensive evaluation of module detection methods for gene expression data(Nature Communications)

Abstract

A critical step in the analysis of large genome-wide gene expression datasets is the use of module detection methods to group genes into co-expression modules. Because of limitations of classical clustering methods, numerous alternative module detection methods have been proposed, which improve upon clustering by handling co-expression in only a subset of samples, modelling the regulatory network, and/or allowing overlap between modules. In this study we use known regulatory networks to do a comprehensive and robust evaluation of these different methods. Overall, decomposition methods outperform all other strategies, while we do not find a clear advantage of biclustering and network inference-based approaches on large gene expression datasets. Using our evaluation workflow, we also investigate several practical aspects of module detection, such as parameter estimation and the use of alternative similarity measures, and conclude with recommendations for the further development of these methods.

2.对英国生物库数据的全基因组分析提供了对骨关节炎遗传结构的见解

Genome-wide analyses using UK Biobank data provide insights into the genetic architecture of osteoarthritis(Nature Genetics)

Abstract

Osteoarthritis is a common complex disease imposing a large public-health burden. Here, we performed a genome-wide association study for osteoarthritis, using data across 16.5 million variants from the UK Biobank resource. After performing replication and meta-analysis in up to 30,727 cases and 297,191 controls, we identified nine new osteoarthritis loci, in all of which the most likely causal variant was noncoding. For three loci, we detected association with biologically relevant radiographic endophenotypes, and in five signals we identified genes that were differentially expressed in degraded compared with intact articular cartilage from patients with osteoarthritis. We established causal effects on osteoarthritis for higher body mass index but not for triglyceride levels or genetic predisposition to type 2 diabetes.

3.SvABA：全基因组范围通过局部组装检测结构变异和插入缺失

SvABA: genome-wide detection of structural variants and indels by local assembly(Genome Research)

Abstract

Structural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA's performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs and substantially improves detection performance for variants in the 20–300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (<1000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types and found that short templated-sequence insertions occur in ∼4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized (50–300 bp) SVs.

4. 来自Hungate1000收集的瘤胃微生物的培养和测序

Cultivation and sequencing of rumen microbiome members from the Hungate1000 Collection(Nature Biotechnology)

Abstract

Productivity of ruminant livestock depends on the rumen microbiota, which ferment indigestible plant polysaccharides into nutrients used for growth. Understanding the functions carried out by the rumen microbiota is important for reducing greenhouse gas production by ruminants and for developing biofuels from lignocellulose. We present 410 cultured bacteria and archaea, together with their reference genomes, representing every cultivated rumen-associated archaeal and bacterial family. We evaluate polysaccharide degradation, short-chain fatty acid production and methanogenesis pathways, and assign specific taxa to functions. A total of 336 organisms were present in available rumen metagenomic data sets, and 134 were present in human gut microbiome data sets. Comparison with the human microbiome revealed rumen-specific enrichment for genes encoding de novosynthesis of vitamin B12, ongoing evolution by gene loss and potential vertical inheritance of the rumen microbiome based on underrepresentation of markers of environmental stress. We estimate that our Hungate genome resource represents ∼75% of the genus-level bacterial and archaeal taxa present in the rumen.

5.Y染色体上人类着丝粒的线性组装

Linear assembly of a human centromere on the Y chromosome(Nature Biotechnology)

Abstract

The human genome reference sequence remains incomplete owing to the challenge of assembling long tracts of near-identical tandem repeats in centromeres. We implemented a nanopore sequencing strategy to generate high-quality reads that span hundreds of kilobases of highly repetitive DNA in a human Y chromosome centromere. Combining these data with short-read variant validation, we assembled and characterized the centromeric region of a human Y chromosome.

6. 520,000名受试者的多学科全基因组关联研究鉴定了32种与中风和中风亚型相关的基因座

Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes(Nature Genetics)

Abstract

Stroke has multiple etiologies, but the underlying genes and pathways are largely unknown. We conducted a multiancestry genome-wide-association meta-analysis in 521,612 individuals (67,162 cases and 454,450 controls) and discovered 22 new stroke risk loci, bringing the total to 32. We further found shared genetic variation with related vascular traits, including blood pressure, cardiac traits, and venous thromboembolism, at individual loci (n = 18), and using genetic risk scores and linkage-disequilibrium-score regression. Several loci exhibited distinct association and pleiotropy patterns for etiological stroke subtypes. Eleven new susceptibility loci indicate mechanisms not previously implicated in stroke pathophysiology, with prioritization of risk variants and genes accomplished through bioinformatics analyses using extensive functional datasets. Stroke risk loci were significantly enriched in drug targets for antithrombotic therapy.

7. 目标富集测序,详细阐述小RNA

Target-enrichment sequencing for detailed characterization of small RNAs(Nature Protocols)

Abstract

Identification of important, functional small RNA (sRNA) species is currently hampered by the lack of reliable and sensitive methods to isolate and characterize them. We have developed a method, termed target-enrichment of sRNAs (TEsR), that enables targeted sequencing of rare sRNAs and diverse precursor and mature forms of sRNAs not detectable by current standard sRNA sequencing methods. It is based on the amplification of full-length sRNA molecules, production of biotinylated RNA probes, hybridization to one or multiple targeted RNAs, removal of nontargeted sRNAs and sequencing. By this approach, target sRNAs can be enriched by a factor of 500–30,000 while maintaining strand specificity. TEsR enriches for sRNAs irrespective of length or different molecular features, such as the presence or absence of a 5′ cap or of secondary structures or abundance levels. Moreover, TEsR allows the detection of the complete sequence (including sequence variants, and 5′ and 3′ ends) of precursors, as well as intermediate and mature forms, in a quantitative manner. A well-trained molecular biologist can complete the TEsR procedure, from RNA extraction to sequencing library preparation, within 4–6 d.

8. 捕获Hi-C在33个乳腺癌风险位点中鉴定到了靶基因

Capture Hi-C identifies putative target genes at 33 breast cancer risk loci(Nature Communications)

Abstract

Genome-wide association studies (GWAS) have identified approximately 100 breast cancer risk loci. Translating these findings into a greater understanding of the mechanisms that influence disease risk requires identification of the genes or non-coding RNAs that mediate these associations. Here, we use Capture Hi-C (CHi-C) to annotate 63 loci; we identify 110 putative target genes at 33 loci. To assess the support for these target genes in other data sources we test for associations between levels of expression and SNP genotype (eQTLs), disease-specific survival (DSS), and compare them with somatically mutated cancer genes. 22 putative target genes are eQTLs, 32 are associated with DSS and 14 are somatically mutated in breast, or other, cancers. Identifying the target genes at GWAS risk loci will lead to a greater understanding of the mechanisms that influence breast cancer risk and prognosis.

9. Fam20激酶的结构和进化

Structure and evolution of the Fam20 kinases(Nature Communications)

Abstract

The Fam20 proteins are novel kinases that phosphorylate secreted proteins and proteoglycans. Fam20C phosphorylates hundreds of secreted proteins and is activated by the pseudokinase Fam20A. Fam20B phosphorylates a xylose residue to regulate proteoglycan synthesis. Despite these wide-ranging and important functions, the molecular and structural basis for the regulation and substrate specificity of these kinases are unknown. Here we report molecular characterizations of all three Fam20 kinases, and show that Fam20C is activated by the formation of an evolutionarily conserved homodimer or heterodimer with Fam20A. Fam20B has a unique active site for recognizing Galβ1-4Xylβ1, the initiator disaccharide within the tetrasaccharide linker region of proteoglycans. We further show that in animals the monomeric Fam20B preceded the appearance of the dimeric Fam20C, and the dimerization trait of Fam20C emerged concomitantly with a change in substrate specificity. Our results provide comprehensive structural, biochemical, and evolutionary insights into the function of the Fam20 kinases.

10. QAPA：从RNA-seq数据中系统分析可变聚腺苷酸化的新方法

QAPA: a new method for the systematic analysis of alternative polyadenylation from RNA-seq data(Genome Biology)

Abstract

Alternative polyadenylation (APA) affects most mammalian genes. The genome-wide investigation of APA has been hampered by an inability to reliably profile it using conventional RNA-seq. We describe ‘Quantification of APA’ (QAPA), a method that infers APA from conventional RNA-seq data. QAPA is faster and more sensitive than other methods. Application of QAPA reveals discrete, temporally coordinated APA programs during neurogenesis and that there is little overlap between genes regulated by alternative splicing and those by APA. Modeling of these data uncovers an APA sequence code. QAPA thus enables the discovery and characterization of programs of regulated APA using conventional RNA-seq.

11. SUPPA2：跨多个条件的快速，准确和不确定性差异可变剪接分析

SUPPA2: fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions(Genome Biology)

Abstract

Despite the many approaches to study differential splicing from RNA-seq, many challenges remain unsolved, including computing capacity and sequencing depth requirements. Here we present SUPPA2, a new method that addresses these challenges, and enables streamlined analysis across multiple conditions taking into account biological variability. Using experimental and simulated data, we show that SUPPA2 achieves higher accuracy compared to other methods, especially at low sequencing depth and short read length. We use SUPPA2 to identify novel Transformer2-regulated exons, novel microexons induced during differentiation of bipolar neurons, and novel intron retention events during erythroblast differentiation.

12.FusorSV：用于优化组合来自多种结构变异检测方法的数据的算法

FusorSV: an algorithm for optimally combining data from multiple structural variation detection methods(Genome Biology)

Abstract

Comprehensive and accurate identification of structural variations (SVs) from next generation sequencing data remains a major challenge. We develop FusorSV, which uses a data mining approach to assess performance and merge callsets from an ensemble of SV-calling algorithms. It includes a fusion model built using analysis of 27 deep-coverage human genomes from the 1000 Genomes Project. We identify 843 novel SV calls that were not reported by the 1000 Genomes Project for these 27 samples. Experimental validation of a subset of these calls yields a validation rate of 86.7%. FusorSV is available athttps://github.com/TheJacksonLaboratory/SVE.

13.taxMaps：在合理的时间内对短读长数据进行全面且高度准确的物种分类分析

高水平生物信息学文献速递

正文

请到「今天看啥」查看全文