1.基因表达数据模块检测方法的综合评估
A comprehensive evaluation of module detection methods for gene expression data(Nature Communications)
Abstract
A
critical step in the analysis of large genome-wide gene expression
datasets is the use of module detection methods to group genes into
co-expression modules. Because of limitations of classical clustering
methods, numerous alternative module detection methods have been
proposed, which improve upon clustering by handling co-expression in
only a subset of samples, modelling the regulatory network, and/or
allowing overlap between modules. In this study we use known regulatory
networks to do a comprehensive and robust evaluation of these different
methods. Overall, decomposition methods outperform all other strategies,
while we do not find a clear advantage of biclustering and network
inference-based approaches on large gene expression datasets. Using our
evaluation workflow, we also investigate several practical aspects of
module detection, such as parameter estimation and the use of
alternative similarity measures, and conclude with recommendations for
the further development of these methods.
2.对英国生物库数据的全基因组分析提供了对骨关节炎遗传结构的见解
Genome-wide analyses using UK Biobank data provide insights into the genetic architecture of osteoarthritis(Nature Genetics)
Abstract
Osteoarthritis
is a common complex disease imposing a large public-health burden.
Here, we performed a genome-wide association study for osteoarthritis,
using data across 16.5 million variants from the UK Biobank resource.
After performing replication and meta-analysis in up to 30,727 cases and
297,191 controls, we identified nine new osteoarthritis loci, in all of
which the most likely causal variant was noncoding. For three loci, we
detected association with biologically relevant radiographic
endophenotypes, and in five signals we identified genes that were
differentially expressed in degraded compared with intact articular
cartilage from patients with osteoarthritis. We established causal
effects on osteoarthritis for higher body mass index but not for
triglyceride levels or genetic predisposition to type 2 diabetes.
3.SvABA:全基因组范围通过局部组装检测结构变异和插入缺失
SvABA: genome-wide detection of structural variants and indels by local assembly(Genome Research)
Abstract
Structural
variants (SVs), including small insertion and deletion variants
(indels), are challenging to detect through standard alignment-based
variant calling methods. Sequence assembly offers a powerful approach to
identifying SVs, but is difficult to apply at scale genome-wide for SV
detection due to its computational complexity and the difficulty of
extracting SVs from assembly contigs. We describe SvABA, an efficient
and accurate method for detecting SVs from short-read sequencing data
using genome-wide local assembly with low memory and computing
requirements. We evaluated SvABA's performance on the NA12878 human
genome and in simulated and real cancer genomes. SvABA demonstrates
superior sensitivity and specificity across a large spectrum of SVs and
substantially improves detection performance for variants in the 20–300
bp range, compared with existing methods. SvABA also identifies complex
somatic rearrangements with chains of short (<1000 bp)
templated-sequence insertions copied from distant genomic regions. We
applied SvABA to 344 cancer genomes from 11 cancer types and found that
short templated-sequence insertions occur in ∼4% of all somatic
rearrangements. Finally, we demonstrate that SvABA can identify sites of
viral integration and cancer driver alterations containing medium-sized
(50–300 bp) SVs.
4. 来自Hungate1000收集的瘤胃微生物的培养和测序
Cultivation and sequencing of rumen microbiome members from the Hungate1000 Collection(Nature Biotechnology)
Abstract
Productivity
of ruminant livestock depends on the rumen microbiota, which ferment
indigestible plant polysaccharides into nutrients used for growth.
Understanding the functions carried out by the rumen microbiota is
important for reducing greenhouse gas production by ruminants and for
developing biofuels from lignocellulose. We present 410 cultured
bacteria and archaea, together with their reference genomes,
representing every cultivated rumen-associated archaeal and bacterial
family. We evaluate polysaccharide degradation, short-chain fatty acid
production and methanogenesis pathways, and assign specific taxa to
functions. A total of 336 organisms were present in available rumen
metagenomic data sets, and 134 were present in human gut microbiome data
sets. Comparison with the human microbiome revealed rumen-specific
enrichment for genes encoding de novosynthesis of vitamin B12, ongoing
evolution by gene loss and potential vertical inheritance of the rumen
microbiome based on underrepresentation of markers of environmental
stress. We estimate that our Hungate genome resource represents ∼75% of
the genus-level bacterial and archaeal taxa present in the rumen.
5.Y染色体上人类着丝粒的线性组装
Linear assembly of a human centromere on the Y chromosome(Nature Biotechnology)
Abstract
The
human genome reference sequence remains incomplete owing to the
challenge of assembling long tracts of near-identical tandem repeats in
centromeres. We implemented a nanopore sequencing strategy to generate
high-quality reads that span hundreds of kilobases of highly repetitive
DNA in a human Y chromosome centromere. Combining these data with
short-read variant validation, we assembled and characterized the
centromeric region of a human Y chromosome.
6. 520,000名受试者的多学科全基因组关联研究鉴定了32种与中风和中风亚型相关的基因座
Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes(Nature Genetics)
Abstract
Stroke has multiple etiologies, but the underlying genes and pathways are largely unknown. We conducted a multiancestry genome-wide-association meta-analysis in 521,612 individuals (67,162 cases and 454,450 controls) and discovered 22 new stroke risk loci, bringing the total to 32. We further found shared genetic variation with related vascular traits, including blood pressure, cardiac traits, and venous thromboembolism, at individual loci (n = 18), and using genetic risk scores and linkage-disequilibrium-score regression. Several loci exhibited distinct association and pleiotropy patterns for etiological stroke subtypes. Eleven new susceptibility loci indicate mechanisms not previously implicated in stroke pathophysiology, with prioritization of risk variants and genes accomplished through bioinformatics analyses using extensive functional datasets. Stroke risk loci were significantly enriched in drug targets for antithrombotic therapy.
7. 目标富集测序,详细阐述小RNA
Target-enrichment sequencing for detailed characterization of small RNAs(Nature Protocols)
Abstract
Identification
of important, functional small RNA (sRNA) species is currently hampered
by the lack of reliable and sensitive methods to isolate and
characterize them. We have developed a method, termed target-enrichment
of sRNAs (TEsR), that enables targeted sequencing of rare sRNAs and
diverse precursor and mature forms of sRNAs not detectable by current
standard sRNA sequencing methods. It is based on the amplification of
full-length sRNA molecules, production of biotinylated RNA probes,
hybridization to one or multiple targeted RNAs, removal of nontargeted
sRNAs and sequencing. By this approach, target sRNAs can be enriched by a
factor of 500–30,000 while maintaining strand specificity. TEsR
enriches for sRNAs irrespective of length or different molecular
features, such as the presence or absence of a 5′ cap or of secondary
structures or abundance levels. Moreover, TEsR allows the detection of
the complete sequence (including sequence variants, and 5′ and 3′ ends)
of precursors, as well as intermediate and mature forms, in a
quantitative manner. A well-trained molecular biologist can complete the
TEsR procedure, from RNA extraction to sequencing library preparation,
within 4–6 d.
8. 捕获Hi-C在33个乳腺癌风险位点中鉴定到了靶基因
Capture Hi-C identifies putative target genes at 33 breast cancer risk loci(Nature Communications)
Abstract
Genome-wide association studies (GWAS) have identified approximately 100 breast cancer risk loci. Translating these findings into a greater understanding of the mechanisms that influence disease risk requires identification of the genes or non-coding RNAs that mediate these associations. Here, we use Capture Hi-C (CHi-C) to annotate 63 loci; we identify 110 putative target genes at 33 loci. To assess the support for these target genes in other data sources we test for associations between levels of expression and SNP genotype (eQTLs), disease-specific survival (DSS), and compare them with somatically mutated cancer genes. 22 putative target genes are eQTLs, 32 are associated with DSS and 14 are somatically mutated in breast, or other, cancers. Identifying the target genes at GWAS risk loci will lead to a greater understanding of the mechanisms that influence breast cancer risk and prognosis.
9. Fam20激酶的结构和进化
Structure and evolution of the Fam20 kinases(Nature Communications)
Abstract
The Fam20 proteins are novel kinases that phosphorylate secreted proteins and proteoglycans. Fam20C phosphorylates hundreds of secreted proteins and is activated by the pseudokinase Fam20A. Fam20B phosphorylates a xylose residue to regulate proteoglycan synthesis. Despite these wide-ranging and important functions, the molecular and structural basis for the regulation and substrate specificity of these kinases are unknown. Here we report molecular characterizations of all three Fam20 kinases, and show that Fam20C is activated by the formation of an evolutionarily conserved homodimer or heterodimer with Fam20A. Fam20B has a unique active site for recognizing Galβ1-4Xylβ1, the initiator disaccharide within the tetrasaccharide linker region of proteoglycans. We further show that in animals the monomeric Fam20B preceded the appearance of the dimeric Fam20C, and the dimerization trait of Fam20C emerged concomitantly with a change in substrate specificity. Our results provide comprehensive structural, biochemical, and evolutionary insights into the function of the Fam20 kinases.
10. QAPA:从RNA-seq数据中系统分析可变聚腺苷酸化的新方法
QAPA: a new method for the systematic analysis of alternative polyadenylation from RNA-seq data(Genome Biology)
Abstract
Alternative polyadenylation (APA) affects most mammalian genes. The genome-wide investigation of APA has been hampered by an inability to reliably profile it using conventional RNA-seq. We describe ‘Quantification of APA’ (QAPA), a method that infers APA from conventional RNA-seq data. QAPA is faster and more sensitive than other methods. Application of QAPA reveals discrete, temporally coordinated APA programs during neurogenesis and that there is little overlap between genes regulated by alternative splicing and those by APA. Modeling of these data uncovers an APA sequence code. QAPA thus enables the discovery and characterization of programs of regulated APA using conventional RNA-seq.
11. SUPPA2:跨多个条件的快速,准确和不确定性差异可变剪接分析
SUPPA2: fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions(Genome Biology)
Abstract
Despite the many approaches to study differential splicing from RNA-seq, many challenges remain unsolved, including computing capacity and sequencing depth requirements. Here we present SUPPA2, a new method that addresses these challenges, and enables streamlined analysis across multiple conditions taking into account biological variability. Using experimental and simulated data, we show that SUPPA2 achieves higher accuracy compared to other methods, especially at low sequencing depth and short read length. We use SUPPA2 to identify novel Transformer2-regulated exons, novel microexons induced during differentiation of bipolar neurons, and novel intron retention events during erythroblast differentiation.
12.FusorSV:用于优化组合来自多种结构变异检测方法的数据的算法
FusorSV: an algorithm for optimally combining data from multiple structural variation detection methods(Genome Biology)
Abstract
Comprehensive and accurate identification of structural variations (SVs) from next generation sequencing data remains a major challenge. We develop FusorSV, which uses a data mining approach to assess performance and merge callsets from an ensemble of SV-calling algorithms. It includes a fusion model built using analysis of 27 deep-coverage human genomes from the 1000 Genomes Project. We identify 843 novel SV calls that were not reported by the 1000 Genomes Project for these 27 samples. Experimental validation of a subset of these calls yields a validation rate of 86.7%. FusorSV is available athttps://github.com/TheJacksonLaboratory/SVE.
13.taxMaps:在合理的时间内对短读长数据进行全面且高度准确的物种分类分析