我有一系列这样定义的类(注意:它们通过 扩展了一个已经存在的数据库automap
,所以对下面列的引用可能不会反映这里的类):
class VariantAssociation(Base):
__tablename__ = "sample_variant_association"
vid = Column(Integer, ForeignKey("variants.variant_id"),
primary_key=True, index=True)
sid = Column(Integer, ForeignKey("samples.sample_id"),
primary_key=True, index=True)
vdepth = Column(Integer, index=True)
valt_depth = Column(Integer, index=True)
gt = Column(Text)
gt_type = Column(Integer)
fraction = Column(Float, index=True)
variant = relationship("Variant", back_populates="samples")
sample = relationship("Samples", back_populates="variants")
class Variant(Base):
__tablename__ = "variants"
variant_id = Column(Integer, primary_key=True)
info = deferred(Column(LargeBinary))
samples = relationship("VariantAssociation",
back_populates="variant")
def __repr__(self):
data = "<Variant {chrom}:{start}-{end} {gene} {ref}/{alt} {type}>"
return data.format(chrom=self.chrom,
start=self.start,
end=self.end,
gene=self.gene,
ref=self.ref,
alt=self.alt,
type=self.type)
class Samples(Base):
__tablename__ = "samples"
sample_id = Column(Integer, primary_key=True, index=True)
name = Column(Text, index=True)
variants = relationship("VariantAssociation",
back_populates="sample")
它们组装在一个相当复杂的查询中,但在这里精简了:
query = session.query(Variant).join(VariantAssociation.variant_id).join(Samples)
query = query.filter(VariantAssociation.vdepth >= 60)
现在,我想计算两列的组合:ref
和alt
。
我认为这很简单:
query = query.with_entities(Variant.ref, Variant.alt,
func.count()).distinct().group_by(gemini.Variant.ref, gemini.Variant.alt)
其中产生(一行示例):
('A', 'C', 308)
但是,如果我只是迭代查询并计数:
from collections import defaultdict, Counter
counts = defaultdict(Counter)
for row in query.with_entities(Variant.ref, Variant.alt):
counts[f"{row.ref}>{row.alt}"].update(["present"])
给我
'A>C': Counter({'present': 155})
近一半的我发现通过count
。我知道后者是正确的,而不是前者。但我想使用前者,因为后者可能非常慢(大型 SQLite 数据库)。
我是否错误地设置了计数?
编辑:根据要求,完整查询count
(包括来自数据库本身的更多过滤器)
SELECT DISTINCT variants.ref AS variants_ref, variants.alt AS variants_alt, count(*) AS count_1
FROM variants JOIN sample_variant_association ON variants.variant_id = sample_variant_association.vid JOIN
samples ON samples.sample_id = sample_variant_association.sid
WHERE sample_variant_association.gt_type != ? AND variants.impact NOT IN (?, ?, ?, ?) AND
sample_variant_association.vdepth >= ? AND sample_variant_association.fraction >= ? AND variants.chrom NOT IN (?,
?) AND variants.aaf_1kg_eur < ? AND variants.type = ? AND sample_variant_association.fraction >= ? AND
sample_variant_association.vdepth >= ? GROUP BY variants.ref, variants.alt
还有一个用于迭代的:
SELECT DISTINCT variants.ref AS variants_ref, variants.alt AS variants_alt
FROM variants JOIN sample_variant_association ON variants.variant_id = sample_variant_association.vid JOIN
samples ON samples.sample_id = sample_variant_association.sid
WHERE sample_variant_association.gt_type != ? AND variants.impact NOT IN (?, ?, ?, ?) AND
sample_variant_association.vdepth >= ? AND sample_variant_association.fraction >= ? AND variants.chrom NOT IN (?,
?) AND variants.aaf_1kg_eur < ? AND variants.type = ? AND sample_variant_association.fraction >= ? AND
sample_variant_association.vdepth >= ?
编辑 2:我将此追溯到基本查询中重复的变体 ID 的存在:
query.with_entities(gemini.Variant.variant_id).count()
18288
query.with_entities(gemini.Variant.variant_id).distinct().count()
14437
所以这个问题与我最初的想法不同。以某种方式在循环中考虑了重复记录,但不考虑func.count()
.
使用子查询有效,首先删除重复项:
id_subquery = query.with_entities(Variant.variant_id).distinct().subquery()
然后获取实际数据:
c_query = session.query(Variant.ref, Variant.alt, func.count(1))
c_query = c_query.filter(Variant.variant_id.in_(id_subquery))
c_query = c_query.group_by(Variant.ref, Variant.alt)
c_query.first()
('A', 'C', 155)
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句