Django ORM:如何按日期排序,然后选择按外键分组的最佳对象?

杰夫

我知道我的头衔有点复杂,但是请允许我示范一下。我在使用Python 3的Django 2.2.5。这是我目前正在使用的模型:

from django.db import models
from django.db.models import F
from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVectorField, SearchVector, SearchQuery, SearchRank

class Thread(models.Model):
    title = models.CharField(max_length=100)
    last_update = models.DateTimeField(auto_now=True)

class PostQuerySet(models.QuerySet):
    _search_vector = SearchVector('thread__type') + \
                     SearchVector('thread__title') + \
                     SearchVector('from_name') + \
                     SearchVector('from_email') + \
                     SearchVector('message')

    ###
    # There's code here that updates the `Post.search_vector` field for each `Post` object
    # using `PostQuerySet._search_vector`.
    ###

    def search(self, text):
        """
            Search posts using the indexed `search_vector` field. I can, for example, call
            `Post.objects.search('influenza h1n1')`.
        """
        search_query = SearchQuery(text)
        search_rank = SearchRank(F('search_vector'), search_query)
        return self.annotate(rank=search_rank).filter(search_vector=search_query).order_by('-rank')

class Post(models.Model):
    thread = models.ForeignKey(Thread, on_delete=models.CASCADE)
    timestamp = models.DateTimeField()
    from_name = models.CharField(max_length=100)
    from_email = models.EmailField()
    message = models.TextField()
    in_response_to = models.ManyToManyField('self', symmetrical=False, blank=True)
    search_vector = SearchVectorField(null=True)

    objects = PostQuerySet().as_manager()

    class Meta:
        ordering = ['timestamp']
        indexes = [
            GinIndex(fields=['search_vector'])
        ]

(为了简洁起见,我在这些模型中切入了一些东西,我认为这是无关紧要的,但是如果以后变得很重要,我将其添加进来。)

用英语,我正在使用一个表示电子邮件列表服务中数据的应用程序。基本上,其中Thread包含多个Post对象。人们会回复所有最初的帖子并进行讨论。我刚刚使用Django提供的内置支持在Django中进行全文搜索来实现搜索功能。超级快,我喜欢。这是我在中搜索的示例views.py

###
# Pull `query` from a form defined in `forms.py`.
###

search_results = Post.objects.search(query).order_by('-timestamp')

一切都很好,而且返回的搜索结果绝对有意义。但是我刚刚遇到一种情况,我不确定如何处理。显示的结果不是我想要的那么干净。该查询使我得到的是Post与用户提供的匹配的所有对象query很好,但是Post同一对象中可能有很多对象Thread阻塞了结果。可能是这样的:

post5 from thread2 - timestamp 2018-04-01, rank 0.5
post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post3 from thread1 - timestamp 2018-01-01, rank 0.6
post2 from thread1 - timestamp 2017-12-01, rank 0.7
post2 from thread2 - timestamp 2017-11-01, rank 0.7

(这里rank是DjangoSearchRank方法返回的相关性。)

我真正想要的是这样的:我想要显示的最有代表性的匹配Post每个Thread,通过排序降序时间戳。换句话说,对于在搜索结果中Thread包含的每个对象Post,仅rank Post应显示最高的rank Post对象,并且这些最高的对象应按时间戳按降序排序。因此,在上面的示例中,这些是我希望看到的结果:

post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post2 from thread1 - timestamp 2017-12-01, rank 0.7

通过几个for循环来完成我想做的事情将是相当简单的,但是我真的希望有一种方法可以纯粹在ORM中完成以提高效率。你们有什么建议吗?如果您需要我澄清有关问题设置或所需内容的任何信息,请告诉我。

保罗·梅尔基奥尔(Paolo Melchiorre)

我认为您必须查询Post模型,按线程等级时间戳对其进行排序,然后distinct线程上使用

搜索

这是按时间戳排序的搜索:

Post.objects.search("text").order_by("-timestamp")

这是在本地PostgreSQL上执行的SQL:

SELECT
    "post"."from_name",
    "thread"."title",
    "post"."timestamp",
    ts_rank("post"."search_vector", plainto_tsquery('text')) AS "rank"
FROM
    "post"
    INNER JOIN "thread" ON ("post"."thread_id" = "thread"."id")
WHERE
    "post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE
ORDER BY
    "post"."timestamp" DESC

这些是带有我的本地数据的搜索结果:

post1 from thread1 - timestamp 2019-07-01, rank 0.0607927
post2 from thread1 - timestamp 2019-06-01, rank 0.0759909
post1 from thread2 - timestamp 2019-06-01, rank 0.0759909
post2 from thread2 - timestamp 2019-05-01, rank 0.0607927
post3 from thread1 - timestamp 2019-05-01, rank 0.0607927
post1 from thread3 - timestamp 2019-05-01, rank 0.0607927
post3 from thread2 - timestamp 2019-04-01, rank 0.0759909
post4 from thread1 - timestamp 2019-04-01, rank 0.0759909
post2 from thread3 - timestamp 2019-04-01, rank 0.0759909
post5 from thread1 - timestamp 2019-03-01, rank 0.0607927
post3 from thread3 - timestamp 2019-03-01, rank 0.0607927
post4 from thread2 - timestamp 2019-03-01, rank 0.0607927
post5 from thread2 - timestamp 2019-02-01, rank 0.0759909
post4 from thread3 - timestamp 2019-02-01, rank 0.0759909
post5 from thread3 - timestamp 2019-01-01, rank 0.0759909

这是正确的查询,它仅显示每个线程的最具代表性的匹配帖子(基于搜索排名),并按时间戳降序排序

Post.objects.search("text").order_by(
   "thread", "-rank", "-timestamp"
).distinct("thread")

这是在本地PostgreSQL上执行的SQL:

SELECT DISTINCT ON ("forum_post"."thread_id")
    "forum_post"."from_name",
    "forum_thread"."title",
    "forum_post"."timestamp",
    ts_rank("forum_post"."search_vector", plainto_tsquery('dolor')) AS "rank"
FROM
    "forum_post"
    INNER JOIN "forum_thread" ON ("forum_post"."thread_id" = "forum_thread"."id")
WHERE
    "forum_post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE
ORDER BY
    "forum_post"."thread_id" ASC,
    "rank" DESC,
    "forum_post"."timestamp" DESC

这些是带有我的本地数据的搜索结果:

post2 from thread1 - timestamp 2019-06-01, rank 0.0759909
post1 from thread2 - timestamp 2019-06-01, rank 0.0759909
post2 from thread3 - timestamp 2019-04-01, rank 0.0759909

注意

您可以distinct在Django官方文档中了解更多信息

更新资料

如果您需要按相反的顺序按时间戳绝对排序,并且不需要显示排名,则可以使用子查询在上一个查询之后对帖子进行排序:

Post.objects.filter(
    pk__in=Subquery(
        Post.objects.search("dolor")
        .order_by("-thread", "-rank", "-timestamp")
        .distinct("thread")
        .values("id")
    )
).order_by("-timestamp")

这是在本地PostgreSQL上执行的SQL:

SELECT
    "forum_post"."from_name",
    "forum_thread"."title",
    "forum_post"."timestamp"
FROM
    "forum_post"
    INNER JOIN "forum_thread" ON ("forum_post"."thread_id" = "forum_thread"."id")
WHERE
    "forum_post"."id" IN ( SELECT DISTINCT ON (U0. "thread_id")
            U0. "id"
        FROM
            "forum_post" U0
        WHERE
            U0. "search_vector" @@ (plainto_tsquery('dolor')) = TRUE
        ORDER BY
            U0. "thread_id" DESC,
            ts_rank(U0. "search_vector", plainto_tsquery('dolor'))
            DESC,
            U0. "timestamp" DESC)
ORDER BY
    "forum_post"."timestamp" DESC

这些是带有我的本地数据的搜索结果:

post2 from thread1 - timestamp 2019-06-01
post1 from thread2 - timestamp 2019-06-01
post2 from thread3 - timestamp 2019-04-01

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章