Potential method for enhancing docs search results for important terms #1649

jacklinke · 2024-10-07T03:31:50Z

Problem

Some terms are important for users to find in the Django documentation, but they are not in the title, slug, or table of contents for the document. As a result, these terms are not weighted as highly in the search results as they should be.

Example: Terms like "TextChoices" and "IntegerChoices" are important, but they are not in the title, slug, or table of contents for the Model field reference page. They are only mentioned in the body of the document.

Background

The current full-text search implementation for the docs uses the Document model's metadata JSONField. This field is populated by the Sphinx-based documentation for the Django project.

Upon each new release, the update_docs management command runs sphinx-build json (among many other things) and invokes the DocumentRelease model's sync_to_db method to import the json data into the Document model.

The metadata field contains various keys like body, toc, etc. But there is no provision for adding additional text to the metadata field and to the search vector to account for important terms that are in the body of the Document but not in the higher-weighted title, slug, or toc (table of contents).

We cannot simply add the additional text to a new key in the metadata field because the Document instances are overwritten with each new release. We need a way to store the additional text separately and add it to the metadata field when the Document instances are created or updated.

A Potential Solution

I am proposing adding a DocumentEnhancement model to store additional text that can be added to the metadata field of the Document model. This additional text can be weighted at the same level as the toc field ("B") to allow for additional text to be added into the search vector when important terms are not in the title, slug, or table of contents.

The DocumentEnhancement model has a one-to-one relationship with Document.
The DocumentRelease.sync_to_db method is updated to read the additional text from the DocumentEnhancement model and add it to the metadata field.
The DOCUMENT_SEARCH_VECTOR in docs/search.py is updated to include the enhancements key in the search vector.
At any time, the migrate_enhancements management command can be run to migrate DocumentEnhancement data to the metadata field of the Document model instances.
Optionally, we can improve the admin interface to allow for easy management of DocumentEnhancement.

Implementation

Note: These are not yet tested. Just jotted down this evening for input from the community.

docs/models.py

from django.db import models
from django.db import transaction


class DocumentEnhancement(models.Model):
    """Additional text to be added to a document."""

    document = models.OneToOneField("docs.Document", on_delete=models.CASCADE, related_name="enhancement")
    path = models.CharField(max_length=500)
    additional_text = models.TextField(blank=True)

    class Meta:
        unique_together = ("path", "document__release")


class DocumentRelease(models.Model):
    # ... existing content

    @transaction.atomic
    def sync_to_db(self, decoded_documents):
        """
        Sync the given list of documents (decoded fjson files from sphinx) to
        the database. Deletes all the release's documents first then
        reinserts them as needed.
        """
        # ** Store existing document enhancements, since they will be CASCADE deleted
        # Alternately, we could decouple the DocumentEnhancement from the Document by setting `path` to None
        document_enhancements = list(
            DocumentEnhancement.objects.select_related("document").filter(document__release=self)
        )
        document_enhancement_dict = {de.path: de for de in document_enhancements}

        self.documents.all().delete()

        # Read excluded paths from robots.docs.txt.
        robots_path = settings.BASE_DIR.joinpath("djangoproject", "static", "robots.docs.txt")
        with open(str(robots_path)) as fh:
            excluded_paths = [
                line.strip().split("/")[-1]
                for line in fh
                if line.startswith(f"Disallow: /{self.lang}/{self.release_id}/")
            ]

        for document in decoded_documents:
            if (
                "body" not in document
                or "title" not in document
                or document["current_page_name"].split("/")[0] in excluded_paths
            ):
                # We don't care about indexing documents with no body or title,
                # or partially translated
                continue

            document_path = _clean_document_path(document["current_page_name"])
            document["slug"] = Path(document_path).parts[-1]
            document["parents"] = " ".join(Path(document_path).parts[:-1])

            # ** Add enhancements to metadata
            matching_enhancement = document_enhancement_dict.get(document["path"])
            document["enhancements"] = matching_enhancement.additional_text if matching_enhancement else ""

            # ** Use a variable that we can use later to create the DocumentEnhancement
            created_doc = Document.objects.create(
                release=self,
                path=document_path,
                title=html.unescape(strip_tags(document["title"])),
                metadata=document,
                config=TSEARCH_CONFIG_LANGUAGES.get(self.lang[:2], DEFAULT_TEXT_SEARCH_CONFIG),
            )

            if matching_enhancement:
                # ** Recreate document enhancement
                DocumentEnhancement.objects.create(
                    document=created_doc, path=created_doc.path, additional_text=matching_enhancement.additional_text
                )
            else:
                # ** Create document enhancement if none existed
                DocumentEnhancement.objects.create(document=created_doc, path=created_doc.path)

        for document in self.documents.all():
            document.metadata["breadcrumbs"] = list(Document.objects.breadcrumbs(document).values("title", "path"))
            document.save(update_fields=("metadata",))

        # ** Delete any unattached document enhancements
        DocumentEnhancement.objects.filter(document__isnull=True).delete()

docs/search.py

# ... existing content

DOCUMENT_SEARCH_VECTOR = (
    SearchVector("title", weight="A", config=F("config"))
    + SearchVector(KeyTextTransform("slug", "metadata"), weight="A", config=F("config"))
    + SearchVector(KeyTextTransform("toc", "metadata"), weight="B", config=F("config"))
    + SearchVector(KeyTextTransform("enhancements", "metadata"), weight="B", config=F("config"))  # ** added
    + SearchVector(KeyTextTransform("body", "metadata"), weight="C", config=F("config"))
    + SearchVector(KeyTextTransform("parents", "metadata"), weight="D", config=F("config"))
)

docs/management/commands/migrate_enhancements.py

from django.core.management.base import BaseCommand
from django.db import transaction
from docs.models import Document, DocumentEnhancement


class Command(BaseCommand):
    help = "Migrate existing DocumentEnhancement data to Document metadata"

    def add_arguments(self, parser):
        parser.add_argument(
            "--force-update",
            action="store_true",
            help="Force update all documents, overwriting existing document enhancements in metadata",
        )

    @transaction.atomic
    def handle(self, *args, **options):
        force_update = options["force_update"]
        migrated = 0
        updated = 0

        # Create a dictionary of document enhancements for lookup
        document_enhancements = {(e.document_id, e.path): e.additional_text for e in DocumentEnhancement.objects.all()}

        # Process all documents
        for document in Document.objects.all():
            document_enhancement_text = document_enhancements.get((document.id, document.path), "")

            if force_update or "enhancements" not in document.metadata:
                document.metadata["enhancements"] = document_enhancement_text
                document.save(update_fields=["metadata"])

                if "enhancements" not in document.metadata:
                    migrated += 1
                else:
                    updated += 1

        self.stdout.write(self.style.SUCCESS(f"Migrated {migrated} document enhancements to Document metadata"))
        if force_update:
            self.stdout.write(
                self.style.SUCCESS(f"Updated {updated} existing document enhancements in Document metadata")
            )

docs/admin.py

Not required, but maybe useful for managing enhancements in the admin interface.

from django.contrib import admin
from .models import Document, DocumentEnhancement


class DocumentEnhancementInline(admin.StackedInline):
    """Inline for DocumentEnhancement model, used in DocumentAdmin."""
    model = DocumentEnhancement
    extra = 1


@admin.register(Document)
class DocumentAdmin(admin.ModelAdmin):
    list_display = ["title", "path", "release", "enhancement_link"]
    inlines = [DocumentEnhancementInline]

    def enhancement_link(self, obj):
        """Link to the document enhancement admin."""
        if hasattr(obj, "enhancement"):
            url = reverse("admin:docs_documentenhancement_change", args=[obj.enhancement.id])
            return format_html('<a href="{}">Edit Enhancement</a>', url)
        return "No Enhancement"

    enhancement_link.short_description = "Enhancement"


@admin.register(DocumentEnhancement)
class DocumentEnhancementAdmin(admin.ModelAdmin):
    """Admin for DocumentEnhancement model."""
    list_display = ["document", "path"]
    search_fields = ["document__title", "path", "additional_text"]

The text was updated successfully, but these errors were encountered:

pauloxnet · 2024-10-07T10:39:45Z

Hi Jack and thanks for the great analysis of the problem and for having detailed this issue so much.

I have some doubts about the solution you propose and I will try to expose some of them below.

The documentation is quite extensive and the work of adding additional words to those automatically extracted seems burdensome and above all arbitrary: in your example you cite the term "TextChoices" a term that appears only 3 times on the page and only in the code examples.
https://docs.djangoproject.com/en/stabele/ref/models/fields/

On that same page there are many other terms that could be of interest to others for other searches (es: "OneToOneField"), so the relevance of "TextChoices" is lower than other terms that appear as paragraph titles, for example.

Paradoxically in the release notes the term "TextChoices" is more relevant, because the text in which it is immersed contains fewer terms and does not appear only as code.
https://docs.djangoproject.com/en/stable/releases/3.0/

If we analyze the reference page of the model fields, it seems that in fact the relevance given to the term "TextChoices" is very low, because it is only used in the example code snippets without being introduced in the descriptive part. Furthermore, the page is very long, it contains many very relevant words which penalizes all the terms present in it.

The solutions that I hypothesize could be:

add a specific sub-paragraph for "TextChoices/IntegerChoices/..." so that the term ends up in the list of titles in the "contents" column of the page that is indexed with greater ranking in the full-text indexing.
break the page with too much concentration of interesting topics such as "Model Field reference" into multiple pages in order to simplify reading and research.
while waiting to improve the content of the pages or break pages that are too large, we could find a way to influence the ranking of some pages in the search. For example, penalizing some pages (e.g. release notes) or rewarding other pages (e.g. model field references)

Another concern with your proposal is the fact of adding an additional model, increasing the complexity. At the moment all the fields needed to build the vector of terms for full-text search reside in a single model
https://github.com/django/djangoproject.com/blob/main/docs/search.py#L42

This leaves open the possibility of automatically generating a GiST index or converting the field into a Generated Field ( #1650 ).

If some of the search data were to reside in separate models we would not be able to use these automatic PostgreSQL mechanisms, and the whole process will be more complex and slower.

Ultimately I think the problem you raise is real, but I believe that the fact that the search for the term "TextChoices" returns only 7 results of which the third you think should be the most relevant, is a reflection of the documentation itself. I believe it is a warning bell that the term you are looking for has actually been given little importance in general and especially on that page. We should work to improve the documentation, rather than manually rigging the search results.

For example, if you search for "OneToOneField" you will see that the first page that is proposed by the search is exactly what you expect, but only because in the documentation a good job was done in giving the right importance to the term.
https://docs.djangoproject.com/en/5.1/search/?q=OneToOneField

The search applied the same search logic, but the result was different because the indexed data for the two terms were decidedly different.

Sorry for the length of the answer, but I wanted it to be clear how things work under the hood of the search and I am very happy that you are also interested in the functionality because in fact many have complained about the results and I hope that together we can make it better.

jacklinke · 2024-10-08T16:52:07Z

Paul, thanks for the response!

Sometimes I get a bit too excited to throw a technical solution at something when another approach might be better to start with. Working through this proposal taught me a great deal about the codebase for the site and how everything works together, so even if it is a bit misguided based on the excellent points you brought up, I am glad I did it.

Regarding your hypothesis:

That's probably a great approach for this particular case.
Good idea! The page you mention, in particular, is exceedingly long. For instance, it is interesting that the Model field reference page has a pretty detailed section on the relationship fields, but then there are separate pages (1 2 3 4) that also go into detail on these same fields. Only two of those are linked to from the Model field reference page. **
It looks like this is being addressed in Docs search: tweak results ranking so release notes have lower priority #1628. I know you know this, but in case anyone else is following the discussion, I figured I should link it.

Another concern with your proposal is the fact of adding an additional model, increasing the complexity. At the moment all the fields needed to build the vector of terms for full-text search reside in a single model

My proposal adds another model, but it has no direct interaction with the search vector. The enhancement data is pushed into the Document model's metadata as another key, so that the search vector is still only considering a single model (Document) to perform its work.

We should work to improve the documentation, rather than manually rigging the search results.

Great point 😄

Maybe a useful task (though one I'd have to really think about how to approach) would be to identify those terms commonly used in Django (classes, methods, arguments, etc) that are not well-represented in either the slug or table of contents of pages within Django's docs. Then we would have a starting point of what terms we could boost naturally by improving the documentation to ensure they are properly represented. Further, identifying which of these are stop words would aid with addressing #1097. If I can find a good approach to this, would this be helpful?

Also, I think during the conversations at DjangoCon Sprints the group brought up a desire to ask the ops team for a list of search terms used on the docs search over a period of time. Do you recall whether we got an answer on 1) whether that is feasible, and 2) when we could get a chunk of this data? Or has the ops team been formally asked for this yet?

** Interestingly, while searching for the other relationship field pages to link to, I found that when searching for ForeignKey, the second-most relevant page ("Many-to-one relationships") is at the bottom of page 3 of the results!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential method for enhancing docs search results for important terms #1649

Potential method for enhancing docs search results for important terms #1649

jacklinke commented Oct 7, 2024

pauloxnet commented Oct 7, 2024

jacklinke commented Oct 8, 2024 •

edited

Loading

Potential method for enhancing docs search results for important terms #1649

Potential method for enhancing docs search results for important terms #1649

Comments

jacklinke commented Oct 7, 2024

Problem

Background

A Potential Solution

Implementation

docs/models.py

docs/search.py

docs/management/commands/migrate_enhancements.py

docs/admin.py

pauloxnet commented Oct 7, 2024

jacklinke commented Oct 8, 2024 • edited Loading

jacklinke commented Oct 8, 2024 •

edited

Loading