-
-
Notifications
You must be signed in to change notification settings - Fork 955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential method for enhancing docs search results for important terms #1649
Comments
Hi Jack and thanks for the great analysis of the problem and for having detailed this issue so much. I have some doubts about the solution you propose and I will try to expose some of them below. The documentation is quite extensive and the work of adding additional words to those automatically extracted seems burdensome and above all arbitrary: in your example you cite the term "TextChoices" a term that appears only 3 times on the page and only in the code examples. On that same page there are many other terms that could be of interest to others for other searches (es: "OneToOneField"), so the relevance of "TextChoices" is lower than other terms that appear as paragraph titles, for example. Paradoxically in the release notes the term "TextChoices" is more relevant, because the text in which it is immersed contains fewer terms and does not appear only as code. If we analyze the reference page of the model fields, it seems that in fact the relevance given to the term "TextChoices" is very low, because it is only used in the example code snippets without being introduced in the descriptive part. Furthermore, the page is very long, it contains many very relevant words which penalizes all the terms present in it. The solutions that I hypothesize could be:
Another concern with your proposal is the fact of adding an additional model, increasing the complexity. At the moment all the fields needed to build the vector of terms for full-text search reside in a single model This leaves open the possibility of automatically generating a GiST index or converting the field into a Generated Field ( #1650 ). If some of the search data were to reside in separate models we would not be able to use these automatic PostgreSQL mechanisms, and the whole process will be more complex and slower. Ultimately I think the problem you raise is real, but I believe that the fact that the search for the term "TextChoices" returns only 7 results of which the third you think should be the most relevant, is a reflection of the documentation itself. I believe it is a warning bell that the term you are looking for has actually been given little importance in general and especially on that page. We should work to improve the documentation, rather than manually rigging the search results. For example, if you search for "OneToOneField" you will see that the first page that is proposed by the search is exactly what you expect, but only because in the documentation a good job was done in giving the right importance to the term. The search applied the same search logic, but the result was different because the indexed data for the two terms were decidedly different. Sorry for the length of the answer, but I wanted it to be clear how things work under the hood of the search and I am very happy that you are also interested in the functionality because in fact many have complained about the results and I hope that together we can make it better. |
Paul, thanks for the response! Sometimes I get a bit too excited to throw a technical solution at something when another approach might be better to start with. Working through this proposal taught me a great deal about the codebase for the site and how everything works together, so even if it is a bit misguided based on the excellent points you brought up, I am glad I did it. Regarding your hypothesis:
My proposal adds another model, but it has no direct interaction with the search vector. The enhancement data is pushed into the Document model's
Great point 😄 Maybe a useful task (though one I'd have to really think about how to approach) would be to identify those terms commonly used in Django (classes, methods, arguments, etc) that are not well-represented in either the slug or table of contents of pages within Django's docs. Then we would have a starting point of what terms we could boost naturally by improving the documentation to ensure they are properly represented. Further, identifying which of these are stop words would aid with addressing #1097. If I can find a good approach to this, would this be helpful? Also, I think during the conversations at DjangoCon Sprints the group brought up a desire to ask the ops team for a list of search terms used on the docs search over a period of time. Do you recall whether we got an answer on 1) whether that is feasible, and 2) when we could get a chunk of this data? Or has the ops team been formally asked for this yet? ** Interestingly, while searching for the other relationship field pages to link to, I found that when searching for |
Problem
Some terms are important for users to find in the Django documentation, but they are not in the title, slug, or table of contents for the document. As a result, these terms are not weighted as highly in the search results as they should be.
Example: Terms like "TextChoices" and "IntegerChoices" are important, but they are not in the title, slug, or table of contents for the Model field reference page. They are only mentioned in the body of the document.
Background
The current full-text search implementation for the docs uses the
Document
model'smetadata
JSONField. This field is populated by the Sphinx-based documentation for the Django project.Upon each new release, the
update_docs
management command runssphinx-build json
(among many other things) and invokes theDocumentRelease
model'ssync_to_db
method to import the json data into the Document model.The
metadata
field contains various keys likebody
,toc
, etc. But there is no provision for adding additional text to themetadata
field and to the search vector to account for important terms that are in thebody
of theDocument
but not in the higher-weightedtitle
,slug
, ortoc
(table of contents).We cannot simply add the additional text to a new key in the
metadata
field because theDocument
instances are overwritten with each new release. We need a way to store the additional text separately and add it to themetadata
field when theDocument
instances are created or updated.A Potential Solution
I am proposing adding a
DocumentEnhancement
model to store additional text that can be added to themetadata
field of theDocument
model. This additional text can be weighted at the same level as thetoc
field ("B") to allow for additional text to be added into the search vector when important terms are not in the title, slug, or table of contents.DocumentEnhancement
model has a one-to-one relationship withDocument
.DocumentRelease.sync_to_db
method is updated to read the additional text from theDocumentEnhancement
model and add it to themetadata
field.DOCUMENT_SEARCH_VECTOR
indocs/search.py
is updated to include theenhancements
key in the search vector.migrate_enhancements
management command can be run to migrateDocumentEnhancement
data to themetadata
field of theDocument
model instances.DocumentEnhancement
.Implementation
Note: These are not yet tested. Just jotted down this evening for input from the community.
docs/models.py
docs/search.py
docs/management/commands/migrate_enhancements.py
docs/admin.py
Not required, but maybe useful for managing enhancements in the admin interface.
The text was updated successfully, but these errors were encountered: