[spelunker]: Misc changes from holiday season (#519)

* Use larger chatModel for queryMaker; it definitely works better * Fix subtle bug in purging (too many distinct types resolve to 'string', so the type checker didn't catch this!) * Root out the last mentions of topics and goals * Make it easier to change index names; (topics, goals) -> (tags, synonyms). * Tweak presentation of TF*IDF (no change in the formula) * Add description of hit and chunk scoring to design.md * Hoist TF calculation out of inner loop --------- Co-authored-by: robgruen <[email protected]>
microsoft · Jan 7, 2025 · 99cc0c3 · 99cc0c3
1 parent 7f4106e
commit 99cc0c3
Show file tree

Hide file tree

Showing 7 changed files with 132 additions and 93 deletions.
diff --git a/ts/examples/spelunker/design.md b/ts/examples/spelunker/design.md
@@ -38,7 +38,7 @@ A user query is handled using the following steps:
 
 ### High level
 
-- Is it worth pursueing this further?
+- Is it worth pursueing this further? (Github Copilot does a better job summarizing a project.)
 - How to integrate it as an agent with shell/cli?
   Especially since the model needs access to conversation history, and the current assumption is that you focus on spelunking exclusively until you say you are (temporarily) done with it.
   Does the cli/shell have UI features for that?
@@ -54,7 +54,7 @@ A user query is handled using the following steps:
 ### Import process open questions
 
 - Should we give the LLM more guidance as to how to generate the best keywords, topics etc.?
-- Do we need all five indexes? Or could we do with fewer, e.g. just **summaries** and **topics**?
+- Do we need all five indexes? Or could we do with fewer, e.g. just **summaries** and **topics**? Or **summaries** and **relationships**?
 - Can we get it to produce better summaries and topics (etc.) through different prompting?
 - What are the optimal parameters for splitting long files?
 - Can we tweak the splitting of large files to make the split files more cohesive?
@@ -76,6 +76,30 @@ A user query is handled using the following steps:
 
 ## Details of the current processes
 
-E.g. my TF\*IDF variant, etc.
+### Scoring hits and chunks
 
-This is TODO. For now just see the code.
+- When scoring responses to a nearest neighbors query, the relevance score of each response is a number between -1 and 1 giving the "cosine similarity".
+  (Which, given that all vectors are normalized already, is just the dot product of the query string's embedding and each potential match's embedding.)
+  We sort all responses by relevance score, and keep the top maxHits responses and call them "hits". (Possibly also applying minScore, which defaults to 0.)
+  Usually maxHits = 10; it can be influenced by a per-user-query setting and/or per index by the LLM in step 1.
+  Each hit includes a list of chunk IDs that produced its key (e.g. all chunks whose topic was "database management").
+
+- When computing the score of a chunk relative to a query result (consisting of multiple hits), we compute the score using TF\*IDF.
+
+  - We keep a mapping from chunk IDs to TF\*IDF scores. Initially each chunk's score is 0.
+  - For each index, for each hit, we compute the TF\*IDF score for each chunk referenced by the hit.
+  - The TF\*IDF score for the chunk is then added to the previous cumulative score for that chunk in the mapping mentioned above.
+  - TF (Term Frequency) is taken to be the hit's relevance score.
+    Example: if "database management" scored 0.832 against the actual query, TF = 0.832 for that hit.
+  - IDF (Inverse Document Frequency) is computed as 1 + log(totalNUmChunks / (1 + hitChunks)). (Using the natural logarithm.)
+    Here totalNUmChunks is the total number of chunks indexed, and hitChunks is the number of chunks mentioned by this hit.
+    Reference: "inverse document frequency smooth" in this [table in Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).
+    Example: If there are 250 chunks in the database, and "database management" is mentioned by 5 chunks, IDF = 1 + log(250 / 6). I.e., 4.729.
+
+- After processing all hits for all indexes, we end up with a cumulative TF\*IDF score for each chunk that appeared at least once in a hit.
+  We sort these by score and keep the maxChunks highest-scoring chunks to send to the LLM in step 4.
+  Currently maxChunks is fixed at 30; we could experiment with this value (and with maxHits).
+
+### TODO
+
+The rest is TODO. For now just see the code.
diff --git a/ts/examples/spelunker/src/chunkyIndex.ts b/ts/examples/spelunker/src/chunkyIndex.ts
@@ -12,12 +12,14 @@ import { createJsonTranslator, TypeChatJsonTranslator } from "typechat";
 import { createTypeScriptJsonValidator } from "typechat/ts";
 import { AnswerSpecs } from "./makeAnswerSchema.js";
 
-export type IndexType =
-    | "summaries"
-    | "keywords"
-    | "topics"
-    | "goals"
-    | "dependencies";
+export const IndexNames = [
+    "summaries",
+    "keywords",
+    "tags",
+    "synonyms",
+    "dependencies",
+];
+export type IndexType = (typeof IndexNames)[number];
 export type NamedIndex = [IndexType, knowLib.TextIndex<string, ChunkId>];
 
 // A bundle of object stores and indexes etc.
@@ -33,11 +35,7 @@ export class ChunkyIndex {
     rootDir!: string;
     answerFolder!: ObjectFolder<AnswerSpecs>;
     chunkFolder!: ObjectFolder<Chunk>;
-    summariesIndex!: knowLib.TextIndex<string, ChunkId>;
-    keywordsIndex!: knowLib.TextIndex<string, ChunkId>;
-    topicsIndex!: knowLib.TextIndex<string, ChunkId>;
-    goalsIndex!: knowLib.TextIndex<string, ChunkId>;
-    dependenciesIndex!: knowLib.TextIndex<string, ChunkId>;
+    indexes!: Map<IndexType, knowLib.TextIndex<string, ChunkId>>;
 
     private constructor() {
         this.chatModel = openai.createChatModelDefault("spelunkerChat");
@@ -52,7 +50,7 @@ export class ChunkyIndex {
             1000,
         );
         this.fileDocumenter = createFileDocumenter(this.chatModel);
-        this.queryMaker = createQueryMaker(this.miniModel);
+        this.queryMaker = createQueryMaker(this.chatModel);
         this.answerMaker = createAnswerMaker(this.chatModel);
     }
 
@@ -73,11 +71,10 @@ export class ChunkyIndex {
             instance.rootDir + "/answers",
             { serializer: (obj) => JSON.stringify(obj, null, 2) },
         );
-        instance.summariesIndex = await makeIndex("summaries");
-        instance.keywordsIndex = await makeIndex("keywords");
-        instance.topicsIndex = await makeIndex("topics");
-        instance.goalsIndex = await makeIndex("goals");
-        instance.dependenciesIndex = await makeIndex("dependencies");
+        instance.indexes = new Map();
+        for (const name of IndexNames) {
+            instance.indexes.set(name, await makeIndex(name));
+        }
 
         async function makeIndex(
             name: string,
@@ -104,13 +101,7 @@ export class ChunkyIndex {
     }
 
     allIndexes(): NamedIndex[] {
-        return [
-            ["summaries", this.summariesIndex],
-            ["keywords", this.keywordsIndex],
-            ["topics", this.topicsIndex],
-            ["goals", this.goalsIndex],
-            ["dependencies", this.dependenciesIndex],
-        ];
+        return [...this.indexes.entries()];
     }
 }
 

diff --git a/ts/examples/spelunker/src/fileDocSchema.ts b/ts/examples/spelunker/src/fileDocSchema.ts
@@ -3,14 +3,42 @@
 
 // Extracted information for a chunk of code.
 export type ChunkDoc = {
+    // Optional file identifier or language context for this chunk.
+    fileName?: string;
+
     lineNumber: number;
+
     name: string; // Function, class or method name (fully qualified)
+
+    // Optional list of base classes, for classes.
+    bases?: string[];
+
+    // Optional list of parameter names/types used by this chunk.
+    // E.g. ["x: list[int]", "y"]  # y is untyped
+    // Take from `__new__` or `__init__` for classes.
+    parameters?: string[];
+
+    // Optional return type or output specification.
+    // E.g. "dict[str, int]" or "None".
+    // Don't set for classes.
+    returnType?: string;
+
     // One paragraph summary of the code chunk starting at that line.
     // Concise, informative, don't explain Python or stdlib features.
-    summary: string; // Can be multiline
+    summary: string;
+
+    // Propose keywords/phrases capturing the chunk's functionality,
+    // context, and notable traits. Make them concise but descriptive,
+    // ensuring users can find these points with common queries or synonyms.
     keywords?: string[];
-    topics?: string[];
-    goals?: string[];
+
+    // Optional high-level labels (e.g., "algorithmic", "I/O").
+    tags?: string[];
+
+    // Additional synonyms or related domain concepts.
+    synonyms?: string[];
+
+    // References to other chunks or external files.
     dependencies?: string[];
 };
 

diff --git a/ts/examples/spelunker/src/fileDocumenter.ts b/ts/examples/spelunker/src/fileDocumenter.ts
@@ -46,7 +46,7 @@ export function createFileDocumenter(model: ChatModel): FileDocumenter {
             "Method C.foo finds the most twisted anagram for a word.\n" +
             "It uses various heuristics to rank a word's twistedness'.\n" +
             "```\n" +
-            "Also fill in the lists of keywords, topics, goals, and dependencies.\n";
+            "Also fill in the lists of keywords, tags, synonyms, and dependencies.\n";
         const result = await fileDocTranslator.translate(request, text);
 
         // Now assign each comment to its chunk.

diff --git a/ts/examples/spelunker/src/makeQuerySchema.ts b/ts/examples/spelunker/src/makeQuerySchema.ts
@@ -15,8 +15,8 @@ export type QuerySpecs = {
     // Queries directed to various indexes. Comments describe what's in each index.
     summaries?: QuerySpec; // A paragraph describing the code
     keywords?: QuerySpec; // Short key words and phrases extracted from the code
-    topics?: QuerySpec; // Slightly longer phrases relating to the code
-    goals?: QuerySpec; // What the code is trying to achieve
+    tags?: QuerySpec; // Optional high-level labels (e.g. "algorithmic", "I/O")
+    synonyms?: QuerySpec; // Additional synonyms or related domain concepts
     dependencies?: QuerySpec; // External dependencies
 
     // If the question can be answered based on chat history and general knowledge.

diff --git a/ts/examples/spelunker/src/pythonImporter.ts b/ts/examples/spelunker/src/pythonImporter.ts
@@ -9,7 +9,7 @@ import * as knowLib from "knowledge-processor";
 import { asyncArray } from "typeagent";
 
 import * as iapp from "interactive-app";
-import { ChunkyIndex } from "./chunkyIndex.js";
+import { ChunkyIndex, IndexNames } from "./chunkyIndex.js";
 import { ChunkDoc, FileDocumentation } from "./fileDocSchema.js";
 import {
     Chunk,
@@ -236,40 +236,22 @@ async function embedChunk(
         summaries.push(chunkDoc.summary);
     }
     const combinedSummaries = summaries.join("\n").trimEnd();
-    if (combinedSummaries) {
-        await exponentialBackoff(
-            io,
-            chunkyIndex.summariesIndex.put,
-            combinedSummaries,
-            [chunk.id],
-        );
-    }
+
     for (const chunkDoc of chunkDocs) {
-        await writeToIndex(
-            io,
-            chunk.id,
-            chunkDoc.topics,
-            chunkyIndex.topicsIndex,
-        );
-        await writeToIndex(
-            io,
-            chunk.id,
-            chunkDoc.keywords,
-            chunkyIndex.keywordsIndex,
-        );
-        await writeToIndex(
-            io,
-            chunk.id,
-            chunkDoc.goals,
-            chunkyIndex.goalsIndex,
-        );
-        await writeToIndex(
-            io,
-            chunk.id,
-            chunkDoc.dependencies,
-            chunkyIndex.dependenciesIndex,
-        );
+        for (const indexName of IndexNames) {
+            let data: string[];
+            if (indexName == "summaries") {
+                data = [combinedSummaries];
+            } else {
+                data = (chunkDoc as any)[indexName];
+            }
+            const index = chunkyIndex.indexes.get(indexName)!;
+            if (data && index) {
+                await writeToIndex(io, chunk.id, data, index);
+            }
+        }
     }
+
     const t1 = Date.now();
     if (verbose) {
         log(
@@ -284,7 +266,7 @@ async function embedChunk(
 async function writeToIndex(
     io: iapp.InteractiveIo | undefined,
     chunkId: ChunkId,
-    phrases: string[] | undefined, // List of keywords, topics, etc. in chunk
+    phrases: string[] | undefined, // List of summaries, keywords, tags, etc. in chunk
     index: knowLib.TextIndex<string, ChunkId>,
 ) {
     for (const phrase of phrases ?? []) {

diff --git a/ts/examples/spelunker/src/queryInterface.ts b/ts/examples/spelunker/src/queryInterface.ts
@@ -74,8 +74,8 @@ export async function interactiveQueryLoop(
         search,
         summaries,
         keywords,
-        topics,
-        goals,
+        tags,
+        synonyms,
         dependencies,
         files,
         purgeFile,
@@ -168,8 +168,9 @@ export async function interactiveQueryLoop(
                                 writeNote(io, "SUMMARY: None");
                             }
                         } else {
-                            const docItem: string[] | undefined =
-                                chunkDoc[name];
+                            const docItem: string[] | undefined = (
+                                chunkDoc as any
+                            )[name];
                             if (docItem?.length) {
                                 writeNote(
                                     io,
@@ -291,32 +292,32 @@ export async function interactiveQueryLoop(
         await _reportIndex(args, io, "keywords");
     }
 
-    function topicsDef(): iapp.CommandMetadata {
+    function tagsDef(): iapp.CommandMetadata {
         return {
-            description: "Show all recorded topics and their postings.",
+            description: "Show all recorded tags and their postings.",
             options: commonOptions(),
         };
     }
-    handlers.topics.metadata = topicsDef();
-    async function topics(
+    handlers.tags.metadata = tagsDef();
+    async function tags(
         args: string[] | iapp.NamedArgs,
         io: iapp.InteractiveIo,
     ): Promise<void> {
-        await _reportIndex(args, io, "topics");
+        await _reportIndex(args, io, "tags");
     }
 
-    function goalsDef(): iapp.CommandMetadata {
+    function synonymsDef(): iapp.CommandMetadata {
         return {
-            description: "Show all recorded goals and their postings.",
+            description: "Show all recorded synonyms and their postings.",
             options: commonOptions(),
         };
     }
-    handlers.goals.metadata = goalsDef();
-    async function goals(
+    handlers.synonyms.metadata = synonymsDef();
+    async function synonyms(
         args: string[] | iapp.NamedArgs,
         io: iapp.InteractiveIo,
     ): Promise<void> {
-        await _reportIndex(args, io, "goals");
+        await _reportIndex(args, io, "synonyms");
     }
 
     function dependenciesDef(): iapp.CommandMetadata {
@@ -560,19 +561,32 @@ export async function purgeNormalizedFile(
     );
 
     // Step 2: Remove chunk ids from indexes.
-    const deletions: ChunkId[] = Array.from(toDelete);
+    const chunkIdsToDelete: ChunkId[] = Array.from(toDelete);
     for (const [name, index] of chunkyIndex.allIndexes()) {
-        let updates = 0;
+        const affectedValues: string[] = [];
+        // Collect values from which we need to remove the chunk ids about to be deleted.
         for await (const textBlock of index.entries()) {
-            if (textBlock?.sourceIds?.some((id) => deletions.includes(id))) {
+            if (
+                textBlock?.sourceIds?.some((id) =>
+                    chunkIdsToDelete.includes(id),
+                )
+            ) {
                 if (verbose) {
                     writeNote(io, `[Purging ${name} entry ${textBlock.value}]`);
                 }
-                await index.remove(textBlock.value, deletions);
-                updates++;
+                affectedValues.push(textBlock.value);
             }
         }
-        writeNote(io, `[Purged ${updates} ${name}]`); // name is plural, e.g. "keywords".
+        // Actually update the index (can't modify it while it's being iterated over).
+        for (const value of affectedValues) {
+            const id = await index.getId(value);
+            if (!id) {
+                writeWarning(io, `[No id for value {value}]`);
+            } else {
+                await index.remove(id, chunkIdsToDelete);
+            }
+        }
+        writeNote(io, `[Purged ${affectedValues.length} ${name}]`); // name is plural, e.g. "keywords".
     }
 
     // Step 3: Remove chunks (do this last so if step 2 fails we can try again).
@@ -741,15 +755,15 @@ async function runIndexQueries(
 
         // Update chunk id scores.
         for (const hit of hits) {
-            // IDF only depends on the term.
+            // Literature suggests setting TF = 1 in this case,
+            // but the term's relevance score intuitively makes sense.
+            const tf = hit.score;
+            // IDF calculation ("inverse document frequency smooth").
             const fraction =
                 totalNumChunks / (1 + (hit.item.sourceIds?.length ?? 0));
             const idf = 1 + Math.log(fraction);
+            const newScore = tf * idf;
             for (const chunkId of hit.item.sourceIds ?? []) {
-                // Binary TF is 1 for all chunks in the list.
-                // As a tweak, we multiply by the term's relevance score.
-                const tf = hit.score;
-                const newScore = tf * idf;
                 const oldScoredItem = chunkIdScores.get(chunkId);
                 const oldScore = oldScoredItem?.score ?? 0;
                 // Combine scores by addition. (Alternatives: max, possibly others.)