Skip to content

Commit

Permalink
[spelunker]: Misc changes from holiday season (#519)
Browse files Browse the repository at this point in the history
* Use larger chatModel for queryMaker; it definitely works better
* Fix subtle bug in purging (too many distinct types resolve to
'string', so the type checker didn't catch this!)
* Root out the last mentions of topics and goals
* Make it easier to change index names; (topics, goals) -> (tags,
synonyms).
* Tweak presentation of TF*IDF (no change in the formula)
* Add description of hit and chunk scoring to design.md
* Hoist TF calculation out of inner loop

---------

Co-authored-by: robgruen <[email protected]>
  • Loading branch information
gvanrossum-ms and robgruen authored Jan 7, 2025
1 parent 7f4106e commit 99cc0c3
Show file tree
Hide file tree
Showing 7 changed files with 132 additions and 93 deletions.
32 changes: 28 additions & 4 deletions ts/examples/spelunker/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ A user query is handled using the following steps:

### High level

- Is it worth pursueing this further?
- Is it worth pursueing this further? (Github Copilot does a better job summarizing a project.)
- How to integrate it as an agent with shell/cli?
Especially since the model needs access to conversation history, and the current assumption is that you focus on spelunking exclusively until you say you are (temporarily) done with it.
Does the cli/shell have UI features for that?
Expand All @@ -54,7 +54,7 @@ A user query is handled using the following steps:
### Import process open questions

- Should we give the LLM more guidance as to how to generate the best keywords, topics etc.?
- Do we need all five indexes? Or could we do with fewer, e.g. just **summaries** and **topics**?
- Do we need all five indexes? Or could we do with fewer, e.g. just **summaries** and **topics**? Or **summaries** and **relationships**?
- Can we get it to produce better summaries and topics (etc.) through different prompting?
- What are the optimal parameters for splitting long files?
- Can we tweak the splitting of large files to make the split files more cohesive?
Expand All @@ -76,6 +76,30 @@ A user query is handled using the following steps:

## Details of the current processes

E.g. my TF\*IDF variant, etc.
### Scoring hits and chunks

This is TODO. For now just see the code.
- When scoring responses to a nearest neighbors query, the relevance score of each response is a number between -1 and 1 giving the "cosine similarity".
(Which, given that all vectors are normalized already, is just the dot product of the query string's embedding and each potential match's embedding.)
We sort all responses by relevance score, and keep the top maxHits responses and call them "hits". (Possibly also applying minScore, which defaults to 0.)
Usually maxHits = 10; it can be influenced by a per-user-query setting and/or per index by the LLM in step 1.
Each hit includes a list of chunk IDs that produced its key (e.g. all chunks whose topic was "database management").

- When computing the score of a chunk relative to a query result (consisting of multiple hits), we compute the score using TF\*IDF.

- We keep a mapping from chunk IDs to TF\*IDF scores. Initially each chunk's score is 0.
- For each index, for each hit, we compute the TF\*IDF score for each chunk referenced by the hit.
- The TF\*IDF score for the chunk is then added to the previous cumulative score for that chunk in the mapping mentioned above.
- TF (Term Frequency) is taken to be the hit's relevance score.
Example: if "database management" scored 0.832 against the actual query, TF = 0.832 for that hit.
- IDF (Inverse Document Frequency) is computed as 1 + log(totalNUmChunks / (1 + hitChunks)). (Using the natural logarithm.)
Here totalNUmChunks is the total number of chunks indexed, and hitChunks is the number of chunks mentioned by this hit.
Reference: "inverse document frequency smooth" in this [table in Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).
Example: If there are 250 chunks in the database, and "database management" is mentioned by 5 chunks, IDF = 1 + log(250 / 6). I.e., 4.729.

- After processing all hits for all indexes, we end up with a cumulative TF\*IDF score for each chunk that appeared at least once in a hit.
We sort these by score and keep the maxChunks highest-scoring chunks to send to the LLM in step 4.
Currently maxChunks is fixed at 30; we could experiment with this value (and with maxHits).

### TODO

The rest is TODO. For now just see the code.
39 changes: 15 additions & 24 deletions ts/examples/spelunker/src/chunkyIndex.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,14 @@ import { createJsonTranslator, TypeChatJsonTranslator } from "typechat";
import { createTypeScriptJsonValidator } from "typechat/ts";
import { AnswerSpecs } from "./makeAnswerSchema.js";

export type IndexType =
| "summaries"
| "keywords"
| "topics"
| "goals"
| "dependencies";
export const IndexNames = [
"summaries",
"keywords",
"tags",
"synonyms",
"dependencies",
];
export type IndexType = (typeof IndexNames)[number];
export type NamedIndex = [IndexType, knowLib.TextIndex<string, ChunkId>];

// A bundle of object stores and indexes etc.
Expand All @@ -33,11 +35,7 @@ export class ChunkyIndex {
rootDir!: string;
answerFolder!: ObjectFolder<AnswerSpecs>;
chunkFolder!: ObjectFolder<Chunk>;
summariesIndex!: knowLib.TextIndex<string, ChunkId>;
keywordsIndex!: knowLib.TextIndex<string, ChunkId>;
topicsIndex!: knowLib.TextIndex<string, ChunkId>;
goalsIndex!: knowLib.TextIndex<string, ChunkId>;
dependenciesIndex!: knowLib.TextIndex<string, ChunkId>;
indexes!: Map<IndexType, knowLib.TextIndex<string, ChunkId>>;

private constructor() {
this.chatModel = openai.createChatModelDefault("spelunkerChat");
Expand All @@ -52,7 +50,7 @@ export class ChunkyIndex {
1000,
);
this.fileDocumenter = createFileDocumenter(this.chatModel);
this.queryMaker = createQueryMaker(this.miniModel);
this.queryMaker = createQueryMaker(this.chatModel);
this.answerMaker = createAnswerMaker(this.chatModel);
}

Expand All @@ -73,11 +71,10 @@ export class ChunkyIndex {
instance.rootDir + "/answers",
{ serializer: (obj) => JSON.stringify(obj, null, 2) },
);
instance.summariesIndex = await makeIndex("summaries");
instance.keywordsIndex = await makeIndex("keywords");
instance.topicsIndex = await makeIndex("topics");
instance.goalsIndex = await makeIndex("goals");
instance.dependenciesIndex = await makeIndex("dependencies");
instance.indexes = new Map();
for (const name of IndexNames) {
instance.indexes.set(name, await makeIndex(name));
}

async function makeIndex(
name: string,
Expand All @@ -104,13 +101,7 @@ export class ChunkyIndex {
}

allIndexes(): NamedIndex[] {
return [
["summaries", this.summariesIndex],
["keywords", this.keywordsIndex],
["topics", this.topicsIndex],
["goals", this.goalsIndex],
["dependencies", this.dependenciesIndex],
];
return [...this.indexes.entries()];
}
}

Expand Down
34 changes: 31 additions & 3 deletions ts/examples/spelunker/src/fileDocSchema.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,42 @@

// Extracted information for a chunk of code.
export type ChunkDoc = {
// Optional file identifier or language context for this chunk.
fileName?: string;

lineNumber: number;

name: string; // Function, class or method name (fully qualified)

// Optional list of base classes, for classes.
bases?: string[];

// Optional list of parameter names/types used by this chunk.
// E.g. ["x: list[int]", "y"] # y is untyped
// Take from `__new__` or `__init__` for classes.
parameters?: string[];

// Optional return type or output specification.
// E.g. "dict[str, int]" or "None".
// Don't set for classes.
returnType?: string;

// One paragraph summary of the code chunk starting at that line.
// Concise, informative, don't explain Python or stdlib features.
summary: string; // Can be multiline
summary: string;

// Propose keywords/phrases capturing the chunk's functionality,
// context, and notable traits. Make them concise but descriptive,
// ensuring users can find these points with common queries or synonyms.
keywords?: string[];
topics?: string[];
goals?: string[];

// Optional high-level labels (e.g., "algorithmic", "I/O").
tags?: string[];

// Additional synonyms or related domain concepts.
synonyms?: string[];

// References to other chunks or external files.
dependencies?: string[];
};

Expand Down
2 changes: 1 addition & 1 deletion ts/examples/spelunker/src/fileDocumenter.ts
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ export function createFileDocumenter(model: ChatModel): FileDocumenter {
"Method C.foo finds the most twisted anagram for a word.\n" +
"It uses various heuristics to rank a word's twistedness'.\n" +
"```\n" +
"Also fill in the lists of keywords, topics, goals, and dependencies.\n";
"Also fill in the lists of keywords, tags, synonyms, and dependencies.\n";
const result = await fileDocTranslator.translate(request, text);

// Now assign each comment to its chunk.
Expand Down
4 changes: 2 additions & 2 deletions ts/examples/spelunker/src/makeQuerySchema.ts
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ export type QuerySpecs = {
// Queries directed to various indexes. Comments describe what's in each index.
summaries?: QuerySpec; // A paragraph describing the code
keywords?: QuerySpec; // Short key words and phrases extracted from the code
topics?: QuerySpec; // Slightly longer phrases relating to the code
goals?: QuerySpec; // What the code is trying to achieve
tags?: QuerySpec; // Optional high-level labels (e.g. "algorithmic", "I/O")
synonyms?: QuerySpec; // Additional synonyms or related domain concepts
dependencies?: QuerySpec; // External dependencies

// If the question can be answered based on chat history and general knowledge.
Expand Down
50 changes: 16 additions & 34 deletions ts/examples/spelunker/src/pythonImporter.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ import * as knowLib from "knowledge-processor";
import { asyncArray } from "typeagent";

import * as iapp from "interactive-app";
import { ChunkyIndex } from "./chunkyIndex.js";
import { ChunkyIndex, IndexNames } from "./chunkyIndex.js";
import { ChunkDoc, FileDocumentation } from "./fileDocSchema.js";
import {
Chunk,
Expand Down Expand Up @@ -236,40 +236,22 @@ async function embedChunk(
summaries.push(chunkDoc.summary);
}
const combinedSummaries = summaries.join("\n").trimEnd();
if (combinedSummaries) {
await exponentialBackoff(
io,
chunkyIndex.summariesIndex.put,
combinedSummaries,
[chunk.id],
);
}

for (const chunkDoc of chunkDocs) {
await writeToIndex(
io,
chunk.id,
chunkDoc.topics,
chunkyIndex.topicsIndex,
);
await writeToIndex(
io,
chunk.id,
chunkDoc.keywords,
chunkyIndex.keywordsIndex,
);
await writeToIndex(
io,
chunk.id,
chunkDoc.goals,
chunkyIndex.goalsIndex,
);
await writeToIndex(
io,
chunk.id,
chunkDoc.dependencies,
chunkyIndex.dependenciesIndex,
);
for (const indexName of IndexNames) {
let data: string[];
if (indexName == "summaries") {
data = [combinedSummaries];
} else {
data = (chunkDoc as any)[indexName];
}
const index = chunkyIndex.indexes.get(indexName)!;
if (data && index) {
await writeToIndex(io, chunk.id, data, index);
}
}
}

const t1 = Date.now();
if (verbose) {
log(
Expand All @@ -284,7 +266,7 @@ async function embedChunk(
async function writeToIndex(
io: iapp.InteractiveIo | undefined,
chunkId: ChunkId,
phrases: string[] | undefined, // List of keywords, topics, etc. in chunk
phrases: string[] | undefined, // List of summaries, keywords, tags, etc. in chunk
index: knowLib.TextIndex<string, ChunkId>,
) {
for (const phrase of phrases ?? []) {
Expand Down
64 changes: 39 additions & 25 deletions ts/examples/spelunker/src/queryInterface.ts
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,8 @@ export async function interactiveQueryLoop(
search,
summaries,
keywords,
topics,
goals,
tags,
synonyms,
dependencies,
files,
purgeFile,
Expand Down Expand Up @@ -168,8 +168,9 @@ export async function interactiveQueryLoop(
writeNote(io, "SUMMARY: None");
}
} else {
const docItem: string[] | undefined =
chunkDoc[name];
const docItem: string[] | undefined = (
chunkDoc as any
)[name];
if (docItem?.length) {
writeNote(
io,
Expand Down Expand Up @@ -291,32 +292,32 @@ export async function interactiveQueryLoop(
await _reportIndex(args, io, "keywords");
}

function topicsDef(): iapp.CommandMetadata {
function tagsDef(): iapp.CommandMetadata {
return {
description: "Show all recorded topics and their postings.",
description: "Show all recorded tags and their postings.",
options: commonOptions(),
};
}
handlers.topics.metadata = topicsDef();
async function topics(
handlers.tags.metadata = tagsDef();
async function tags(
args: string[] | iapp.NamedArgs,
io: iapp.InteractiveIo,
): Promise<void> {
await _reportIndex(args, io, "topics");
await _reportIndex(args, io, "tags");
}

function goalsDef(): iapp.CommandMetadata {
function synonymsDef(): iapp.CommandMetadata {
return {
description: "Show all recorded goals and their postings.",
description: "Show all recorded synonyms and their postings.",
options: commonOptions(),
};
}
handlers.goals.metadata = goalsDef();
async function goals(
handlers.synonyms.metadata = synonymsDef();
async function synonyms(
args: string[] | iapp.NamedArgs,
io: iapp.InteractiveIo,
): Promise<void> {
await _reportIndex(args, io, "goals");
await _reportIndex(args, io, "synonyms");
}

function dependenciesDef(): iapp.CommandMetadata {
Expand Down Expand Up @@ -560,19 +561,32 @@ export async function purgeNormalizedFile(
);

// Step 2: Remove chunk ids from indexes.
const deletions: ChunkId[] = Array.from(toDelete);
const chunkIdsToDelete: ChunkId[] = Array.from(toDelete);
for (const [name, index] of chunkyIndex.allIndexes()) {
let updates = 0;
const affectedValues: string[] = [];
// Collect values from which we need to remove the chunk ids about to be deleted.
for await (const textBlock of index.entries()) {
if (textBlock?.sourceIds?.some((id) => deletions.includes(id))) {
if (
textBlock?.sourceIds?.some((id) =>
chunkIdsToDelete.includes(id),
)
) {
if (verbose) {
writeNote(io, `[Purging ${name} entry ${textBlock.value}]`);
}
await index.remove(textBlock.value, deletions);
updates++;
affectedValues.push(textBlock.value);
}
}
writeNote(io, `[Purged ${updates} ${name}]`); // name is plural, e.g. "keywords".
// Actually update the index (can't modify it while it's being iterated over).
for (const value of affectedValues) {
const id = await index.getId(value);
if (!id) {
writeWarning(io, `[No id for value {value}]`);
} else {
await index.remove(id, chunkIdsToDelete);
}
}
writeNote(io, `[Purged ${affectedValues.length} ${name}]`); // name is plural, e.g. "keywords".
}

// Step 3: Remove chunks (do this last so if step 2 fails we can try again).
Expand Down Expand Up @@ -741,15 +755,15 @@ async function runIndexQueries(

// Update chunk id scores.
for (const hit of hits) {
// IDF only depends on the term.
// Literature suggests setting TF = 1 in this case,
// but the term's relevance score intuitively makes sense.
const tf = hit.score;
// IDF calculation ("inverse document frequency smooth").
const fraction =
totalNumChunks / (1 + (hit.item.sourceIds?.length ?? 0));
const idf = 1 + Math.log(fraction);
const newScore = tf * idf;
for (const chunkId of hit.item.sourceIds ?? []) {
// Binary TF is 1 for all chunks in the list.
// As a tweak, we multiply by the term's relevance score.
const tf = hit.score;
const newScore = tf * idf;
const oldScoredItem = chunkIdScores.get(chunkId);
const oldScore = oldScoredItem?.score ?? 0;
// Combine scores by addition. (Alternatives: max, possibly others.)
Expand Down

0 comments on commit 99cc0c3

Please sign in to comment.