Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Partial loading with FAISS engine. #2401

Open
0ctopus13prime opened this issue Jan 17, 2025 · 16 comments
Open

[RFC] Partial loading with FAISS engine. #2401

0ctopus13prime opened this issue Jan 17, 2025 · 16 comments
Assignees
Labels
Enhancements Increases software capabilities beyond original client specifications memory management memory-reduction search-improvements v3.0.0

Comments

@0ctopus13prime
Copy link
Collaborator

0ctopus13prime commented Jan 17, 2025

Partial Loading Low Level Design

1. Objective

OpenSearch KNN plugin supports three engines: NMSLIB, FAISS, and Lucene.
The first two native engines, NMSLIB and FAISS, require all vector-related data structures (such as HNSW graphs) to be loaded into memory for search operation.
For large workloads__, this memory cost can quickly become substantial if quantization techniques are not applied.
_Therefore, 'Partial Loading' must be enabled as an option in native engines to control the available memory for KNN search. _
The objective of partial loading is twofold:

  1. To allow users to control the maximum memory available for KNN searching.
  2. To enable native engines to partially load only the necessary data within the constraint.

If we look closely a HNSW graph mainly consist of below things:

  1. Full precision 32 bit vectors.
  2. Graph representations.
  3. Metadata like dimensions, number of vectors, space type, headers etc.

From the above items, main memory is used by these full precision vectors 4 bytes * the number of vectors * the number of dimension.
The way FAISS stores these vectors is in a Flat Index and during serialization and deserialization these vectors are written and read to/from the file and put in the main memory which increases the memory consumption.

GH Issue : #1693

2. Scope

The partial loading LLD described in this issue is specifically tailored for FAISS HNSW. However, the core concept can be easily generalized and extended to support other types of vector search.

3. PARTIAL_LOADING_MODE in Mappings

I propose adding a PARTIAL_LOADING_MODE setting to the mappings. (e.g. have it at field level) To deliver an "out-of-the-box" experience for users, we can begin with straightforward options and progressively introduce more advanced configurations.

Proposed Modes:

  • (Default) BEST_PERFORMANCE (Or simply DISABLED):
    Partial loading is disabled by default, ensuring the engine loads full graphs into memory for optimal performance.

  • MEMORY_EFFICIENT:
    The engine minimizes memory usage when serving search requests by delegating to IndexInput to fetch the required bytes during a search. Performance depends on the configured IndexInput type. For instance, with MMapDirectory, which maps index files using mmap, this design proposes directly invoking the system call from the C++ layer, bypassing multiple JNI calls to retrieve bytes.

4. Low Level Design

We have two approaches for achieving partial loading:

  1. [Recommended] Run Lucene like HNSW searching algorithm on FAISS index using Java.
  2. Run FAISS HNSW search logic in C++ but use a provided IndexInput to fetch bytes on demand during the search.

Despite the differences between these two options, the underlying concept remains the same: during search operations, the core component relies on a provided IndexInput to fetch bytes as needed, instead of loading all bytes into memory upfront, which is the default behavior. The choice between the Java and C++ implementations should prioritize maximizing productivity while maintaining optimal performance.

I strongly recommend proceeding with the Java-based option for the following four reasons below:

1. Java version is much faster than the C++ version mainly due to JNI call overheads.

For 1M dataset, Java version showed 157.81 ops/s as a median throughput while C++ version showed 115.75. ops/s. Java version is 36% faster than C++.
But with 10M dataset, Java version 137% faster than C++ version. Java version showed 15.11 ops/s throughput and C++ showed 6.36 ops/s. For more details, please refer to Section 5. Performance Benchmark

2. Minimal Code Changes Compared to the C++ Option

Implementing partial loading in C++ requires significantly more effort compared to Java. In C++, integrating partial loading logic into the existing FAISS components is challenging. FAISS tightly couples its search and storage mechanisms, making it infeasible to enable partial loading transparently with simple patches. To achieve this, we would need to extract the minimum required HNSW search algorithms from FAISS and adapt them for partial loading.

This complexity arises because FAISS lacks a clear separation between search and storage. In contrast, Lucene’s architecture inherently supports partial loading through its separation of concerns and its flexible Directory implementation. This makes implementing partial loading in Lucene as simple as switching the Directory.

Consequently, introducing partial loading in FAISS would require fundamental changes that compromise some of its core design principles. In comparison, implementing partial loading in Java is far less intrusive!

While supporting partial loading for IVF still requires additional implementation, keeping the logic on the Java side is undoubtedly simpler and more maintainable than handling it in C++.

3. Debugging is Significantly Easier

Debugging in Java is far simpler and more efficient compared to C++. Debugging C++ can be cumbersome, especially when dealing with JNI bindings and JVM shared libraries, where issues often become complex and difficult to trace. It's widely accepted that debugging Java is much more straightforward than debugging C++. I'll stop here.

4. Miscellaneous Benefits

  1. **"Free Lunch" from prefetch and Vector Calculations
    **Lucene contributors continuously work on improving the system. By leveraging Lucene’s HNSW searching logic to operate on the FAISS index, we can benefit from these improvements without additional effort. One good example is the prefetch optimization during search, which triggers the underlying storage to load bytes asynchronously. (Good opportunity for kernel or network file system daemon to load bytes in parallel). This reduces I/O wait times, and decreasing search latency.

  2. Simpler Unit Testing
    Adding unit tests becomes significantly easier, leading to better productivity and maintainability. Mocking in the JVM ecosystem is far simpler and requires much less effort compared to testing in C++. Additionally, flaky tests are easier to identify and resolve in Java, further streamlining the development process.

4.1. [Recommended] Run HNSW Search on FAISS index using Java.

4.1.1. Low Level Design Overview

Image

4.1.2. Index Loading

We use a lazy loading approach for vector indices. Initially, the system checks the cache manager and loads the index only if it is absent in the cache manager. This loading process differs from the baseline behavior, which fully loads the index into memory.
Instead of reading the entire file, we record the starting offset of each section and skip directly to the next section. When a specific section is accessed later, the system calculates the absolute offset in the file by adding the section's starting offset to the given relative offset and fetches the required bytes.

To achieve this, we need to introduce a new component capable of interpreting the FAISS index file layout. This component will handle the conversion of relative positions into absolute offsets and mark the starting offsets of regions within the load method of the IndexLoadStrategy.

This section will not cover index setting changes related to introducing the partial loading mode, as the focus here is on the core components.

IndexLoadStrategy::load

In the load function, we initialize the partial loading components and store them in a PartialLoadingContext. During this process, we save only the starting offsets of each section rather than fully loading the data. The PartialLoadingContext will then be used during the search phase.

@Override
public NativeMemoryAllocation.IndexAllocation load(
       NativeMemoryEntryContext.IndexEntryContext indexEntryContext)

    ...
    // Retrieving partial loading mode from settings and compare it to `DISABLED`.
    final bool partialLoadingEnabled = ...;

    try (IndexInput readStream = directory.openInput(vectorFileName, IOContext.RANDOM)) {
        if (partialLoadingEnabled) {
            // Let's mark offsets
            FaissIndex faissIndex = FaissIndex.load(readStream);

            // Create partial loading context.
            PartialLoadingContext partialLoadingContext =
                new PartialLoadingContext(faissIndex, vectorFileName);

            // We don't have native pointer.
            final long nullIndexAddress = 0;

            // Index does not have a pre-allocated space in memory, 0.
            final long zeroSizeKB = 0;

            return createIndexAllocation(
                       indexEntryContext,
                       knnEngine,
                       nullIndexAddress,
                       zeroSizeKB,
                       vectorFileName,
                       partialLoadingContext);
        } else {
            // Fallback to the baseline loading full bytes into memory.
            ...
        }
    }
}

[New Component] FaissIndex

This class primarily focus on marking the offsets of regions within the FAISS index file. (You can find the code in read_index)
Most regions will be skipped during loading, with only their start offsets marked. However, three specific regions listed below will be loaded into memory due to their relatively small size, typically amounting to at most a few megabytes. Lucene follows the same approach by loading these regions into memory:

  1. Probabilities per layer (double[]): Usually just a few bytes.
  2. Cumulative number of neighbors per level (int[]): Few bytes
  3. Neighbor list offsets (long[]): Generally a few kilobytes. At most few megabytes.
// This class represents a general FAISS index.
// There're plethora of index types other than HNSW graph.
public class FaissIndex {
    public static FaissIndex load(IndexInput input) {
        final String indexName = readFourBytes(input);
        if (indexName == "IxMp") {
            // Maps to IdMapIndex
            FaissIdMapIndex idMapIndex = new FaissIdMapIndex();
            idMapIndex.readCommonHeader(input);

            // Partial load a nested index recursively.
            idMapIndex.setNestedIndex(load(input));

            // Mark `list` section.
            idMapIndex.partialLoadIdListSection(input);
            return idMapIndex;
        } else if (indexName == "IHNf") {
            // Maps to HNSWFlatIndex
            FaissHNSWFlatIndex hnswFlatIndex = new FaissHNSWFlatIndex();
            hnswFlatIndex.readCommonHeader(input);

            // Mark regions of HNSW graph. Ex: neighbors per level etc.
            hnswFlatIndex.partialLoadHNSW(input);
            hnswFlatIndex.setStorage(load(input));
            return hnswFlatIndex;
        } else if (indexName == "IxF2") {
            // Maps to IndexFlatL2
            FaissIndexFlatL2 indexFlatL2 = new FaissIndexFlatL2();
            indexFlatL2.readCommonHeader(input);

            // Mark starting offset of flat vector region.
            indexFlatL2.partialLoadFlatVectors(input);
            return indexFlatL2;
        } else {
            // We throw here as HNSW graph is the only index we support.
            // but in the future, we can gradually port more FAISS index.
            throw IllegalStateException(
               "Failed to recognize FAISS index, got=" + indexName);
        }
    } 
}

//
//  FaissHNSWFlatIndex
//
class FaissHNSWFlatIndex extends FaissIndex {
    private FaissHNSW hnsw = new FaissHNSW();

    public void partialLoadHNSW(IndexInput input) {
        //
        // Partial Load HNSW Graph. Neighbor list, offsets etc.
        //
    
        // 1. Load probabilities per each level. It is few bytes.
        long sectionSize = input.readLong();
        hnsw.assignedProbablities = new double[Math.toIntExact(sectionSize)];
        byte[] doubleBytes = new byte[8];
        for (int i = 0 ; i < sectionSize ; ++i) {
            input.readBytes(doubleBytes, 0, 8);
            hnsw.assignedProbablities[i] = 
                ByteBuffer.wrap(doubleBytes)
                          .order(ByteOrder.nativeOrder())
                          .getDouble();
        }
        
        // 2. Load the cumulative number of neighbors. It is few bytes.
        sectionSize = input.readLong();
        hnsw.cumNumberNeighborPerLevel = new int[Math.toIntExact(sectionSize)];
        input.readInts(hnsw.cumNumberNeighborPerLevel, 0, sectionSize);
        
        // 3. Mark starting offset of `levels` section.
        hnsw.levels.markSection(input);
        
        // 4. Load offsets in memory. Typically, it is few KBs.
        //    This offset is used for calculating an index of neighbor list.
        sectionSize = input.readLong();
        hnsw.offsets = new long[Math.toIntExact(sectionSize)];
        input.readLongs(offsets, 0, offsets.length);
        
        // 5. Marking a starting offset of `neighbor list` section.
        hnsw.neighbors.markSection(input);
        
        // 6. Get properties of HNSW graph
        hnsw.entryPoint = input.readInput();
        hnsw.maxLevel = input.readInput();
        hnsw.efConstruction = input.readInput();
        hnsw.efSearch = input.readInput();

        // Dummy read one integer. It's just a relic of legacy.
        input.readInput();
    }
    
    ...
}


//
//  FaissIndexFlatL2
//

public FaissIndexFlatL2 extends FaissIndex {
    private Storage vectorBytes = new Storage();
    private int oneVectorByteSize;
    
    public void partialLoadFlatVectors(IndexInput input) {
        //
        // Partial load vector[] section
        //
    
        oneVectorByteSize = 4 * dimensions;
        
        final long totalVectorBytesSize = input.readLong() * 4;
        vectorBytes.baseOffset = input.getFilePointer();
        vectorBytes.regionSize = totalVectorBytesSize;

        // Skip to the next section.
        input.seek(vectorBytes.baseOffset + totalVectorBytesSize);
    }
    
    ...
}


[New Component] PartialLoadingContext

We will introduce the PartialLoadingContext to IndexAllocation. This new component will include the FaissIndex along with all necessary information required for partial loading.

public class PartialLoadingContext implement Closeable {
    // Partial loaded proxy index.
    private FaissIndex faissIndex;
    private String vectorFileName;
    private IndexInput indexInput;
    private PartialLoadingMode partialLoadingMode;
    
    // Other properties for partial loading
    ...
    
    public IndexInput getIndexInput(Directory directory) {
        if (indexInput == null) {
            indexInput = openIndexInput(directory);
        }
        // `clone` method is not thread safe.
        synchronized (indexInput) {
            reutrn indexInput.clone();
        }
    }
    
    // Clean up resources
    ...
}

4.1.3. Vector Search Part

During search, if partial loading is enabled, we can delegate the search to FaissIndex. Otherwise, it will fall back to the default behavior. In the case of FaissHNSWFlatIndex, it will execute the ported HNSW searching logic ported from FAISS to Java. For other index types, such as IVF, it will be supported shortly.

Currently, the vector search is triggered in the KNNWeight::scorer method. After acquiring results from a single segment, we convert the results into KNNScorer and return them. In the short term, I propose extending the KNNWeight::doANNSearch method to branch based on whether partial loading is enabled.

For the long term, we should consider refactoring this logic into KNNQuery instead of keeping it in KNNWeight.
For parallel search, Lucene uses MultiLeafKnnCollector to collect results, allowing each leaf-level collector to exchange eligible minimum scores (e.g., the inverse of distance) at the global level, helping to terminate early in the loop and improving performance. However, since the scorer method is called at the leaf level (IndexSearcher::searchLeaf) it will not benefit from such a collector during parallel searches.

KNNWeight

private Map<Integer, Float> doANNSearch(
    LeafReaderContext context,
    BitSet filterIdsBitSet,
    int cardinality,
    int k) {
    ...
    
    // Get index allocation from cache manager.
    NativeMemoryAllocation indexAllocation = ...;
    
    // Prepare for searching
    FilterIdsSelector filterIdsSelector = FilterIdsSelector.getFilterIdSelector(filterIdsBitSet, cardinality);
    long[] filterIds = filterIdsSelector.getFilterIds();
    FilterIdsSelector.FilterIdsSelectorType filterType = filterIdsSelector.getFilterType();
    // Now that we have the allocation, we need to readLock it
    indexAllocation.readLock();
    indexAllocation.incRef();
    
    // Start searching
    if (partialLoadingMode != DISABLED) {
        results = doPartialLoadingANNSearch(...);
    } else {
        // Fallback to default searching logic
        ...
    }

    // Post processing after ANN search
    ...
    if (quantizedVector != null) {
        return Arrays.stream(results)
        .collect(Collectors.toMap(KNNQueryResult::getId, result -> knnEngine.score(result.getScore(), SpaceType.HAMMING)));
    }
    return Arrays.stream(results)
        .collect(Collectors.toMap(KNNQueryResult::getId, result -> knnEngine.score(result.getScore(), spaceType)));
}

private KNNResult[] doPartialLoadingANNSearch(
            NativeMemoryAllocation indexAllocation,
            Directory directory,
            KNNQuery query,
            int[] parentIds,
            BitSet filterIds) {
    // Load partial load search strategy.
    // For now, we only have `MEMORY_EFFICIENT` mode.
    PartialLoadingSearchStrategy strategy =
        PartialLoadingSearchStrategy.load(
            indexAllocation.partialLoadingContext.partialLoadingMode);

    // Building search parameters
    PartialLoadingSearchParameters parameters =
        PartialLoadingSearchParameters.builder()
                                      .indexAllocation(indexAllocation)
                                      ...;

    // Start searching. After done, convert TopDocs to KNNResult[]
    return toKNNResults(strategy.searchLeaf(parameters));
}

[New Component] MemoryEfficientPartialLoadingSearchStrategy

For scalability, we define a partial load search strategy that performs vector search based on its mode. Currently, we only have the MEMORY_EFFICIENT mode, but this strategy pattern allows us to easily add more modes in the future.

The MemoryEfficientPartialLoadingSearchStrategy constructs both the KnnCollector and RandomVectorScorer, and then passes them to FaissIndex to initiate the search. Note that both components are defined in the Lucene core package.

The following code illustrates the "happy happy" scenario, where there are no parent IDs or filterIdsBitSet. This example is simplified for clarity, but it demonstrates how each case can be implemented in the future.

//
//  MemoryEfficientPartialLoadingSearchStrategy
//

@Builder
@Getter
class PartialLoadingSearchParameters {
    private float[] floatQueryVector;
    private byte[] byteQueryVector;
    private PartialLoadingContext partialLoadingContext;
    private int k;
    private Integer efSearch;
    @Setter
    private MatchDocSelector matchDocSelector;
    @Setter
    private DocIdGrouper docIdGrouper;
    private SpaceType spaceType;
    private IndexInput indexInput;
}

class MemoryEfficientPartialLoadingSearchStrategy extend PartialLoadingSearchStrategy {
    @Override
    public KNNQueryResult[] searchLeaf(PartialLoadingSearchParameters parameters) {
        // Prepare search
        PartialLoadingContext partialLoadingContext = searchParameters.getPartialLoadingContext();
        FaissIndex faissIndex = partialLoadingContext.getFaissIndex();

        // Start search a single index
        final DocIdAndDistance[] results = new DocIdAndDistance[searchParameters.getK()];
        for (int i = 0; i < searchParameters.getK(); i++) {
            results[i] = new DocIdAndDistance(DocIdAndDistance.INVALID_DOC_ID, 0);
        }

        try {
            faissIndex.searchLeaf(partialLoadingContext.getIndexInput(), results, searchParameters);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }

        // Transform results to query results
        int idx = results.length - 1;
        while (idx >= 0 && results[idx].id == DocIdAndDistance.INVALID_DOC_ID) {
            --idx;
        }
        // Ex: [id0, id2, id3, -1, -1, -1] where k == 6, then resultSize = 3
        final int resultSize = idx + 1;

        final KNNQueryResult[] queryResults = new KNNQueryResult[resultSize];
        for (int i = 0; i < resultSize; i++) {
            queryResults[i] = new KNNQueryResult(results[i].id, results[i].distance);
        }
        return queryResults;
    }
}

[New Component] FaissHNSWFlatIndex::searchLeaf

FaissHNSWFlatIndex prepares Lucene HNSW graph adapters and delegates the execution of the HNSW search algorithm to Lucene. Lucene defines an HNSW graph interface in its core package. The idea is to create a LuceneFaissHnswGraph class that wraps the FaissHNSW to implement the required contracts, allowing Lucene’s HnswGraphSearcher to navigate the graph and compute scores. Similarly, this approach can be extended to implement specific vector search algorithms in the various FaissIndex subclasses, such as the IVF index.

//
//  FaissHNSWFlatIndex
//
public class FaissHNSWFlatIndex extends FaissIndex {
    private FaissHNSW hnsw = new FaissHNSW();

    public void partialLoadHNSW(IndexInput input) {
        ...
    }

    @Override
    public TopDocs searchLeaf(RandomVectorScorer scorer,
                              KnnCollector knnCollector,
                              PartialLoadingSearchParameters parameters) {
        // Create Faiss Lucene HNSW graph wraps FaissHnsw inside.
        LuceneFaissHnswGraph faissHnswGraph = 
            new LuceneFaissHnswGraph(hnsw, parameters.indexInput);

        // Delegate Lucene to do a vector search.
        HnswGraphSearcher.search(scorer, 
                                 knnCollector, 
                                 faissHnswGraph, 
                                 parameters.filterIdsBitSet);

        // Return collected results
        return knnCollector.topDocs();
    }
}

//
//  LuceneFaissHnswGraph :
//    Implement Lucene's HnswGraph interface while employing FaissHNSW.
//

public LuceneFaissHnswGraph extends extends HnswGraph {
    private FaissHNSW faissHnsw;
    private long begin;
    private long end;
    private long currOffset;
    private IndexInput indexInput;
    
    @Override 
    publc void seek(int level, int target) {
        // Get a relative starting offset of neighbor list at `level`.
        long o = faissHnsw.offsets[target];
        
        // `begin` and `end` represent for a pair of staring offset and end offset.
        // But, what `end` represents is the maximum offset a neighbor list at a level can have.
        // Therefore it is required to traverse a list until getting a terminal `-1`.
        begin = o + hnsw.cumNumberNeighborPerLevel[level];
        end = o + hnsw.cumNumberNeighborPerLevel[level + 1];
        currOffset = begin;
    }
    
    @Override 
    public int nextNeighbor() {
        if (currOffset < end) {
            // Read a next neighbor id.
            final int neighborId = hnsw.neighbors.readInt(indexInput, currOffset);
            ++currOffset;

            // Negative integer is a terminal.
            if (neighborId >= 0) {
                return neighborId;
            }
        }
        
        // Neighbor list has been exhausted.
        return NO_MORE_DOCS;
    }
       
    // Other miscellaneous APIs 
    ...
}

4.1.3. Pros / Cons

Pros

  1. We can implement partial loading with minimum code changes possible.
  2. Partial loading will be continuous benefited with Lucene’s improvements over time.
  3. Easy to debug than C++.
  4. More readable than C++.
  5. No context switching overhead between Java and JNI.

Cons

  1. With the same index, partial loading may yield different results compared to when it is disabled. While the overall HNSW search logic is similar between Lucene and FAISS, there is a slight difference in the loop termination conditions. Lucene halts the loop when the best score in the candidate heap is no longer better than the minimum eligible score[Ref], whereas FAISS continues until the candidate heap is exhausted. [Ref].

    However, recall and precision will remain at the same level. If this becomes a significant concern, I can port the FAISS HNSW search logic to Java. I've tested this as well, and it yielded the same performance as Lucene.

4.2. [Not Recommended] Implement Partial Loading in C++

(If you’re not interested in knowing what to change in C++, you can skip this section and move on performance section)

In this section, we will discuss how partial loading can be achieved in C++. The underlying concept remains the same: whenever bytes are requested, we rely on Lucene’s IndexInput to fetch them.

However, for this approach, we need to duplicate the FAISS core logic and introduce several modifications due to the reasons outlined below. This is not something that can be accomplished with simple patches.

4.2.1. Why it is difficult to introduce changes in FAISS

1. Absent of a storage concept in FAISS.

In its current implementation, FAISS directly reads data from faiss::IOReader into a std::vector. (READXBVECTOR) There is no concept of "Storage" in FAISS; everything is expected to reside in memory. However, partial loading is the opposite of this approach, where bytes are loaded on-demand from the underlying storage. This means the resulting bytes could end up being scattered across memory.

The challenge is that it’s not straightforward to extend core components to switch between full loading and partial loading modes. std::vector should be used when partial loading is disabled, but when it’s enabled, a Storage type wrapping Lucene's IndexInput must be used instead. Implementing this change would require widespread modifications throughout the codebase. (Chain explosion!)

Furthermore, since FAISS doesn’t have a clear distinction between updating and searching, these fundamental changes would also affect the index writing logic. FAISS’s in-memory data structures can be updated while being searched in parallel. This interdependence makes it difficult to modify core components without brining additional changes in other parts of the system.

While a storage concept could solve these challenges, introducing such a change would require agreement and collaboration from the FAISS team. Until that happens, it would be unwise to implement disruptive changes that would likely introduce unwanted side effects in other areas of the system.

2. FAISS’s strong 'faith' that bytes will be in a continuous memory space.

FAISS passes size_t* or float* pointers around freely, with the firm belief that everything will be stored in a contiguous block of memory. This assumption is deeply embedded in its design, particularly in the distance computation logic, which simply looks up vectors using &ptr[index]. To implement partial loading, this approach would need to change significantly to rely on a Storage system to fetch the required bytes on demand.

Beyond the technical challenges, I also question whether many contributors fully understand the complexities of partial loading in C++. This could hinder the speed at which we can add new features or maintain the system in a robust manner.

Therefore, If we are determined to implement partial loading in C++, the only viable approach would be to extract the HNSW searching logic from FAISS and tailor it specifically for partial loading. However, as FAISS continues to evolve, we would need to constantly update our partial loading implementation to keep pace. Given that only a few contributors have a deep understanding of FAISS's internal workings, it is likely that we will fall behind. Falling behind on maintaining such critical logic can lead to technical debt, blocking further changes and ultimately resulting in deprecation.

For this reason, I believe it is not worth the risk to pursue this approach. Developing logic that may be deprecated in the future does not make sense.

4.2.2. Low Level Design Overview

Image

4.2.3. [New Component] FaissIndexInputStorage

FAISS HNSW heavily relies on std::vector<T> for its search operations, using it as a memory block to manage various data structures, such as float vectors and adjacency neighbor lists. During the loading phase, FAISS uses IndexInput to fetch bytes and populate the memory regions managed by std::vector<T>.

By introducing a generalized Storage<T> implementation with functionality equivalent to std::vector<T>, we could abstract the storage layer in FAISS and replace std::vector<T> with this more flexible solution. This would provide a more adaptable way to manage memory while preserving the functionality of the existing system.

Storage is an abstract layer that provides loading APIs with offsets. This will replace std::vector in faiss::Index. Since we will use static polymorphism to introduce this abstract layer, it will be passed as a template type.
Storage must be templated with a data type T, such as long, and must have the following four methods:

  1. markSection(FaissOpenSearchIOReader, size_t nitems)*
    Use the provided reader to mark the start offset of each section and skip to the next section.
    nitems refers to the number of items of data type T. For example, if T is long and nitems is 10, it requires skipping 10 * 8 bytes.

  2. **setAtomicUnit(int32_t atomic_unit)
    **Sets the atomic_unit to guarantee that an atomic unit of data can be loaded.
    For example, with a dimension of 768 for a float vector, atomic_unit must be set to 3072 (i.e., 768 * sizeof(float) bytes).

  3. const T& operator[](const size_t index) const
    Loads bytes, the size of which equals the atomic unit, located at the given index in the file.

  4. size()
    Returns the total byte size of the storage.

struct FaissIndexInputStorageBase {
  size_t nitems_;
  size_t base_offset_;
  size_t atomic_unit_;

  FaissIndexInputStorageBase()
      : nitems_(),
        base_offset_(),
        atomic_unit_(1) {
  }
};

template<typename T>
structure FaissIndexInputStorage final : public FaissIndexInputStorageBase {
    void markSection(FaissOpenSearchIOReader *reader, size_t nitems) {
        // Marking start offset
        base_offset_ = reader->getOffset();
        nitems_ = nitems;
        
        // Skip to the next section
        reader->seek(base_offset_ + sizeof(T) * nitems);
    }
    
    const T&operator[](const size_t index) const {
        // We initialize thread local state for partial loading in QueryIndex_WithFilter.
        auto* state = PartialLoadingCotext::getThreadLocalState();
        // This mediator has IndexInput wrapper.
        auto* mediator = state->mediator;
        auto* buffer = state->buffer;

        // Fetch the minimum required bytes at the absolute offset.
        mediator->copyBytesWithOffset(
            base_offset_ + (sizeof(T) * index),
            sizeof(T) * atomic_unit_,
            buffer);

        return (T*) buffer;
    }
    
    void setAtomicUnit(int32_t atomic_unit) {
        atomic_unit_ = atomic_unit;
    }
    
    size_t size() const {
        return sizeof(T) * nitems_;
    }
};

Ex: [Baseline] HNSW Graph Traversal

//
// Graph
//

struct HNSW {
    ...
    std::vector<storage_idx_t> neighbors; // Neighbor list
};

HNSWStats greedy_update_nearest(...) {
    for (size_t j = begin; j < end; j++) {
      storage_idx_t v = hnsw.neighbors[j]; 
      if (v < 0)
          break;
      ndis += 1;
      
    ...
}

//
// Vectors
//

struct IndexFlatCodes : Index {
    size_t code_size;

    // encoded dataset, size ntotal * code_size
    std::vector<uint8_t> codes;
    ...
};

Ex: [Partial Loading] HNSW Graph Traversal

//
// Graph
//

template <template <class> typename Storage>
struct OpenSearchHNSW {
    ...
    Storage<storage_idx_t> neighbors;  // Replacing std::vector
};

HNSWStats greedy_update_nearest(...) {
    for (size_t j = begin; j < end; j++) {
      // Calling Storage<T>::operator[](size_t)
      storage_idx_t v = hnsw.neighbors[j];
      if (v < 0)
          break;
      ndis += 1;
    
    ...
}

//
// Vectors
//

template<template<class> typename Storage>
struct OpenSearchIndexFlatCodes : Index {
    size_t code_size;
  
    // encoded dataset, size ntotal * code_size
    Storage<uint8_t> codes;  // Replacing std::vector
    ...
};

4.2.4. Index Loading

Similar to the first approach, we duplicate FAISS's read_index method. When partial loading is enabled, it marks the starting offsets of each section in the file and moves on to the next section. If partial loading is disabled, the flow falls back to the default behavior, which loads the entire index into memory.

IndexLoadStrategy::load

@Override
public NativeMemoryAllocation.IndexAllocation load(
       NativeMemoryEntryContext.IndexEntryContext indexEntryContext)

    ...
    // Retrieving partial loading mode from settings and compare it to `DISABLED`.
    final bool partialLoadingEnabled = ...;

    try (IndexInput readStream = directory.openInput(vectorFileName, IOContext.READONCE)) {
        if (partialLoadingEnabled) {
            // Partial load native index
            IndexInputWithBuffer indexInputWithBuffer = ...;
            PartialLoadingContext partialLoadingContext =
                new PartialLoadingContext(vectorFileName, partialLoadingMode);

            // JNI layer will branch on partial loading in C++
            long indexAddress = JNIService.loadIndexWithStream(
                      indexInputWithBuffer,
                      partialLoadingContext,
                      indexEntryContext.getParameters(),
                      knnEngine);

            return createIndexAllocation(
                       indexEntryContext,
                       knnEngine,
                       indexAddress,
                       indexSizeKB,
                       vectorFileName,
                       partialLoadingContext);
        } else {
            // Fallback to the baseline loading full bytes into memory.
            ...
        }
    }
}

//
//  PartialLoadindgContext
//
public class PartialLoadingContext implement Closeable {
    private String vectorFileName;
    private IndexInput indexInput;
    private PartialLoadingMode partialLoadingMode;
    
    // Other properties for partial loading
    ...
    
    public IndexInput getIndexInput(Directory directory) {
        if (indexInput == null) {
            indexInput = openIndexInput(directory);
        }
        // `clone` method is not thread safe.
        synchronized (indexInput) {
            reutrn indexInput.clone();
        }
    }
    
    // Clean up resources
    ...
}

org_opensearch_knn_jni_FaissService.cpp

Here, we create a C++ struct, PartialLoadingContext, which wraps the passed Java instance of PartialLoadingContext. This struct provides a set of convenient methods that internally handle the necessary JNI calls.

jlong Java_org_opensearch_knn_jni_FaissService_loadIndexWithStream(
         JNIEnv* env,
         jclass cls,
         jobject indexInput,
         jobject partialLoadingCtx) {
    // Create a mediator.
    knn_jni::stream::NativeEngineIndexInputMediator mediator {
               &jniUtil, env, readStream};

    // Wrap the mediator with a glue code inheriting IOReader.
    knn_jni::stream::FaissOpenSearchIOReader faissOpenSearchIOReader {&mediator};

    // Create a partial loading context
    knn_jni::partial_loading::PartialLoadingContext partialLoadingContext {
               &jniUtil, env, partialLoadingCtx};
    
    // Load or Partial load index.
    return faiss_wrapper::LoadIndexWithStream(
             &faissOpenSearchIOReader,
             &partialLoadingContext);
}

faiss_wrapper.h

Here, depending on whether partial loading is enabled, the loading function branches. If partial loading is enabled, it calls partialLoadingFaissIndexLoad to load the index partially. Otherwise, it falls back to the default behavior, which loads the full index into memory.

jlong faiss_wrapper::LoadIndexWithStream(
                         FaissOpenSearchIOReader* io_reader,
                         PartialLoadingContext* partial_loading_ctx) {
    if (partial_loading_ctx->isEnabled()) {
        return partialLoadingFaissIndexLoad(io_reader);
    } else {
        // Fallback to default. Loading all bytes into memory.
        ...
    }
}

partialloading/faiss_index_load.h

Similar to what we did in FaissIndex::load in Java, we need to mark the starting offset of each section in the FAISS index.


faiss::Index* partialLoadingFaissIndexLoad(FaissOpenSearchIOReader* reader) {
    uint32_t index_name;
    READ1(index_name);
    
    if (index_name == faiss::fourcc("IxMp")) {
        // Corresponding to IndexIDMap2 in FAISS
        auto* idxmap = 
            new OpenSearchIndexIDMapTemplate<faiss::Index, FaissIndexInputStorage>();
        // Read common header
        faiss::read_index_header(idxmap, reader);
        
        // Recursively load nested index
        idxmap->index = partialLoadingFaissIndexLoad(reader);
        
        // Mark id section
        idxmap->id_map.markSection(reader);
        return idxmap;
    } else if (index_name == faiss::fourcc("IHNf")) {
        // Corresponding to IndexHNSWFlat in FAISS
        auto* idxhnsw = new OpenSearchIndexHNSWFlat<FaissIndexInputStorage>();
        
        // Read common header
        faiss::read_index_header(idxhnsw, reader);
        
        // Partial load HNSW graph
        partialLoadHNSW(&idxhnsw->hnsw, reader);
        
        // Load nested storage
        idxhnsw->storage = partialLoadingFaissIndexLoad(reader);
        return idxhnsw;
    } else if (index_name == faiss::fourcc("IxF2")) {
        // Corresponding to IndexFlatL2 in FAISS
        auto* idxf = new OpenSearchIndexFlatL2<FaissIndexInputStorage>();

        // Read common header
        faiss::read_index_header(idxf, reader);
        
        // Size of all vector bytes.
        // idxf->d -> Dimensions
        idxf->code_size = sizeof(float) * idxf->d;
        
        // Partial load vector bytes
        // When actually reading bytes, it needs to be at least one single vector.
        idxf->codes.setAtomicUnit(idxf->code_size);
        idxf->codes.markSection(reader, 4);
        
        return idxf;
    } else {
        throw std::runtime_error(
            "Unrecognized index type : " + toString(index_name));
    }
}

// Similar to FaissHNSWFlatIndex::partialLoadHNSW, we mark sections in hnsw file.
void partialLoadHNSW(OpenSearchFaissHNSW* hnsw, 
                     FaissOpenSearchIOReader* reader) {
    // Load assigned probabilities.
    READVECTOR(hnsw->assign_probas);
    
    // Load cumulative number of neighbors per level
    READVECTOR(hnsw->cum_nneighbor_per_level);
    
    // Mark `levels` section
    hnsw->levels.markSection(reader);
    
    // Load `offsets` section
    READVECTOR(hnsw->offsets);
    
    // Mark `neighbor list` section
    hnsw->neighbors.markSection(reader);
    
    // Read hnsw graph properties
    READ1(hnsw->entry_point);
    READ1(hnsw->max_level);
    READ1(hnsw->efConstruction);
    READ1(hnsw->efSearch);
    
    // Read dummy field
    READ1_DUMMY(int);
}

4.2.5. Vector Search Part

Once an index is partially loaded in C++, we can perform the vector search in the same way we did in Java. The search process in C++ is more complicated than in Java, as it does not delegate to Lucene's HnswGraphSearcher. Instead, we need to duplicate the FAISS search logic, specifically tailored for partial loading. However, the underlying idea remains the same: depending on the partial loading mode, we create a search strategy and proceed with the vector search.

KNNWeight::doANNSearch

private Map<Integer, Float> doANNSearch(
    LeafReaderContext context,
    BitSet filterIdsBitSet,
    int cardinality,
    int k) {
    ...
    
    // Get index allocation from cache manager.
    NativeMemoryAllocation indexAllocation = ...;
    
    // Prepare for searching
    FilterIdsSelector filterIdsSelector = FilterIdsSelector.getFilterIdSelector(filterIdsBitSet, cardinality);
    long[] filterIds = filterIdsSelector.getFilterIds();
    FilterIdsSelector.FilterIdsSelectorType filterType = filterIdsSelector.getFilterType();

    // Now that we have the allocation, we need to readLock it
    indexAllocation.readLock();
    indexAllocation.incRef();
    
    // Start searching
    if (partialLoadingMode != DISABLED) {
        results = JNIService.queryIndex(
                    indexAllocation.getMemoryAddress(),
                    indexAllocation.getPartialLoadingContext(),
                    knnQuery.getQueryVector(),
                    k,
                    knnQuery.getMethodParameters(),
                    knnEngine,
                    filterIds,
                    filterType.getValue(),
                    parentIds);
    );
    } else {
        // Fallback to default searching logic
        ...
    }
    
    // Post processing after ANN search
    ...
    if (quantizedVector != null) {
        return Arrays.stream(results)
        .collect(Collectors.toMap(KNNQueryResult::getId, result -> knnEngine.score(result.getScore(), SpaceType.HAMMING)));
    }
    return Arrays.stream(results)
        .collect(Collectors.toMap(KNNQueryResult::getId, result -> knnEngine.score(result.getScore(), spaceType)));
}

faiss_wrapper.cpp

jobjectArray QueryIndex_WithFilter(knn_jni::JNIUtilInterface *jniUtil,
                                   JNIEnv *env,
                                   jlong indexPointerJ,
                                   jobject partial_loading_ctx_j,
                                   jfloatArray queryVectorJ,
                                   jint kJ,
                                   jobject methodParamsJ,
                                   jlongArray filterIdsJ,
                                   jint filterIdsTypeJ,
                                   jintArray parentIdsJ) {
    // Prepare search
    //   1. Initialize method parameters.
    //   2. Prepare two std::vector of distances + ids
    //   3. Acquire query vector

    ...
    
    // NOTE : Let's just assume we don't have filter ids 
    //        nor parent ids for simplicity.
    
    knn_jni::partial_loading::PartialLoadingContext partialLoadingContext {
               &jniUtil, env, partialLoadingCtx};

    if (partialLoadingContext.isEnabled()) {
        // Override `efSearch`
        hnswParams.efSearch = getIntegerMethodParameter(
            env, jniUtil, methodParams, EF_SEARCH, hnswReader->hnsw.efSearch);
        searchParameters = &hnswParams;
        
        // Acquire IndexInput and set it as a thread local.
        // The thread local variable will be used by each `Storage`.
        partialLoadingContext.setThreadLocalState();
    
        // For now, we only have `MEMORY_EFFICIENT` mode.
        PartialLoadingSearchStrategy* strategy =
            PartialLoadingSearchStrategy::load(partialLoadingContext.mode);
        
        // Start search!
        strategy.search(indexPointerJ, rawQueryvector, kJ, dis.data(), ids.data(), searchParameters);
        
        // Posting result processing
        ...
    } else {
        // Fallback to default one
    }
}

[New Component] partialloading/memory_efficient_partial_loading_search_strategy

MemoryEfficientPartialLoadingSearchStrategy allows each index to continue the vector search and fetch bytes on demand. The actual byte fetching is done through JNI calls to invoke Lucene's IndexInput.


class MemoryEfficientPartialLoadingSearchStrategy : public PartialLoadingSearchStrategy {
  public:
    void search(long topIndexAddress,
                int32_t num_queries,
                float* query_vectors,
                int32_t k,
                float* distances,
                int32_t* ids,
                const SearchParameters* params) final {
        // Top level IndexIDMap
        auto* indexReader =
            reinterpret_cast<OpenSearchIndexIDMapTemplate<
                faiss::Index, FaissIndexInputStorage>*>(indexPointerJ);

        // Nested index of IndexIDMap. HNSW is the only index type being supported.
        auto* hnswReader =
            dynamic_cast<OpenSearchIndexHNSWFlat<FaissIndexInputStorage>*>(
                indexReader->index);

        if (hnswReader == nullptr) {
            throw std::runtime_error(
                "IVF is not supported in partial loading");
        }

        // Start searching
        indexReader->search(1, rawQueryvector, kJ, dis.data(), 
                            ids.data(), searchParameters);
    }
};

[New Component] OpenSearchIndexIDMapTemplate

This is the top-level index that delegates the search to its nested index (e.g., currently IndexHNSWFlat) and then transforms the result IDs.

template <typename IndexT, typename <class> typename Storage>
void OpenSearchIndexIDMapTemplate<IndexT, Storage>::search() {
            int32_t num_queries,
            float* query_vectors,  // e.g. float[dimensions * num_queries]
            int32_t k,
            float* distances,      // e.g. float[dimensions * num_queries]
            int32_t* ids,          // e.g. int32_t[k * num_queries]
            const SearchParameters* params) {
    // Let its nested index to search.
    // For now, IndexHNSWFlat is the only available nested index.
    index->search(num_queries, query_vectors, k, distances, ids, params);
    
    // Then transforms ids
    for (int i = 0 ; i < num_queries * k ; ++i) {
        ids[i] = ids[i] < 0 ? ids[i] : id_map[ids[i]];
    }
}

[New Component] OpenSearchIndexHNSWFlat

Eventually, it will reach the hnswSearch method, which performs vector search by traversing the HNSW graph. All byte fetching will be handled transparently by each Storage implementation.

template <template<class> typename Storage>
void OpenSearchIndexHNSWFlat::hnswSearch(
            int32_t num_queries,
            const float* query_vectors,
            BlockResultHandler& block_result_handler,
            const SearchParameters* params) {
    // Get `efSearch` from params. Otherwise use the default one.
    const int32_t ef_search = ...;
    
    // Make a visit table. You can think of this as a BitSet in Java
    VisitedTable vt (index->ntotal);
    
    // Make result heap
    BlockResultHandler::SingleResultHandler res (block_result_handler);
    
    // Get distance computer
    // Unlike Java, distance computer should be retrieved from storage index.
    // For example, IndexFlatL2 will give L2 distance computer.
    auto distance_computer = storage->get_distance_computer();
    distance_computer->set_query(query_vectors);
    
    // Start HNSW graph searching
    hnsw.search(*distance_computer, res, vt, params);
    
    // Post processing
    ...
}

[New Component] OpenSearchHNSW

For simplicity, I will not go into the details of the HNSW search algorithm. In short, the algorithm consists of two parts:

  1. It first greedily finds the closest vector at each level, starting from the entry point and moving down to the bottom layer.
  2. It then performs an exhaustive greedy search at the bottom graph using a bounded max heap with a capacity of efSearch.

For more details, refer to the greedy_update_nearest and search_from_candidates methods in FAISS.

template <template <class> typename Storage>
struct OpenSearchHNSW {
    // Probabilities per each level, it typically few bytes.
    std::vector<double> assign_probas;
    // Cumulative number of neighbors per each level. Few KBs.
    std::vector<int32_t> cum_nneighbor_per_level;
    // Level of each vector (base level = 1), size = ntotal
    Storage<int32_t> levels;
    // Offset for neighbor list. Few KBs upto MBs.
    std::vector<size_t> offsets;
    // Neighbor list.
    Storage<int32_t> neighbors;
    
    // Graph properties
    int32_t entry_point;
    int32_t max_level = -1;
    int32_t ef_construction = 40;
    int32_t ef_search = 16;
};

template <template <class> typename Storage>
void OpenSearchHNSW<Storage>::search(
              DistanceComputer& distance_computer,
              ResultHandler& result_handler,
              VisitedTable& visited_table,
              const SearchParametersHNSW* params) {
    // Implementing HNSW graph searching
    
    // Obtain `k` from result_handler
    const int k = extract_k_from_ResultHandler(result_handler);
    
    // Greedily reach out to the bottom layer from entry point
    int32_t nearest = entry_point;
    float d_nearest = distance_computer(entry_point);
    
    // Start the first part in HNSW graph search
    for (int level = max_level ; level >= 1 ; --level) {
        greedyUpdateNearest(distance_computer, level, nearest, d_nearest);
    }
    
    // Start the second part of exploring at the bottom graph.
    const int32_t ef = std::max(params ? params->efSearch : efSearch, k);
    MinimaxHeap candidates (ef);
    candidates.push(nearest, d_nearest);  // Put the starting point at the bottom.
    searchFromCandidates(distance_computer, result_handler, candidates, visited_table, ...);
    
    // Post processing
    ....
}


[New Component] OpenSearchFlatL2Dis

This is the L2 distance calculator created by OpenSearchIndexFlatL2. As the name suggests, it calculates the L2 distance (i.e., euclidian distance) between the given query vector and a vector stored within the index.
It fetches vectors via Storage, loading bytes equal to the size configured through setAtomicUnit. This ensures that a single vector is fully loaded. For example, for a 768-dimensional float vector, 3072 bytes (i.e., sizeof(float) * 768) will be loaded at once.

template<template<class> typename Storage>
struct OpenSearchFlatL2Dis : DistanceComputer {
    // This is acquired from storage index having vector data.
    const Storage<uint8_t>* codes_data;
    size_t code_size;  // sizeof(one single vector)
    size_t d;  // dimensions
    int32_t nb;  // #vectors
    const float* query_vector;
    
    // Compute single distance bewteen query vector.
    float operator()(int32_t i) final {
        // Partially load vector bytes.
        // Atomic unit will be set as a size of single vector in `Storage`.
        // Hence it will be guaranteed that a vector will be loaded fully.
        // This will call operator[]() of `Storage`
        const float* vec = &(*codes_data)[i * code_size];
        
        // Calls FAISS API
        return fvec_L2sqr(query_vector, vec, d);
    }
    
    // Optimized version in batch.
    void distances_batch_4(
                const idx_t idx0,
                const idx_t idx1,
                const idx_t idx2,
                const idx_t idx3,
                float &dis0,
                float &dis1,
                float &dis2,
                float &dis3) final {
        // compute first, assign next.
        // Partial load four vectors from `Storage`.
        // Internally, it will keep four vectors in a temp buffer.
        // Each line will call operator[]() of `Storage`
        const auto *__restrict y0 =
            reinterpret_cast<const float *>(&(*codes_data)[idx0 * code_size]);
        const auto *__restrict y1 =
            reinterpret_cast<const float *>(&(*codes_data)[idx1 * code_size]);
        const auto *__restrict y2 =
            reinterpret_cast<const float *>(&(*codes_data)[idx2 * code_size]);
        const auto *__restrict y3 =
            reinterpret_cast<const float *>(&(*codes_data)[idx3 * code_size]);
        
        // Call FAISS APIs
        float dp0 = 0;
        float dp1 = 0;
        float dp2 = 0;
        float dp3 = 0;
        fvec_L2sqr_batch_4(queryVector,
                           y0, y1, y2, y3,
                           d,
                           dp0, dp1, dp2, dp3);

        // Assign results
        dis0 = dp0;
        dis1 = dp1;
        dis2 = dp2;
        dis3 = dp3;
    }
    
    void set_query(const float* _query_vector) {
        query_vector = _query_vector
    }
}

4.2.6. Pros / Cons

Pros

  1. I could not find a single pros of this approach really.

Cons

  1. Need to keep up with FAISS updates.
  2. Hard to debug.
  3. Not readable.
  4. More importantly, it is slow.
  5. Will need more code changes.

5. Performance Benchmark

Vector search performance is primarily influenced by three factors:

  1. The HNSW search algorithm.
  2. Distance computation.
  3. Bytes loading from storage, which is largely determined by the Directory implementation.
    Partial loading has little to no impact on this part.

Partial loading mainly affects the first two factors. In my testing, I observed that the more time it takes to load bytes from storage (e.g., when using Network File System with NioFSDirectory, though results aren't included here), the harder it becomes to isolate and analyze the performance impact of partial loading.

To mitigate this, I used MMapDirectory to load the entire index into memory, minimizing the interaction with the Directory and enabling me to assess the worst-case performance. This provides lower bound of latency for comparison with Lucene as a baseline.

5.1. Benchmark Environment

  • Data set : Random 1M, Random 10M
  • Dimensions : 768
  • Index Type : float[]
  • Instance Type : c6a.32xlarge
    • 128 CPUs
    • 256GB RAM
  • JVM Args : -Xms100g -Xmx100g
  • Search parameters
    • k: 10
    • efSearch : 100
  • Engines :
    • Partial Loading - Java : FAISS Java version.
    • Partial Loading - C++ : The one that implemented partial loading in C++.
    • Lucene : Lucene engine.
    • FullLoading FAISS : Current OpenSearch KNN engine that loads all bytes into memory.

5.2. 1M Summary

Throughput

Engine Min Throuput Mean Throughput Median Throughput Max Throughput recall@k / recall@1
Partial Loading - Java 155.24 157.81 158.43 158.84 0.60 / 0.75
Partial Loading - C++ 114.71 115.75 115.86 116.6 0.60 / 0.75
Lucene 153.88 157.46 157.92 159.59 0.68 / 0.85
FullLoading FAISS 175.28 178.76 179.32 181.14 0.6 / 0.75

Latency

Engine 50% Latency 90% Latency 99% Latency 99.9% Latency 100% Latency recall@k / recall@1
Partial Loading - Java 5.05 5.39 5.75 7.73 11.8 0.60 / 0.75
Partial Loading - C++ 7.36 7.78 8.68 9.64 14.18 0.60 / 0.75
Lucene 5.04 5.41 5.9 6.92 10.87 0.68 / 0.85
FullLoading FAISS 4.2 4.54 4.89 8.47 11.12 0.6 / 0.75

5.3. 10M Summary

For some reasons, I was consistently getting low recall (around 10%) after indexing 10M dataset in a single instance after performing a force merge with max_segment=1. Despite trying various efConstruction values, the results remained the same.

However, when I allowed the engine to keep around 70 segments instead of merging everything into one giant segment, recall improved to over 80%. In this case, Lucene outperformed FAISS by 180% (31.42 ops/s vs. 17.31 ops/s). Interestingly, 'Partial Loading - Java' performed only 7% to 14% slower than 'FullLoading FAISS' (15.09 ops/s vs. 17.31 ops/s). I suspect this performance difference stems from how the index is organized.

I dug deeper into this and discovered that the number of vectors visited per query in FAISS is roughly twice that of Lucene. Lucene visits around 30,000 vectors per a single query, while FAISS visits between 60,000 and 75,000 vectors. This explains the faster performance in Lucene, as it visits far fewer vectors and performs less distance computation. I suspect this difference is due to the quality of the index built, and I believe that with a similarly high-quality index, 'Partial Loading Java' will deliver performance comparable to Lucene's.
It seems odd that partial loading performed similarly with a 1M dataset but experienced a significant performance drop with a 10M dataset.

Nonetheless, I believe this experiment still provides a useful proxy (though not perfect), showing that if we can implement partial loading in Java, we can achieve similar performance to 'FullLoading FAISS', assuming IO costs are not a significant factor. (Note that as more IO is involved, the performance gap narrows. In this analysis, we are focusing on the lower bound of performance.)

Throughput

Engine Min Throuput Mean Throughput Median Throughput Max Throughput recall@k / recall@1
Partial Loading - Java 15.09 15.11 15.1 15.27 0.89 / 0.97
Partial Loading - C++ 6.36 6.36 6.36 6.41 0.89 / 0.97
Lucene 31.42 31.66 31.6 32.07 0.68 / 0.85
FullLoading FAISS 17.31 17.44 17.43 17.73 0.89 / 0.97

Latency

Engine 50% Latency 90% Latency 99% Latency 99.9% Latency 100% Latency recall@k / recall@1
Partial Loading - Java 64.98 66.44 68.42 70.84 80.92 0.89 / 0.97
Partial Loading - C++ 155.78 158.74 162.79 189.62 208.32 0.89 / 0.97
Lucene 30.39 33.18 35.82 38.78 81.12 0.68 / 0.85
FullLoading FAISS 55.92 58.93 63.87 66.56 92.87 0.89 / 0.97

6. Conclusion

With the partial loading design, we can selectively decide whether to load all bytes into memory. When its mode is MEMORY_EFFICIENT, we can fetch bytes via IndexInput on demand eliminating the need for allocating memory space in application.

From the benchmark results, we observe three things.

  1. Partial loading implemented in Java will give a comparable performance to Lucene.
  2. Warm up performance is much better than 'FullLoading FAISS'

7. Discussion Points

7.1. getSizeInKB

For MEMORY_EFFICIENT partial loading mode, do we have to return 0 in getSizeInKB method in NativeMemoryAllocation? I believe so for most cases, but FaissIndexInputStorage<T> really depends on the IndexInput implementation where it can allocate memory blocks and can hold them inside. Which can be rare, but it is technically not zero.

7.2. Setting Update

To update partial loading mode, user must close index first and update the mode then reopen the index again. I don’t think it necessarily be dynamically configurable, but I’m open to make it dynamic.
To update partial loading mode, user must close index first and update the mode then reopen the index again. I don’t think it should be dynamically configurable, but I’m open to make it dynamic.

@0ctopus13prime 0ctopus13prime self-assigned this Jan 17, 2025
@navneet1v navneet1v added Enhancements Increases software capabilities beyond original client specifications memory management memory-reduction search-improvements and removed untriaged labels Jan 17, 2025
@navneet1v navneet1v moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Jan 17, 2025
@navneet1v
Copy link
Collaborator

I dug deeper into this and discovered that the number of vectors visited per query in FAISS is roughly twice that of Lucene. Lucene visits around 30,000 vectors per a single query, while FAISS visits between 60,000 and 75,000 vectors. This explains the faster performance in Lucene, as it visits far fewer vectors and performs less distance computation. I suspect this difference is due to the quality of the index built, and I believe that with a similarly high-quality index, 'Partial Loading Java' will deliver performance comparable to Lucene's.

@0ctopus13prime this is an interesting finding. I think we should dig more on this. Lets have a gh issue on this for further deep-dive.

cc: @vamshin , @jmazanec15 , @kotwanikunal

@kotwanikunal
Copy link
Member

@0ctopus13prime Great work on the RFC. The code dive and walkthrough is amazing.
I had some questions on the 10M benchmarking section

However, when I allowed the engine to keep around 70 segments instead of merging everything into one giant segment, recall improved to over 80%. In this case, Lucene outperformed FAISS by 180% (31.42 ops/s vs. 17.31 ops/s). Interestingly, 'Partial Loading - Java' performed only 7% to 14% slower than 'FullLoading FAISS' (15.09 ops/s vs. 17.31 ops/s). I suspect this performance difference stems from how the index is organized.

In this case, the Lucene numbers you are comparing with are standard Lucene runs which are also covered as a part of https://opensearch.org/benchmarks/ right?

I dug deeper into this and discovered that the number of vectors visited per query in FAISS is roughly twice that of Lucene. Lucene visits around 30,000 vectors per a single query, while FAISS visits between 60,000 and 75,000 vectors. This explains the faster performance in Lucene, as it visits far fewer vectors and performs less distance computation. I suspect this difference is due to the quality of the index built, and I believe that with a similarly high-quality index, 'Partial Loading Java' will deliver performance comparable to Lucene's.
It seems odd that partial loading performed similarly with a 1M dataset but experienced a significant performance drop with a 10M dataset.

Would this mean Lucene should perform better (given that it does less work) w.r.t Faiss? I know there are fundamental differences like a Java v/s C++ implementation, but purely in terms of the work complexity, Lucene should be more optimal right?

@0ctopus13prime
Copy link
Collaborator Author

@navneet1v, @kotwanikunal
Thank you for the comments. I'm suspecting this is coming from the way of vector index organzied, Lucene had lower recall 0.68 compared to 0.89 seems like made BFS searching converge more quickly than FAISS. Even it does less work, but the results may not meet the quality bar we're expecting.

Sorry for the confusion, but let me rerun the benchmark and post the results to make sure all engine have the same recall, precision.

@0ctopus13prime
Copy link
Collaborator Author

@navneet1v
Created a new issue to identify inherent indexing logic difference between Lucene and Faiss. - #2407

@jed326
Copy link
Contributor

jed326 commented Jan 21, 2025

In this case, the Lucene numbers you are comparing with are standard Lucene runs which are also covered as a part of https://opensearch.org/benchmarks/ right?

I had a naive question on this -- from this dashboard the Lucene 10m Cohere perf seems to be worse than the FAISS 10m Cohere perf on both search and indexing, which is opposite of what we're discussing here. Are those perf runs unrelated to this discussion or is there another piece I am missing?

@navneet1v
Copy link
Collaborator

@jed326 yes you are correct in the nightlies we saw the latencies high. This is a know bug we have in the nightlies. @shatejas can add more here. I know he was doing some deep-dive around this.

@sohami
Copy link

sohami commented Jan 21, 2025

@0ctopus13prime Thanks for the great and detailed writeup. I have couple of questions here:

However, recall and precision will remain at the same level. If this becomes a significant concern, I can port the FAISS HNSW search logic to Java. I've tested this as well, and it yielded the same performance as Lucene.

Any reason to not use FAISS search algorithm and instead use Lucene one under the hood ? It seems to me that it will be confusing for users to see different results based on different loading mode specially when data representation and engine is same. Also it looks like for cases with single segment, the recall is also getting affected ? One of the use case I am thinking was changing the partial loading mode while tiering the index to cheaper tier (i.e. warm tier). It will be confusing if search returns different results in different tiers.

7.2. Setting Update
To update partial loading mode, user must close index first and update the mode then reopen the index again. I don’t think it necessarily be dynamically configurable, but I’m open to make it dynamic.

In similar tiering use case like above, it will be useful to update the partial loading mode setting dynamically to avoid downtime during the tiering of indices. Otherwise, live tiering will be difficult to achieve as one needs to close the index and index will not be searchable for that time. We can treat field level option as an override to the index level setting. Also I am not sure if field level is really needed currently since it will be complicated for a user to think for each field. They will think from use case perspective and make trade-offs between performance and cost based on available resources to support the use case. So providing at index level will fit most of the users requirement. Other way to look at it will be, today users are making similar choices while choosing Faiss vs Lucene engine during index creation and understand the trade off upfront. So similarly they can make the same trade-off at index level with loading mode.

@navneet1v
Copy link
Collaborator

7.2. Setting Update

To update partial loading mode, user must close index first and update the mode then reopen the index again. I don’t think it necessarily be dynamically configurable, but I’m open to make it dynamic.

I don't think closing and updating the index is necessary. Right now we don't allow any mapping attribute of the k-NN field to be updatable, but we can very well do that and avoid this opening and closing thing of the index.

Otherwise, live tiering will be difficult to achieve as one needs to close the index and index will not be searchable for that time. We can treat field level option as an override to the index level setting. Also I am not sure if field level is really needed currently since it will be complicated for a user to think for each field.

Also I kind of agree on few points here with @sohami. I would like to keep this setting at an index level to start with and then move this to field level only when there is a customer need. Putting this at field level sounds very attractive option but updating the value is always very painful as it requires building a complex mapping request.

Other way to look at it will be, today users are making similar choices while choosing Faiss vs Lucene engine during index creation and understand the trade off upfront. So similarly they can make the same trade-off at index level with loading mode.

Just to clarify here @sohami, user don't make the engine choice at an index level they make the choice at a field level. In an index you can very well have multiple vector fields with different engine on each field.

However, recall and precision will remain at the same level. If this becomes a significant concern, I can port the FAISS HNSW search logic to Java. I've tested this as well, and it yielded the same performance as Lucene.

Any reason to not use FAISS search algorithm and instead use Lucene one under the hood ? It seems to me that it will be confusing for users to see different results based on different loading mode specially when data representation and engine is same.

+1 on this. But while reading the RFC I see you have mentioned at some place that we should use Lucene search algorithm and at some place adopting the faiss search algorithm. Can you please clarify this. Ref the below quoted sentences from the RFC

During search, if partial loading is enabled, we can delegate the search to FaissIndex. Otherwise, it will fall back to the default behavior. In the case of FaissHNSWFlatIndex, it will execute the ported HNSW searching logic ported from FAISS to Java. For other index types, such as IVF, it will be supported shortly.

Along with that and I think we have discussed this in past,

  1. Faiss filtered search and Lucene search is little different. This can also lead to different results when partial loading mode is different. The only way to get around this would be use the same search logic for all the loading modes.
  2. Since Lucene library controls both reading and writing part of the HNSW graph their search will always be compatible but that might not be true for us and it may end up breaking the search logic for faiss index with partial loading.

But having said that there are pros on using the Lucene HNSWGraphSearcher algo which are:

  1. We get lot of optimizations for free like Multi-KNNCollector, SeededKNNQueries etc.

I believe we should dig little bit more deep here to see what should be our the right strategy.


On performance benchmarks I am having a hard time to understand what is being benchmarked and how to interpret the results, so can you please add the following details which will help in understanding the benchmarks properly

  1. Number of data nodes used
  2. Heap size of data nodes, as I see this JVM Args : -Xms100g -Xmx100g which is very big
  3. Total RAM of data nodes
  4. Number of shards per node
  5. Number of replicas

@sohami
Copy link

sohami commented Jan 23, 2025

Other way to look at it will be, today users are making similar choices while choosing Faiss vs Lucene engine during index creation and understand the trade off upfront. So similarly they can make the same trade-off at index level with loading mode.

Just to clarify here @sohami, user don't make the engine choice at an index level they make the choice at a field level. In an index you can very well have multiple vector fields with different engine on each field.

Probably a separate discussion but why is it at field level ? Do users really use different engine at per field level ?

@navneet1v
Copy link
Collaborator

navneet1v commented Jan 23, 2025

Other way to look at it will be, today users are making similar choices while choosing Faiss vs Lucene engine during index creation and understand the trade off upfront. So similarly they can make the same trade-off at index level with loading mode.

Just to clarify here @sohami, user don't make the engine choice at an index level they make the choice at a field level. In an index you can very well have multiple vector fields with different engine on each field.

Probably a separate discussion but why is it at field level ? Do users really use different engine at per field level ?

Since k-NN plugin supports multiple vector fields putting an engine at an index level is not a good option since it limits the usability. A simple example can be user may want to use Native engine with IVF algorithm with quantization since the vector field is of higher dimension, but Lucene engine for a smaller dimension. There are multiple other usecases that is there.

Feel free to cut a GH issue to kick off the conversation if you think engine should be at an index setting. :) It would be an interesting discussion to have.

@sohami
Copy link

sohami commented Jan 27, 2025

@0ctopus13prime May be this could be something which can be used if a Java wrapper for FAISS is available. We can probably see if the partial load mechanism can be added there directly.

@shatejas
Copy link
Collaborator

shatejas commented Feb 5, 2025

By leveraging Lucene’s HNSW searching logic to operate on the FAISS index, we can benefit from these improvements without additional effort. One good example is the prefetch optimization during search, which triggers the underlying storage to load bytes asynchronously. (Good opportunity for kernel or network file system daemon to load bytes in parallel). This reduces I/O wait times, and decreasing search latency.

Lucene does not prefetch during search for vectors I think. Vector search in lucene specifically use madvise MADV_RANDOM system call to tell kernel to not prefetch to optimize HNSW search considering the nature of graph traversal\

ublic static FaissIndex load(IndexInput input) {
      final String indexName = readFourBytes(input);
       if (indexName == "IxMp") {
           // Maps to IdMapIndex
           FaissIdMapIndex idMapIndex = new FaissIdMapIndex();
          idMapIndex.readCommonHeader(input);

           // Partial load a nested index recursively.
          idMapIndex.setNestedIndex(load(input));

           // Mark `list` section.
           idMapIndex.partialLoadIdListSection(input);
           return idMapIndex;
       } else if (indexName == "IHNf") {
           // Maps to HNSWFlatIndex
          FaissHNSWFlatIndex hnswFlatIndex = new FaissHNSWFlatIndex();
           hnswFlatIndex.readCommonHeader(input);

We should rethink this if else structure, this can quickly become hard to maintain. A strategy pattern will be helpful here

I propose extending the KNNWeight::doANNSearch method to branch based on whether partial loading is enabled.

Have you looked into introducing an interface here considering these are two different implementations? If not we should look into it instead of branching with if else and adding code in the same file

For the long term, we should consider refactoring this logic into KNNQuery instead of keeping it in KNNWeight.

Any affect on NativeEngineKnnVectorQuery with partial loading?

@0ctopus13prime
Copy link
Collaborator Author

0ctopus13prime commented Feb 13, 2025

Hi @sohami, regarding to the setting I'm open to both have it at field level versus index level. Will follow other ppl's opinion on that.

For this one

Any reason to not use FAISS search algorithm and instead use Lucene one under the hood ? It seems to me that it will be confusing for users to see different results based on different loading mode specially when data representation and engine is same.

Answer : I'm sure people will pick up the difference very sooner. Fundamentally, it is not that different from what Lucene is using Directory as a storage layer under the hood. User who understands the current OpenSearch KNN will quickly understand the concept of 'partial loading' - IndexInput to load bytes vs Load everything off-heap.

Also it looks like for cases with single segment, the recall is also getting affected ? One of the use case I am thinking was changing the partial loading mode while tiering the index to cheaper tier (i.e. warm tier). It will be confusing if search returns different results in different tiers.

Answer : If we pour pretty big data into a single segment, the recall is likely getting affected. For that case, queue size during search should be adjusted. What I can guarantee with partial loading is that it will always produce the same result as long as with the same hyper parameters that was used for FAISS was provided.
Put it simply, unexpected result does not indicate that something is happening within partial loading, if anything, it indicates that it needs hyper parameter adjustment. Even if its internal mode was configured for baseline FAISS which loads everything into memory, it will get the same result. If recall is not high enough, then user should adjust efSearch.

@0ctopus13prime
Copy link
Collaborator Author

0ctopus13prime commented Feb 13, 2025

@navneet1v

For the setting: I'm truly open to whether put it at field level or index level. Will follow other folks opinion.

Faiss filtered search and Lucene search is little different. This can also lead to different results when partial loading mode is different. The only way to get around this would be use the same search logic for all the loading modes.
Since Lucene library controls both reading and writing part of the HNSW graph their search will always be compatible but that might not be true for us and it may end up breaking the search logic for faiss index with partial loading.

Answer : No, it is not likely. FAISS file format will guarantee the compatibility (my question), therefore already working logic will be guaranteed to locate file and load. The issue is when the introduced a new file format. In that case, we need to catch up, but it is different from breaking you mentioned.

But having said that there are pros on using the Lucene HNSWGraphSearcher algo which are:
We get lot of optimizations for free like Multi-KNNCollector, SeededKNNQueries etc.
I believe we should dig little bit more deep here to see what should be our the right strategy.

Answer : This is not true. Multi KNN collector can benefit for both cases regardless - Lucene HNSWGraphSearcher or FAISS ported one.
Even for HNSWGraphSearcher, we still need coding to create and initialize multi collector then pass it to the searcher.
Equally, it does for the FAISS ported one.
Therefore, so long as we can introduce multi collector, we can be benefitted from it regardless of internal implementation.

But as we discussed in the past, hoping to have the FAISS ported version to make sure regardless of its mode, user always can get the same result as possible. One of the callout brought in the past discussion was a keep-up cost for the Lucene API changes overtime - We want to avoid this.

On performance benchmarks I am having a hard time to understand what is being benchmarked and how to interpret the results, so can you please add the following details which will help in understanding the benchmarks properly

Answer :
Number of data nodes used : 1
Heap size of data nodes, as I see this JVM Args : -Xms100g -Xmx100g which is very big
-> only half of physical ram, which is 256 GB
Total RAM of data nodes : 256GB
Number of shards per node : 1
Number of replicas : 0

Sorry, but could you directly drop the question here? I can answer that.

@0ctopus13prime
Copy link
Collaborator Author

@sohami

@0ctopus13prime May be apache/lucene#12615 (comment) could be something which can be used if a Java wrapper for FAISS is available. We can probably see if the partial load mechanism can be added there directly.

Answer : No, it's a wrapper and we could leverage it to minimize JNI layer, but it still loads everything into memory.
The design objective is different between two where the one you shared wants to integrate FAISS within Lucene while we want to give user options to run FAISS in memory constraint environment.

@0ctopus13prime
Copy link
Collaborator Author

0ctopus13prime commented Feb 13, 2025

@shatejas
For the prefetch:
Em.. let me dig more to this. If I could not find prefetch in HNSW searching, will correct the doc.

We should rethink this if else structure, this can quickly become hard to maintain. A strategy pattern will be helpful here

Answer : sure, happy to discuss in PR.

Have you looked into introducing an interface here considering these are two different implementations? If not we should look into it instead of branching with if else and adding code in the same file

Answer : sure, I think we already started the convo, happy to discuss more. But I generally agree with you.

Any affect on NativeEngineKnnVectorQuery with partial loading?

Answer : I don't think there will be side effect on it. I wish we could have collector enabled within createWeight likewise how Lucene does now a days. - Code

@0ctopus13prime 0ctopus13prime moved this from Backlog (Hot) to 3.x in Vector Search RoadMap Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications memory management memory-reduction search-improvements v3.0.0
Projects
Status: 3.x
Development

No branches or pull requests

6 participants