Semantic Search PDF Files Locally Using .NET / C# And Build5Nines.SharpVector

The ability to extract and semantically search through unstructured documents is becoming not just a convenience, but a necessity. This is especially true for software developers working with large repositories of technical documentation, compliance reports, or knowledge bases. In this article, we’ll explore how to build a semantic search solution using C#, PDFs, and the Build5Nines.SharpVector library—a tool designed to bring simple and powerful vector search capabilities to your applications.

This article walk through how to ingest the contents of PDF documents using the PdfPig Nuget package, generate text vector embeddings in-memory using Build5Nines.SharpVector, and run semantic search queries using natural language. This is becoming an increasingly used scenario for Generative AI and Retrieval Augmented Generation (RAG) solutions with the growing adoption of AI and LLMs.

Let’s get started!

Understanding Semantic Search with Build5Nines.SharpVector

To implement a semantic search capability in C#, we’ll need to understand two core pieces:

Vector Embeddings: These are high-dimensional numerical representations of text, capturing semantic meaning rather than just text keywords.
Vector Search: This retrieves results based on a semantic similarity search; rather than simple keyword matching.

The Build5Nines.SharpVector library is a lightweight, in-memory, semantic search, vector database built for .NET that supports semantic text similarity search. Behind the scenes, it uses local vector generation; with support for integrating with OpenAI, Azure OpenAI or Ollama embeddings models for more robust text vector generation. Once data is loaded and vectorized, a semantic search is run on the text data using the cosine similarity algorithm to match text that represents the search queries with a similarity score.

Here’s a quick example of how SharpVector works:

using Build5Nines.SharpVector;

// create vector database, with 'string' metadata
var db = new BasicMemoryVectorDatabase();

// Add text document to vector database with a string of metadata
db.AddText("The cloud is a model for enabling ubiquitous access...", "Page 1");
// ... load more text data ...

// Perform semantic search on text in database
var results = db.Search("What is the cloud?", threshold: 0.5f);
// 'threshold' of '0.5' means only results with a similarity score of 0.5 or greater will be returned

// Loop through search results
foreach(var result in results) {
  var text = result.Text;
  var metadata = result.Metadata;
  var similarity = result.VectorComparison
}

Additionally, a Nuget package like PdfPig, can be used to extract text from PDFs page by page and store them with page metadata for semantic search and precise retrieval.

Setting Up Your Environment

Before we build the solution, you’ll need to install the required tools and libraries. Follow these steps to get started.

Prerequisites

.NET 8.0 SDK or later

Install Required Nuget Packages

Run the following commands in your terminal or package manager console to add the Build5Nines.SharpVector and PdfPig Nuget package references to the application:

dotnet add package Build5Nines.SharpVector
dotnet add package PdfPig

Now you’re ready to start coding.

Building the Solution

Let’s now build a working semantic search utility for PDF documents that includes the following functionality:

Loading and embedding PDF contents
Searching semantically using queries
Returning metadata and similarity scores

Create the Vector Database

To start, create a new in-memory, text vector database using Build5Nines.SharpVector:

using Build5Nines.SharpVector;

var vdb = new BasicMemoryVectorDatabase();

The Build5Nines.SharpVector project site contains full documentation on it’s usage that will enable more advanced usage scenarios; including OpenAI, Azure OpenAI, and Ollama embeddings model support.

Read / Load the PDF File

Now let’s load the PDF using PdfPig and load the text into the vector database with metadata of the page for the text within the PDF file:

// Open PDF file with PdfPig
using (var pdfDocument = UglyToad.PdfPig.PdfDocument.Open("document.pdf"))
{
    // Loop through the pages of the document
    foreach (var page in pdfDocument.GetPages())
    {
        // define the Metadata value setting it to the Page Number
        var metadata = page.Number.ToString();

        // Add the Text to the vector database; along with the metadata
        vdb.AddText(page.Text, metadata);
    }
}

This code performs the following tasks:

Opens the PDF with PdfPig.
Loops through each page, extracting text.
Adds page text to the vector DB with the page number as metadata.

Now that the PDF document text is loaded into the vector database, semantic searching is now enabled!

Perform Semantic Search

Here’s a couple basic examples of performing a semantic search on the PDF text loaded into the Build5Nines.SharpVector database:

// Semantic search returning only matches with a high similarity score
var query = "Azure ML";
var semanticResults = vdb.Search(query, threshold: 0.6f);

// Semantic search returning only the top 3 matches with the highest similarity score
var semanticResults = vdb.Search(query, pageCount: 3);

Loop Through Semantic Search Results

Once the semantic search is performed, the search results exposes access to the text and its associated metadata:

// Loop through semantic search results
foreach (var result in semanticResults.Texts)
{
    // Access data
    var text = result.Text;
    var metadata = result.Metadata;
    var similarity = result.VectorComparison;

    // do something; like pass text results to Generative AI + RAG + LLM for more intelligent AI solution

    Console.WriteLine($" - Page: {result.Metadata} - Similarity: {result.VectorComparison}");
}

Now that you have the semantic search results from the PDF document, the text data can then be passed to a Generative AI + RAG solution for building more intelligent AI solutions, or some other task like showing the user what pages in the PDF document are relevant to their search query.

Further Optimization

While this solution is powerful, a few improvements could still be made to ensure robustness and performance:

Text Chunking: The semantic search performance for large PDF files can be enhanced with Text Chunking by break pages into paragraphs or logical sections.
Embedding Quality: Integrate OpenAI, Azure OpenAI or Ollama embedding models to increase accuracy of semantic search.
Metadata Enrichment: Include section titles, document names, or topics in with more customized metadata supported by Build5Nines.SharpVector.
Logging Similarity Scores: Track low-score results to tune threshold parameter and better understand what users are searching for.

If the solution is not getting relevant semantic search results:

OCR errors may be occurring (if the PDF is scanned), and a cloud service like Microsoft Azure Document Intelligence may provide more robust text extraction from the PDF files.
Instead of using the built-in, local embeddings vectorization of Build5Nines.SharpVector; integrate with OpenAI, Azure OpenAI, or Ollama embeddings models instead.

Additionally, if you are needing to load PDF files once and perform semantic search on those documents multiple times without the latency of PDF file loading into the database on each search; then the Data Persistence functionality of Build5Nines.SharpVector can be used to save/load the database to a file. This will enable you to store the PDF file vectors database for reuse; thus greatly increasing the performance of semantic search queries.

Conclusion

This article walked through the steps to build a text semantic search that parses PDF files, vectorizes the text content, and retrieves relevant pages and information based on natural language queries. With Build5Nines.SharpVector and PdfPig, this process is efficient, flexible, and entirely local—making it a solid choice for offline search or sensitive document repositories.

Key Takeaways:

Semantic search improves information retrieval over traditional keyword-based search.
Build5Nines.SharpVector simplifies the embedding generation, semantic search and retrieval process.
Combining it with PdfPig creates a completely local path for building semantic search over PDF files.

As a next step, consider integrating this solution into an ASP.NET Core web app, or persisting your vector store to disk or cloud storage for scalability. You could even replace the default embedding engine with OpenAI or Azure OpenAI for enhanced accuracy and robustness.

Happy coding!

Original Article Source: Semantic Search PDF Files Locally using .NET / C# and Build5Nines.SharpVector written by Chris Pietschmann (If you're reading this somewhere other than Build5Nines.com, it was republished without permission.)