A coding implementation for structure

In today’s information -rich world, it is crucial to find relevant documents quickly. Traditional keyword -based search systems often fall short when dealing with semantic significance. This tutorial demonstrates how to build a powerful document search engine using:

  1. Hugging Face’s embedded models to convert text to rich vector representations
  2. Chroma DB as our Vektordatabase for effective equality search
  3. High -quality term transformers

This implementation enables semantic search features – finding documents based on meaning rather than just keyword matching. At the end of this tutorial you have a working document search engine that can:

  • Process and embedes text documents
  • Keep these embeddings effectively
  • Download the most semantic similar documents for any inquiry
  • Handle a range of document types and search needs

Follow the detailed steps mentioned below in order to implement DocSearchagent.

First, we need to install the necessary libraries.

!pip install chromadb sentence-transformers langchain datasets

Let’s start by importing the libraries we use:

import os
import numpy as np
import pandas as pd
from datasets import load_dataset
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
import time

For this tutorial we use a subgroup of Wikipedia articles from Hugging Face Data’s library. This gives us a different set of documents to work with.

dataset = load_dataset("wikipedia", "20220301.en", split="train[:1000]")
print(f"Loaded {len(dataset)} Wikipedia articles")


documents = []
for i, article in enumerate(dataset):
   doc = {
       "id": f"doc_{i}",
       "title": article["title"],
       "text": article["text"],
       "url": article["url"]
   }
   documents.append(doc)


df = pd.DataFrame(documents)
df.head(3)

Now let’s divide our documents into smaller chunks for more granular search:

text_splitter = RecursiveCharacterTextSplitter(
   chunk_size=1000,
   chunk_overlap=200,
   length_function=len,
)


chunks = []
chunk_ids = []
chunk_sources = []


for i, doc in enumerate(documents):
   doc_chunks = text_splitter.split_text(doc["text"])
   chunks.extend(doc_chunks)
   chunk_ids.extend([f"chunk_{i}_{j}" for j in range(len(doc_chunks))])
   chunk_sources.extend([doc["title"]] * len(doc_chunks))


print(f"Created {len(chunks)} chunks from {len(documents)} documents")

We use a predetermined sentence transformer model from embracing face to create our embedders:

model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedding_model = SentenceTransformer(model_name)


sample_text = "This is a sample text to test our embedding model."
sample_embedding = embedding_model.encode(sample_text)
print(f"Embedding dimension: {len(sample_embedding)}")

Now let’s set Chroma DB, a light vector database perfect for our search engine:

chroma_client = chromadb.Client()


embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)


collection = chroma_client.create_collection(
   name="document_search",
   embedding_function=embedding_function
)


batch_size = 100
for i in range(0, len(chunks), batch_size):
   end_idx = min(i + batch_size, len(chunks))
  
   batch_ids = chunk_ids[i:end_idx]
   batch_chunks = chunks[i:end_idx]
   batch_sources = chunk_sources[i:end_idx]
  
   collection.add(
       ids=batch_ids,
       documents=batch_chunks,
       metadatas=[{"source": source} for source in batch_sources]
   )
  
   print(f"Added batch {i//batch_size + 1}/{(len(chunks)-1)//batch_size + 1} to the collection")


print(f"Total documents in collection: {collection.count()}")

Now comes the exciting part – searching through our documents:

def search_documents(query, n_results=5):
   """
   Search for documents similar to the query.
  
   Args:
       query (str): The search query
       n_results (int): Number of results to return
  
   Returns:
       dict: Search results
   """
   start_time = time.time()
  
   results = collection.query(
       query_texts=[query],
       n_results=n_results
   )
  
   end_time = time.time()
   search_time = end_time - start_time
  
   print(f"Search completed in {search_time:.4f} seconds")
   return results


queries = [
   "What are the effects of climate change?",
   "History of artificial intelligence",
   "Space exploration missions"
]


for query in queries:
   print(f"\nQuery: {query}")
   results = search_documents(query)
  
   for i, (doc, metadata) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
       print(f"\nResult {i+1} from {metadata['source']}:")
       print(f"{doc[:200]}...") 

Let’s create a simple feature to provide a better user experience:

def interactive_search():
   """
   Interactive search interface for the document search engine.
   """
   while True:
       query = input("\nEnter your search query (or 'quit' to exit): ")
      
       if query.lower() == 'quit':
           print("Exiting search interface...")
           break
          
       n_results = int(input("How many results would you like? "))
      
       results = search_documents(query, n_results)
      
       print(f"\nFound {len(results['documents'][0])} results for '{query}':")
      
       for i, (doc, metadata, distance) in enumerate(zip(
           results['documents'][0],
           results['metadatas'][0],
           results['distances'][0]
       )):
           relevance = 1 - distance  
           print(f"\n--- Result {i+1} ---")
           print(f"Source: {metadata['source']}")
           print(f"Relevance: {relevance:.2f}")
           print(f"Excerpt: {doc[:300]}...")  
           print("-" * 50)


interactive_search()

Let’s add the opportunity to filter our search results with metadata:

def filtered_search(query, filter_source=None, n_results=5):
   """
   Search with optional filtering by source.
  
   Args:
       query (str): The search query
       filter_source (str): Optional source to filter by
       n_results (int): Number of results to return
  
   Returns:
       dict: Search results
   """
   where_clause = {"source": filter_source} if filter_source else None
  
   results = collection.query(
       query_texts=[query],
       n_results=n_results,
       where=where_clause
   )
  
   return results


unique_sources = list(set(chunk_sources))
print(f"Available sources for filtering: {len(unique_sources)}")
print(unique_sources[:5])  


if len(unique_sources) > 0:
   filter_source = unique_sources[0]
   query = "main concepts and principles"
  
   print(f"\nFiltered search for '{query}' in source '{filter_source}':")
   results = filtered_search(query, filter_source=filter_source)
  
   for i, doc in enumerate(results['documents'][0]):
       print(f"\nResult {i+1}:")
       print(f"{doc[:200]}...") 

Finally, we demonstrate how to build a semantic document search engine using KRAM face embedding models and chromadb. The system retrieves documents based on meaning rather than just keywords by converting text into vector representations. The implementation processes Wikipedia articles are piling them for granularity, embedding them using sentence transformers and hiding them in a vektor database for effective retrieval. The final product contains interactive search, metadata filtration and relevance ranking.


Here it is Colab notebook. Nor do not forget to follow us on Twitter and join in our Telegram Channel and LinkedIn GrOUP. Don’t forget to take part in our 80k+ ml subbreddit.


Asif Razzaq is CEO of Marketchpost Media Inc. His latest endeavor is the launch of an artificial intelligence media platform, market post that stands out for its in -depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts over 2 million monthly views and illustrates its popularity among the audience.

Leave a Comment