Building your AI Q&ABOT to web pages using Open Source AI models

In today’s information -rich digital landscape, navigating extensive web content can be overwhelming. Whether you are examining a project, studying complex material or trying to extract specific information from long articles, the process can be time -consuming and ineffective. This is where an AI-driven question answer (Q&A) Bot becomes invaluable.

This tutorial will guide you through the construction of a practical AI Q&A system that can analyze web page content and answer specific questions. Instead of relying on expensive API services, we use open source models from embracing face to create a solution that is:

  • Completely free to use
  • Runs in Google Colab (no local setup required)
  • Adapted to your specific needs
  • Built on groundbreaking NLP technology

At the end of this tutorial, you have a functional web Q&A system that can help you extract insight from online content more effectively.

What we build

We create a system that:

  1. Taking a URL as input
  2. Extracts and processes the web page content
  3. Accepts natural language questions about the content
  4. Provides accurate, contextual answers based on the webpage

Prerequisites

  • A Google account to access Google Colab
  • Basic understanding of python
  • No previous knowledge of machine learning required

Step 1: Setting up the environment

First, let’s create a new Google Colab notebook. Go to Google Colab and create a new notebook.

Let’s start by installing the necessary libraries:

# Install required packages

!pip install transformers torch beautifulsoup4 requests

This installs:

  • Transformers: Hugging Face’s Library for Advanced NLP Models
  • Torch: Pytorch Deep Learning Framework
  • Beautiful Soup4: For Parsing of HTML and withdrawal of web content
  • Requests: To make HTTP requests for web pages

Step 2: Import libraries and set up basic features

Now let’s import all the necessary libraries and define some helper functions:

import torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import requests
from bs4 import BeautifulSoup
import re
import textwrap

# Check if GPU is available

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Function to extract text from a webpage

def extract_text_from_url(url):
   try:
       headers = {
           'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
       }
       response = requests.get(url, headers=headers)
       response.raise_for_status()  
       soup = BeautifulSoup(response.text, 'html.parser')


       for script_or_style in soup(['script', 'style', 'header', 'footer', 'nav']):
           script_or_style.decompose()


       text = soup.get_text()


       lines = (line.strip() for line in text.splitlines())
       chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
       text="n".join(chunk for chunk in chunks if chunk)


       text = re.sub(r's+', ' ', text).strip()


       return text


   except Exception as e:
       print(f"Error extracting text from URL: {e}")
       return None

This code:

  1. Import all necessary libraries
  2. Sets up our device (GPU if it is available, otherwise CPU)
  3. Creates a feature to extract readable text content from a webpage -url

Step 3: Load Question-Significant Model

Now let’s load a pre-educated question-strain model from embracing face:

# Load for-trained model and tokenizer

model_name = "deepset/roberta-base-squad2"


print(f"Loading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name).to(device)
print("Model loaded successfully!")

We use Deepset/Roberta-Base-Squad2what is:

  • Based on Roberta -Architecture (a robust optimized bert -approach)
  • Finely tuned on Squad 2.0 (Stanford Questions Answering Data Set)
  • A good balance between accuracy and speed for our task

Step 4: Implementation of question-strain feature

Now let’s implement the core functionality – the ability to answer questions based on the unpacked web page content:

def answer_question(question, context, max_length=512):
   max_chunk_size = max_length - len(tokenizer.encode(question)) - 5  
   all_answers = []


   for i in range(0, len(context), max_chunk_size):
       chunk = context[i:i + max_chunk_size]


       inputs = tokenizer(
           question,
           chunk,
           add_special_tokens=True,
           return_tensors="pt",
           max_length=max_length,
           truncation=True
       ).to(device)


       with torch.no_grad():
           outputs = model(**inputs)


       answer_start = torch.argmax(outputs.start_logits)
       answer_end = torch.argmax(outputs.end_logits)


       start_score = outputs.start_logits[0][answer_start].item()
       end_score = outputs.end_logits[0][answer_end].item()
       score = start_score + end_score


       input_ids = inputs.input_ids.tolist()[0]
       tokens = tokenizer.convert_ids_to_tokens(input_ids)


       answer = tokenizer.convert_tokens_to_string(tokens[answer_start:answer_end+1])


       answer = answer.replace("[CLS]", "").replace("[SEP]", "").strip()


       if answer and len(answer) > 2:  
           all_answers.append((answer, score))


   if all_answers:
       all_answers.sort(key=lambda x: x[1], reverse=True)
       return all_answers[0][0]
   else:
       return "I couldn't find an answer in the provided content."

This feature:

  1. Taking a question and the web page content as input
  2. Handles long content by treating it in chunks
  3. Uses the model to predict the answer buckle (start and end positions)
  4. Treats multiple chunks and returns the answer with the highest self -confidence score

Step 5: Test and Examples

Let’s test our system with some examples. Here is the complete code:

url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
webpage_text = extract_text_from_url(url)


print("Sample of extracted text:")
print(webpage_text[:500] + "...")


questions = [
   "When was the term artificial intelligence first used?",
   "What are the main goals of AI research?",
   "What ethical concerns are associated with AI?"
]


for question in questions:
   print(f"nQuestion: {question}")
   answer = answer_question(question, webpage_text)
   print(f"Answer: {answer}")

This will demonstrate how the system works with real examples.

Output from the above code

Limitations and future improvements

Our current implementation has some restrictions:

  1. It can struggle with very long web pages due to restrictions in context length
  2. The model may not understand complex or ambiguous questions
  3. It works best with factual content rather than opinions or subjective material

Future improvements may include:

  • Implementing semantic search to better handle long documents
  • Adding Document Summary Functions
  • Supports multiple languages
  • Implementation of memory of previous questions and answers
  • Fine tuning the model on specific domains (eg medical, legal, technical)

Conclusion

Now you have successfully built your AI-powered Q&A system for web pages using Open Source models. This tool can help you:

  • Extract specific information from long articles
  • Examine more effectively
  • Get quick answers from complex documents

By using Hugging Face’s powerful models and the flexibility of Google Colab, you have created a practical application that demonstrates the capabilities of modern NLP. Feel free to customize and expand this project to meet your specific needs.

Useful resources


Here it is Colab notebook. Nor do not forget to follow us on Twitter and join in our Telegram Channel and LinkedIn GrOUP. Don’t forget to take part in our 85k+ ml subbreddit.

🔥 [Register Now] Minicon Virtual Conference On Open Source AI: Free Registration + Certificate of Participation + 3 Hours Short Event (12 April, at [Sponsored]


Asjad is an internal consultant at Marketchpost. He surpasses B.Tech in Mechanical Engineering at the Indian Institute of Technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who always examines the use of machine learning in healthcare.

Leave a Comment