In today’s information -rich digital landscape, navigating extensive web content can be overwhelming. Whether you are examining a project, studying complex material or trying to extract specific information from long articles, the process can be time -consuming and ineffective. This is where an AI-driven question answer (Q&A) Bot becomes invaluable.
This tutorial will guide you through the construction of a practical AI Q&A system that can analyze web page content and answer specific questions. Instead of relying on expensive API services, we use open source models from embracing face to create a solution that is:
- Completely free to use
- Runs in Google Colab (no local setup required)
- Adapted to your specific needs
- Built on groundbreaking NLP technology
At the end of this tutorial, you have a functional web Q&A system that can help you extract insight from online content more effectively.
What we build
We create a system that:
- Taking a URL as input
- Extracts and processes the web page content
- Accepts natural language questions about the content
- Provides accurate, contextual answers based on the webpage
Prerequisites
- A Google account to access Google Colab
- Basic understanding of python
- No previous knowledge of machine learning required
Step 1: Setting up the environment
First, let’s create a new Google Colab notebook. Go to Google Colab and create a new notebook.
Let’s start by installing the necessary libraries:
# Install required packages
!pip install transformers torch beautifulsoup4 requests
This installs:
- Transformers: Hugging Face’s Library for Advanced NLP Models
- Torch: Pytorch Deep Learning Framework
- Beautiful Soup4: For Parsing of HTML and withdrawal of web content
- Requests: To make HTTP requests for web pages
Step 2: Import libraries and set up basic features
Now let’s import all the necessary libraries and define some helper functions:
import torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import requests
from bs4 import BeautifulSoup
import re
import textwrap
# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
# Function to extract text from a webpage
def extract_text_from_url(url):
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
for script_or_style in soup(['script', 'style', 'header', 'footer', 'nav']):
script_or_style.decompose()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text="n".join(chunk for chunk in chunks if chunk)
text = re.sub(r's+', ' ', text).strip()
return text
except Exception as e:
print(f"Error extracting text from URL: {e}")
return None
This code:
- Import all necessary libraries
- Sets up our device (GPU if it is available, otherwise CPU)
- Creates a feature to extract readable text content from a webpage -url
Step 3: Load Question-Significant Model
Now let’s load a pre-educated question-strain model from embracing face:
# Load for-trained model and tokenizer
model_name = "deepset/roberta-base-squad2"
print(f"Loading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name).to(device)
print("Model loaded successfully!")
We use Deepset/Roberta-Base-Squad2what is:
- Based on Roberta -Architecture (a robust optimized bert -approach)
- Finely tuned on Squad 2.0 (Stanford Questions Answering Data Set)
- A good balance between accuracy and speed for our task
Step 4: Implementation of question-strain feature
Now let’s implement the core functionality – the ability to answer questions based on the unpacked web page content:
def answer_question(question, context, max_length=512):
max_chunk_size = max_length - len(tokenizer.encode(question)) - 5
all_answers = []
for i in range(0, len(context), max_chunk_size):
chunk = context[i:i + max_chunk_size]
inputs = tokenizer(
question,
chunk,
add_special_tokens=True,
return_tensors="pt",
max_length=max_length,
truncation=True
).to(device)
with torch.no_grad():
outputs = model(**inputs)
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits)
start_score = outputs.start_logits[0][answer_start].item()
end_score = outputs.end_logits[0][answer_end].item()
score = start_score + end_score
input_ids = inputs.input_ids.tolist()[0]
tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer = tokenizer.convert_tokens_to_string(tokens[answer_start:answer_end+1])
answer = answer.replace("[CLS]", "").replace("[SEP]", "").strip()
if answer and len(answer) > 2:
all_answers.append((answer, score))
if all_answers:
all_answers.sort(key=lambda x: x[1], reverse=True)
return all_answers[0][0]
else:
return "I couldn't find an answer in the provided content."
This feature:
- Taking a question and the web page content as input
- Handles long content by treating it in chunks
- Uses the model to predict the answer buckle (start and end positions)
- Treats multiple chunks and returns the answer with the highest self -confidence score
Step 5: Test and Examples
Let’s test our system with some examples. Here is the complete code:
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
webpage_text = extract_text_from_url(url)
print("Sample of extracted text:")
print(webpage_text[:500] + "...")
questions = [
"When was the term artificial intelligence first used?",
"What are the main goals of AI research?",
"What ethical concerns are associated with AI?"
]
for question in questions:
print(f"nQuestion: {question}")
answer = answer_question(question, webpage_text)
print(f"Answer: {answer}")
This will demonstrate how the system works with real examples.
Limitations and future improvements
Our current implementation has some restrictions:
- It can struggle with very long web pages due to restrictions in context length
- The model may not understand complex or ambiguous questions
- It works best with factual content rather than opinions or subjective material
Future improvements may include:
- Implementing semantic search to better handle long documents
- Adding Document Summary Functions
- Supports multiple languages
- Implementation of memory of previous questions and answers
- Fine tuning the model on specific domains (eg medical, legal, technical)
Conclusion
Now you have successfully built your AI-powered Q&A system for web pages using Open Source models. This tool can help you:
- Extract specific information from long articles
- Examine more effectively
- Get quick answers from complex documents
By using Hugging Face’s powerful models and the flexibility of Google Colab, you have created a practical application that demonstrates the capabilities of modern NLP. Feel free to customize and expand this project to meet your specific needs.
Useful resources
Here it is Colab notebook. Nor do not forget to follow us on Twitter and join in our Telegram Channel and LinkedIn GrOUP. Don’t forget to take part in our 85k+ ml subbreddit.
🔥 [Register Now] Minicon Virtual Conference On Open Source AI: Free Registration + Certificate of Participation + 3 Hours Short Event (12 April, at [Sponsored]

Asjad is an internal consultant at Marketchpost. He surpasses B.Tech in Mechanical Engineering at the Indian Institute of Technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who always examines the use of machine learning in healthcare.
