One step by step guide to building a trend finds tool with python: web scraping, NLP (Sentiment Analysis & Topic Modeling) and Word Cloud Visualization

Monitoring and extracting trends from web content has become important for market research, content creation, or keeping you ahead of your field. In this tutorial, we provide a practical guide to building your trend finding tool using Python. Without needing external APIs or complex setups, you will learn how to scrape publicly available sites, use powerful NLP (natural language processing) techniques such as sentiment analysis and topic modeling and visualize new trends using dynamic words clouds.

import requests
from bs4 import BeautifulSoup


# List of URLs to scrape
urls = ["https://en.wikipedia.org/wiki/Natural_language_processing",
        "https://en.wikipedia.org/wiki/Machine_learning"]  


collected_texts = []  # to store text from each page


for url in urls:
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        # Extract all paragraph text
        paragraphs = [p.get_text() for p in soup.find_all('p')]
        page_text = " ".join(paragraphs)
        collected_texts.append(page_text.strip())
    else:
        print(f"Failed to retrieve {url}")

Only with the above code piece do we demonstrate a straightforward way of scraping text data from publicly available sites using Python’s requests and Beautifuls. It retrieves content from specified URLs, extracts sections from HTML, and prepares them for additional NLP analysis by combining text data into structured strings.

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords


stop_words = set(stopwords.words('english'))


cleaned_texts = []
for text in collected_texts:
    # Remove non-alphabetical characters and lower the text
    text = re.sub(r'[^A-Za-z\s]', ' ', text).lower()
    # Remove stopwords
    words = [w for w in text.split() if w not in stop_words]
    cleaned_texts.append(" ".join(words))

Then we clean the scraped text by converting it into lowercase letters, removing punctuation and special characters and filtering ordinary English stop words using NLTK. This processing ensures that the text data is clean, focused and ready for meaningful NLP analysis.

from collections import Counter


# Combine all texts into one if analyzing overall trends:
all_text = " ".join(cleaned_texts)
word_counts = Counter(all_text.split())
common_words = word_counts.most_common(10)  # top 10 frequent words
print("Top 10 keywords:", common_words)

Now we calculate word frequencies from the purified text data that identify the 10 most frequent keywords. This helps highlight dominant trends and recurring themes across the documents collected, providing immediate insight into popular or significant topics within the scraped content.

!pip install textblob
from textblob import TextBlob


for i, text in enumerate(cleaned_texts, 1):
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0.1:
        sentiment = "Positive 😀"
    elif polarity < -0.1:
        sentiment = "Negative 🙁"
    else:
        sentiment = "Neutral 😐"
    print(f"Document {i} Sentiment: {sentiment} (polarity={polarity:.2f})")

We perform sentiment analysis on each purified text document using Textblob, a Python library built on top of NLTK. It evaluates the overall emotional tone of each document – positively, negatively or neutral – and prints the mood along with a numeric polarity score, giving a quick indication of the general mood or attitude within the text data.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation


# Adjust these parameters
vectorizer = CountVectorizer(max_df=1.0, min_df=1, stop_words="english")
doc_term_matrix = vectorizer.fit_transform(cleaned_texts)


# Fit LDA to find topics (for instance, 3 topics)
lda = LatentDirichletAllocation(n_components=3, random_state=42)
lda.fit(doc_term_matrix)


feature_names = vectorizer.get_feature_names_out()


for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx + 1}: ", [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-11:-1]])

Then we use Latent Dirichlet Allocation (LDA) – a popular subject modeling algorithm – to discover underlying topics in the text corpus. It first transforms cleaned texts into a numeric document-Term Matrix using Scikit-Learns Countvectorizer and then fits an LDA model to identify the primary themes. The output shows the best keywords for each discovered topic that briefly key concepts in the data collected.

# Assuming you have your text data stored in combined_text
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import re


nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


# Preprocess and clean the text:
cleaned_texts = []
for text in collected_texts:
    text = re.sub(r'[^A-Za-z\s]', ' ', text).lower()
    words = [w for w in text.split() if w not in stop_words]
    cleaned_texts.append(" ".join(words))


# Generate combined text
combined_text = " ".join(cleaned_texts)


# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color="white", colormap='viridis').generate(combined_text)


# Display the word cloud
plt.figure(figsize=(10, 6))  # <-- corrected numeric dimensions
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Word Cloud of Scraped Text", fontsize=16)
plt.show()

Finally, we generate a Ordsky visualization showing prominent keywords from the combined and cleaned text data. By visually emphasizing the most frequent and relevant expressions, this approach allows for intuitive exploration of the most important trends and themes in the collected web content.

Word Cloud Output from the Scratched Place

Finally, we have successfully built a robust and interactive trend finding tool. This exercise equipped you with practical experience with web scraping, NLP analysis, subject modeling and intuitive visualizations using word clouds. With this powerful, yet simple approach, you can continuously track the trends of the industry, gain valuable insight from social and blog content and make informed decisions based on real -time data.

Here it is Colab notebook. Nor do not forget to follow us on Twitter and join in our Telegram Channel and LinkedIn GrOUP. Don’t forget to take part in our 80k+ ml subbreddit.

🚨 Meet Parlant: An LLM-First Conversation-IA frame designed to give developers the control and precision they need in relation to their AI Customer Service Agents, using behavioral guidelines and Runtime supervision. 🔧 🎛 It is operated using a user -friendly cli 📟 and native client SDKs in Python and Typescript 📦.

Asif Razzaq is CEO of Marketchpost Media Inc. His latest endeavor is the launch of an artificial intelligence media platform, market post that stands out for its in -depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts over 2 million monthly views and illustrates its popularity among the audience.

Parlant: Build Reliable AI customer facing agents with llms 💬 ✅ (promoted)

Leave a Comment Cancel reply