Building a crawler which comes with its own set of challenges, such as handling dynamic content, infinite scrolling, and varying URL structures across different platforms.
In this blog post, I’ll walk you through how I built a scalable web crawler that can handle multiple domains, dynamically loaded content, and extract product URLs efficiently. I’ll also share the technical details, challenges faced, and future improvements planned for the project.
Problem Statement
The goal of this project was to build a web crawler that:
Takes a list of domains (e.g.,
amazon.in
,flipkart.com
) as input.Crawls all the URLs on the given domains.
Returns a comprehensive list of product URLs found on those domains.
Key Features:
Scalability: Handle a minimum of 10 domains and scale to hundreds.
Dynamic Content Handling: Crawl websites with infinite scrolling or dynamically loaded content.
URL Pattern Recognition: Identify product URLs using AI and regex patterns.
Performance: Execute crawls in parallel or asynchronously to minimize runtime.
Robustness: Handle edge cases like invalid domains, non-HTML resources, and varying URL structures.
Project Overview
The project is built using the following tech stack:
Backend: Node.js with Express for handling HTTP requests.
Database: MongoDB for storing crawled domains and product URLs.
Queue System: BullMQ for managing asynchronous crawling tasks.
Crawling Logic: Puppeteer for handling dynamic content and infinite scrolling.
AI Integration: Google’s Gemini AI for filtering product URLs.
Here’s a high-level overview of the project structure:
x-crawler/
├── src/
│ ├── app.ts // Express app setup
│ ├── server.ts // Server initialization
│ ├── config/ // Configuration files
│ ├── controllers/ // API controllers
│ ├── db/ // Database models and connections
│ ├── middlewares/ // Request validation
│ ├── models/ // MongoDB schemas
│ ├── routes/ // API routes
│ ├── services/ // Business logic
│ ├── types/ // TypeScript interfaces
│ ├── utils/ // Utility functions
│ └── workers/ // Background workers for crawling
Technical Deep Dive
1. Crawler Service
The core of the project is the CrawlerService
, which handles both static and dynamic crawling. Here’s how it works:
Static Crawling
The static crawler is designed to crawl websites that do not rely on JavaScript to load content dynamically. It uses BFS (Breadth-First Search) to traverse URLs starting from a seed URL (e.g., https://amazon.in
). Below is a detailed explanation of how it works:
a. Initialization
The static crawler is initialized in the CrawlerService
class. It takes a domain (e.g., amazon.in
) and an optional maxDepth
parameter to limit the depth of crawling.
async startCrawl(domain: string, maxDepth?: number): Promise<string[]> {
const seedUrl = url.format({
protocol: 'https',
hostname: domain,
pathname: '/',
});
const visitedUrls = new Set<string>(); // Track visited URLs
const visitUrlsQueue: { url: string; depth: number }[] = [{ url: seedUrl, depth: 0 }]; // BFS queue
const foundUrlsSet = new Set<string>(); // Store unique URLs
seedUrl
: The starting point for crawling (e.g.,https://amazon.in/
).visitedUrls
: Tracks URLs that have already been crawled.visitUrlsQueue
: A queue for BFS traversal containing URLs and their depths.foundUrlsSet
: Stores unique URLs found during crawling.
b. BFS Traversal
The crawler explores URLs level by level using BFS, starting from the seed URL and exploring all links on the page within the same domain.
while (visitUrlsQueue.length > 0) {
const { url: currentUrl, depth } = visitUrlsQueue.shift()!; // Dequeue the next URL
// Skip if already visited
if (visitedUrls.has(currentUrl)) continue;
// Skip if max depth is reached
if (maxDepth !== undefined && depth >= maxDepth) {
return Array.from(foundUrlsSet);
}
try {
// Fetch the page content
const response = await axios.get(currentUrl, { timeout: 30000 });
visitedUrls.add(currentUrl); // Mark URL as visited
// Parse the HTML content
const dom = new JSDOM(response.data);
const anchorTags = dom.window.document.querySelectorAll('a'); // Extract all anchor tags
BFS Queue: URLs are processed in the order they are discovered.
Max Depth: The crawler stops exploring URLs beyond the specified
maxDepth
.HTML Parsing: The
JSDOM
library extracts<a>
tags from the HTML content.
c. URL Filtering and Processing
For each anchor tag, the crawler checks if the URL is valid and belongs to the same domain, skipping non-HTML resources.
for (const anchorTag of anchorTags) {
const href = anchorTag.href;
// Skip invalid or non-HTML URLs
if (
!href ||
href.includes('#') ||
href.includes('mailto:') ||
href.includes('tel:') ||
CrawlerService.NON_HTML_EXTENSIONS.test(href)
) {
continue;
}
// Convert relative URLs to absolute URLs
const absoluteUrl = CrawlerService.getAbsoluteUrl(currentUrl, href);
// Skip URLs not from the same domain
if (!CrawlerService.isSameDomain(seedUrl, absoluteUrl)) continue;
// Skip already visited URLs
if (absoluteUrl && visitedUrls.has(absoluteUrl)) continue;
// Add URL to the found set
foundUrlsSet.add(absoluteUrl);
// Add URL to the queue if not already visited or in the queue
const urlExistsInQueue = visitUrlsQueue.some((urlObj) => urlObj.url === absoluteUrl);
if (!visitedUrls.has(absoluteUrl) && !urlExistsInQueue) {
visitUrlsQueue.push({
url: absoluteUrl,
depth: depth + 1,
});
}
}
Relative to Absolute URLs: Converts relative URLs (e.g.,
/product/123
) to absolute URLs (e.g.,https://amazon.in/product/123
).Same Domain Check: Ensures URLs belong to the same domain as the seed URL.
Unique URLs: Adds URLs to the
foundUrlsSet
only if they are unique.
d. Delay Between Requests
To avoid overloading the server, a random delay is added between requests.
// Add a delay between requests
const delayInMS = Math.round((Math.random() + 1) * 10000);
await new Promise((resolve) => setTimeout(resolve, delayInMS));
- Random Delay: Mimics human behavior to avoid being blocked by the server.
e. Returning the Results
After processing all URLs, the crawler returns the list of unique URLs.
return Array.from(foundUrlsSet); // Return unique URLs as an array
f. Error Handling
If an error occurs (e.g., network issues or invalid HTML), the crawler logs the error and continues with the next URL.
} catch (error) {
console.log(`Error crawling ${currentUrl}: `, error);
}
Example Workflow
Input: User provides a domain (e.g.,
amazon.in
).Seed URL: Starts with
https://amazon.in/
.BFS Traversal:
Fetches HTML content of
https://amazon.in/
.Extracts and processes
<a>
tags.Converts relative URLs to absolute URLs.
Filters out invalid/non-HTML URLs.
Adds valid URLs to the queue.
Depth Control: Stops exploring URLs beyond
maxDepth
.Output: Returns a list of unique URLs found.
Key Components
CrawlerService
Manages core crawling logic.
Uses
axios
for HTTP requests andJSDOM
for HTML parsing.
Static Crawler
Implementation of
CrawlerService
for static websites.Uses BFS for URL traversal.
URL Validation
Filters out non-HTML resources using regex.
Ensures URLs belong to the same domain.
Queue System
- Managed by
BullMQ
for asynchronous job processing.
Dynamic Crawler Workflow
The dynamic crawler uses Puppeteer, a headless browser, to simulate a real user interacting with the website. This allows it to handle JavaScript-rendered content, infinite scrolling, and other dynamic behaviors.
Step-by-Step Breakdown
1. Initialization
The dynamic crawler is initialized in the CrawlerService
class. It takes a domain (e.g., amazon.in
) and an optional maxDepth
parameter to limit the depth of crawling.
async startCrawlDynamic(domain: string, maxDepth?: number): Promise<string[]> {
const seedUrl = url.format({
protocol: 'https',
hostname: domain,
pathname: '/',
});
const visitedUrls = new Set<string>(); // Track visited URLs
const visitUrlsQueue: { url: string; depth: number }[] = [{ url: seedUrl, depth: 0 }]; // BFS queue
const foundUrlsSet = new Set<string>(); // Store unique URLs
}
seedUrl
: The starting point for crawling (e.g.,https://amazon.in/
).visitedUrls
: Tracks URLs already crawled.visitUrlsQueue
: BFS queue for traversal.foundUrlsSet
: Stores unique URLs discovered.
2. Launching Puppeteer
The dynamic crawler launches a headless browser using Puppeteer.
// Launch Puppeteer browser
const browserInstance = await puppeteer.launch({ headless: false });
const page = await browserInstance.newPage();
headless: false
: Runs in non-headless mode for debugging. Set totrue
for production.page
: Opens a new browser tab for navigation.
3. BFS Traversal with Puppeteer
The crawler explores URLs level by level using BFS.
while (visitUrlsQueue.length > 0) {
const { url: currentUrl, depth } = visitUrlsQueue.shift()!; // Dequeue the next URL
// Skip if already visited
if (visitedUrls.has(currentUrl)) continue;
// Skip if max depth is reached
if (maxDepth !== undefined && depth >= maxDepth) {
break;
}
try {
// Navigate to the current URL
await page.goto(currentUrl, { waitUntil: 'domcontentloaded', timeout: 30000 });
visitedUrls.add(currentUrl); // Mark URL as visited
BFS Queue: Processes URLs in the discovery order (FIFO).
Max Depth: Stops exploration beyond the specified depth.
Puppeteer Navigation: Navigates and waits for the DOM to load.
4. Handling Infinite Scrolling
Simulates scrolling to load additional content.
let scrollCount = 0;
let previousHeight = await page.evaluate('document.body.scrollHeight');
while (scrollCount <= maxScrolls) {
console.log(`Scroll attempt ${scrollCount + 1} of ${maxScrolls}`);
// Extract all anchor href links on the webpage
const hrefs = await page.evaluate(() =>
Array.from(document.querySelectorAll('a'), (a) => a.href)
);
// Add valid URLs to queue
for (const href of hrefs) {
if (
!href ||
href.includes('#') ||
href.includes('mailto:') ||
href.includes('tel:') ||
CrawlerService.NON_HTML_EXTENSIONS.test(href)
) continue;
const absoluteUrl = CrawlerService.getAbsoluteUrl(currentUrl, href);
if (!CrawlerService.isSameDomain(seedUrl, absoluteUrl) || visitedUrls.has(absoluteUrl)) continue;
foundUrlsSet.add(absoluteUrl);
const urlExistsInQueue = visitUrlsQueue.some((urlObj) => urlObj.url === absoluteUrl);
if (!visitedUrls.has(absoluteUrl) && !urlExistsInQueue) {
visitUrlsQueue.push({ url: absoluteUrl, depth: depth + 1 });
}
}
// Scroll to the bottom of the page
await page.evaluate(() => window.scrollBy(0, document.body.scrollHeight));
// Delay for content loading
await new Promise((resolve) => setTimeout(resolve, scrollDelay));
const newHeight = await page.evaluate('document.body.scrollHeight');
if (newHeight === previousHeight) {
console.log('No further scrolling possible. Exiting.');
break;
}
previousHeight = newHeight;
scrollCount++;
}
Scrolling Logic: Scrolls to the bottom to load new content.
Max Scrolls: Limits scrolling attempts.
Delay Between Scrolls: Allows new content to load.
5. Delay Between Requests
Adds a random delay between requests to mimic human behavior.
const delayInMS = Math.round((Math.random() + 1) * 10000);
await new Promise((resolve) => setTimeout(resolve, delayInMS));
6. Closing the Browser
Closes the Puppeteer instance after processing URLs.
await browserInstance.close();
7. Returning the Results
Returns the list of unique URLs.
return Array.from(foundUrlsSet);
Example Workflow
Input: Domain provided (e.g.,
amazon.in
).Seed URL: Starts with
https://amazon.in/
.Navigation:
Opens a headless browser.
Navigates to
https://amazon.in/
.Waits for DOM content to load.
Infinite Scrolling:
Scrolls to load additional content.
Extracts all
<a>
tags and processes theirhref
attributes.
URL Filtering:
Converts relative URLs to absolute.
Filters invalid or non-HTML URLs.
Ensures URLs belong to the same domain.
Depth Control: Stops beyond specified
maxDepth
.Output: Returns a list of unique URLs.
Key Components
CrawlerService
Handles core crawling logic.
Uses Puppeteer for dynamic content.
Dynamic Crawler
- Simulates user interactions like scrolling.
URL Validation
Filters non-HTML resources.
Ensures domain consistency.
Queue System
- Managed by BullMQ for asynchronous processing.
Challenges and Solutions
Dynamic Content
Challenge: Infinite scrolling or lazy-loaded content.
Solution: Simulate scrolling and wait for new content.
Performance
Challenge: Crawling dynamic websites can be slow.
Solution: Use asynchronous processing and random delays.
Edge Cases
Challenge: Some websites block automated crawlers.
Solution: Mimic human behavior with random delays and headless mode.
2. Queue System with BullMQ
To handle multiple domains efficiently, the project uses BullMQ for job queuing. Each domain is processed as a separate job, and the system can scale to handle hundreds of domains.
const connection = new Redis({
...REDIS_CONFIG,
maxRetriesPerRequest: null,
});
// for static cralwer
export const validDomainQueue = new Queue(validDomainQueueName, { connection });
export const validDomainWorker = new Worker(validDomainQueueName, domainCrawlJob, { connection });
// TOOD: implement logger
validDomainWorker.on('completed', (job: Job) => {
console.log(`Job completed: ${job.id}`);
});
validDomainWorker.on('failed', (job: Job | undefined, err) => {
console.log(`Job failed: ${err}`);
});
// for dynamic crawler
export const dynamicValidDomainQueue = new Queue(dynamicValidDomainQueueName, { connection });
export const dynamicValidDomainWorker = new Worker(
dynamicValidDomainQueueName,
dynamicDomainCrawlJob,
{
connection,
}
);
// TOOD: implement logger
dynamicValidDomainWorker.on('completed', (job: Job) => {
console.log(`Job completed: ${job.id}`);
});
dynamicValidDomainWorker.on('failed', (job: Job | undefined, err) => {
console.log(`Job failed: ${err}`);
});
3. AI Integration for Product URL Filtering
To identify product URLs, the project uses Google’s Gemini AI. The AI is trained to recognize patterns in URLs that indicate product pages (e.g., /product/
, /item/
).
export const getProductUrls = async (foundUrls: string[]): Promise<string[]> => {
try {
const genAI = new GoogleGenerativeAI(String(Config.GEMINI_AI_API_KEY));
const model = genAI.getGenerativeModel(googleAIConfigs);
const prompt = `${promptForFetchingProductUrls}\n${foundUrls}`;
const result: GenerateContentResult = await model.generateContent(prompt);
const generatedText: string = result.response.text();
const productUrls: string[] = JSON.parse(generatedText);
return productUrls;
} catch (error) {
console.log('Something went wrong :( with google AI, using regex as fallback', error);
let productUrls: string[] = [];
const productUrlRegex =
/\/product\/|\/products\/|\/item\/|\/items\/|\/p\/|\/dp\/|\/buy\/|product_id=|item=/;
for (const foundUrl of foundUrls) {
if (productUrlRegex.test(foundUrl)) {
productUrls.push(foundUrl);
}
}
return productUrls;
}
};
export const promptForFetchingProductUrls = `You are an AI assistant that identifies product URLs. A product URL points to a page containing details about a single product, such as its name, price, specifications, or "Add to Cart" button.
Here are some examples:
- Product URL: https://example.com/product/12345
- Product URL: https://example.com/item/67890
- Non-Product URL: https://example.com/category/shoes
- Non-Product URL: https://example.com/search?q=shoes
Here are some more real examples of productUrls:
1. https://amzn.in/d/gxGhtVq
2. https://www.amazon.in/Weighing-Balanced-Batteries-Included-A121/dp/B083C6XMKQ/
3. https://www.flipkart.com/mi-11-lite-vinyl-black-128-gb/p/itmac6203bae9394?pid=MOBG3VSKRKGZKJAR
4. https://www.myntra.com/watches/coach/coach-women-grand-embellished-dial-analogue-watch-14503941/30755447/buy
5. https://www.myntra.com/tshirts/nautica/nautica-pure-cotton-polo-collar-t-shirt/31242755/buy
6. https://jagrukjournal.lla.in/products/ek-geet-2025-collectible-calendar
7. https://mrbeast.store/products/kids-basics-panther-tee-royal-blue
productUrls = ['https://abc.com/product/1', 'https://abc.com/product/2']
Return: Array<productUrls>
Given the following URLs, return only the product URLs in array[]:`;
// This prompt is structured as per Google AI only, but will soon extend it and will make a generalized prompt for pouplar LLM's
4. Database Models
The project uses MongoDB to store crawled domains and product URLs. Here are the schemas:
CrawlDomain Schema
const createCrawlDomainSchema = (documentExpirationInSeconds: number): Schema<IDomainDocument> =>
new Schema<IDomainDocument>(
{
domain: { type: String, required: true, unique: true },
jobId: { type: [String], default: [] },
status: { type: String, enum: Object.values(CrawlDomainStatus), default: CrawlDomainStatus.PENDING },
expiresAt: { type: Date, default: () => new Date(Date.now() + documentExpirationInSeconds * 1000) },
},
{ timestamps: true }
);
CrawlProductUrl Schema
const createCrawlProductUrlSchema = (documentExpirationInSeconds: number): Schema<IProductUrlsDocument> =>
new Schema<IProductUrlsDocument>(
{
urls: { type: [String], required: true },
domainId: { type: Schema.Types.ObjectId, ref: 'Domain', required: true },
expiresAt: { type: Date, default: () => new Date(Date.now() + documentExpirationInSeconds * 1000) },
},
{ timestamps: true }
);
Detailed Sequence Diagram
Challenges and Solutions
1. Dynamic Content Handling
Challenge: Websites with infinite scrolling or dynamically loaded content were difficult to crawl.
Solution: Used Puppeteer to simulate a browser and scroll through the page to load additional content.
2. Performance
Challenge: Crawling large websites sequentially was slow.
Solution: Implemented BullMQ for asynchronous job processing, allowing multiple domains to be crawled in parallel.
3. URL Pattern Recognition
Challenge: Different e-commerce platforms use varying URL structures for product pages.
Solution: Integrated Google’s Gemini AI to identify product URLs based on patterns.
Future Improvements
While the project meets the initial requirements, there’s always room for improvement:
Concurrency: Implement concurrency to handle multiple domains simultaneously.
Dead Letter Queue (DLQ): Handle failed jobs more efficiently.
Caching: Use Redis to cache frequently accessed data.
Error Handling: Improve error handling and logging for better debugging.
Conclusion
Building this web crawler was a challenging yet rewarding experience. It not only helped me deepen my understanding of web scraping and asynchronous programming but also gave me hands-on experience with AI integration and queue systems. I’m excited to continue improving this project and exploring new features.
Feel free to check out the code on x-crawler Github and share your feedback!