Building an Efficient and Scalable Web Crawler: A Beginner's Guide

Building an Efficient and Scalable Web Crawler: A Beginner's Guide

Building a crawler which comes with its own set of challenges, such as handling dynamic content, infinite scrolling, and varying URL structures across different platforms.

In this blog post, I’ll walk you through how I built a scalable web crawler that can handle multiple domains, dynamically loaded content, and extract product URLs efficiently. I’ll also share the technical details, challenges faced, and future improvements planned for the project.


Problem Statement

The goal of this project was to build a web crawler that:

  1. Takes a list of domains (e.g., amazon.in, flipkart.com) as input.

  2. Crawls all the URLs on the given domains.

  3. Returns a comprehensive list of product URLs found on those domains.

Key Features:

  • Scalability: Handle a minimum of 10 domains and scale to hundreds.

  • Dynamic Content Handling: Crawl websites with infinite scrolling or dynamically loaded content.

  • URL Pattern Recognition: Identify product URLs using AI and regex patterns.

  • Performance: Execute crawls in parallel or asynchronously to minimize runtime.

  • Robustness: Handle edge cases like invalid domains, non-HTML resources, and varying URL structures.


Project Overview

The project is built using the following tech stack:

  • Backend: Node.js with Express for handling HTTP requests.

  • Database: MongoDB for storing crawled domains and product URLs.

  • Queue System: BullMQ for managing asynchronous crawling tasks.

  • Crawling Logic: Puppeteer for handling dynamic content and infinite scrolling.

  • AI Integration: Google’s Gemini AI for filtering product URLs.

Here’s a high-level overview of the project structure:

x-crawler/
├── src/
│   ├── app.ts              // Express app setup
│   ├── server.ts           // Server initialization
│   ├── config/             // Configuration files
│   ├── controllers/        // API controllers
│   ├── db/                 // Database models and connections
│   ├── middlewares/        // Request validation
│   ├── models/             // MongoDB schemas
│   ├── routes/             // API routes
│   ├── services/           // Business logic
│   ├── types/              // TypeScript interfaces
│   ├── utils/              // Utility functions
│   └── workers/            // Background workers for crawling

Technical Deep Dive

1. Crawler Service

The core of the project is the CrawlerService, which handles both static and dynamic crawling. Here’s how it works:

Static Crawling

The static crawler is designed to crawl websites that do not rely on JavaScript to load content dynamically. It uses BFS (Breadth-First Search) to traverse URLs starting from a seed URL (e.g., https://amazon.in). Below is a detailed explanation of how it works:

a. Initialization

The static crawler is initialized in the CrawlerService class. It takes a domain (e.g., amazon.in) and an optional maxDepth parameter to limit the depth of crawling.

async startCrawl(domain: string, maxDepth?: number): Promise<string[]> {
    const seedUrl = url.format({
        protocol: 'https',
        hostname: domain,
        pathname: '/',
    });

    const visitedUrls = new Set<string>(); // Track visited URLs
    const visitUrlsQueue: { url: string; depth: number }[] = [{ url: seedUrl, depth: 0 }]; // BFS queue
    const foundUrlsSet = new Set<string>(); // Store unique URLs
  • seedUrl: The starting point for crawling (e.g., https://amazon.in/).

  • visitedUrls: Tracks URLs that have already been crawled.

  • visitUrlsQueue: A queue for BFS traversal containing URLs and their depths.

  • foundUrlsSet: Stores unique URLs found during crawling.

b. BFS Traversal

The crawler explores URLs level by level using BFS, starting from the seed URL and exploring all links on the page within the same domain.

while (visitUrlsQueue.length > 0) {
    const { url: currentUrl, depth } = visitUrlsQueue.shift()!; // Dequeue the next URL

    // Skip if already visited
    if (visitedUrls.has(currentUrl)) continue;

    // Skip if max depth is reached
    if (maxDepth !== undefined && depth >= maxDepth) {
        return Array.from(foundUrlsSet);
    }

    try {
        // Fetch the page content
        const response = await axios.get(currentUrl, { timeout: 30000 });
        visitedUrls.add(currentUrl); // Mark URL as visited

        // Parse the HTML content
        const dom = new JSDOM(response.data);
        const anchorTags = dom.window.document.querySelectorAll('a'); // Extract all anchor tags
  • BFS Queue: URLs are processed in the order they are discovered.

  • Max Depth: The crawler stops exploring URLs beyond the specified maxDepth.

  • HTML Parsing: The JSDOM library extracts <a> tags from the HTML content.

c. URL Filtering and Processing

For each anchor tag, the crawler checks if the URL is valid and belongs to the same domain, skipping non-HTML resources.

for (const anchorTag of anchorTags) {
    const href = anchorTag.href;

    // Skip invalid or non-HTML URLs
    if (
        !href ||
        href.includes('#') ||
        href.includes('mailto:') ||
        href.includes('tel:') ||
        CrawlerService.NON_HTML_EXTENSIONS.test(href)
    ) {
        continue;
    }

    // Convert relative URLs to absolute URLs
    const absoluteUrl = CrawlerService.getAbsoluteUrl(currentUrl, href);

    // Skip URLs not from the same domain
    if (!CrawlerService.isSameDomain(seedUrl, absoluteUrl)) continue;

    // Skip already visited URLs
    if (absoluteUrl && visitedUrls.has(absoluteUrl)) continue;

    // Add URL to the found set
    foundUrlsSet.add(absoluteUrl);

    // Add URL to the queue if not already visited or in the queue
    const urlExistsInQueue = visitUrlsQueue.some((urlObj) => urlObj.url === absoluteUrl);
    if (!visitedUrls.has(absoluteUrl) && !urlExistsInQueue) {
        visitUrlsQueue.push({
            url: absoluteUrl,
            depth: depth + 1,
        });
    }
}
  • Relative to Absolute URLs: Converts relative URLs (e.g., /product/123) to absolute URLs (e.g., https://amazon.in/product/123).

  • Same Domain Check: Ensures URLs belong to the same domain as the seed URL.

  • Unique URLs: Adds URLs to the foundUrlsSet only if they are unique.

d. Delay Between Requests

To avoid overloading the server, a random delay is added between requests.

// Add a delay between requests
const delayInMS = Math.round((Math.random() + 1) * 10000);
await new Promise((resolve) => setTimeout(resolve, delayInMS));
  • Random Delay: Mimics human behavior to avoid being blocked by the server.

e. Returning the Results

After processing all URLs, the crawler returns the list of unique URLs.

return Array.from(foundUrlsSet); // Return unique URLs as an array

f. Error Handling

If an error occurs (e.g., network issues or invalid HTML), the crawler logs the error and continues with the next URL.

} catch (error) {
    console.log(`Error crawling ${currentUrl}: `, error);
}

Example Workflow

  1. Input: User provides a domain (e.g., amazon.in).

  2. Seed URL: Starts with https://amazon.in/.

  3. BFS Traversal:

    • Fetches HTML content of https://amazon.in/.

    • Extracts and processes <a> tags.

    • Converts relative URLs to absolute URLs.

    • Filters out invalid/non-HTML URLs.

    • Adds valid URLs to the queue.

  4. Depth Control: Stops exploring URLs beyond maxDepth.

  5. Output: Returns a list of unique URLs found.

Key Components

CrawlerService

  • Manages core crawling logic.

  • Uses axios for HTTP requests and JSDOM for HTML parsing.

Static Crawler

  • Implementation of CrawlerService for static websites.

  • Uses BFS for URL traversal.

URL Validation

  • Filters out non-HTML resources using regex.

  • Ensures URLs belong to the same domain.

Queue System

  • Managed by BullMQ for asynchronous job processing.

Dynamic Crawler Workflow

The dynamic crawler uses Puppeteer, a headless browser, to simulate a real user interacting with the website. This allows it to handle JavaScript-rendered content, infinite scrolling, and other dynamic behaviors.

Step-by-Step Breakdown

1. Initialization

The dynamic crawler is initialized in the CrawlerService class. It takes a domain (e.g., amazon.in) and an optional maxDepth parameter to limit the depth of crawling.

async startCrawlDynamic(domain: string, maxDepth?: number): Promise<string[]> {
    const seedUrl = url.format({
        protocol: 'https',
        hostname: domain,
        pathname: '/',
    });

    const visitedUrls = new Set<string>(); // Track visited URLs
    const visitUrlsQueue: { url: string; depth: number }[] = [{ url: seedUrl, depth: 0 }]; // BFS queue
    const foundUrlsSet = new Set<string>(); // Store unique URLs
}
  • seedUrl: The starting point for crawling (e.g., https://amazon.in/).

  • visitedUrls: Tracks URLs already crawled.

  • visitUrlsQueue: BFS queue for traversal.

  • foundUrlsSet: Stores unique URLs discovered.

2. Launching Puppeteer

The dynamic crawler launches a headless browser using Puppeteer.

// Launch Puppeteer browser
const browserInstance = await puppeteer.launch({ headless: false });
const page = await browserInstance.newPage();
  • headless: false: Runs in non-headless mode for debugging. Set to true for production.

  • page: Opens a new browser tab for navigation.

3. BFS Traversal with Puppeteer

The crawler explores URLs level by level using BFS.

while (visitUrlsQueue.length > 0) {
    const { url: currentUrl, depth } = visitUrlsQueue.shift()!; // Dequeue the next URL

    // Skip if already visited
    if (visitedUrls.has(currentUrl)) continue;

    // Skip if max depth is reached
    if (maxDepth !== undefined && depth >= maxDepth) {
        break;
    }

    try {
        // Navigate to the current URL
        await page.goto(currentUrl, { waitUntil: 'domcontentloaded', timeout: 30000 });
        visitedUrls.add(currentUrl); // Mark URL as visited
  • BFS Queue: Processes URLs in the discovery order (FIFO).

  • Max Depth: Stops exploration beyond the specified depth.

  • Puppeteer Navigation: Navigates and waits for the DOM to load.

4. Handling Infinite Scrolling

Simulates scrolling to load additional content.

let scrollCount = 0;
let previousHeight = await page.evaluate('document.body.scrollHeight');

while (scrollCount <= maxScrolls) {
    console.log(`Scroll attempt ${scrollCount + 1} of ${maxScrolls}`);

    // Extract all anchor href links on the webpage
    const hrefs = await page.evaluate(() =>
        Array.from(document.querySelectorAll('a'), (a) => a.href)
    );

    // Add valid URLs to queue
    for (const href of hrefs) {
        if (
            !href ||
            href.includes('#') ||
            href.includes('mailto:') ||
            href.includes('tel:') ||
            CrawlerService.NON_HTML_EXTENSIONS.test(href)
        ) continue;

        const absoluteUrl = CrawlerService.getAbsoluteUrl(currentUrl, href);

        if (!CrawlerService.isSameDomain(seedUrl, absoluteUrl) || visitedUrls.has(absoluteUrl)) continue;

        foundUrlsSet.add(absoluteUrl);

        const urlExistsInQueue = visitUrlsQueue.some((urlObj) => urlObj.url === absoluteUrl);
        if (!visitedUrls.has(absoluteUrl) && !urlExistsInQueue) {
            visitUrlsQueue.push({ url: absoluteUrl, depth: depth + 1 });
        }
    }

    // Scroll to the bottom of the page
    await page.evaluate(() => window.scrollBy(0, document.body.scrollHeight));

    // Delay for content loading
    await new Promise((resolve) => setTimeout(resolve, scrollDelay));

    const newHeight = await page.evaluate('document.body.scrollHeight');

    if (newHeight === previousHeight) {
        console.log('No further scrolling possible. Exiting.');
        break;
    }

    previousHeight = newHeight;
    scrollCount++;
}
  • Scrolling Logic: Scrolls to the bottom to load new content.

  • Max Scrolls: Limits scrolling attempts.

  • Delay Between Scrolls: Allows new content to load.

5. Delay Between Requests

Adds a random delay between requests to mimic human behavior.

const delayInMS = Math.round((Math.random() + 1) * 10000);
await new Promise((resolve) => setTimeout(resolve, delayInMS));

6. Closing the Browser

Closes the Puppeteer instance after processing URLs.

await browserInstance.close();

7. Returning the Results

Returns the list of unique URLs.

return Array.from(foundUrlsSet);

Example Workflow

  1. Input: Domain provided (e.g., amazon.in).

  2. Seed URL: Starts with https://amazon.in/.

  3. Navigation:

    • Opens a headless browser.

    • Navigates to https://amazon.in/.

    • Waits for DOM content to load.

  4. Infinite Scrolling:

    • Scrolls to load additional content.

    • Extracts all <a> tags and processes their href attributes.

  5. URL Filtering:

    • Converts relative URLs to absolute.

    • Filters invalid or non-HTML URLs.

    • Ensures URLs belong to the same domain.

  6. Depth Control: Stops beyond specified maxDepth.

  7. Output: Returns a list of unique URLs.

Key Components

CrawlerService

  • Handles core crawling logic.

  • Uses Puppeteer for dynamic content.

Dynamic Crawler

  • Simulates user interactions like scrolling.

URL Validation

  • Filters non-HTML resources.

  • Ensures domain consistency.

Queue System

  • Managed by BullMQ for asynchronous processing.

Challenges and Solutions

Dynamic Content

  • Challenge: Infinite scrolling or lazy-loaded content.

  • Solution: Simulate scrolling and wait for new content.

Performance

  • Challenge: Crawling dynamic websites can be slow.

  • Solution: Use asynchronous processing and random delays.

Edge Cases

  • Challenge: Some websites block automated crawlers.

  • Solution: Mimic human behavior with random delays and headless mode.


2. Queue System with BullMQ

To handle multiple domains efficiently, the project uses BullMQ for job queuing. Each domain is processed as a separate job, and the system can scale to handle hundreds of domains.

const connection = new Redis({
    ...REDIS_CONFIG,
    maxRetriesPerRequest: null,
});

// for static cralwer
export const validDomainQueue = new Queue(validDomainQueueName, { connection });

export const validDomainWorker = new Worker(validDomainQueueName, domainCrawlJob, { connection });

// TOOD: implement logger
validDomainWorker.on('completed', (job: Job) => {
    console.log(`Job completed: ${job.id}`);
});

validDomainWorker.on('failed', (job: Job | undefined, err) => {
    console.log(`Job failed: ${err}`);
});

// for dynamic crawler
export const dynamicValidDomainQueue = new Queue(dynamicValidDomainQueueName, { connection });

export const dynamicValidDomainWorker = new Worker(
    dynamicValidDomainQueueName,
    dynamicDomainCrawlJob,
    {
        connection,
    }
);

// TOOD: implement logger
dynamicValidDomainWorker.on('completed', (job: Job) => {
    console.log(`Job completed: ${job.id}`);
});

dynamicValidDomainWorker.on('failed', (job: Job | undefined, err) => {
    console.log(`Job failed: ${err}`);
});

3. AI Integration for Product URL Filtering

To identify product URLs, the project uses Google’s Gemini AI. The AI is trained to recognize patterns in URLs that indicate product pages (e.g., /product/, /item/).

export const getProductUrls = async (foundUrls: string[]): Promise<string[]> => {
    try {
        const genAI = new GoogleGenerativeAI(String(Config.GEMINI_AI_API_KEY));
        const model = genAI.getGenerativeModel(googleAIConfigs);

        const prompt = `${promptForFetchingProductUrls}\n${foundUrls}`;

        const result: GenerateContentResult = await model.generateContent(prompt);
        const generatedText: string = result.response.text();
        const productUrls: string[] = JSON.parse(generatedText);
        return productUrls;
    } catch (error) {
        console.log('Something went wrong :( with google AI, using regex as fallback', error);

        let productUrls: string[] = [];

        const productUrlRegex =
            /\/product\/|\/products\/|\/item\/|\/items\/|\/p\/|\/dp\/|\/buy\/|product_id=|item=/;

        for (const foundUrl of foundUrls) {
            if (productUrlRegex.test(foundUrl)) {
                productUrls.push(foundUrl);
            }
        }

        return productUrls;
    }
};

export const promptForFetchingProductUrls = `You are an AI assistant that identifies product URLs. A product URL points to a page containing details about a single product, such as its name, price, specifications, or "Add to Cart" button.
Here are some examples:
- Product URL: https://example.com/product/12345
- Product URL: https://example.com/item/67890
- Non-Product URL: https://example.com/category/shoes
- Non-Product URL: https://example.com/search?q=shoes

Here are some more real examples of productUrls: 
1. https://amzn.in/d/gxGhtVq
2. https://www.amazon.in/Weighing-Balanced-Batteries-Included-A121/dp/B083C6XMKQ/
3. https://www.flipkart.com/mi-11-lite-vinyl-black-128-gb/p/itmac6203bae9394?pid=MOBG3VSKRKGZKJAR
4. https://www.myntra.com/watches/coach/coach-women-grand-embellished-dial-analogue-watch-14503941/30755447/buy
5. https://www.myntra.com/tshirts/nautica/nautica-pure-cotton-polo-collar-t-shirt/31242755/buy
6. https://jagrukjournal.lla.in/products/ek-geet-2025-collectible-calendar
7. https://mrbeast.store/products/kids-basics-panther-tee-royal-blue

productUrls = ['https://abc.com/product/1', 'https://abc.com/product/2']
Return: Array<productUrls>

Given the following URLs, return only the product URLs in array[]:`;

// This prompt is structured as per Google AI only, but will soon extend it and will make a generalized prompt for pouplar LLM's

4. Database Models

The project uses MongoDB to store crawled domains and product URLs. Here are the schemas:

CrawlDomain Schema

const createCrawlDomainSchema = (documentExpirationInSeconds: number): Schema<IDomainDocument> =>
    new Schema<IDomainDocument>(
        {
            domain: { type: String, required: true, unique: true },
            jobId: { type: [String], default: [] },
            status: { type: String, enum: Object.values(CrawlDomainStatus), default: CrawlDomainStatus.PENDING },
            expiresAt: { type: Date, default: () => new Date(Date.now() + documentExpirationInSeconds * 1000) },
        },
        { timestamps: true }
    );

CrawlProductUrl Schema

const createCrawlProductUrlSchema = (documentExpirationInSeconds: number): Schema<IProductUrlsDocument> =>
    new Schema<IProductUrlsDocument>(
        {
            urls: { type: [String], required: true },
            domainId: { type: Schema.Types.ObjectId, ref: 'Domain', required: true },
            expiresAt: { type: Date, default: () => new Date(Date.now() + documentExpirationInSeconds * 1000) },
        },
        { timestamps: true }
    );

Detailed Sequence Diagram


Challenges and Solutions

1. Dynamic Content Handling

  • Challenge: Websites with infinite scrolling or dynamically loaded content were difficult to crawl.

  • Solution: Used Puppeteer to simulate a browser and scroll through the page to load additional content.

2. Performance

  • Challenge: Crawling large websites sequentially was slow.

  • Solution: Implemented BullMQ for asynchronous job processing, allowing multiple domains to be crawled in parallel.

3. URL Pattern Recognition

  • Challenge: Different e-commerce platforms use varying URL structures for product pages.

  • Solution: Integrated Google’s Gemini AI to identify product URLs based on patterns.


Future Improvements

While the project meets the initial requirements, there’s always room for improvement:

  1. Concurrency: Implement concurrency to handle multiple domains simultaneously.

  2. Dead Letter Queue (DLQ): Handle failed jobs more efficiently.

  3. Caching: Use Redis to cache frequently accessed data.

  4. Error Handling: Improve error handling and logging for better debugging.


Conclusion

Building this web crawler was a challenging yet rewarding experience. It not only helped me deepen my understanding of web scraping and asynchronous programming but also gave me hands-on experience with AI integration and queue systems. I’m excited to continue improving this project and exploring new features.

Feel free to check out the code on x-crawler Github and share your feedback!