Process & Execution

Case Study: Building a High-Performance Sitemap Generator with Node.js & Crawlee

Introduction

In the world of Technical SEO, discoverability is everything. You can have the best content on the web, but if search engines can't find it, it doesn't exist. The humble XML Sitemap is the roadmap we give to Google, Bing, and other crawlers to ensure every piece of valuable content is indexed.

However, for large, dynamic, or legacy websites, maintaining an up-to-date sitemap is a logistical nightmare. plugin-based solutions often fail on decoupled CMS architectures, and enterprise SaaS tools can be prohibitively expensive.

I recognized a gap in the market for a developer-first, highly configurable automation tool that could crawl any website—regardless of its tech stack—and generate compliant XML, HTML, and TXT sitemaps.

This case study breaks down how I built the Fast Sitemap Generator, a high-performance actor on the Apify platform, designed to solve the problem of SEO visibility at scale.

The Challenge: Why Not Just Use a Plugin?

Building a sitemap generator sounds simple on paper: visit a page, grab links, repeat. But when you move from a 10-page blog to a 10,000-page e-commerce site, complexity explodes.

I faced three core technical hurdles:

Infinite Loops & Spider Traps: Many modern sites have calendar widgets, faceted search parameters (e.g., ?price=low&color=red), or relative link structures that can trap a naive crawler in an infinite loop.
Politeness & Compliance: A responsible crawler must respect robots.txt. Ignoring standard exclusion protocols can get IPs banned or negatively impact the target site's server load.
Performance vs. Depth: How do you crawl deep into a site structure (Level 5+) without the process taking hours or consuming excessive memory?

My goal was to build a tool that was robust enough for enterprise sites but simple enough for a junior dev to configure.

The Architecture: Built on Giant Shoulders

To tackle these challenges, I chose a stack optimized for speed and reliability:

Runtime: Node.js (for non-blocking I/O).
Crawler Engine: Crawlee (specifically CheerioCrawler).
Platform: Apify (for serverless execution and dataset storage).
Language: TypeScript (for type safety in complex configuration objects).

Why Crawlee?

I opted for Crawlee because of its intelligent queue management. Unlike a simple recursive function, Crawlee handles the RequestQueue persistently. If the crawler crashes (or the server restarts) after 5,000 pages, the state is saved, and it resumes exactly where it left off.

I specifically chose CheerioCrawler over PlaywrightCrawler or PuppeteerCrawler.

CheerioCrawler: Downloads raw HTML and parses it. It's lightning-fast and consumes minimal CPU.
Browser-Based Crawlers: Render JavaScript. While powerful, they are slower and costlier. For sitemaps, we rarely need to execute JS just to find <a href> tags, making Cheerio the efficient choice.


// Simplified view of the Crawler setup
const crawler = new CheerioCrawler({
    requestQueue,
    maxRequestsPerCrawl: input.maxPagesPerCrawl,
    requestHandler: async ({ request, $, log }) => {
        const title = $('title').text();
        log.info(`Processing ${request.url}...`);
        
        // Extract links and add to queue if they match patterns
        await enqueueLinks({
            globs: input.includePatterns,
            exclude: input.excludePatterns,
        });

        // Store data for sitemap generation
        await Dataset.pushData({ url: request.url, title });
    },
});

The "Aha!" Moment: Solving the Regex Puzzle

The biggest UX challenge was giving users control over what to crawl. A simple "match domain" rule isn't enough. Users needed to exclude /admin panels, /login pages, or dynamic tags like /tag/*/feed.

I implemented a dual-layer filtering system using Regex (Regular Expressions) within the enqueueLinks strategy.

The breakthrough came when I realized I could map the user's simple configuration directly to Craylee's globs and exclude options, but I had to sanitize them to prevent ReDoS (Regex Denial of Service) attacks. By strictly typing the input schema and validating regex patterns before the crawl starts, I ensured the actor fails fast with a helpful error message rather than hanging indefinitely on a bad pattern.

Respecting Robots.txt

One of the most complex features was respecting robots.txt. I integrated a parsing library that fetches the robots file at the start of the run, creates a rule set, and checks every potential URL against it before adding it to the queue. This ensures the generator is a "good citizen" of the web.

Performance & Results

Since its release, the Fast Sitemap Generator has achieved:

100% Success Rate: On all runs on the Apify platform.
Speed: Capable of crawling 1,000 pages in under 120 seconds (network dependent).
Efficiency: Runs on the lowest tier of Apify compute units (0.1 CPU), making it extremely cost-effective.

Users have reported using it to generate sitemaps for single-page applications (SPAs) and legacy PHP sites alike, bridging the gap between development and marketing teams.

Technical Deep Dive: Generating the XML

Writing the XML file itself posed a memory challenge. For a 50,000-page site, storing a massive array of URL objects in memory is risky.

I implemented a stream-based approach. As the crawler progresses, it writes results to a Key-Value Store. Once the crawl finishes, a separate "Merge & Export" step iterates through the dataset and streams the lines into the final sitemap.xml file.


<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>https://example.com/</loc>
      <lastmod>2023-10-10</lastmod>
      <changefreq>daily</changefreq>
      <priority>1.0</priority>
   </url>
   ...
</urlset>

This decoupling of crawling and generating ensures that the process is scalable and fault-tolerant.

Future Roadmap

The tool is functional and stable, but I have ambitious plans for v2:

Broken Link Reporter: Since we are already visiting every page, why not report 404s? This turns the tool into a site health auditor.
Visual Site Tree: generating a JSON graph of the site structure to visualize internal linking silos.
Proxy Configuration: Adding support for residential proxies to crawl sites with strict firewalls (WAFs).

Conclusion

Building the Fast Sitemap Generator taught me that "simple" tools often require complex architectural decisions to be reliable at scale. By leveraging Node.js streams and Crawlee's robust queueing, I built a tool that saves developers hours of manual work.

If you are looking to automate your technical SEO or need a reliable crawler for your next project, give it a spin.

Ready to Optimize Your SEO?

Try it Live: Run the Actor on Apify

Building a High-Performance Sitemap Generator with Node.js & Crawlee

Problem Statement

Solution

Process & Execution

Case Study: Building a High-Performance Sitemap Generator with Node.js & Crawlee

Introduction

The Challenge: Why Not Just Use a Plugin?

The Architecture: Built on Giant Shoulders

Why Crawlee?

The "Aha!" Moment: Solving the Regex Puzzle

Respecting Robots.txt

Performance & Results

Technical Deep Dive: Generating the XML

Future Roadmap

Conclusion

Ready to Optimize Your SEO?

Lessons Learned

More Projects

MyTherapist.ng - Online Therapy for Nigerians

DA Lewis Consulting

HostelPaddy