
MyTherapist.ng - Online Therapy for Nigerians
Mytherapist.ng is a platform that connects individuals seeking mental health support with licensed and certified therapists.

A scalable, customizable sitemap generator capable of crawling thousands of pages while respecting robots.txt and regex filters. Built to automate technical SEO tasks.
In the world of Technical SEO, discoverability is everything. You can have the best content on the web, but if search engines can't find it, it doesn't exist. The humble XML Sitemap is the roadmap we give to Google, Bing, and other crawlers to ensure every piece of valuable content is indexed.
However, for large, dynamic, or legacy websites, maintaining an up-to-date sitemap is a logistical nightmare. plugin-based solutions often fail on decoupled CMS architectures, and enterprise SaaS tools can be prohibitively expensive.
I recognized a gap in the market for a developer-first, highly configurable automation tool that could crawl any website—regardless of its tech stack—and generate compliant XML, HTML, and TXT sitemaps.
This case study breaks down how I built the Fast Sitemap Generator, a high-performance actor on the Apify platform, designed to solve the problem of SEO visibility at scale.
Building a sitemap generator sounds simple on paper: visit a page, grab links, repeat. But when you move from a 10-page blog to a 10,000-page e-commerce site, complexity explodes.
I faced three core technical hurdles:
?price=low&color=red), or relative link structures that can trap a naive crawler in an infinite loop.robots.txt. Ignoring standard exclusion protocols can get IPs banned or negatively impact the target site's server load.My goal was to build a tool that was robust enough for enterprise sites but simple enough for a junior dev to configure.
To tackle these challenges, I chose a stack optimized for speed and reliability:
CheerioCrawler).I opted for Crawlee because of its intelligent queue management. Unlike a simple recursive function, Crawlee handles the RequestQueue persistently. If the crawler crashes (or the server restarts) after 5,000 pages, the state is saved, and it resumes exactly where it left off.
I specifically chose CheerioCrawler over PlaywrightCrawler or PuppeteerCrawler.
<a href> tags, making Cheerio the efficient choice.// Simplified view of the Crawler setup const crawler = new CheerioCrawler({ requestQueue, maxRequestsPerCrawl: input.maxPagesPerCrawl, requestHandler: async ({ request, $, log }) => { const title = $('title').text(); log.info(`Processing ${request.url}...`); // Extract links and add to queue if they match patterns await enqueueLinks({ globs: input.includePatterns, exclude: input.excludePatterns, }); // Store data for sitemap generation await Dataset.pushData({ url: request.url, title }); }, });
The biggest UX challenge was giving users control over what to crawl. A simple "match domain" rule isn't enough. Users needed to exclude /admin panels, /login pages, or dynamic tags like /tag/*/feed.
I implemented a dual-layer filtering system using Regex (Regular Expressions) within the enqueueLinks strategy.
The breakthrough came when I realized I could map the user's simple configuration directly to Craylee's globs and exclude options, but I had to sanitize them to prevent ReDoS (Regex Denial of Service) attacks. By strictly typing the input schema and validating regex patterns before the crawl starts, I ensured the actor fails fast with a helpful error message rather than hanging indefinitely on a bad pattern.
One of the most complex features was respecting robots.txt. I integrated a parsing library that fetches the robots file at the start of the run, creates a rule set, and checks every potential URL against it before adding it to the queue. This ensures the generator is a "good citizen" of the web.
Since its release, the Fast Sitemap Generator has achieved:
Users have reported using it to generate sitemaps for single-page applications (SPAs) and legacy PHP sites alike, bridging the gap between development and marketing teams.
Writing the XML file itself posed a memory challenge. For a 50,000-page site, storing a massive array of URL objects in memory is risky.
I implemented a stream-based approach. As the crawler progresses, it writes results to a Key-Value Store. Once the crawl finishes, a separate "Merge & Export" step iterates through the dataset and streams the lines into the final sitemap.xml file.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2023-10-10</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
...
</urlset>
This decoupling of crawling and generating ensures that the process is scalable and fault-tolerant.
The tool is functional and stable, but I have ambitious plans for v2:
Building the Fast Sitemap Generator taught me that "simple" tools often require complex architectural decisions to be reliable at scale. By leveraging Node.js streams and Crawlee's robust queueing, I built a tool that saves developers hours of manual work.
If you are looking to automate your technical SEO or need a reliable crawler for your next project, give it a spin.

Mytherapist.ng is a platform that connects individuals seeking mental health support with licensed and certified therapists.

DALC, LLC specializes in equal employment opportunity, diversity and inclusion, human resources, and business consulting.

Your No.1 Solution for hostel accommodation. Application for Nigerian students to easily search for hostel accommodation.