
MyTherapist.ng - Online Therapy for Nigerians
Mytherapist.ng is a platform that connects individuals seeking mental health support with licensed and certified therapists.

A high-performance Python scraper built on Apify to extract massive datasets of LinkedIn job postings, bypassing complex anti-bot measures limits with residential proxies.
Recruiters, HR tech startups, and analysts need real-time, granular job market data, but LinkedIn's aggressive anti-scraping defenses (infinite scroll, auth-walls, and IP bans) make manual or basic automated extraction impossible at scale.
I engineered a resilient, serverless Actor using Python and the Apify SDK. It leverages rotating residential proxies and smart DOM traversal to emulate human behavior, bypassing rate limits and parsing dynamic content with 100% reliability.
Lead Full Stack Engineer & Maintainer
This project deepened my expertise in anti-bot evasion strategies, specifically the necessity of residential proxy rotation and the efficiency of headless HTML parsing over full browser automation for high-volume data tasks.
In the era of data-driven decision-making, real-time labor market intelligence is a goldmine. Whether it's a recruitment agency automating lead generation, a university researcher analyzing improved skill demand, or an HR tech startup training LLMs on job descriptions, the need for raw, structured job data is insatiable.
However, accessing this data is a formidable engineering challenge. LinkedIn, the world's largest professional repository, guards its data behind one of the most sophisticated anti-scraping infrastructures on the web. It is a fortress of dynamic class names, infinite scrolling feeds, and aggressive IP-based rate limiting.
This case study details the journey of building the LinkedIn Job Postings Scraper, a high-performance extraction tool hosted on the Apify platform. It serves as not just a utility but a study in resilience engineering—demonstrating how to reliably extract structured data from a hostile, ever-changing environment. What started as a personal script has grown into a community-trusted tool with a perfect success rate, helping users automate what used to be weeks of manual copying and pasting.
Building a scraper for a static blog is trivial; building one for a heavily defended Single Page Application (SPA) like LinkedIn is a war of attrition. The complexity wasn't just in getting the data, but in getting it consistently and safely.
Unlike traditional pagination where ?page=2 yields predictable results, LinkedIn uses an infinite scroll mechanism. Content is dynamically injected into the DOM as the user reaches the bottom of the viewport.
GET request only retrieves the initial skeleton of the page—metadata and a few headers—not the actual job list.LinkedIn employs rigorous fingerprinting. If a request lacks the correct headers, cookies, or originates from a known data center IP (like AWS or DigitalOcean), it is immediately redirected to an authentication wall or served a 999 error code.
The HTML structure of LinkedIn is designed to thwart parsers. Class names are often generated dynamically (e.g., job-card-list__title--verified) or obfuscated. Furthermore, the layout shifts subtly between "Promoted" jobs and organic listings, often breaking rigid XPath or CSS selectors. Data normalization was another hurdle; converting "Posted 3 days ago" into a concrete ISO 8601 timestamp requires context-aware parsing logic.
To solve these problems, I architected a solution that prioritized stealth over speed and efficiency over raw power.
Beautiful Soup) and data manipulation (Pandas).1. The Hybrid Extraction Approach Instead of relying solely on heavy browser automation tools like Selenium for the entire lifecycle, I adopted a hybrid approach.
Beautiful Soup. This is significantly faster and less CPU-intensive than asking a browser to query the DOM thousands of times.2. Intelligent Proxy Rotation This is the backbone of the scraper's reliability. I implemented a strategy using Apify's Residential Proxy pool.
3. Dockerized Deployment The entire application is containerized using Docker. This ensures environment consistency—preventing the "it works on my machine" syndrome—and allows the scraper to scale horizontally on the Apify cloud. If a user wants to scrape 10,000 jobs, the platform can spin up multiple containers to divide and conquer the workload.
# Snippet: Robust parsing with fallback mechanisms
def extract_salary(soup_element):
try:
# Attempt to find the primary salary tag
salary_tag = soup_element.find('span', class_='compensation-text')
if salary_tag:
return clean_currency_string(salary_tag.text)
# Fallback: Look for salary in the metadata text
metadata = soup_element.find('div', class_='metadata-card')
if metadata and '$' in metadata.text:
return extract_regex_salary(metadata.text)
except AttributeError:
# Log warning but do not crash the scraper
logging.warning("Salary structure changed or missing.")
return None
The biggest breakthrough came when I stopped trying to "look" at the page and started listening to it.
During the initial development, the infinite scroll was flaky. It would scroll, but sometimes new items wouldn't load, or the scraper would think it reached the end prematurely. I opened the Chrome DevTools Network tab and analyzed the XHR requests triggered during a scroll event.
I discovered that LinkedIn fetches the next batch of jobs via a specific internal API endpoint passing a start parameter (e.g., start=25, start=50).
The pivot: Instead of simulating a user physically scrolling (which is error-prone and slow), I rewrote the scraper to intercept and identify the logic behind these requests. While I couldn't hit the internal API directly without complex auth tokens, I optimized the browser automation to trigger these specific events more reliably. I essentially taught the bot to "know" exactly when the network request finished, rather than just waiting a completely arbitrary 3 seconds. This reduced the runtime by approx 40% and eliminated "partial load" errors.
The LinkedIn Job Postings Scraper has evolved into a robust tool effectively used by the community.
Job Title, Company URL, Logo, Posted Date (normalized), Salary Range, and full Description.One significant user story involved a recruitment firm that needed to track hiring surges in the Fintech sector. Using this scraper, they automated the daily collection of 500+ job postings. They piped this data directly into a JSON file, which was then ingested by their internal dashboard. This automation saved their team approximately 15 hours of manual work per week, allowing them to focus on candidate outreach rather than data entry.
While the current version is stable and performant, the roadmap involves deeper intelligence integration:
job_description text. This will automatically tag listings with standardized skills (e.g., extracting "React", "Python", "AWS" into a clean array) and seniority levels, even if the original poster didn't explicitly populate those fields.Building the LinkedIn Job Postings Scraper was a masterclass in modern web scraping defense and offense. It reinforced the importance of residential proxies, the power of hybrid browser/parser architectures, and the need for robust error handling.
For developers and companies looking to harness the power of web data, this tool offers a glimpse into what's possible when you move beyond curl and start engineering for resilience.

Mytherapist.ng is a platform that connects individuals seeking mental health support with licensed and certified therapists.

DALC, LLC specializes in equal employment opportunity, diversity and inclusion, human resources, and business consulting.

Your No.1 Solution for hostel accommodation. Application for Nigerian students to easily search for hostel accommodation.