Building a Python Web Scraper using only Natural Language
Building a Python Web Scraper using only Natural Language
Web scraping used to require deep knowledge of HTML parsing, HTTP requests, and CSS selectors. You had to understand the DOM, handle pagination, and deal with rate limiting.
With Vibe Coding, you can build a production-ready web scraper by simply describing what data you want to extract.
In this tutorial, we'll build a scraper that extracts job listings from a website and exports them to a CSV file—using only natural language prompts in Cursor.
What We're Building
Scraper Name: JobHunter
Target: A job board website (we'll use a practice site)
Data to Extract:
* Job title
* Company name
* Location
* Salary range
* Posted date
Output: CSV file with all the data
Tech Stack:
* Python 3.10+
* BeautifulSoup4 (for HTML parsing)
* Requests (for HTTP)
* Pandas (for CSV export)
Step 1: Set Up the Project
Create a New Directory
“`bash
mkdir job-scraper
cd job-scraper
“`
Create a Virtual Environment
“`bash
python -m venv venv
source venv/bin/activate # On Windows: venv\\Scripts\\activate
“`
Open in Cursor
“`bash
cursor .
“`
Step 2: The Initial Prompt
Open Cursor's Composer (`Cmd+I`) and paste this prompt:
> “Create a Python web scraper that:
> 1. Scrapes job listings from ‘https://realpython.github.io/fake-jobs/' (a practice scraping site)
> 2. Extracts: job title, company name, location, and posted date
> 3. Stores the data in a list of dictionaries
> 4. Exports the data to a CSV file called ‘jobs.csv'
> 5. Uses BeautifulSoup4 and requests
> 6. Includes error handling for network issues
> 7. Adds a 1-second delay between requests to be polite
> 8. Prints progress as it scrapes
>
> Create a file called `scraper.py` with all the necessary code.”
Press Enter and wait.
Step 3: Review the Generated Code
Cursor will create `scraper.py`:
“`python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
def scrape_jobs():
url = “https://realpython.github.io/fake-jobs/”
print(f”Fetching {url}…”)
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
except requests.RequestException as e:
print(f”Error fetching the page: {e}”)
return []
soup = BeautifulSoup(response.content, ‘html.parser')
job_cards = soup.find_all(‘div', class_='card-content')
jobs = []
print(f”Found {len(job_cards)} job listings. Scraping…”)
for i, card in enumerate(job_cards, 1):
try:
title = card.find(‘h2′, class_='title').text.strip()
company = card.find(‘h3′, class_='company').text.strip()
location = card.find(‘p', class_='location').text.strip()
posted = card.find(‘time').text.strip()
job = {
‘title': title,
‘company': company,
‘location': location,
‘posted_date': posted
}
jobs.append(job)
print(f” [{i}/{len(job_cards)}] Scraped: {title} at {company}”)
time.sleep(1) # Be polite
except AttributeError as e:
print(f” Error parsing job card {i}: {e}”)
continue
return jobs
def save_to_csv(jobs, filename='jobs.csv'):
if not jobs:
print(“No jobs to save.”)
return
df = pd.DataFrame(jobs)
df.to_csv(filename, index=False)
print(f”\\nSaved {len(jobs)} jobs to {filename}”)
if __name__ == “__main__”:
jobs = scrape_jobs()
save_to_csv(jobs)
“`
Step 4: Install Dependencies
Cursor might also create a `requirements.txt`:
“`
beautifulsoup4==4.12.2
requests==2.31.0
pandas==2.1.0
“`
Install them:
“`bash
pip install -r requirements.txt
“`
Step 5: Run the Scraper
“`bash
python scraper.py
“`
Output:
“`
Fetching https://realpython.github.io/fake-jobs/…
Found 100 job listings. Scraping…
[1/100] Scraped: Senior Python Developer at Payne, Roberts and Davis
[2/100] Scraped: Energy engineer at Vasquez-Davidson
…
Saved 100 jobs to jobs.csv
“`
Step 6: Iterate with Vibes
The basic scraper works, but let's make it better.
Iteration 1: Add Filtering
Prompt:
> “Update the scraper to only save jobs that contain ‘Python' in the title.”
Iteration 2: Add Pagination
Prompt:
> “The website has multiple pages. Update the scraper to handle pagination. The next page URL is in a link with class ‘next'.”
Iteration 3: Add Salary Extraction
Prompt:
> “Some job cards have a salary range in a `
` tag. Extract this if it exists, otherwise set it to ‘Not specified'.”
Iteration 4: Add Logging
Prompt:
> “Replace print statements with proper logging using Python's logging module. Save logs to ‘scraper.log'.”
Each iteration takes 30-60 seconds.
Step 7: Handle Edge Cases
Real-world scraping has challenges. Let's address them.
Challenge 1: Dynamic Content (JavaScript-rendered)
Prompt:
> “If the website uses JavaScript to load content, update the scraper to use Selenium instead of requests.”
Challenge 2: Rate Limiting
Prompt:
> “Add exponential backoff if we get a 429 (Too Many Requests) response.”
Challenge 3: User-Agent Spoofing
Prompt:
> “Some websites block scrapers. Add a realistic User-Agent header to the requests.”
Advanced: Building a Multi-Site Scraper
Once you've mastered single-site scraping, you can build a scraper that handles multiple job boards.
Prompt:
> “Create a scraper that can scrape jobs from multiple websites. It should:
> – Accept a list of URLs
> – Detect the site structure automatically (or use site-specific parsers)
> – Combine all results into a single CSV
> – Run scrapers in parallel using threading”
Legal and Ethical Considerations
Important: Always check a website's `robots.txt` file and Terms of Service before scraping.
Prompt to Cursor:
> “Add a function that checks if scraping is allowed by reading the robots.txt file.”
Conclusion
Building web scrapers used to require expertise in HTML parsing and HTTP protocols. With Vibe Coding, you just describe what data you want, and the AI handles the implementation.
At BYS Marketing, we use AI-powered scraping to gather competitive intelligence, monitor pricing, and track industry trends for our clients.
—
Need custom data extraction?
Contact BYS Marketing. We build intelligent scrapers that respect website policies and deliver clean data.
🚀 Elevate Your Business with BYS Marketing
From AI Coding to Media Production, we deliver excellence.
Contact Us: Get a Quote Today