Extract URLs and page metadata from website sitemaps into CSV format. Harvests titles, descriptions, keywords, author info, canonical URLs, and Open Graph data automatically.
git clone https://github.com/meysam81/sitemap-harvester.gitSitemap Harvester is a Python tool that crawls website sitemaps and extracts comprehensive page metadata into CSV format. It automatically harvests page titles, meta descriptions, keywords, author information, canonical URLs, and Open Graph social media data. The tool handles URL deduplication automatically and provides real-time progress updates during extraction. It's designed for marketers, SEO professionals, and web analysts who need to audit and analyze website structure and metadata at scale.
Install via pip with `pip install sitemap-harvester`. Run `sitemap-harvester --url https://example.com` to harvest a website's sitemap into CSV. Use `--output` to specify a custom filename and `--timeout` to adjust timeout for slower websites or large sitemaps.
SEO audits: Extract metadata from entire website sitemaps for analysis
Content inventory: Create comprehensive CSV catalogs of all website pages and their metadata
Competitive analysis: Harvest competitor website sitemap data for comparison
Migration planning: Document all pages and metadata before website restructuring
No install command available. Check the GitHub repository for manual installation instructions.
git clone https://github.com/meysam81/sitemap-harvesterCopy the install command above and run it in your terminal.
Launch Claude Code, Cursor, or your preferred AI coding agent.
Use the prompt template or examples below to test the skill.
Adapt the skill to your specific use case and workflow.
Crawl the sitemap of [WEBSITE_URL] and export the metadata of its pages recursively into a CSV file. Include the following metadata for each page: URL, title, meta description, h1, h2, and last modified date. Save the CSV file as [FILE_NAME].
# Sitemap Metadata Harvesting Report ## Summary The sitemap of `example.com` was successfully crawled and analyzed. A total of 1,245 pages were indexed, with the following metadata extracted: - **Total Pages**: 1,245 - **Pages with Missing Titles**: 42 - **Pages with Missing Meta Descriptions**: 118 - **Average Word Count**: 842 ## Key Findings ### Top 5 Most Linked Pages 1. `/about-us` - 142 internal links 2. `/contact` - 98 internal links 3. `/products` - 89 internal links 4. `/blog` - 76 internal links 5. `/services` - 65 internal links ### Pages with Missing Critical Metadata - `/products/widget-a` - Missing meta description - `/blog/post-123` - Missing H1 tag - `/services/consulting` - Missing meta description ## CSV Export The complete dataset has been exported to `example_com_sitemap_metadata.csv`. The file includes the following columns: - `URL` - `Title` - `Meta Description` - `H1` - `H2` - `Last Modified Date` For further analysis, you can open the CSV file in Excel or any data analysis tool.
Take a free 3-minute scan and get personalized AI skill recommendations.
Take free scan