Utilities for detecting errors in websites and inspecting site graphs.
It might evolve into an open source alternative to screaming frog, at least for some SEO errors.
Things you can do with this:
- detect errors in your website: broken links, 500, redirect loops
- crawl a website's links and metadata into a generic json representation that's easy to deal with for further processing
- clone this repo
- create a virtualenv:
python3 -m venv env - install requirements:
env/bin/pip install -r requirements.txt
This will scrape the https://blog.scrapinghub.com site into a site.json file:
If site.json already exists scrapy will append lines to it, so delete it to start from a clean slate.
env/bin/scrapy crawl checks \
-a start_url=https://blog.scrapinghub.com \
--output=site.json \
--output-format jsonlinesWe can pipe the site.json file to the get_errors utility; this will report things like 404 links, 500 errors and redirect loops, and output csv or friendly formats:
cat site.json | env/bin/python -m site2graph.get_errors --output_format friendlyThe CSV output format option is more convenient for dealing with large dumps:
cat site.json | env/bin/python -m site2graph.get_errors --output_format csvmake checkruns typechecks and lintersmake fmtruns formatters on the sourcemake testruns unit tests