GitHub - j-plou/site2graph

site2graph

Utilities for detecting errors in websites and inspecting site graphs.

It might evolve into an open source alternative to screaming frog, at least for some SEO errors.

Overview
Usage
Development

Overview

Things you can do with this:

detect errors in your website: broken links, 500, redirect loops
crawl a website's links and metadata into a generic json representation that's easy to deal with for further processing

Usage

Install

clone this repo
create a virtualenv: python3 -m venv env
install requirements: env/bin/pip install -r requirements.txt

Scrape a site

This will scrape the https://blog.scrapinghub.com site into a site.json file:

If site.json already exists scrapy will append lines to it, so delete it to start from a clean slate.

env/bin/scrapy crawl checks \
    -a start_url=https://blog.scrapinghub.com \
    --output=site.json \
    --output-format jsonlines

Detect errors

We can pipe the site.json file to the get_errors utility; this will report things like 404 links, 500 errors and redirect loops, and output csv or friendly formats:

cat site.json | env/bin/python -m site2graph.get_errors --output_format friendly

The CSV output format option is more convenient for dealing with large dumps:

cat site.json | env/bin/python -m site2graph.get_errors --output_format csv

Development

make check runs typechecks and linters
make fmt runs formatters on the source
make test runs unit tests

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.circleci		.circleci
site2graph		site2graph
.gitignore		.gitignore
.isort.cfg		.isort.cfg
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

site2graph

Overview

Usage

Install

Scrape a site

Detect errors

Development

About

Uh oh!

Releases

Packages

Languages

License

j-plou/site2graph

Folders and files

Latest commit

History

Repository files navigation

site2graph

Overview

Usage

Install

Scrape a site

Detect errors

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages