COWS - Corion's Own Web Scraper
use 5.020;
use COWS 'scrape';
my $html = '...';
my $rules = [
{
name => 'link',
query => 'a@href',
munge => ['absolute'],
},
];
my $res = scrape( $html, $rules, { url => 'https://example.com/' } );
say "url: $_" for $res->{link}->@*;
scrape( $html, $rules, $options );
For each query, a hashref with the following keys is accepted
-
nameThe name of this query. This will be used as a key in the resulting hashref.
-
queryThe CSS selector or XPath query to search for.
-
fieldsQueries that should be matched below this node. The results will get merged into a hashref. Duplicate names are not allowed (duh).
fields => [ { name => 'foo', ... }, { name => 'bar', ... }, { name => 'baz', ... }, ]results in
{ foo => ..., bar => ..., baz => ... } -
debugOutput progress while stepping through this query. This is convenient for finding why a specific query doesn't result in what you think it should.
-
singleExpect only a single item, result will be the value of the query. If the query has a
fieldsfield, it will be a hashref of the fields.If this key is missing, the result will always be an arrayref.
-
indexUse the n-th node as result. The result will always be a hashref or scalar value. This could be done in XPath but sometimes it's easier to do it here.
-
discardDo not use this intermediate value but replace it by the arrayref of its
fieldsvalue. -
htmlInclude the value of this node as an HTML string.
-
mungeApply the functions in the listed order to the value.
-
tagtag => foo=bar tag => baz:batAdd the key/value to the resulting hash. If the separator is
=, the result will be a plain scalar:foo => 'bar',If the separator is
:, there can be multiple tags and they are collected in an arrayref:baz => [ 'bat' ],