Skip to content

01 Introduction

javidsho edited this page Jun 1, 2022 · 4 revisions

What's SharpGrabber?

SharpGrabber is a .NET Standard library that defines abstractions for writing a multi-purpose scraping bot. Its job is to look up the address of certain web resources and take useful information. This data may include information about images, videos or any other information of interest.

The core library however, as mentioned before, doesn't actually provide the functionalities mentioned above. The actual scraping job is done by grabbers implementing the interface IGrabber.

Grabber

A grabber targets certain types of resources, or websites. It accepts a URI, tests if it supports it, and if it does, it proceeds with the scraping and finally grabs the requested data.

For example, a YouTube grabber that is able to get information about a video works like this:

  • Accepts a URI, tests if it refers to a YouTube video.
  • If it doesn't, lets the caller know that the URI is not supported.
  • If it does, downloads the YouTube page, extracts the number of views, likes, dislikes etc. and returns the information.

Grabbers act according to the options sent with the grab request. The caller can specify what kind of resources it needs and the grabber should not return any other resources. This helps minimize the amount of resources spent to reach caller's goal.

Multi-Grabber

As mentioned before, a grabber is responsible for a single target. For instance, an Instagram grabber can only grab from Instagram. That works just fine when we're always working with Instagram and don't care about any other grabbers. But what if we don't know what service the URI refers to? The multi-grabber saves the day!

The multi-grabber itself also implements IGrabber just like a normal grabber. The difference is that it has an internal list of grabbers. When a grab is requested, it iterates through its registered grabbers and tests which one supports the target URI. The first grabber that supports the URI will handle the request.

If this is not the behavior you desire, you can always write your own custom multi-grabber. Let's say there could be potentially multiple grabbers supporting the URI, and you want your multi-grabber to collect all of the returned values. You can easily implement IMultiGrabber and achieve this. Notice that this is different than the default multi-grabber because it doesn't only use the first grabber that works.

Grab Result

Result of a grab operation is returned as an instance of GrabResult. The object contains the most basic information such as title, content creation date etc. and also a collection of small pieces of information called resources.

Grabbed Resources

Any piece of data grabbed from a URL is wrapped in an object implementing IGrabbed. This includes basic information, media files, playlists etc.

You can write your own classes implementing IGrabbed where necessary. Now let's go through some of the built-in resource types.

GrabbedInfo

Contains common information found on the page; such as author, total views, length etc.

GrabbedImage

Provides information about an image specific to the subject of content.

GrabbedMedia

Provides information about one single media file. It can contain video or audio, which can be determined by its properties. URL, format, bit-rate, and many other properties are also available.

GrabbedStream

Provides information grabbed from an HLS stream, such as length and segments.

Clone this wiki locally