GUIDE: Scrape multi-pages content with Scrapy


Scrapy is really amazing. It is a very handy framework for scraping. In this guide, I will illustrate how to create a spider to extract multi-pages content.  I assume that you already know Scrapy and you’ve covered the official tutorial.

I also assume that you’re familiar with XPath, if not please get your self familiar with it on w3schools site.

There are multiple ways to do this, however I found using a CrawlSpider is the fastest way to do this. CrawlSpider keeps following links based on a certain rules.   In this guide I will cover the spider part only. I believe defining the item is very straight forward.

Let’s assume we’ve a site called example.com. Example.com lists 10 items – say articles – per page, with a next link at the bottom – except for the last page off course. We want to scrape article titles across all pages.  Below is the HTML of the pages we want to scrape

Ready? let’s start writing our spider. First we should import the following

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom
from scraper.items import MyItem ##This is the item I defined in items.py

Then create you spider class that inherit from CrawlSpider and define its main attributes

class MultiPagesSpider(CrawlSpider):
domain_name = ‘example.com’
allowed_domains = [‘example.com’]
start_urls = [‘http://www.example.com’,]

So far our spider is just like the spider covered in the official tutorial with one exception. Our spider inherits CrawlSpider and not BaseSpider.

Next, define the rules. The rules is a tuple that defines which links to follow while scraping. The 3 main attributes of the each rule are:

  1. SgmlLinkExtractor: this is where you define the link you want the spider to follow
    1. allow: indicates the link href
    2. restrict_xpaths: restrict following links to a certain xpath. In our example we want the spider to follow the next link, which resides in the div. In our example we might not need to define it as we’ve only one link. however, if the page have multiple links and this value is not defined, the spider will follow all links!
  2. callback: the call back function to be called after each page is scraped
  3. follow: to instruct the spider to keep following the link through pages. The spider keeps following the link defined in the SgmlLinkExtractor as long as it exists!

The rest of the spider code should look something like below:

rules = (
Rule (SgmlLinkExtractor(allow=(‘next\.html’, ),restrict_xpaths=(‘//div[@id=”pg”]’,))
, callback=’parse_item’, follow= True),
)

def parse_item(self, response):
hxs = HtmlXPathSelector(response)
items= hxs.select(‘/html/body/ul’)
scraped_items =[]
for item in items:
scraped_item = MyItem() ### this is the item object you defined in the items file
scraped_item[“title”] = item.select(‘li/text()’).extract() ### assuming your item object has “title” field
scraped_items.append(scraped_item)
return(items)

spider = MultiPagesSpider()

That’s it! we’re done. Now you can move to your pipeline and write the code to handle scraped items! happy scraping!

About these ads

18 thoughts on “GUIDE: Scrape multi-pages content with Scrapy

  1. Very nice tuto.
    Can we add an other layer :
    instead of … we have to give as the page where are our items.
    Is it possible with CrawlSpider or we have to write our own spider ?
    Thx (can you please tell me how to do that, I am trying to scrap something like google search results with multi-pages)

      • Thank you for replying, her is my point :
        The first page does not have the data to scrape it just have links to details pages
        and you have to do the next :
        1- Find links in the search page
        2- Follow them to details page and do the scraping job
        3- Once done, go to the next search page: the “next >>” link
        4. Redo all over again
        I hope my idea is more clear now :)

  2. What if there are more than one links in a page and want to follow all the links and the links the subsequent pages as well without repeating the links once visited.

    How can I do that. This approach is not helping me do that.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s