GUIDE: Scrape multi-pages content with Scrapy

Posted on February 13, 2011February 13, 2011 by abuhijleh

Scrapy is really amazing. It is a very handy framework for scraping. In this guide, I will illustrate how to create a spider to extract multi-pages content. I assume that you already know Scrapy and you’ve covered the official tutorial.

I also assume that you’re familiar with XPath, if not please get your self familiar with it on w3schools site.

There are multiple ways to do this, however I found using a CrawlSpider is the fastest way to do this. CrawlSpider keeps following links based on a certain rules. In this guide I will cover the spider part only. I believe defining the item is very straight forward.

Let’s assume we’ve a site called example.com. Example.com lists 10 items – say articles – per page, with a next link at the bottom – except for the last page off course. We want to scrape article titles across all pages. Below is the HTML of the pages we want to scrape

Ready? let’s start writing our spider. First we should import the following

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom
from scraper.items import MyItem ##This is the item I defined in items.py

Then create you spider class that inherit from CrawlSpider and define its main attributes

class MultiPagesSpider(CrawlSpider):
domain_name = ‘example.com’
allowed_domains = [‘example.com’]
start_urls = [‘http://www.example.com’,]

So far our spider is just like the spider covered in the official tutorial with one exception. Our spider inherits CrawlSpider and not BaseSpider.

Next, define the rules. The rules is a tuple that defines which links to follow while scraping. The 3 main attributes of the each rule are:

SgmlLinkExtractor: this is where you define the link you want the spider to follow
1. allow: indicates the link href
2. restrict_xpaths: restrict following links to a certain xpath. In our example we want the spider to follow the next link, which resides in the div. In our example we might not need to define it as we’ve only one link. however, if the page have multiple links and this value is not defined, the spider will follow all links!
callback: the call back function to be called after each page is scraped
follow: to instruct the spider to keep following the link through pages. The spider keeps following the link defined in the SgmlLinkExtractor as long as it exists!

The rest of the spider code should look something like below:

rules = (
Rule (SgmlLinkExtractor(allow=(‘next\.html’, ),restrict_xpaths=(‘//div[@id=”pg”]’,))
, callback=’parse_item’, follow= True),
)

def parse_item(self, response):
hxs = HtmlXPathSelector(response)
items= hxs.select(‘/html/body/ul’)
scraped_items =[]
for item in items:
scraped_item = MyItem() ### this is the item object you defined in the items file
scraped_item[“title”] = item.select(‘li/text()’).extract() ### assuming your item object has “title” field
scraped_items.append(scraped_item)
return(items)

spider = MultiPagesSpider()

That’s it! we’re done. Now you can move to your pipeline and write the code to handle scraped items! happy scraping!

23 thoughts on “GUIDE: Scrape multi-pages content with Scrapy”

Lowell

February 15, 2011 at 12:39 am

I truly knew about most of this, but that being said, I still believed it had been useful. Nice post!

Reply
1. abuhijleh
  
  March 1, 2011 at 9:43 am
  
  Thanks Lowell,
  
  Other might not know about this J. Thanks for your comment
  
  Reply
Daniel

May 15, 2011 at 11:25 pm

Hi Abuhijleh,

thanks for your post. I have always missed the new instance of the spider on the end… 🙂

Reply
1. abuhijleh
  
  May 16, 2011 at 3:37 pm
  
  You’re welcome Daniel
  
  Reply
David

June 20, 2011 at 6:11 pm

You can improve your code:
SgmlLinkExtractor(restrict_xpaths=(“//a[.=’next >>’]”))

Reply
1. abuhijleh
  
  June 20, 2011 at 6:41 pm
  
  Thanks David
  
  Reply
Alan

September 25, 2011 at 3:32 am

Wow – exactly what I was looking for – Thank you!

Reply
1. abuhijleh
  
  September 25, 2011 at 11:45 am
  
  My pleasure!
  
  Reply
Amine Hmida

December 4, 2011 at 4:24 am

Very nice tuto.
Can we add an other layer :
instead of … we have … to give as the page where are our items.
Is it possible with CrawlSpider or we have to write our own spider ?
Thx (can you please tell me how to do that, I am trying to scrap something like google search results with multi-pages)

Reply
1. abuhijleh
  
  December 6, 2011 at 10:16 pm
  
  Sorry Amine,
  
  I don’t seem to get your point. Can you please elaborate me.
  
  Reply
  1. Amine Hmida
    
    December 6, 2011 at 10:28 pm
    
    Thank you for replying, her is my point :
    The first page does not have the data to scrape it just have links to details pages
    and you have to do the next :
    1- Find links in the search page
    2- Follow them to details page and do the scraping job
    3- Once done, go to the next search page: the “next >>” link
    4. Redo all over again
    I hope my idea is more clear now 🙂
Amine Hmida

December 4, 2011 at 4:28 am

Sorry, I mean \ with \
Thx again.

Reply
Siddharth Saha

December 5, 2011 at 9:01 am

What if there are more than one links in a page and want to follow all the links and the links the subsequent pages as well without repeating the links once visited.

How can I do that. This approach is not helping me do that.

Reply
1. abuhijleh
  
  December 6, 2011 at 10:25 pm
  
  You can define multiple rules. For each rule you can define a different call back. Check this link for more details
  http://doc.scrapy.org/en/0.14/topics/spiders.html#crawling-rules
  
  Reply
captcher

July 21, 2012 at 5:14 pm

Hello, I check your blogs daily. Your writing style is awesome, keep
doing what you’re doing!

Reply
1. abuhijleh
  
  July 21, 2012 at 7:30 pm
  
  Thanks a lot
  
  Reply
kokku

December 2, 2013 at 8:46 pm

it is too clear .. it saved my time alot and gave me clear view on crawl spider .. thanks alot

Reply
Data Scraping

July 30, 2014 at 3:08 pm

i m scraper and scraping web using core php and .Net…now i m gonna try scrapy for scraping web…nice post thanks for sharing…

Reply
chaibapat

June 3, 2016 at 4:38 pm

Thanks for the information. Correction – latest version of Scrapy insists of getting rid of SGMLLinkExtractor(getting deprecated) and use LinkExtractor class instead.

Reply
1. abuhijleh
  
  June 22, 2016 at 6:30 pm
  
  Thanks buddy. I haven’t updated my self with Scrapy for a while J.
  
  Reply
Guilherme

June 22, 2016 at 4:41 pm

How can I scrapy a page for exemple: domain.com?search=WORDS _TO_SEARCH

And return the data for each search?

Reply
1. abuhijleh
  
  June 22, 2016 at 6:31 pm
  
  haven’t palyed with Scrapy for a while. Sorry cannot help J
  
  Reply
sutarno2001

January 26, 2018 at 12:07 pm

Thanks

Gudang Skripsi

Reply