Common Practices

    You can use the API to run Scrapy from a script, instead of the typical way of running Scrapy via .

    Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor.

    The first utility you can use to run your spiders is . This class will start a Twisted reactor for you, configuring the logging and setting shutdown handlers. This class is the one used by all Scrapy commands.

    Here’s an example showing how to run a single spider with it.

    Define settings within dictionary in CrawlerProcess. Make sure to check CrawlerProcess documentation to get acquainted with its usage details.

    If you are inside a Scrapy project there are some additional helpers you can use to import those components within the project. You can automatically import your spiders passing their name to , and use get_project_settings to get a Settings instance with your project settings.

    What follows is a working example of how to do that, using the project as example.

    1. from scrapy.crawler import CrawlerProcess
    2. from scrapy.utils.project import get_project_settings
    3. process = CrawlerProcess(get_project_settings())
    4. # 'followall' is the name of one of the spiders of the project.
    5. process.crawl('followall', domain='scrapy.org')
    6. process.start() # the script will block here until the crawling is finished

    There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.

    Using this class the reactor should be explicitly run after scheduling your spiders. It’s recommended you use instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.

    Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved by adding callbacks to the deferred returned by the method.

    Here’s an example of its usage, along with a callback to manually stop the reactor after MySpider has finished running.

    Reactor Overview

    By default, Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running multiple spiders per process using the .

    Here is an example that runs multiple spiders simultaneously:

    1. import scrapy
    2. from scrapy.crawler import CrawlerProcess
    3. from scrapy.utils.project import get_project_settings
    4. class MySpider1(scrapy.Spider):
    5. # Your first spider definition
    6. ...
    7. class MySpider2(scrapy.Spider):
    8. ...
    9. settings = get_project_settings()
    10. process = CrawlerProcess(settings)
    11. process.crawl(MySpider1)
    12. process.crawl(MySpider2)
    13. process.start() # the script will block here until all crawling jobs are finished

    Same example using CrawlerRunner:

    Same example but running the spiders sequentially by chaining the deferreds:

    1. from twisted.internet import reactor, defer
    2. from scrapy.crawler import CrawlerRunner
    3. from scrapy.utils.log import configure_logging
    4. from scrapy.utils.project import get_project_settings
    5. class MySpider1(scrapy.Spider):
    6. # Your first spider definition
    7. ...
    8. # Your second spider definition
    9. ...
    10. configure_logging()
    11. runner = CrawlerRunner(settings)
    12. @defer.inlineCallbacks
    13. def crawl():
    14. yield runner.crawl(MySpider1)
    15. yield runner.crawl(MySpider2)
    16. reactor.stop()
    17. crawl()
    18. reactor.run() # the script will block here until the last crawl call is finished

    Different spiders can set different values for the same setting, but when they run in the same process it may be impossible, by design or because of some limitations, to use these different values. What happens in practice is different for different settings:

    • and the ones used by its value (SPIDER_MODULES, for the default one) cannot be read from the per-spider settings. These are applied when the CrawlerRunner or object is created.

    • For TWISTED_REACTOR and the first available value is used, and if a spider requests a different reactor an exception will be raised. These are applied when the reactor is installed.

    • For REACTOR_THREADPOOL_MAXSIZE, and the ones used by the resolver (DNSCACHE_ENABLED, , DNS_TIMEOUT for ones included in Scrapy) the first available value is used. These are applied when the reactor is started.

    See also

    .

    If you have many spiders, the obvious way to distribute the load is to setup many Scrapyd instances and distribute spider runs among those.

    If you instead want to run a single (big) spider through many machines, what you usually do is partition the urls to crawl and send them to each separate spider. Here is a concrete example:

    First, you prepare the list of urls to crawl and put them into separate files/urls:

    Then you fire a spider run on 3 different Scrapyd servers. The spider would receive a (spider) argument part with the number of the partition to crawl:

    1. curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
    2. curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2

    Some websites implement certain measures to prevent bots from crawling them, with varying degrees of sophistication. Getting around those measures can be difficult and tricky, and may sometimes require special infrastructure. Please consider contacting commercial support if in doubt.

    Here are some tips to keep in mind when dealing with these kinds of sites:

    • rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)

    • disable cookies (see ) as some sites may use cookies to spot bot behaviour

    • use download delays (2 or higher). See DOWNLOAD_DELAY setting.

    • if possible, use to fetch pages, instead of hitting the sites directly

    • use a pool of rotating IPs. For example, the free Tor project or paid services like . An open source alternative is scrapoxy, a super proxy that you can attach your own proxies to.

    If you are still unable to prevent your bot getting banned, consider contacting .