MongoDB pipeline for Scrapy

January 07, 2013

Reading time ~2 minutes

I just released a MongoDB pipeline for Scrapy, called scrapy-mongodb. The module supports both regular MongoDB deployments as well as replica sets. When logging your items with scrapy-mongodb you will instantly see the collected items in MongoDB. This post will show you how to use it in your Scrapy project.

See the scrapy-mongodb GitHub page for source code and additional documentation.

Installing scrapy-mongodb

The installation is straight forward. You simply install scrapy-mongodb using pip:

pip install scrapy-mongodb

Note that you might need to run pip as administrator.

Option 1: Configuring scrapy-mongodb for single MongoDB instances

We need to know some details about the MongoDB database that you want to store your items in. So update your Scrapy settings.py with the following:

MONGODB_HOST = 'localhost'
MONGODB_PORT = 27017
MONGODB_DATABASE = 'myDatabaseName'
MONGODB_COLLECTION = 'myCollectionName'

If you want us to create and use a unique key for your items, please add the following setting as well:

MONGODB_UNIQUE_KEY = 'keyName'

scrapy-mongodb will automatically ensure an index on that key.

Then we need to tell Scrapy to use the new pipeline. Add the following to your settings.py file:

ITEM_PIPELINES = [
    'scrapy_mongodb.MongoDBPipeline',
]

Additional configuration options can be found at https://github.com/sebdah/scrapy-mongodb.

Option 2: Configuring scrapy-mongodb for MongoDB replica sets

If you are logging the items to a MongoDB replica set, you will need to configure scrapy-mongodb to be replica set aware. Update your Scrapy settings.py with the following:

MONGODB_REPLICA_SET = 'replicaSetName'
MONGODB_REPLICA_HOSTS = 'h1.example.com,h2.example.com,h3.example.com'
MONGODB_DATABASE = 'myDatabaseName'
MONGODB_COLLECTION = 'myCollectionName'

If you want us to create and use a unique key for your items, please add the following setting as well:

MONGODB_UNIQUE_KEY = 'keyName'

scrapy-mongodb will automatically ensure an index on that key.

Then we need to tell Scrapy to use the new pipeline. Add the following to your settings.py file:

ITEM_PIPELINES = [
    'scrapy_mongodb.MongoDBPipeline',
]

Additional configuration options can be found at https://github.com/sebdah/scrapy-mongodb.

Summary

Done! Now start your spider just as usual and have a look in MongoDB for your items. They will show as soon as the spider has found and processed them, so you can see the progress as the spider crawls :).