Scrapy入门教程

1.开发准备

1.1 python安装

1.2 安装python IDE，PyCharm

1
2
# 执行jar文件
java -jar xxx.jar

1.3 macOS安装Scrapy

2.1 Creating a project

scrapy startproject tutorial //命令行创建Scrapy项目 tutorial

scrapy.cfg: 项目的配置文件
tutorial/:  该项目的python模块。之后将在此加入代码。
tutorial/items.py: 项目中的item文件
tutorial/pipelines.py:项目中的pipelines文件.
tutorial/settings.py:项目中的设置文件
tutorial/spiders/:放置spider代码的目录

Our first Spider

Spiders是您定义的类，它用于从网站(或一组网站)中提取信息。
他们必须是scrapy.Spider的子类。
可以选择如何跟踪页面中的链接，以及如何解析下载的页面内容以提取数据。

这是第一个spider的代码。
将其保存在目录 tutorial/spiders下的py项目，名为quotes_spider.py 的文件中。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

As you can see, our Spider subclasses and defines some attributes and methods:
如您所见，scrapy.Spider 子类定义了一些属性和方法:

name: 用于区别Spider。名字必须唯一, 也意味着您不能为不同的Spider设置相同名称。
start_requests(): 必须返回请求的迭代(您可以返回请求列表或编写函数方法)，Spider爬虫将从该迭代开始爬。后续的请求将从这些初始请求中依次生成。
parse(): 是spider的一个方法。将被调用来处理为每个请求下载的响应的方法。response参数是TextResponse的一个实例，它保存了页面内容，并有进一步的方法来处理它。parse()方法通常解析响应，将爬取的数据提取为dicts（字典），并跟踪新的url，并从中创建新请求。

2.2 How to run our spider （如何运行spider）

进入项目根目录，执行下列命令启动spider:

1	scrapy crawl quotes

这个命令用我们刚刚添加的quotes名字引号运行爬行器，它将发送一些对quot.toscrape.com的请求。您将得到一个输出:

2018-08-10 10:48:09 [scrapy.core.engine] INFO: Spider opened
...
2018-08-10 10:48:11 [scrapy.core.engine] INFO: Spider closed (finished)

现在，检查当前目录中的文件。
您应该注意到已经创建了两个新文件:quotes-1。html和quotes-2。html，以及相应url的内容，如我们的parse方法所示。

What just happened under the hood? (hood下面发生了什么?)

Scrapy计划爬行器Spider的start_requests方法返回的scrapy.Request请求对象。在接收到每个响应之后，它实例化响应对象并调用与请求关联的回调方法(在本例中是parse方法)，将响应作为参数传递。

A shortcut to the start_requests method(start_requests方法的一个快捷方式)

取代用于从url请求中生成scrapy.Request对象的start_requests()实现方法。
，您可以只定义一个start_urls类属性和一个url列表。
start_requests()的默认实现将使用这个列表为您的spider创建初始请求:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

将调用**parse()方法来处理这些url的每个请求，尽管我们还没有明确告诉Scrapy这样做。
这是因为parse()**是Scrapy的默认回调方法，它在没有显式分配回调的情况下调用请求。

Extracting data

学习如何使用Scrapy提取数据的最佳方法是使用shell Scrapy shell尝试选择器。
运行:
scrapy shell 'http://quotes.toscrape.com/page/1/'

在Windows上，需要使用双引号:
scrapy shell "http://quotes.toscrape.com/page/1/"
你将会看到这样的输出:

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x104f0cc18>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x105ca4748>
[s]   spider     <DefaultSpider 'default' at 0x105f5de10>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

使用shell时，可以尝试使用CSS和响应对象选择元素:

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to     Scrape</title>'>]

XPath: a brief intro

Extracting quotes and authors

现在您已经了解了一些关于选择和提取的知识，让我们通过编写从web页面中提取引号的代码来完成爬行器。
http://quotes.toscrape.com 中的每个引语都由HTML元素表示，如下所示:

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tag">
        tag:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

让我们打开scrapy shell并播放一些，以了解如何提取我们想要的数据:
$ scrapy shell 'http://quotes.toscrape.com'

我们得到了一个引用HTML元素的选择器列表:

>>> response.css("div.quote")

上面查询返回的每个选择器都允许我们对它们的子元素运行进一步的查询。
让我们把第一个选择器分配给一个变量，这样我们就可以直接在一个引用上运行我们的CSS选择器:

>>> quote = response.css("div.quote")[0]

现在，让我们使用刚才创建的quote对象从引用中提取title, author和tag:

>>> title = quote.css("span.text::text").extract_first()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").extract_first()
>>> author
'Albert Einstein'

假设标记是字符串列表，我们可以使用.extract()方法获得所有字符串:

>>> tag = quote.css("div.tag a.tag::text").extract()
>>> tag
['change', 'deep-thoughts', 'thinking', 'world']

知道了如何提取每一个位之后，我们现在可以遍历所有的引号元素，并将它们放到Python字典中:

>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").extract_first()
...     author = quote.css("small.author::text").extract_first()
...     tag = quote.css("div.tag a.tag::text").extract()
...     print(dict(text=text, author=author, tag=tag))
{'tag': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'tag': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
    ... a few more of these, omitted for brevity
>>>

Extracting data in our spider （提取爬虫数据）

Storing the scraped data （存储爬取到数据）

存储剪贴数据的最简单方法是使用提要导出，使用以下命令:
scrapy crawl quotes -o quotes.json
这会产生一个引号。json文件，包含所有剪贴项，用json序列化。
出于历史原因，Scrapy将一个给定的文件附加而不是覆盖其内容。如果您在第二次运行该命令两次而没有在第二次之前删除该文件，那么最终会得到一个损坏的JSON文件。

还可以使用其他格式，比如JSON行:
scrapy crawl quotes -o quotes.jl

JSON行格式很有用，因为它类似于流，您可以很容易地向其追加新记录。当你运行两次时，它不会有JSON那样的问题。另外，由于每个记录都是单独的一行，您可以处理大文件，而不必将所有内容都放在内存中。

在小项目中(如本教程中的项目)，这就足够了。但是，如果您想要对已清理的项执行更复杂的操作，您可以编写一个项管道。
在tutorial/pipelines.py中，项目创建时为您设置了项目管道的占位符文件。
不过，如果您只是想要存储剪贴件，则不需要实现任何项管道。

Following Links

A shortcut for creating Request

More examples and patterns

Using spider arguments

Next steps

Scrapy爬虫添加MySQL支持

1. Mac添加MySQL支持

1.1 mysql官网下载并安装

添加zsh命令行.bash_profile支持

sudo vim .bash_profile
//追加一行 /usr/local/mysql为mysql默认安装目录
export PATH=/usr/local/mysql/bin:$PATH
//保存并更新
source ~/.bash_profile
```    

### 1.2 mysql命令行使用

```bash
//默认找到bin文件目录
/usr/local/mysql/bin/mysql -u root -p
//配置.bash_profile后直接执行
mysql -u root -p

1.3 MySQL数据库连接并操作Python3（PyMySQL驱动）

1. sql查询操作

# coding=utf-8
import pymysql

if __name__ == '__main__':
    # 1. 打开数据库连接
    connection = pymysql.connect(host="localhost",
                                 user="root",
                                 password="91499419",
                                 db="database_chain_news",
                                 port=3306)
    # 2. 获取游标对象
    cursor = connection.cursor()
    sql = "select version()"# query version 
    cursor.execute(sql)
    data = cursor.fetchone()
    print(data)
    # 3. 数据库创建和删除
    ## 判断数据库不存在，创建数据库
    sql_create_db = "create database If Not Exists database_chain_news charset UTF8;"
    cursor.execute(sql_create_db)
    sql_use_db = "use database_chain_news;"
    cursor.execute(sql_use_db)
    ## 判断数据库存在, 则删除:
    sql_drop_db = "drop database if exists database_name;"
    # 4. 数据表的创建和删除
    ## 数据表不存在，直接创建表
    try:
        sql_create_table = "create table if not exists tl_babtc_flash(babtc_id int primary key , source varchar(100) , title varchar(100) , babtc_content varchar(100) , babtc_post_date int, babtc_views int, babtc_post_name varchar(100) , babtc_desc varchar(100));"
        cursor.execute(sql_create_table)
        # 5. 执行sql查询操作
        # 6. 使用fetchone() 获取单条数据
        # 7. 表插入操作
        ### 1. 在 Python 中使用 sqlite3 连接数据库，插入语句的展位符为 "？"
        cur.execute("insert into user values(?,?,?)",(1,2,"zhang"))
        ### 2. 在Python中，使用pymysql连接mysql数据库，插入语句的占位符为 "%s"
        ### cur.execute("insert into user values(?,?,?)", (1, 2, "zhang"))
        sql_insert_flash = "insert into tl_babtc_flash(babtc_id, source, title, babtc_content, babtc_post_date, " \
                           "babtc_views, babtc_post_name, babtc_desc)" 
                           " value (%s, %s, %s, %s, %s, %s, %s, %s);"
        cursor.execute(sql_insert_flash, (1, "source", "title", "content", 100, 99, "post_name", "desc"))
        connection.commit()
    # 8. 表更新操作
    # 9. 表查询操作
    # 10. 表删除操作
    # 11. 事务处理
    except:
        connection.rollback()
    # 12. 异常处理
    finally:
        connection.close()
    # 13. 关闭连接

2. Ubuntu服务器端

coding=utf-8
import pymysql

if __name__ == '__main__':
    # 1. 打开数据库连接
    connection = pymysql.connect(
    host="localhost",
    user="root",
    password="91499419",
    b="database_chain_news", 
    port=3306)
    # 2. 获取游标对象
    cursor = connection.cursor()
    sql = "select version()" # query version 
    cursor.execute(sql)
    data = cursor.fetchone()
    print(data)
    # 3. 数据库创建和
    ## 判断数据库不存在，创建数据库
    sql_create_db = "create database If Not Exists database_chain_news charset UTF8;"
    cursor.execute(sql_create_db)
    sql_use_db = "use database_chain_news;"
    cursor.execute(sql_use_db)
    ## 判断数据库存在, 则删除:
    sql_drop_db = "drop database if exists database_name;"
    # 4. 数据表的创建和删除
    ## 数据表不存在，直接创建表
    try:
        sql_create_table = "create table if not exists tl_babtc_flash(babtc_id int primary key , source varchar(100) , title varchar(100) , babtc_content varchar(100) , babtc_post_date int, babtc_views int, babtc_post_name varchar(100) , babtc_desc varchar(100));"
        cursor.execute(sql_create_table)

        # 5. 执行sql查询操作
        # 6. 使用fetchone() 获取单条数据
        # 7. 表插入操作
        ### 1. 在 Python 中使用 sqlite3 连接数据库，插入语句的展位符为 "？"
        cur.execute("insert into user values(?,?,?)",(1,2,"zhang"))

        ### 2. 在Python中，使用pymysql连接mysql数据库，插入语句的占位符为 "%s"
        cur.execute("insert into user values(?,?,?)", (1, 2, "zhang"))

        sql_insert_flash = "insert into tl_babtc_flash(babtc_id, source, title, babtc_content, babtc_post_date, babtc_views, babtc_post_name, babtc_desc) value (%s, %s, %s, %s, %s, %s, %s, %s);"
        
        cursor.execute(sql_insert_flash, (1, "source", "title", "content", 100, 99, "post_name", "desc"))
        
        connection.commit()
    # 8. 表更新操作
    # 9. 表查询操作
    # 10. 表删除操作
    # 11. 事务处理
    except:
        connection.rollback()
    # 12. 异常处理
    finally:
        connection.close()
    # 13. 关闭连接