亚洲美日韩,天堂网2018,恐怖星球在线观看完整版免费

1.安裝BeautifulSoup4
easy_install安裝方式,easy_install需要提前安裝

				?

									easy_install beautifulsoup4

pip安裝方式,pip也需要提前安裝.此外PyPi中還有一個名字是 BeautifulSoup 的包,那是 Beautiful Soup3 的發(fā)布版本.在這里不建議安裝.

				?

									pip install beautifulsoup4

Debain或ubuntu安裝方式

				?

									apt-get install Python-bs4

你也可以通過源碼安裝,下載BS4源碼

				?

									Python setup.py install

2.小試牛刀

				?

									# coding=utf-8

									'''

									@通過BeautifulSoup下載百度貼吧圖片

									'''

									import urllib

									from bs4 import BeautifulSoup

									url = 'http://tieba.baidu.com/p/3537654215'

									# 下載網(wǎng)頁

									html = urllib.urlopen(url)

									content = html.read()

									html.close()

									# 使用BeautifulSoup匹配圖片

									html_soup = BeautifulSoup(content)

									# 圖片代碼我們在[Python爬蟲基礎1--urllib]( http://blog.xiaolud.com/2015/01/22/spider-1st/ "Python爬蟲基礎1--urllib")里面已經(jīng)分析過了

									# 相較通過正則表達式去匹配,BeautifulSoup提供了一個更簡單靈活的方式

									all_img_links = html_soup.findAll('img', class_='BDE_Image')

									# 接下來就是老生常談的下載圖片

									img_counter = 1

									for img_link in all_img_links:

									  img_name = '%s.jpg' % img_counter

									  urllib.urlretrieve(img_link['src'], img_name)

									  img_counter += 1

很簡單,代碼注釋里面已經(jīng)解釋的很清楚了.BeautifulSoup提供了一個更簡單靈活的方式,去分析網(wǎng)站源碼,更快獲取圖片link.

3.爬取實例
3.1基本的抓取技術
在寫一個爬蟲腳本時，第一件事情就是手動觀察要抓取的頁面來確定數(shù)據(jù)如何定位。

首先，我們要看一看在 http://pyvideo.org/category/50/pycon-us-2014 上的 PyCon 大會視頻列表。檢查這個頁面的 HTML 源代碼我們發(fā)現(xiàn)視頻列表的結果差不多是長這樣的：

				?

									<div id="video-summary-content">

									  <div class="video-summary">  <!-- first video -->

									    <div class="thumbnail-data">...</div>

									    <div class="video-summary-data">

									      <div>

									        <strong><a href="#link to video page#">#title#</a></strong>

									      </div>

									    </div>

									  </div>

									  <div class="video-summary">  <!-- second video -->

									    ...

									  </div>

									  ...

									</div>

那么第一個任務就是加載這個頁面，然后抽取每個單獨頁面的鏈接，因為到 YouTube 視頻的鏈接都在這些單獨頁面上。

使用requests來加載一個 web 頁面是非常簡單的：

				?

									import requests

									response = requests.get('http://pyvideo.org/category/50/pycon-us-2014')

就是它！在這個函數(shù)返回后就能從response.text中獲得這個頁面的 HTML 。

下一個任務是抽取每一個單獨視頻頁面的鏈接。通過 BeautifulSoup 使用 CSS 選擇器語法就能完成它，如果你是客戶端開發(fā)者的話你可能對這會很熟悉。

為了獲得這些鏈接，我們要使用一個選擇器，它能抓取在每一個 id 為video-summary-data的<div>中所有的<a>元素。由于每個視頻都有幾個<a>元素，我們將只保留那些 URL 以/video開頭的<a>元素，這些就是唯一的單獨視頻頁面。實現(xiàn)上述標準的 CSS 選擇器是div.video-summary-data a[href^=/video]。下面的代碼片段通過 BeautifulSoup 使用這個選擇器來獲得指向視頻頁面的<a>元素：

				?

									import bs4

									soup = bs4.BeautifulSoup(response.text)

									links = soup.select('div.video-summary-data a[href^=/video]')

因為我們真正關心的是這個鏈接本身而不是包含它的<a>元素，我們可以使用列表解析來改善上述代碼。

links = [a.attrs.get('href') for a in soup.select('div.video-summary-data a[href^=/video]')]
現(xiàn)在，我們已經(jīng)有了一個包含所有鏈接的數(shù)組，這些鏈接指向了每個單獨頁面。

下面這段腳本整理了目前我們提到的所有技術：

				?

									import requests

									import bs4

									root_url = 'http://pyvideo.org'

									index_url = root_url + '/category/50/pycon-us-2014'

									def get_video_page_urls():

									  response = requests.get(index_url)

									  soup = bs4.BeautifulSoup(response.text)

									  return [a.attrs.get('href') for a in soup.select('div.video-summary-data a[href^=/video]')]

									print(get_video_page_urls())

如果你運行上面這段腳本你將會獲得一個滿是 URL 的數(shù)組。現(xiàn)在我們需要去解析每個 URL 以獲得更多關于每場 PyCon 會議的信息。

3.2抓取相連頁面
下一步是加載我們的 URL 數(shù)組中每一個頁面。如果你想要看看這些頁面長什么樣的話，這兒是個樣例：http://pyvideo.org/video/2668/writing-restful-web-services-with-flask。沒錯，那就是我，那是我會議中的一個！

從這些頁面我們可以抓取到會議的標題，在頁面的頂部能看到它。我們也可以從側(cè)邊欄獲得演講者的姓名和 YouTube 的鏈接，側(cè)邊欄在嵌入視頻的右下方。獲取這些元素的代碼展示在下方：

				?

									def get_video_data(video_page_url):

									  video_data = {}

									  response = requests.get(root_url + video_page_url)

									  soup = bs4.BeautifulSoup(response.text)

									  video_data['title'] = soup.select('div#videobox h3')[0].get_text()

									  video_data['speakers'] = [a.get_text() for a in soup.select('div#sidebar a[href^=/speaker]')]

									  video_data['youtube_url'] = soup.select('div#sidebar a[href^=http://www.youtube.com]')[0].get_text()

關于這個函數(shù)需要注意的一些事情：

從首頁抓取的 URL 是相對路徑，所以root_url需要加到前面。
大會標題是從 id 為videobox的<div>里的<h3>元素中獲得的。注意[0]是必須的，因為調(diào)用select()返回的是一個數(shù)組，即使只有一個匹配。
演講者的姓名和 YouTube 鏈接的獲取方式與首頁上的鏈接獲取方式類似。
現(xiàn)在就剩下從每個視頻的 YouTube 頁面抓取觀看數(shù)了。接著上面的函數(shù)寫下去其實是非常簡單的。同樣，我們也可以抓取 like 數(shù)和 dislike 數(shù)。

				?

									def get_video_data(video_page_url):

									  # ...

									  response = requests.get(video_data['youtube_url'])

									  soup = bs4.BeautifulSoup(response.text)

									  video_data['views'] = int(re.sub('[^0-9]', '',

									                   soup.select('.watch-view-count')[0].get_text().split()[0]))

									  video_data['likes'] = int(re.sub('[^0-9]', '',

									                   soup.select('.likes-count')[0].get_text().split()[0]))

									  video_data['dislikes'] = int(re.sub('[^0-9]', '', 

									                    soup.select('.dislikes-count')[0].get_text().split()[0]))

									  return video_data

上述調(diào)用soup.select()函數(shù)，使用指定了 id 名字的選擇器，采集到了視頻的統(tǒng)計數(shù)據(jù)。但是元素的文本需要被處理一下才能變成數(shù)字。考慮觀看數(shù)的例子，在 YouTube 上顯示的是"1,344 views"。用一個空格分開（split）數(shù)字和文本后，只有第一部分是有用的。由于數(shù)字里有逗號，可以用正則表達式過濾掉任何不是數(shù)字的字符。

為了完成爬蟲，下面的函數(shù)調(diào)用了之前提到的所有代碼：

				?

									def show_video_stats():

									  video_page_urls = get_video_page_urls()

									  for video_page_url in video_page_urls:

									    print get_video_data(video_page_url)

3.3并行處理
上面到目前為止的腳本工作地很好，但是有一百多個視頻它就要跑個一會兒了。事實上我們沒做什么工作，大部分時間都浪費在了下載頁面上，在這段時間腳本時被阻塞的。如果腳本能同時跑多個下載任務，可能就會更高效了，是嗎？

回顧當時寫一篇使用 Node.js 的爬蟲文章的時候，并發(fā)性是伴隨 JavaScript 的異步特性自帶來的。使用 Python 也能做到，不過需要顯示地指定一下。像這個例子，我將開啟一個擁有8個可并行化進程的進程池。代碼出人意料的簡潔：

				?

									from multiprocessing import Pool

									def show_video_stats(options):

									  pool = Pool(8)

									  video_page_urls = get_video_page_urls()

									  results = pool.map(get_video_data, video_page_urls)

multiprocessing.Pool 類開啟了8個工作進程等待分配任務運行。為什么是8個？這是我電腦上核數(shù)的兩倍。當時實驗不同大小的進程池時，我發(fā)現(xiàn)這是最佳的大小。小于8個使腳本跑的太慢，多于8個也不會讓它更快。

調(diào)用pool.map()類似于調(diào)用常規(guī)的map()，它將會對第二個參數(shù)指定的迭代變量中的每個元素調(diào)用一次第一個參數(shù)指定的函數(shù)。最大的不同是，它將發(fā)送這些給進程池所擁有的進程運行，所以在這個例子中八個任務將會并行運行。

節(jié)省下來的時間是相當大的。在我的電腦上，第一個版本的腳本用了75秒完成，然而進程池的版本做了同樣的工作只用了16秒！

3.4完成爬蟲腳本
我最終版本的爬蟲腳本在獲得數(shù)據(jù)后還做了更多的事情。

我添加了一個--sort命令行參數(shù)去指定一個排序標準，可以指定views，likes或者dislikes。腳本將會根據(jù)指定屬性對結果數(shù)組進行遞減排序。另一個參數(shù)，--max代表了要顯示的結果數(shù)的個數(shù)，萬一你只想看排名靠前的幾條而已。最后，我還添加了一個--csv選項，為了可以輕松地將數(shù)據(jù)導到電子制表軟件中，可以指定數(shù)據(jù)以 CSV 格式打印出來，而不是表對齊格式。

完整腳本顯示在下方：

				?

									import argparse

									import re

									from multiprocessing import Pool

									import requests

									import bs4

									root_url = 'http://pyvideo.org'

									index_url = root_url + '/category/50/pycon-us-2014'

									def get_video_page_urls():

									  response = requests.get(index_url)

									  soup = bs4.BeautifulSoup(response.text)

									  return [a.attrs.get('href') for a in soup.select('div.video-summary-data a[href^=/video]')]

									def get_video_data(video_page_url):

									  video_data = {}

									  response = requests.get(root_url + video_page_url)

									  soup = bs4.BeautifulSoup(response.text)

									  video_data['title'] = soup.select('div#videobox h3')[0].get_text()

									  video_data['speakers'] = [a.get_text() for a in soup.select('div#sidebar a[href^=/speaker]')]

									  video_data['youtube_url'] = soup.select('div#sidebar a[href^=http://www.youtube.com]')[0].get_text()

									  response = requests.get(video_data['youtube_url'])

									  soup = bs4.BeautifulSoup(response.text)

									  video_data['views'] = int(re.sub('[^0-9]', '',

									                   soup.select('.watch-view-count')[0].get_text().split()[0]))

									  video_data['likes'] = int(re.sub('[^0-9]', '',

									                   soup.select('.likes-count')[0].get_text().split()[0]))

									  video_data['dislikes'] = int(re.sub('[^0-9]', '',

									                    soup.select('.dislikes-count')[0].get_text().split()[0]))

									  return video_data

									def parse_args():

									  parser = argparse.ArgumentParser(description='Show PyCon 2014 video statistics.')

									  parser.add_argument('--sort', metavar='FIELD', choices=['views', 'likes', 'dislikes'],

									            default='views',

									            help='sort by the specified field. Options are views, likes and dislikes.')

									  parser.add_argument('--max', metavar='MAX', type=int, help='show the top MAX entries only.')

									  parser.add_argument('--csv', action='store_true', default=False,

									            help='output the data in CSV format.')

									  parser.add_argument('--workers', type=int, default=8,

									            help='number of workers to use, 8 by default.')

									  return parser.parse_args()

									def show_video_stats(options):

									  pool = Pool(options.workers)

									  video_page_urls = get_video_page_urls()

									  results = sorted(pool.map(get_video_data, video_page_urls), key=lambda video: video[options.sort],

									           reverse=True)

									  max = options.max

									  if max is None or max > len(results):

									    max = len(results)

									  if options.csv:

									    print(u'"title","speakers", "views","likes","dislikes"')

									  else:

									    print(u'Views +1 -1 Title (Speakers)')

									  for i in range(max):

									    if options.csv:

									      print(u'"{0}","{1}",{2},{3},{4}'.format(

									        results[i]['title'], ', '.join(results[i]['speakers']), results[i]['views'],

									        results[i]['likes'], results[i]['dislikes']))

									    else:

									      print(u'{0:5d} {1:3d} {2:3d} {3} ({4})'.format(

									        results[i]['views'], results[i]['likes'], results[i]['dislikes'], results[i]['title'],

									        ', '.join(results[i]['speakers'])))

									if __name__ == '__main__':

									  show_video_stats(parse_args())

下方輸出的是在我寫完代碼時前25個觀看數(shù)最多的會議：

				?

									(venv) $ python pycon-scraper.py --sort views --max 25 --workers 8

									Views +1 -1 Title (Speakers)

									 3002 27  0 Keynote - Guido Van Rossum (Guido Van Rossum)

									 2564 21  0 Computer science fundamentals for self-taught programmers (Justin Abrahms)

									 2369 17  0 Ansible - Python-Powered Radically Simple IT Automation (Michael Dehaan)

									 2165 27  6 Analyzing Rap Lyrics with Python (Julie Lavoie)

									 2158 24  3 Exploring Machine Learning with Scikit-learn (Jake Vanderplas, Olivier Grisel)

									 2065 13  0 Fast Python, Slow Python (Alex Gaynor)

									 2024 24  0 Getting Started with Django, a crash course (Kenneth Love)

									 1986 47  0 It's Dangerous to Go Alone: Battling the Invisible Monsters in Tech (Julie Pagano)

									 1843 24  0 Discovering Python (David Beazley)

									 1672 22  0 All Your Ducks In A Row: Data Structures in the Standard Library and Beyond (Brandon Rhodes)

									 1558 17  1 Keynote - Fernando Pérez (Fernando Pérez)

									 1449  6  0 Descriptors and Metaclasses - Understanding and Using Python's More Advanced Features (Mike Müller)

									 1402 12  0 Flask by Example (Miguel Grinberg)

									 1342  6  0 Python Epiphanies (Stuart Williams)

									 1219  5  0 0 to 00111100 with web2py (G. Clifford Williams)

									 1169 18  0 Cheap Helicopters In My Living Room (Ned Jackson Lovely)

									 1146 11  0 IPython in depth: high productivity interactive and parallel python (Fernando Perez)

									 1127  5  0 2D/3D graphics with Python on mobile platforms (Niko Skrypnik)

									 1081  8  0 Generators: The Final Frontier (David Beazley)

									 1067 12  0 Designing Poetic APIs (Erik Rose)

									 1064  6  0 Keynote - John Perry Barlow (John Perry Barlow)

									 1029 10  0 What Is Async, How Does It Work, And When Should I Use It? (A. Jesse Jiryu Davis)

									 981 11  0 The Sorry State of SSL (Hynek Schlawack)

									 961 12  2 Farewell and Welcome Home: Python in Two Genders (Naomi Ceder)

									 958  6  0 Getting Started Testing (Ned Batchelder)