dayjournal | Python #006 - 手軽にwebスクレイピング

この記事は、「python Advent Calendar 2017」の23日目の記事です。

手軽にさっとwebスクレイピングするメモ。

Pythonでwebスクレイピングするには色々なモジュールがあるが、今回は「lxml」を利用。

まず、スクレイピングしたいHTML要素を調査。

コード記述。モジュールが無い場合は、pipでインストール。

script.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
#必要なモジュールインポート
import requests
import lxml.html

#URL指定
URL = "https://day-journal.com/memo"

#webページの情報取得(post-tags → a → spanの中身を抽出する)
req = requests.get(URL)
root = lxml.html.fromstring(req.text)
codeV = root.xpath('//div[@class="post-tags"]/a/span')

#保存するテキストを開く
f = open("./example.txt", "w")

#テキストに情報を書き込み保存
for code in codeV:
    output = code.text
    print output
    f.write(str(output))
    f.write('\n')

#テキストを閉じる
f.close()

対象ディレクトリでコマンドを実行。


python script.js

スクレイピング結果がテキストに出力。

example.txt


Try
Try
Try
Python
webpack
Leaflet
Riot.js
webpack
Leaflet
webpack

参考文献
Python
lxml
「Requests」と「lxml」を使ってWebの情報を取得してみた

dayjournal memo

Python #006 - 手軽にwebスクレイピング

Yasunori Kirimoto 2017/12/23

script.py

example.txt