Wget and Curl
Category: /knowledge /linuxTags: linux
Wget
exclude a specific directory ——————————–
wget --exclude-directories=/media/ -m $web_url
download single-page
The trick here is to get images specified in CSS. 要求wget版本在1.12以上.
wget -E -H -k -K -p url
or
wget -pk url
-p/--page-requisites
: 这个是最重要的. It includes things like inclided
images, sound, import css etc.
-k/--convert-links
: 这个其次重要. convert the links for local viewing.
-E
: add extension to the download file, useful for .asp types.
-H/--span-hosts
: spanning across hosts
-K/--backup-converted
: create a backup with .orig suffix.
forge an User agent
有些网站不喜欢wget/curl之流来爬内容, 这时候需要设置User
agent
和其它header信息. 最好把它们放在.wgetrc
里.
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
header = Accept-Encoding: gzip,deflate
header = Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
header = Keep-Alive: 300
user_agent = Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.13) Gecko/2009080316 Ubuntu/8.04 (hardy) Firefox/3.0.13
referer = http://www.google.com
wget --referer="http://www.google.com" \
--user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" \
--header="Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" \
--header="Accept-Language: en-us,en;q=0.5" \
--header="Accept-Encoding: gzip,deflate" \
--header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" \
--header="Keep-Alive: 300"
curl
一直认为有了wget
, curl
就没什么用了. 后来体会用它查看HTTP header挺方便的.
Installation
aptitude install curl python-pycurl
用curl显示http header
curl -I http://google.com
HTTP/1.1 301 Moved Permanently
Location: http://www.google.com/
Content-Type: text/html; charset=UTF-8
Date: Mon, 12 Sep 2011 19:32:47 GMT
Expires: Wed, 12 Oct 2011 19:32:47 GMT
Cache-Control: public, max-age=2592000
Server: gws
Content-Length: 219
X-XSS-Protection: 1; mode=block
自动跳转
curl -L url
-L/--location: follow 3xx code to the redirect