4æ份ç»èªå·±æä¸ä¸ªç¬è«ç³»åçåï¼ä»ç论å°å®è·µï¼è®©å¤§å®¶ä¸ä»
ç¥å
¶ç¶èä¸ç¥å
¶æ以ç¶ãå¸æè½åå°æ·±å
¥æµ
åºã以ç¬è«ä¸ºä¸»çº¿ï¼æ¶åHTTP åè®®ãæ£å表达å¼ãç¬è«æ¡æ¶ Scrapyãæ¶æ¯éåãæ°æ®åºçå
容ã订é
å
¬ä¼å·ãPythonä¹ç¦
ãï¼è·åææ°å¹²è´§ã
ç¬è«æ¯ä¸ä¸ªæ¨¡ææµè§å¨è¿è¡ HTTP 请æ±çè¿ç¨ã
HTTPåè®®æ¯ä»ä¹ï¼
ä½ æµè§çæ¯ä¸ä¸ªç½é¡µé½æ¯åºäº HTTP åè®®åç°çï¼HTTP åè®®æ¯äºèç½åºç¨ä¸ï¼å®¢æ·ç«¯ï¼æµè§å¨ï¼ä¸æå¡å¨ä¹é´è¿è¡æ°æ®éä¿¡çä¸ç§åè®®ãåè®®ä¸è§å®äºå®¢æ·ç«¯åºè¯¥æç
§ä»ä¹æ ¼å¼ç»æå¡å¨åé请æ±ï¼åæ¶ä¹çº¦å®äºæå¡ç«¯è¿åçååºç»æåºè¯¥æ¯ä»ä¹æ ¼å¼ã
åªè¦å¤§å®¶é½æç
§åè®®è§å®æ¹å¼å起请æ±åè¿åååºç»æï¼ä»»ä½äººé½å¯ä»¥åºäºHTTPåè®®å®ç°èªå·±çWeb客æ·ç«¯ï¼æµè§å¨ãç¬è«ï¼åWebæå¡å¨ï¼NginxãApacheçï¼ã
HTTP åè®®æ¬èº«æ¯é常ç®åçãå®è§å®ï¼åªè½ç±å®¢æ·ç«¯ä¸»å¨å起请æ±ï¼æå¡å¨æ¥æ¶è¯·æ±å¤çåè¿åååºç»æï¼åæ¶ HTTP æ¯ä¸ç§æ ç¶æçåè®®ï¼åè®®æ¬èº«ä¸è®°å½å®¢æ·ç«¯çåå²è¯·æ±è®°å½ã
HTTP åè®®æ¯å¦ä½è§å®è¯·æ±æ ¼å¼åååºæ ¼å¼çå¢ï¼æ¢è¨ä¹ï¼å®¢æ·ç«¯æç
§ä»ä¹æ ¼å¼æè½æ£ç¡®åèµ· HTTP 请æ±å¢ï¼æå¡ç«¯æç
§ä»ä¹æ ¼å¼è¿åååºç»æ客æ·ç«¯æè½æ£ç¡®è§£æï¼
HTTP 请æ±
HTTP 请æ±ç±3é¨åç»æï¼åå«æ¯è¯·æ±è¡ã请æ±é¦é¨ã请æ±ä½ï¼é¦é¨å请æ±ä½æ¯å¯éçï¼å¹¶ä¸æ¯æ¯ä¸ªè¯·æ±é½éè¦çã
请æ±è¡
请æ±è¡æ¯æ¯ä¸ªè¯·æ±å¿
ä¸å¯å°çé¨åï¼å®ç±3é¨åç»æï¼åå«æ¯è¯·æ±æ¹æ³ï¼method)ã请æ±URLï¼URIï¼ãHTTPåè®®çæ¬ï¼ä»¥ç©ºæ ¼éå¼ã
HTTPåè®®ä¸æ常ç¨ç请æ±æ¹æ³æï¼GETãPOSTãPUTãDELETEãGET æ¹æ³ç¨äºä»æå¡å¨è·åèµæºï¼90%çç¬è«é½æ¯åºäºGET请æ±æåæ°æ®ã
è¯·æ± URL æ¯æèµæºæå¨æå¡å¨çè·¯å¾å°åï¼æ¯å¦ä¸å¾çä¾å表示客æ·ç«¯æ³è·å index.html è¿ä¸ªèµæºï¼å®çè·¯å¾å¨æå¡å¨ foofish.net çæ ¹ç®å½ï¼/ï¼ä¸é¢ã
请æ±é¦é¨
å 为请æ±è¡ææºå¸¦çä¿¡æ¯éé常æéï¼ä»¥è³äºå®¢æ·ç«¯è¿æå¾å¤æ³åæå¡å¨è¦è¯´çäºæ
ä¸å¾ä¸æ¾å¨è¯·æ±é¦é¨ï¼Headerï¼ï¼è¯·æ±é¦é¨ç¨äºç»æå¡å¨æä¾ä¸äºé¢å¤çä¿¡æ¯ï¼æ¯å¦ User-Agent ç¨æ¥è¡¨æ客æ·ç«¯ç身份ï¼è®©æå¡å¨ç¥éä½ æ¯æ¥èªæµè§å¨ç请æ±è¿æ¯ç¬è«ï¼æ¯æ¥èª Chrome æµè§å¨è¿æ¯ FireFoxãHTTP/1.1 è§å®äº47ç§é¦é¨å段类åãHTTPé¦é¨å段çæ ¼å¼å¾å Python ä¸çåå
¸ç±»åï¼ç±é®å¼å¯¹ç»æï¼ä¸é´ç¨åå·éå¼ãæ¯å¦ï¼
User-Agent: Mozilla/5.0
å 为客æ·ç«¯åé请æ±æ¶ï¼åéçæ°æ®ï¼æ¥æï¼æ¯ç±å符串ææçï¼ä¸ºäºåºå请æ±é¦é¨çç»å°¾å请æ±ä½çå¼å§ï¼ç¨ä¸ä¸ªç©ºè¡æ¥è¡¨ç¤ºï¼éå°ç©ºè¡æ¶ï¼å°±è¡¨ç¤ºè¿æ¯é¦é¨çç»å°¾ï¼è¯·æ±ä½çå¼å§ã
请æ±ä½
请æ±ä½æ¯å®¢æ·ç«¯æ交ç»æå¡å¨ççæ£å
容ï¼æ¯å¦ç¨æ·ç»å½æ¶çéè¦ç¨çç¨æ·ååå¯ç ï¼æ¯å¦æ件ä¸ä¼ çæ°æ®ï¼æ¯å¦æ³¨åç¨æ·ä¿¡æ¯æ¶æ交ç表åä¿¡æ¯ã
ç°å¨æä»¬ç¨ Python æä¾çæåå§API
socket
模åæ¥æ¨¡æåæå¡å¨åèµ·ä¸ä¸ª HTTP 请æ±
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
  # 1. ä¸æå¡å¨å»ºç«è¿æ¥
  s.connect(("www.seriot.ch", 80))
  # 2. æ建请æ±è¡ï¼è¯·æ±èµæºæ¯ index.php
  request_line = b"GET /index.php HTTP/1.1"
  # 3. æ建请æ±é¦é¨ï¼æå®ä¸»æºå
  headers = b"Host: seriot.ch"
  # 4. ç¨ç©ºè¡æ 记请æ±é¦é¨çç»æä½ç½®
  blank_line = b"\r\n"
  # 请æ±è¡ãé¦é¨ã空è¡è¿3é¨åå
容ç¨æ¢è¡ç¬¦åéï¼ç»æä¸ä¸ªè¯·æ±æ¥æå符串
  # åéç»æå¡å¨
  message = b"\r\n".join([request_line, headers, blank_line])
  s.send(message)
  # æå¡å¨è¿åçååºå
容ç¨åè¿è¡åæ
  response = s.recv(1024)
  print(response)
HTTP ååº
æå¡ç«¯æ¥æ¶è¯·æ±å¹¶å¤çåï¼è¿åååºå
容ç»å®¢æ·ç«¯ï¼åæ ·å°ï¼ååºå
容ä¹å¿
é¡»éµå¾ªåºå®çæ ¼å¼æµè§å¨æè½æ£ç¡®è§£æãHTTP ååºä¹ç±3é¨åç»æï¼åå«æ¯ï¼ååºè¡ãååºé¦é¨ãååºä½ï¼ä¸ HTTP ç请æ±æ ¼å¼æ¯ç¸å¯¹åºçã
ååºè¡
ååºè¡åæ ·ä¹æ¯3é¨åç»æï¼ç±æå¡ç«¯æ¯æç HTTP åè®®çæ¬å·ãç¶æç ã以å对ç¶æç çç®çåå æè¿°ç»æã
ç¶æç æ¯ååºè¡ä¸å¾éè¦çä¸ä¸ªå段ãéè¿ç¶æç ï¼å®¢æ·ç«¯å¯ä»¥ç¥éæå¡å¨æ¯å¦æ£å¸¸å¤çç请æ±ãå¦æç¶æç æ¯200ï¼è¯´æ客æ·ç«¯ç请æ±å¤çæåï¼å¦ææ¯500ï¼è¯´ææå¡å¨å¤ç请æ±çæ¶ååºç°äºå¼å¸¸ã404 表示请æ±çèµæºå¨æå¡å¨æ¾ä¸å°ãé¤æ¤ä¹å¤ï¼HTTP åè®®è¿å¾å®ä¹äºå¾å¤å
¶ä»çç¶æç ï¼ä¸è¿å®ä¸æ¯æ¬æç讨论èå´ã
ååºé¦é¨
ååºé¦é¨å请æ±é¦é¨ç±»ä¼¼ï¼ç¨äºå¯¹ååºå
容çè¡¥å
ï¼å¨é¦é¨éé¢å¯ä»¥åç¥å®¢æ·ç«¯ååºä½çæ°æ®ç±»åæ¯ä»ä¹ï¼ååºå
容è¿åçæ¶é´æ¯ä»ä¹æ¶åï¼ååºä½æ¯å¦å缩äºï¼ååºä½æåä¸æ¬¡ä¿®æ¹çæ¶é´ã
ååºä½
ååºä½ï¼bodyï¼æ¯æå¡å¨è¿åççæ£å
容ï¼å®å¯ä»¥æ¯ä¸ä¸ªHTML页é¢ï¼æè
æ¯ä¸å¼ å¾çãä¸æ®µè§é¢ççã
æ们继ç»æ²¿ç¨åé¢é£ä¸ªä¾åæ¥ççæå¡å¨è¿åçååºç»ææ¯ä»ä¹ï¼å 为æåªæ¥æ¶äºå1024个åèï¼æ以æä¸é¨åååºå
容æ¯çä¸å°çã
b'HTTP/1.1 200 OK\r\n
Date: Tue, 04 Apr 2017 16:22:35 GMT\r\n
Server: Apache\r\n
Expires: Thu, 19 Nov 1981 08:52:00 GMT\r\n
Set-Cookie: PHPSESSID=66bea0a1f7cb572584745f9ce6984b7e; path=/\r\n
Transfer-Encoding: chunked\r\n
Content-Type: text/html; charset=UTF-8\r\n\r\n118d\r\n
\n\n
\n
\n\t
    \n\t
  \n\t
...