4æ份ç»èªå·±æä¸ä¸ªç¬è«ç³»åçåï¼ä»ç论å°å®è·µï¼è®©å¤§å®¶ä¸ä»
ç¥å
¶ç¶èä¸ç¥å
¶æ以ç¶ãå¸æè½åå°æ·±å
¥æµ
åºã以ç¬è«ä¸ºä¸»çº¿ï¼æ¶åHTTP åè®®ãæ£å表达å¼ãç¬è«æ¡æ¶ Scrapyãæ¶æ¯éåãæ°æ®åºçå
容ã订é
å
¬ä¼å·ãPythonä¹ç¦
ãï¼è·åææ°å¹²è´§ã
ç¬è«æ¯ä¸ä¸ªæ¨¡ææµè§å¨è¿è¡ HTTP 请æ±çè¿ç¨ã
HTTPåè®®æ¯ä»ä¹ï¼
ä½ æµè§çæ¯ä¸ä¸ªç½é¡µé½æ¯åºäº HTTP åè®®åç°çï¼HTTP åè®®æ¯äºèç½åºç¨ä¸ï¼å®¢æ·ç«¯ï¼æµè§å¨ï¼ä¸æå¡å¨ä¹é´è¿è¡æ°æ®éä¿¡çä¸ç§åè®®ãåè®®ä¸è§å®äºå®¢æ·ç«¯åºè¯¥æç
§ä»ä¹æ ¼å¼ç»æå¡å¨åé请æ±ï¼åæ¶ä¹çº¦å®äºæå¡ç«¯è¿åçååºç»æåºè¯¥æ¯ä»ä¹æ ¼å¼ã
åªè¦å¤§å®¶é½æç
§åè®®è§å®æ¹å¼å起请æ±åè¿åååºç»æï¼ä»»ä½äººé½å¯ä»¥åºäºHTTPåè®®å®ç°èªå·±çWeb客æ·ç«¯ï¼æµè§å¨ãç¬è«ï¼åWebæå¡å¨ï¼NginxãApacheçï¼ã
HTTP åè®®æ¬èº«æ¯é常ç®åçãå®è§å®ï¼åªè½ç±å®¢æ·ç«¯ä¸»å¨å起请æ±ï¼æå¡å¨æ¥æ¶è¯·æ±å¤çåè¿åååºç»æï¼åæ¶ HTTP æ¯ä¸ç§æ ç¶æçåè®®ï¼åè®®æ¬èº«ä¸è®°å½å®¢æ·ç«¯çåå²è¯·æ±è®°å½ã
HTTP åè®®æ¯å¦ä½è§å®è¯·æ±æ ¼å¼åååºæ ¼å¼çå¢ï¼æ¢è¨ä¹ï¼å®¢æ·ç«¯æç
§ä»ä¹æ ¼å¼æè½æ£ç¡®åèµ· HTTP 请æ±å¢ï¼æå¡ç«¯æç
§ä»ä¹æ ¼å¼è¿åååºç»æ客æ·ç«¯æè½æ£ç¡®è§£æï¼
HTTP 请æ±
HTTP 请æ±ç±3é¨åç»æï¼åå«æ¯è¯·æ±è¡ã请æ±é¦é¨ã请æ±ä½ï¼é¦é¨å请æ±ä½æ¯å¯éçï¼å¹¶ä¸æ¯æ¯ä¸ªè¯·æ±é½éè¦çã
请æ±è¡
请æ±è¡æ¯æ¯ä¸ªè¯·æ±å¿
ä¸å¯å°çé¨åï¼å®ç±3é¨åç»æï¼åå«æ¯è¯·æ±æ¹æ³ï¼method)ã请æ±URLï¼URIï¼ãHTTPåè®®çæ¬ï¼ä»¥ç©ºæ ¼éå¼ã
HTTPåè®®ä¸æ常ç¨ç请æ±æ¹æ³æï¼GETãPOSTãPUTãDELETEãGET æ¹æ³ç¨äºä»æå¡å¨è·åèµæºï¼90%çç¬è«é½æ¯åºäºGET请æ±æåæ°æ®ã
è¯·æ± URL æ¯æèµæºæå¨æå¡å¨çè·¯å¾å°åï¼æ¯å¦ä¸å¾çä¾å表示客æ·ç«¯æ³è·å index.html è¿ä¸ªèµæºï¼å®çè·¯å¾å¨æå¡å¨ foofish.net çæ ¹ç®å½ï¼/ï¼ä¸é¢ã
请æ±é¦é¨
å 为请æ±è¡ææºå¸¦çä¿¡æ¯éé常æéï¼ä»¥è³äºå®¢æ·ç«¯è¿æå¾å¤æ³åæå¡å¨è¦è¯´çäºæ
ä¸å¾ä¸æ¾å¨è¯·æ±é¦é¨ï¼Headerï¼ï¼è¯·æ±é¦é¨ç¨äºç»æå¡å¨æä¾ä¸äºé¢å¤çä¿¡æ¯ï¼æ¯å¦ User-Agent ç¨æ¥è¡¨æ客æ·ç«¯ç身份ï¼è®©æå¡å¨ç¥éä½ æ¯æ¥èªæµè§å¨ç请æ±è¿æ¯ç¬è«ï¼æ¯æ¥èª Chrome æµè§å¨è¿æ¯ FireFoxãHTTP/1.1 è§å®äº47ç§é¦é¨å段类åãHTTPé¦é¨å段çæ ¼å¼å¾å Python ä¸çåå
¸ç±»åï¼ç±é®å¼å¯¹ç»æï¼ä¸é´ç¨åå·éå¼ãæ¯å¦ï¼
User-Agent: Mozilla/5.0
å 为客æ·ç«¯åé请æ±æ¶ï¼åéçæ°æ®ï¼æ¥æï¼æ¯ç±å符串ææçï¼ä¸ºäºåºå请æ±é¦é¨çç»å°¾å请æ±ä½çå¼å§ï¼ç¨ä¸ä¸ªç©ºè¡æ¥è¡¨ç¤ºï¼éå°ç©ºè¡æ¶ï¼å°±è¡¨ç¤ºè¿æ¯é¦é¨çç»å°¾ï¼è¯·æ±ä½çå¼å§ã
请æ±ä½
请æ±ä½æ¯å®¢æ·ç«¯æ交ç»æå¡å¨ççæ£å
容ï¼æ¯å¦ç¨æ·ç»å½æ¶çéè¦ç¨çç¨æ·ååå¯ç ï¼æ¯å¦æ件ä¸ä¼ çæ°æ®ï¼æ¯å¦æ³¨åç¨æ·ä¿¡æ¯æ¶æ交ç表åä¿¡æ¯ã
ç°å¨æä»¬ç¨ Python æä¾çæåå§API
socket
模åæ¥æ¨¡æåæå¡å¨åèµ·ä¸ä¸ª HTTP 请æ±
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
  # 1. ä¸æå¡å¨å»ºç«è¿æ¥
  s.connect(("www.seriot.ch", 80))
  # 2. æ建请æ±è¡ï¼è¯·æ±èµæºæ¯ index.php
  request_line = b"GET /index.php HTTP/1.1"
  # 3. æ建请æ±é¦é¨ï¼æå®ä¸»æºå
  headers = b"Host: seriot.ch"
  # 4. ç¨ç©ºè¡æ 记请æ±é¦é¨çç»æä½ç½®
  blank_line = b"\r\n"
  # 请æ±è¡ãé¦é¨ã空è¡è¿3é¨åå
容ç¨æ¢è¡ç¬¦åéï¼ç»æä¸ä¸ªè¯·æ±æ¥æå符串
  # åéç»æå¡å¨
  message = b"\r\n".join([request_line, headers, blank_line])
  s.send(message)
  # æå¡å¨è¿åçååºå
容ç¨åè¿è¡åæ
  response = s.recv(1024)
  print(response)
HTTP ååº
æå¡ç«¯æ¥æ¶è¯·æ±å¹¶å¤çåï¼è¿åååºå
容ç»å®¢æ·ç«¯ï¼åæ ·å°ï¼ååºå
容ä¹å¿
é¡»éµå¾ªåºå®çæ ¼å¼æµè§å¨æè½æ£ç¡®è§£æãHTTP ååºä¹ç±3é¨åç»æï¼åå«æ¯ï¼ååºè¡ãååºé¦é¨ãååºä½ï¼ä¸ HTTP ç请æ±æ ¼å¼æ¯ç¸å¯¹åºçã