zoukankan      html  css  js  c++  java
  • 爬虫小练

    编程模块:requests,lxml,pymongo,time,BeautifulSoup
    
    首先获取所有产品的分类网址:
    
    复制代码
     1 def step():
     2     try:
     3         headers = {
     4            。。。。。
     5             }
     6         r = requests.get(url,headers,timeout=30)
     7         html = r.content
     8         soup = BeautifulSoup(html,"lxml")
     9         url = soup.find_all(正则表达式)
    10         for i in url:
    11             url2 =  i.find_all('a')
    12             for j in url2:
    13                  step1url =url + j['href']
    14                  print step1url
    15                  step2(step1url)
    16     except Exception,e:
    17         print e
    复制代码
     
    
    我们在产品分类的同时需要确定我们所访问的地址是产品还是又一个分类的产品地址(所以需要判断我们访问的地址是否含有if判断标志):
    
    复制代码
     1 def step2(step1url):
     2     try:
     3         headers = {
     4            。。。。
     5             }
     6         r = requests.get(step1url,headers,timeout=30)
     7         html = r.content
     8         soup = BeautifulSoup(html,"lxml")
     9         a = soup.find('div',id='divTbl')
    10         if a:
    11             url = soup.find_all('td',class_='S-ITabs')
    12             for i in url:
    13                 classifyurl =  i.find_all('a')
    14                 for j in classifyurl:
    15                      step2url = url + j['href']
    16                      #print step2url
    17                      step3(step2url)
    18         else:
    19             postdata(step1url)
    复制代码
    当我们if判断后为真则将第二页的分类网址获取到(第一个步骤),否则执行postdata函数,将网页产品地址抓取!
    
    复制代码
     1 def producturl(url):
     2     try:
     3         p1url = doc.xpath(正则表达式)
     4         for i in xrange(1,len(p1url) + 1):
     5             p2url = doc.xpath(正则表达式)
     6             if len(p2url) > 0:
     7                 producturl = url + p2url[0].get('href')
     8                 count = db[table].find({'url':producturl}).count()
     9                 if count <= 0:
    10                         sn = getNewsn()
    11                         db[table].insert({"sn":sn,"url":producturl})
    12                         print str(sn) + 'inserted successfully'
    13                 else:
    14                         'url exist'
    15 
    16     except Exception,e:
    17         print e
    复制代码
    其中为我们所获取到的产品地址并存入mongodb中,sn作为地址的新id。
    
    下面我们需要在mongodb中通过新id索引来获取我们的网址并进行访问,对产品进行数据分析并抓取,将数据更新进数据库内!
    
    其中用到最多的BeautifulSoup这个模块,但是对于存在于js的价值数据使用BeautifulSoup就用起来很吃力,所以对于js中的数据我推荐使用xpath,但是解析网页就需要用到HTML.document_fromstring(url)方法来解析网页。
    
    对于xpath抓取价值数据的同时一定要细心!如果想了解xpath就在下面留言,我会尽快回答!
    
    复制代码
     1 def parser(sn,url):
     2     try:
     3         headers = {
     4             。。。。。。
     5             }
     6         r = requests.get(url, headers=headers,timeout=30)
     7         html = r.content
     8         soup = BeautifulSoup(html,"lxml")
     9         dt = {}
    10         #partno
    11         a = soup.find("meta",itemprop="mpn")
    12         if a:
    13             dt['partno'] = a['content']
    14         #manufacturer
    15         b = soup.find("meta",itemprop="manufacturer")
    16         if b:
    17             dt['manufacturer'] = b['content']
    18         #description
    19         c = soup.find("span",itemprop="description")
    20         if c:
    21             dt['description'] = c.get_text().strip()
    22         #price
    23         price = soup.find("table",class_="table table-condensed occalc_pa_table")
    24         if price:
    25             cost = {}
    26             for i in price.find_all('tr'):
    27                 if len(i) > 1:
    28                     td = i.find_all('td')
    29                     key=td[0].get_text().strip().replace(',','')
    30                     val=td[1].get_text().replace(u'u20ac','').strip()
    31                     if key and val:
    32                         cost[key] = val
    33             if cost:
    34                 dt['cost'] = cost
    35                 dt['currency'] = 'EUR'
    36         
    37         #quantity
    38         d = soup.find("input",id="ItemQuantity")
    39         if d:
    40            dt['quantity'] = d['value']
    41         #specs
    42         e = soup.find("div",class_="row parameter-container")
    43         if e:
    44             key1 = []
    45             val1= []
    46             for k in e.find_all('dt'):
    47                 key =  k.get_text().strip().strip('.')
    48                 if key:
    49                     key1.append(key)
    50             for i in e.find_all('dd'):
    51                 val =  i.get_text().strip()
    52                 if val:
    53                     val1.append(val)
    54             specs = dict(zip(key1,val1))
    55         if specs:
    56             dt['specs'] = specs
    57             print dt
    58 
    59             
    60         if dt:
    61             db[table].update({'sn':sn},{'$set':dt})
    62             print str(sn) +  ' insert successfully'
    63             time.sleep(3)
    64         else:
    65             error(str(sn) + '	' + url)
    66     except Exception,e:
    67         error(str(sn) + '	' + url)
    68         print "Don't data!"
    复制代码
    最后全部程序运行,将价值数据分析处理并存入数据库中!
    

      

  • 相关阅读:
    mac os programming
    Rejecting Good Engineers?
    Do Undergrads in MIT Struggle to Obtain Good Grades?
    Go to industry?
    LaTex Tricks
    Convert jupyter notebooks to python files
    How to get gradients with respect to the inputs in pytorch
    Uninstall cuda 9.1 and install cuda 8.0
    How to edit codes on the server which runs jupyter notebook using your pc's bwroser
    Leetcode No.94 Binary Tree Inorder Traversal二叉树中序遍历(c++实现)
  • 原文地址:https://www.cnblogs.com/pyxiaomangshe/p/7728556.html
Copyright © 2011-2022 走看看