当前位置：首页 > 软件开发 > 开发语言 > Python

python结合shell查询google关键词排名

来源：岁月联盟编辑：exp 时间：2012-05-16

最近老婆大人的公司给老婆大人安排了一个根据关键词查询google网站排名的差事。老婆大人的公司是做seo的，查询的关键词及网站特别的多，看着老婆大人这么辛苦的重复着查询工作，心疼啊。所以花点时间用python写了一个根据关键词搜索网站排名的py脚本。
在写这个脚本之前，我也曾在网站搜索过关于在google查排名的脚本。很多是利用google的api。但是我测试了一下，不准。所以，自己写一个吧。
脚本内容如下：(关键词我在网站随便找了几个。以做测试使用)
1. #vim keyword.py
2. import urllib,urllib2,cookielib,re,sys,os,time,random
3. cj = cookielib.CookieJar()
4. vibramkey=['cheap+five+fingers','vibram+five+fingers']
5. beatskey=['beats+by+dre','beats+by+dre+cheap']
6. vibramweb=['vibramforshoes.com','vibramfivetoeshoes.net','vibramfivefingersshoesx.com ']
7. beatsweb=['beatsbydre.com','justlovebeats.com']
8. allweb=['vibramweb','beatsweb']
9. def serchkey(key,start):
10.         url="http://www.google.com/search?hl=en&q=%s&revid=33815775&sa=X&ei=X6CbT4GrIoOeiQfth43GAw&ved=0CIgBENUCKAY&start=%s" %(key,start)
11.         try:
12.                 opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
13.                 opener.addheaders = [('User-agent', 'Opera/9.23')]
14.                 urllib2.install_opener(opener)
15.                 req=urllib2.Request(url)
16.                 response =urllib2.urlopen(req)
17.                 content = response.read()
18.                 f=open('google','w')
19.                 f.write(content)
20.                 tiqu=os.popen("grep -ioP '(?<=<cite>).*?(?=</cite>)' google|sed -r 's/(<*//*cite>|<//*b>)//g'").readlines()
21.         except:
22.                 changeip()
23.         else:
24.                 for yuming in pinpai:
25.                                 a=1
26.                                 for shouyuming in tiqu:
27.                                         real=shouyuming.find(yuming)
28.                                         if real>0:
29.                                                 if start==0:
30.                                                         page=1
31.                                                 elif start==10:
32.                                                         page=2
33.                                                 elif start==20:
34.                                                         page=3
35.                                                 elif start==30:
36.                                                         page=4
37.                                                 else:
38.                                                         page=5
39.                                                 lastkey=key.replace("+"," ")
40.                                                 xinxi="%s/t/t %s/t/t page%s,%s<br>/n" %(yuming,lastkey,page,a)
41.                                                 xinxifile=open('index.html','a')
42.                                                 xinxifile.write(xinxi)
43.                                                 xinxifile.close()
44.                                         a=a+1
45. def changeip():
46.         ip=random.randint(0,2)
47.         de="route del -host google.com"
48.         add="route add -host google.com eth1:%s" %ip
49.         os.system(de)
50.         os.system(add)
51.         print "changip to %s" %ip
52. pinpaiid=0
53. for x in vibramkey,beatskey:
54.         if    pinpaiid == 0:
55.                 pinpai=vibramweb
56.         elif pinpaiid == 1:
57.                 pinpai=beatsweb
58. pinpaiid=pinpaiid+1
59.         for key in x:
60.                 for start in 0,10,20,30,40:
61.                         serchkey(key,start)
62.         changeip()
63. os.system("sh paiban.sh")
1. #vim paiban.sh
2. #! /bin/bash
3. sort index.html -o index.html
4. line=`wc -l index.html|awk '{print $1}'`
5. yuming2=`sed -n 1p index.html|awk '{print $1}'`
6. for i in `seq 2 $line`
7. do
8. yuming=`sed -n "$i"p index.html|awk '{print $1}'`
9. if [ $yuming == $yuming2 ];then
10. sed -i ""$i"s/"$yuming"//t/t/g" index.html
11. else
12. yuming2=$yuming
13. fi
14. done
这段脚本分两部分，第一部分是python利用关键词搜索google的页面。老婆大人说只要每一个关键词的前5页就可以。所以只查询了前5页。
第二部分是将查询出来的结果进行排版。也就是最下面调用paiban.sh 所做的事情，让最终出来的结果为如下格式：
网站1        关键词1 第几页第几名
             关键词2   第几页第几名
             关键词3   第几页第几名
网站2        关键词1 第几页第几名
             关键词2   第几页第几名
             关键词3   第几页第几名
下面就来对程序进行讲解。
1. import urllib,urllib2,cookielib,re,sys,os,time,random   #加载模块
2. cj = cookielib.CookieJar()
3. vibramkey=['cheap+five+fingers','vibram+five+fingers'] #定义要查询的关键词组1，里面的单引号里面就是要查询的关键词。
4. beatskey=['beats+by+dre','beats+by+dre+cheap']        #同上，定义关键词组2，这个是另一组关键词。
5. vibramweb=['vibramforshoes.com','vibramfivetoeshoes.net','vibramfivefingersshoesx.com ']
6. #定义关健词组1要查询的网站
7. beatsweb=[' beatsbydre.com',' justlovebeats.com’] #定义关健词组2要查询的网站
8. allweb=['vibramweb','beatsweb']   #这里定义了一个所有网站的组，下面好调用。
9. def serchkey(key,start): #这里定义一个函数，key为查询的关健词，start为页面，通过google查询页面可以看出来每个页面除ads外只有十条记录，start=0时显示为第一个页面第一至第十条记录，start=10时，显示第二页的第一至十条记录，以些类推。
10.         url="http://www.google.com/search?hl=en&q=%s&revid=33815775&sa=X&ei=X6CbT4GrIoOeiQfth43GAw&ved=0CIgBENUCKAY&start=%s" %(key,start)   #这个定义了查询的URL
11.         try:
12.                 opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
13.                 opener.addheaders = [('User-agent', 'Opera/9.23')] #模拟浏览器访问
14.                 urllib2.install_opener(opener)
15.                 req=urllib2.Request(url) #用urllib2访问
16.                 response =urllib2.urlopen(req)
17.                 content = response.read()#这块是模拟浏览器进行访问url的页面并读取源代码
18.                 f=open('google','w')
19.                 f.write(content) #将读取出来的内容保存到google的一个页面里。
20.                 tiqu=os.popen("grep -ioP '(?<=<cite>).*?(?=</cite>)' google|sed -r 's/(<*//*cite>|<//*b>)//g'").readlines() #这里利用了系统命令了。利用正则的零宽断言提直接取出第一到第十位的网站域名。
21.         except:
22.                 changeip() #这边是怕访问过多被google封了。所以这里有一个换ip的函数，下面有定义。上面如果try失败了，就执行换ip的动作。
23.         else:
24.                 for yuming in pinpai:        #循环读取要查找的网站
25.                                 a=1
26.                                 for shouyuming in tiqu:   #循环读取查找出来的网站
27.                                         real=shouyuming.find(yuming)   #将查找出来的网站与需要查找的网站进行比对
28.                                         if real>0:
29.                                                 if start==0:
30.                                                         page=1
31.                                                 elif start==10:
32.                                                         page=2
33.                                                 elif start==20:
34.                                                         page=3
35.                                                 elif start==30:
36.                                                         page=4
37.                                                 else:
38.                                                         page=5
39.                   #这里的查看域名在google搜索后的哪一页。
40.                                                 lastkey=key.replace("+"," ") #将定义的关键词中间的加号去掉。
41.                                                 print yuming,lastkey,page,a
42.                                                 xinxi="%s/t/t %s/t/t 第%s页,排名%s/n" %(yuming,lastkey,page,a)
43.                                                 xinxifile=open('index.html','a')
44.                                                 xinxifile.write(xinxi)
45.                                                 xinxifile.close() #将查找出来的信息写入到index.html文件里
46.                                         aa=a+1
47. def changeip():    #这里是定义查询时换ip的函数。如果机器只有一个ip那就不用这段了。
48.         ip=random.randint(0,10)                  #随机生成0-10的数
49.         del="route del -host google.com"            #删除路由命令
50.         add="route add -host google.com eth1:%s" %ip #添加路由命令
51.         os.system(del)                               #执行删除路由命令
52.         os.system(add)      #执行添加路由命令
53.         print "changip to %s" %ip                      #打印更改路由信息
54. pinpaiid=0
55. for x in vibramkey,beatskey:          #循环所有的关键词组
56.         if    pinpaiid == 0:         # 对应关键词组与要查询的网站组
57.                 pinpai=vibramweb
58.         elif pinpaiid == 1:
59.                 pinpai=beatsweb
60. pinpaiidpinpaiid=pinpaiid+1
61.         for key in x:                #循环关键词组里的关键词
62.                 for start in 0,10,20,30,40:        #定义所要查找的google的页面
63.                         serchkey(key,start)
64.         changeip()                            #更改ip函数。在每一组关键词查询完毕后更改ip.
以上命令执行后，我们看一下index.html文件内容。如下：
1. #cat index.html
2. vibramforshoes.com               cheap five fingers              page 1,rank 3
3. vibramfivetoeshoes.net           cheap five fingers              page 5,rank 5
4. vibramforshoes.com               vibram five fingers             page 1,rank 6
5. vibramfivetoeshoes.net           vibram five fingers             page 5,rank 10
6. beatsbydre.com                   beats by dre                    page 1,rank 1
7. justlovebeats.com                beats by dre                    page 5,rank 7
8. beatsbydre.com                   beats by dre cheap              page 2,rank 2
9. beatsbydre.com                   beats by dre cheap              page 2,rank 3
10. beatsbydre.com                   beats by dre cheap              page 5,rank 10
如图：

这样看很乱，那么我们如何才能达到上面所讲一个站后面对应多个关键词的格式呢，这里我们就要用到 paiban.sh 这个小脚本了。我们把paiban.sh放在py程序的最后，当执行py程序执行完毕后，执行paiban.sh 这个paiban.sh已经加在py程序里面了，所有不需要另外执行。我这里主要看一下区别。所有在py程序里注释了。
1. #sh   paiban.sh
2. #cat index.html
3. beatsbydre.com                   beats by dre cheap              page 2,rank 2
4.                                  beats by dre cheap              page 2,rank 3
5.                                  beats by dre cheap              page 5,rank 10
6.                                  beats by dre                    page 1,rank 1
7. justlovebeats.com                beats by dre                    page 5,rank 7
8. vibramfivetoeshoes.net           cheap five fingers              page 5,rank 5
9.                                  vibram five fingers             page 5,rank 10
10. vibramforshoes.com               cheap five fingers              page 1,rank 3
11.                                  vibram five fingers             page 1,rank 6
如图：

这样就能达到上面的效果了。排版也很清楚，哪个站对应哪个关键词。在第几页，第几位，一目了然。
我们也对paiban.sh这个脚本做一下解释。
1. #vim paiban.sh
2. #! /bin/bash
3. sort index.html -o index.html                    #先把index.html文件排下序，再写入index.html
4. line=`wc -l index.html|awk '{print $1}'`          #统计行
5. yuming2=`sed -n 1p index.html|awk '{print $1}'`   #取第一行的域名给yuming2
6. for i in `seq 2 $line`                            #从第二行开始了取域名
7. do
8. yuming=`sed -n "$i"p index.html|awk '{print $1}'`
9. if [ $yuming == $yuming2 ];then
10. sed -i ""$i"s/"$yuming"//t/t/g" index.html        #如果下一行域名与yuming2域名相同，就把下一行域名替换成空
11. else
12. yuming2=$yuming                                    #如果不相等，就把下一行的域名给yuming2变量
13. fi
14. done

摘自运维人生

上一篇：python字符串的操作——python cookbook

下一篇：Python：利用pexpect库直接解压缩加密的zip文件

图片内容

安装Python简单

当前位置：首页 > 软件开发 > 开发语言 > Python

python结合shell查询google关键词排名

图片内容

最近更新

随机推荐