这段时间有个朋友想导出微信里面的账单信息,后来发现微信的反爬虫还是很厉害的,花了点时间去分析。
一、采用传统模拟http抓取
抓取的主要URL:https://wx.tenpay.com/userroll/userrolllist,其中后面带上三个参数,具体参数见代码,其中exportkey这参数是会过期的,userroll_encryption和userroll_pass_ticket 这两个参数需要从cookie中获得,应该是作为获取数据的标识,通过抓包也看不出端倪,应该是微信程序内部生成的,如果使用微信开发着工具登录后直接访问网址有的时候可以访问返回数据,但是只是在较短的时间内有效,而且当返回会话超时后,继续使用网页访问就会被限制,一直提示会话超时,应该是在网页和移动端中exportkey有不同的时间和访问次数的限制。
之后想通过破解seesion的方式,研究了一下,发现这是不可能的,想要破解session需要搞定wx.login,而wx.login是微信提供的,想要破解难度应该不用我说了。
二、解决exportkey 这个key和Cookie的获取
需要的工具:
1、安卓/苹果手机
2、Fiddler(抓包工具)
搞过爬虫的都知道Fiddler,具体操作就不多说了,设置好代理和开启Fiddler后,抓取url中的exportkey和相应的Cookie,用于接下来的数据抓取。
三、上代码
代码写的不是很好,若有错误还望各位大大指正。
# coding:utf-8 import datetime import time import urllib import urllib.request import json import sys import io import ssl from DBController import DBController #数据库 #设置系统编码格式 sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') #解决访问Https时不受信任SSL证书问题 ssl._create_default_https_context = ssl._create_unverified_context class MainCode: def __init__(self, url=""): self.url = url self.dbController = DBController() # 数据库控制 self.userroll_encryption = "uoxQXsCenowxj0G0ppRKBg8iHRPZwZKaUZB0ka1Y5apUuQnKkZTsA/2RMhBPGyMdiHS8QXk8y2JeLgqTPqZPU9fkrCUp+TIQPkHH/uExAwKeBFLute0ztdHaC6GJUJ2+/R8NGWGe16hSKc6L1+LvAw==" self.userroll_pass_ticket = "V7oum4glDbdaAwibC8mcuTizGIKmC9A/Y/V12qASuDALdRMveHcRHv1QXamFk27Z" # self.last_bill_id = "" # self.last_bill_type = "" # self.last_create_time = "" # self.last_trans_id = "" self.last_item = {} self.num= 0 #获取网页信息 def get_html(self, url, maxTryNum=5): goon = True # 网络中断标记 obj = {} for tryNum in range(maxTryNum): try: # print(self.token) header = { "Accept": 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', "Accept-Encoding":'gzip, deflate, br', "Accept-Language":'zh-CN,zh;q=0.8', "Cache-Control":'max-age=0', "Connection": "keep-alive", "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 10_2 like Mac OS X) AppleWebKit/602.3.12 (KHTML, like Gecko) Mobile/14C92 Safari/601.1 wechatdevtools/1.02.1810240 MicroMessenger/6.5.7 Language/zh_CN webview/15415760070117398 webdebugger port/32594", "Cookie":"userroll_encryption="+self.userroll_encryption+"; userroll_pass_ticket="+self.userroll_pass_ticket, "Host":"wx.tenpay.com", "Upgrade-Insecure-Requests":"1", } req = urllib.request.Request(url=url, headers=header) # 访问网址 result = urllib.request.urlopen(req, timeout=5).read() break except urllib.error.HTTPError as e: if tryNum < (maxTryNum - 1): print("尝试连接请求" + str(tryNum + 1)) # host = self.host2 time.sleep(5) else: print('Internet Connect Error!', "Error URL:" + url) goon = False break if goon: page = result.decode('utf-8') obj = json.loads(page) #print(obj) #print(page) else: print("--------------------------") return obj #保存到数据库 def save_info_to_db(self, item): select_sql = "SELECT count(*)as num FROM wx_order2 where trans_id = '%s'" % (item["trans_id"]) results = self.dbController.ExecuteSQL_Select(select_sql) if int(results[0][0]) == 0: sql = "INSERT INTO wx_order2 (bill_id, bill_type, classify_type, fee, fee_type, out_trade_no, pay_bank_name, payer_remark, remark, order_time, title, total_refund_fee, trans_id,fee_attr) VALUES ( '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s','%s','%s')" % ( str(item['bill_id']), str(item['bill_type']), str(item['classify_type']), str(item['fee']), str(item['fee_type']) , str(item['out_trade_no']), str(item['pay_bank_name']), str(item['payer_remark']), str(item['remark']), str(item['order_time']), str(item['title']), str(item['total_refund_fee']), str(item['trans_id']), str(item['fee_attr']) ) # print(sql) try: self.dbController.ExecuteSQL_Insert(sql) # self.log.info("插入数据成功") except Exception as e: print("save_info_to_db:",e) return #从获取的网页信息中过滤所需要的信息 def get_data(self,url): res_obj = self.get_html(url) this_page_num = 0 #若返回的ret_code== 0 则说明获取数据成功 if res_obj['ret_code'] == 0: record_list = res_obj['record'] self.last_bill_id = res_obj['last_bill_id'] self.last_bill_type = res_obj['last_bill_type'] self.last_create_time = res_obj['last_create_time'] self.last_trans_id = res_obj['last_trans_id'] num = 1 this_page_num = len(record_list) # order = record_list[i] for order in record_list: bill_id = order['bill_id'] bill_type = order['bill_type'] classify_type = order['classify_type'] fee = order['fee'] #账单金额 fee = fee * 0.01 fee = round(fee, 2) #对金额保留两位小数 fee_type = order['fee_type'] #金额类型 out_trade_no = order['out_trade_no'] #账单编号 pay_bank_name = order['pay_bank_name'] #支付的银行 payer_remark =order['payer_remark'] #支付说明 remark = order['remark'] #账单说明 order_time = datetime.datetime.fromtimestamp(order['timestamp']) #将时间戳转为时间 title = order['title'] #账单标题 title = title.replace(',','').replace('.','').replace("'",'') #去除英文逗号和单引号 total_refund_fee = "0" trans_id = order['trans_id'] fee_attr = order['fee_attr'] #title = self.remove_emoji(title) fee_attr = order['fee_attr'] pay_type = "" if bill_type == 1: pay_type= "支付" elif bill_type == 2: pay_type = "充值" elif bill_type == 4: pay_type = "转账" elif bill_type == 6: pay_type="红包" else: pay_type = str(bill_type) if fee_attr == "positive": fee_attr = "收入" elif fee_attr == "negtive": fee_attr = "支出" elif fee_attr == "neutral": fee_attr = "提现" item = {} item['bill_id'] = bill_id item['bill_type'] =bill_type item['classify_type'] = classify_type item['fee'] = fee item['fee_type'] = fee_type item['out_trade_no'] = out_trade_no item['pay_bank_name'] = pay_bank_name item['payer_remark'] = payer_remark item['remark'] = remark item['order_time'] = order_time item['title'] = title item['total_refund_fee'] = total_refund_fee item['trans_id'] = trans_id item['fee_attr'] = fee_attr # title = self.remove_emoji(title) if bill_id != '': self.last_item['last_bill_id'] = bill_id self.last_item['last_bill_type'] = bill_type self.last_item['last_create_time'] = order['timestamp'] self.last_item['last_trans_id'] = trans_id try: print(str(self.num),self.last_item,end='\n') self.num += 1 time.sleep(0.2) self.save_info_to_db(item) #print(str(num)+" 时间:" + str(order_time) + " 账单标题:" + title + " 说明:"+ str(remark)+ " " +str(pay_type) +"金额:" + str(fee) + " 支付方式:"+ str(pay_bank_name)+" 类型:" + str(pay_type) +" fee_attr:"+str(fee_attr)+ '\n',end='') except Exception as e: print(e,end='\n') num = num+1 else:#若获取数据不成功,打印原因 print(res_obj) return this_page_num #实例化 maincode = MainCode(); #设置Cookie参数 maincode.userroll_encryption = "6Ow68aKrAz70mEczqeevA2gOXbr9H2a7+2ite6uuyWFdB6j1+SLhlaCNpYA6RjmaOI7IfCi9PXjQsrZPFIs1SMn38Uxr04GJsxMuSO/9wG+eBFLute0ztdHaC6GJUJ2+vmo+JIw351su8RiFxSagwA==" maincode.userroll_pass_ticket = "i0Co+55KSEjmFjfFZqMG14hasW4qtKFtbj0FiErcSzHY0afkFqHGib3YfsAZWcaG" #用于非第一页的数据抓取 #maincode.last_item['last_bill_id'] = "2ce3d65b20a10700b2048d68" #maincode.last_item['last_bill_type'] = "4" #maincode.last_item['last_create_time'] = "1540809516" #maincode.last_item['last_trans_id'] = "1000050201201810290100731805325" #设置每次返回的数量 count = "20" #exportkey 需要从Fiddler 抓包获取,有一定的时间限制 exportkey ="A%2BsIJaTGZksgZWPLtSKiyos%3D" #抓取的URL url ="https://wx.tenpay.com/userroll/userrolllist"+count+"&exportkey="+exportkey+"&sort_type=1" for page in range(0,10): #记录当前页返回的数据数量 this_page_num = 0 #第一页 if page == 0: this_page_num = maincode.get_data(url) #从第二页开始需要增加上一页最后一个item的部分参数,进行下一页的数据的抓取 else: url = "https://wx.tenpay.com/userroll/userrolllist"+count+"&exportkey="+exportkey+"&sort_type=1"+"&last_bill_id="+str(maincode.last_item['last_bill_id'])+"&last_bill_type="+str(maincode.last_item['last_bill_type'])+"&last_create_time="+str(maincode.last_item['last_create_time'])+"&last_trans_id="+str(maincode.last_item['last_trans_id'] + "&start_time="+str(maincode.last_item['last_create_time'])) print(url) this_page_num = maincode.get_data(url) #如果数量少于20个则跳出循环,抓取结束 if this_page_num < 20: break time.sleep(0.5) print(maincode.last_item)
因为是帮朋友抓取的,能实现就可以了。之后若有需要再继续优化代码吧!
总结
以上所述是小编给大家介绍的Python3 抓取微信账单信息,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对网站的支持!
如果你觉得本文对你有帮助,欢迎转载,烦请注明出处,谢谢!
P70系列延期,华为新旗舰将在下月发布
3月20日消息,近期博主@数码闲聊站 透露,原定三月份发布的华为新旗舰P70系列延期发布,预计4月份上市。
而博主@定焦数码 爆料,华为的P70系列在定位上已经超过了Mate60,成为了重要的旗舰系列之一。它肩负着重返影像领域顶尖的使命。那么这次P70会带来哪些令人惊艳的创新呢?
根据目前爆料的消息来看,华为P70系列将推出三个版本,其中P70和P70 Pro采用了三角形的摄像头模组设计,而P70 Art则采用了与上一代P60 Art相似的不规则形状设计。这样的外观是否好看见仁见智,但辨识度绝对拉满。
更新日志
- 小骆驼-《草原狼2(蓝光CD)》[原抓WAV+CUE]
- 群星《欢迎来到我身边 电影原声专辑》[320K/MP3][105.02MB]
- 群星《欢迎来到我身边 电影原声专辑》[FLAC/分轨][480.9MB]
- 雷婷《梦里蓝天HQⅡ》 2023头版限量编号低速原抓[WAV+CUE][463M]
- 群星《2024好听新歌42》AI调整音效【WAV分轨】
- 王思雨-《思念陪着鸿雁飞》WAV
- 王思雨《喜马拉雅HQ》头版限量编号[WAV+CUE]
- 李健《无时无刻》[WAV+CUE][590M]
- 陈奕迅《酝酿》[WAV分轨][502M]
- 卓依婷《化蝶》2CD[WAV+CUE][1.1G]
- 群星《吉他王(黑胶CD)》[WAV+CUE]
- 齐秦《穿乐(穿越)》[WAV+CUE]
- 发烧珍品《数位CD音响测试-动向效果(九)》【WAV+CUE】
- 邝美云《邝美云精装歌集》[DSF][1.6G]
- 吕方《爱一回伤一回》[WAV+CUE][454M]