0%

python requests库爬取洛谷代码总行数

总体思路:根据个人的评测记录页(https://www.luogu.com.cn/record/list?user=...),获取每道题的评测编号,再进入编号页获取代码中换行符的个数进行累加。

获取页面代码:

1
2
3
4
5
6
7
import requests
def getHTMLText(url,cookie):
try:
response = requests.get(url, headers=header,cookies=cookie)
return response.text
except:
return ''

其中参数header初始化(dict型):

1
header = {'User-Agent': 'Mozilla/5.0'}

cookie初始化需要查看一下浏览器的开发人员工具:(以Edge为例,其他类似)

1.按F12打开开发人员工具,找到“应用程序”
在这里插入图片描述
2.在洛谷登录后界面点击cookie
在这里插入图片描述
3.对于每一个名称和值,在代码中以dict形式存入:
(代码中仅放了一对,应把所有cookie全放入dict中)

1
cookie = {'login_referer':'https%3A%2F%2Fwww.luogu.com.cn%2F'}

现在可以分析评测记录页爬取的代码了(user=自己的id编号):

1
2
url = r'https://www.luogu.com.cn/record/list?user=...&page=1'#修改user=
print(getHTMLText(url,cookie))

其中<script>标签下有一句

1
<script>window._feInjection = JSON.parse(decodeURIComponent("%7B%22code%22%3A200%....

发现decodeURIComponent里的内容包含我们需要的信息,经查阅可以用python的execjs库来执行decodeURIComponent函数来把原内容解码。(需要先在pip中安装execjs库)

1
pip install PyExecJS

解码部分:

1
2
3
ctx = execjs.compile("function decode(str){return decodeURIComponent(str);}")
def decode(str):
return ctx.call("decode",str)

对刚才爬取的字符串进行decode,形式如下:

1
"problem":{"pid":"P7074","title":"[CSP-J2020] \u65b9\u683c\u53d6\u6570","difficulty":3,"fullScore":100,"type":"P"},"contest":null,"sourceCodeLength":894,"submitTime":1627265394,"language":3,"user":{"uid":xxx,"name":"xxx","slogan":"","badge":null,"isAdmin":false,"isBanned":false,"color":"xxx","ccfLevel":0},"id":xxx,"status":12,"enableO2":false,"score":100},{"time":xxx,"memory":xxx,

因此就可以用正则表达式找到每个pid和id,对每个不同的pid累加代码行数。

先写出给定评测页,统计评测页行数的代码:
(评测页形如https://www.luogu.com.cn/record/....)

1
2
3
def getLineOf(url):
text = getHTMLText(url,cookie)
return decode(text).count(r'\n')

再写出统计给定页数总代码行数的代码(同样需要改为自己的userId):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def getLineOfPage(page):
url = r'https://www.luogu.com.cn/record/list?user=xxx&page='+str(page)#user=...
text = getHTMLText(url,cookie)
text = decode(text)
#print(text)
m = re.findall(r'"id":(.+?),', text)
n = re.findall(r'"pid":(.+?),',text)
ret = 0

pointer = 0
for pointer in range(len(n)):
if n[pointer] in vis:
continue
eachUrl = r'https://www.luogu.com.cn/record/'+m[pointer]
vis.append(n[pointer])
t = getLineOf(eachUrl)
print("%s--%d Lines"%(n[pointer],t))
ret = ret+t

return ret

正则表达式匹配后id应该比pid多一个,因为多了最后面一条捣乱的代码(可能是主题之类的信息)

1
"currentTheme":{"id":1,"header":{"imagePath":null,"color":...

所以循环时按照代码里n列表的长度即可。
完整代码:(写时比较匆忙,没有完善代码,用时需要把所有注释中user=…的地方、最大页数以及cookie改为自己的)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import requests
import execjs
import re

vis = []
ctx = execjs.compile("function decode(str){return decodeURIComponent(str);}")
cookie = {}
#修改cookie
header = {'User-Agent': 'Mozilla/5.0'}

def decode(str):
return ctx.call("decode",str)
def getHTMLText(url,cookie):
try:
response = requests.get(url, headers=header,cookies=cookie)
return response.text
except:
return ''
def getLineOf(url):
text = getHTMLText(url,cookie)
return decode(text).count(r'\n')
def getLineOfPage(page):
url = r'https://www.luogu.com.cn/record/list?user=...&page='+str(page)#改user
text = getHTMLText(url,cookie)
text = decode(text)
#print(text)
m = re.findall(r'"id":(.+?),', text)
n = re.findall(r'"pid":(.+?),',text)
ret = 0

pointer = 0
for pointer in range(len(n)):
if n[pointer] in vis:
continue
eachUrl = r'https://www.luogu.com.cn/record/'+m[pointer]
vis.append(n[pointer])
t = getLineOf(eachUrl)
print("%s--%d Lines"%(n[pointer],t))
ret = ret+t

return ret
url = ""

maxpage = 1#修改最大页数,去评测记录页看一下页数


now = 1
cnt = 0

while now <= maxpage:
url = 'https://www.luogu.com.cn/record/list?user=...&page='+str(now)#修改
cnt = cnt+getLineOfPage(now)
now = now+1

print('总代码行数:')
print(cnt)
'''

运行示例(仅爬了一页,这页反复提交了一道题不少次所以只有五条):
在这里插入图片描述