Skip to content

Commit

Permalink
Site updated: 2023-07-05 14:01:14
Browse files Browse the repository at this point in the history
  • Loading branch information
cxzlw committed Jul 5, 2023
1 parent 33773b7 commit be86817
Show file tree
Hide file tree
Showing 21 changed files with 50 additions and 504 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@

<head>
<meta charset="UTF-8">
<link rel="apple-touch-icon" sizes="76x76" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="icon" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="apple-touch-icon" sizes="76x76" href="/favicon.ico">
<link rel="icon" href="/favicon.ico">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=5.0, shrink-to-fit=no">
<meta http-equiv="x-ua-compatible" content="ie=edge">

Expand All @@ -19,18 +19,18 @@
<meta name="description" content="近些阵子,知乎上线了针对专栏中盐选文章的反爬系统,随后该系统也被运用在知乎回答页面中的盐选文章上。具体表现为爬取的文章内容中出现大量的错乱词汇。而在本篇文章中,我们将一步步带领各位解开这些乱码。在这个过程中,我们将对字体反爬有更深入的认识,并学到运用字体反爬时需要注意的问题。">
<meta property="og:type" content="article">
<meta property="og:title" content="聊聊知乎盐选反爬 (回答页篇)">
<meta property="og:url" content="https://blog.cxzlw.top/2023/07/05/zhihu-aac-old/index.html">
<meta property="og:url" content="https://blog.cxzlw.top/2023/07/04/zhihu-aac-old/index.html">
<meta property="og:site_name" content="创新者.老王的博客">
<meta property="og:description" content="近些阵子,知乎上线了针对专栏中盐选文章的反爬系统,随后该系统也被运用在知乎回答页面中的盐选文章上。具体表现为爬取的文章内容中出现大量的错乱词汇。而在本篇文章中,我们将一步步带领各位解开这些乱码。在这个过程中,我们将对字体反爬有更深入的认识,并学到运用字体反爬时需要注意的问题。">
<meta property="og:locale" content="zh_CN">
<meta property="og:image" content="https://blog.cxzlw.top/imgs/image.png">
<meta property="article:published_time" content="2023-07-05T01:49:31.000Z">
<meta property="article:modified_time" content="2023-07-05T13:41:49.432Z">
<meta property="og:image" content="https://blog.cxzlw.top/img/image.png">
<meta property="article:published_time" content="2023-07-04T17:49:31.000Z">
<meta property="article:modified_time" content="2023-07-05T14:00:59.248Z">
<meta property="article:author" content="cxzlw">
<meta property="article:tag" content="知乎">
<meta property="article:tag" content="反爬">
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:image" content="https://blog.cxzlw.top/imgs/image.png">
<meta name="twitter:image" content="https://blog.cxzlw.top/img/image.png">


<meta name="referrer" content="no-referrer-when-downgrade">
Expand Down Expand Up @@ -288,7 +288,7 @@ <h1 id="seo-header">聊聊知乎盐选反爬 (回答页篇)</h1>

<p>近些阵子,知乎上线了针对专栏<sup id="fnref:1" class="footnote-ref"><a href="#fn:1" rel="footnote"><span class="hint--top hint--rounded" aria-label="专栏反爬现已更新,故本文只以回答反爬为演示。">[1]</span></a></sup>中盐选文章的反爬系统,随后该系统也被运用在知乎回答页面中的盐选文章上。具体表现为爬取的文章内容中出现大量的错乱词汇。而在本篇文章中,我们将一步步带领各位解开这些乱码。在这个过程中,我们将对字体反爬有更深入的认识,并学到运用字体反爬时需要注意的问题。</p>
<h2 id="一、知乎反爬效果"><a href="#一、知乎反爬效果" class="headerlink" title="一、知乎反爬效果"></a>一、知乎反爬效果</h2><p>来自知乎回答<a target="_blank" rel="noopener" href="https://www.zhihu.com/question/41922324/answer/3073556909">不被爱是一种什么样的感受? - 知乎</a></p>
<p><img src="/../imgs/image.png" srcset="/img/loading.gif" lazyload alt="乱码示意图"> </p>
<p><img src="/../img/image.png" srcset="/img/loading.gif" lazyload alt="乱码示意图"> </p>
<p>如图所示,在页面源码中出现了大量乱码,例如(原字,错字):<sup id="fnref:2" class="footnote-ref"><a href="#fn:2" rel="footnote"><span class="hint--top hint--rounded" aria-label="由于知乎回答页反爬使用了两套字体,故本文所有截图,代码运行结果等内容可能与实际不符。你可以选择以实际为主或刷新页面直到页面显示的内容与本文一致。">[2]</span></a></sup></p>
<ul>
<li>中 -&gt; 在</li>
Expand All @@ -298,17 +298,17 @@ <h2 id="一、知乎反爬效果"><a href="#一、知乎反爬效果" class="hea
<p>这些乱码使得文章可读性大大下降,那么乱码是怎么产生的?又如何解决这个问题呢?</p>
<h2 id="二、找寻乱码真凶"><a href="#二、找寻乱码真凶" class="headerlink" title="二、找寻乱码真凶"></a>二、找寻乱码真凶</h2><p>观察上述现象,页面源码中的字,在被显示到页面后,居然变成了正确的字。因此我们初步推断知乎在该页面运用了字体反爬。</p>
<p>接下来我们打开 F12 -&gt; Network 页面,选择 Font,观察知乎加载的字体。</p>
<p><img src="/../imgs/image-1.png" srcset="/img/loading.gif" lazyload alt="知乎加载的字体"></p>
<p><img src="/../img/image-1.png" srcset="/img/loading.gif" lazyload alt="知乎加载的字体"></p>
<p>右键选择 Open in new tab 将字体保存下来。</p>
<p><img src="/../imgs/image-2.png" srcset="/img/loading.gif" lazyload alt="下载的字体文件"></p>
<p><img src="/../img/image-2.png" srcset="/img/loading.gif" lazyload alt="下载的字体文件"></p>
<p>将字体后缀名改为 .ttf <sup id="fnref:3" class="footnote-ref"><a href="#fn:3" rel="footnote"><span class="hint--top hint--rounded" aria-label=".ttf 是因为 `data:font/ttf;...` 代表该字体是 ttf 格式的。">[3]</span></a></sup> 并打开。</p>
<div class="group-image-container"><div class="group-image-row"><div class="group-image-wrap"><img src="/../imgs/image-3.png" srcset="/img/loading.gif" lazyload alt="正常字体"></div><div class="group-image-wrap"><img src="/../imgs/image-4.png" srcset="/img/loading.gif" lazyload alt="反爬字体"></div></div></div>
<div class="group-image-container"><div class="group-image-row"><div class="group-image-wrap"><img src="/../img/image-3.png" srcset="/img/loading.gif" lazyload alt="正常字体"></div><div class="group-image-wrap"><img src="/../img/image-4.png" srcset="/img/loading.gif" lazyload alt="反爬字体"></div></div></div>
<figcaption aria-hidden="true" class="image-caption">左:正常字体 右:反爬字体</figcaption>

<p>与正常字体对比,我们下载的字体明显替换了部分字体,这便是知乎用于反爬的字体了。接下来我们将分析这个字体并给出应对方案。</p>
<h2 id="三、致命缺陷"><a href="#三、致命缺陷" class="headerlink" title="三、致命缺陷"></a>三、致命缺陷</h2><p>字体反爬的根本原理是替换原本的字为一个新字,再用字体将新字渲染为原字,这样对程序而言就只见到新字而不是旧字了,而用户看到的还是原本的内容。因此只要找到新字与原字间的对应关系便可解决该反爬。而要找到这个对应关系,抓住字体中各个字形的特征是必不可少的一环。</p>
<p>我们打开 <a target="_blank" rel="noopener" href="https://fontdrop.info/">FontDrop!</a> 加载字体,向下翻,观察字形的特征。</p>
<p><img src="/../imgs/image-5.png" srcset="/img/loading.gif" lazyload alt="字体中的字形"></p>
<p><img src="/../img/image-5.png" srcset="/img/loading.gif" lazyload alt="字体中的字形"></p>
<p>我们发现字形的 Glyph 为 uni662F 而 Unicode 为 65F6,接下来我们试着查询这两个十六进制数对应的字:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><code class="hljs python">glyph = <span class="hljs-string">&quot;\u662F&quot;</span><br>unicode = <span class="hljs-string">&quot;\u65F6&quot;</span><br><span class="hljs-built_in">print</span>(glyph, unicode)<br><span class="hljs-comment"># output: 是 时</span><br></code></pre></td></tr></table></figure>

Expand Down Expand Up @@ -382,7 +382,7 @@ <h2 id="注"><a href="#注" class="headerlink" title="注"></a>注</h2><section
<div class="license-box my-3">
<div class="license-title">
<div>聊聊知乎盐选反爬 (回答页篇)</div>
<div>https://blog.cxzlw.top/2023/07/05/zhihu-aac-old/</div>
<div>https://blog.cxzlw.top/2023/07/04/zhihu-aac-old/</div>
</div>
<div class="license-meta">

Expand Down Expand Up @@ -444,8 +444,8 @@ <h2 id="注"><a href="#注" class="headerlink" title="注"></a>注</h2><section
<div id="cusdis_thread"
data-host="https://cusdis.com"
data-app-id="bd220f7c-6b55-463a-912f-6e5a10b9b460"
data-page-id="e75cd7136d10caf689f9a6c27efb2705"
data-page-url="2023/07/05/zhihu-aac-old/"
data-page-id="b60ceaf6da9d12e6bf3d9297d120c8dd"
data-page-url="2023/07/04/zhihu-aac-old/"
data-page-title="聊聊知乎盐选反爬 (回答页篇)"
data-theme="auto"
>
Expand Down
4 changes: 2 additions & 2 deletions 404.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@

<head>
<meta charset="UTF-8">
<link rel="apple-touch-icon" sizes="76x76" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="icon" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="apple-touch-icon" sizes="76x76" href="/favicon.ico">
<link rel="icon" href="/favicon.ico">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=5.0, shrink-to-fit=no">
<meta http-equiv="x-ua-compatible" content="ie=edge">

Expand Down
8 changes: 4 additions & 4 deletions about/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@

<head>
<meta charset="UTF-8">
<link rel="apple-touch-icon" sizes="76x76" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="icon" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="apple-touch-icon" sizes="76x76" href="/favicon.ico">
<link rel="icon" href="/favicon.ico">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=5.0, shrink-to-fit=no">
<meta http-equiv="x-ua-compatible" content="ie=edge">

Expand All @@ -23,8 +23,8 @@
<meta property="og:site_name" content="创新者.老王的博客">
<meta property="og:description" content="欢迎来到我的博客">
<meta property="og:locale" content="zh_CN">
<meta property="article:published_time" content="2023-07-03T01:01:41.000Z">
<meta property="article:modified_time" content="2023-07-05T13:41:49.432Z">
<meta property="article:published_time" content="2023-07-02T17:01:41.000Z">
<meta property="article:modified_time" content="2023-07-05T14:00:59.248Z">
<meta property="article:author" content="cxzlw">
<meta name="twitter:card" content="summary_large_image">

Expand Down
6 changes: 3 additions & 3 deletions archives/2023/07/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@

<head>
<meta charset="UTF-8">
<link rel="apple-touch-icon" sizes="76x76" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="icon" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="apple-touch-icon" sizes="76x76" href="/favicon.ico">
<link rel="icon" href="/favicon.ico">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=5.0, shrink-to-fit=no">
<meta http-equiv="x-ua-compatible" content="ie=edge">

Expand Down Expand Up @@ -224,7 +224,7 @@

<p class="h5">2023</p>

<a href="/2023/07/05/zhihu-aac-old/" class="list-group-item list-group-item-action">
<a href="/2023/07/04/zhihu-aac-old/" class="list-group-item list-group-item-action">
<time>07-05</time>
<div class="list-group-item-title">聊聊知乎盐选反爬 (回答页篇)</div>
</a>
Expand Down
6 changes: 3 additions & 3 deletions archives/2023/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@

<head>
<meta charset="UTF-8">
<link rel="apple-touch-icon" sizes="76x76" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="icon" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="apple-touch-icon" sizes="76x76" href="/favicon.ico">
<link rel="icon" href="/favicon.ico">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=5.0, shrink-to-fit=no">
<meta http-equiv="x-ua-compatible" content="ie=edge">

Expand Down Expand Up @@ -224,7 +224,7 @@

<p class="h5">2023</p>

<a href="/2023/07/05/zhihu-aac-old/" class="list-group-item list-group-item-action">
<a href="/2023/07/04/zhihu-aac-old/" class="list-group-item list-group-item-action">
<time>07-05</time>
<div class="list-group-item-title">聊聊知乎盐选反爬 (回答页篇)</div>
</a>
Expand Down
6 changes: 3 additions & 3 deletions archives/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@

<head>
<meta charset="UTF-8">
<link rel="apple-touch-icon" sizes="76x76" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="icon" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="apple-touch-icon" sizes="76x76" href="/favicon.ico">
<link rel="icon" href="/favicon.ico">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=5.0, shrink-to-fit=no">
<meta http-equiv="x-ua-compatible" content="ie=edge">

Expand Down Expand Up @@ -224,7 +224,7 @@

<p class="h5">2023</p>

<a href="/2023/07/05/zhihu-aac-old/" class="list-group-item list-group-item-action">
<a href="/2023/07/04/zhihu-aac-old/" class="list-group-item list-group-item-action">
<time>07-05</time>
<div class="list-group-item-title">聊聊知乎盐选反爬 (回答页篇)</div>
</a>
Expand Down
4 changes: 2 additions & 2 deletions categories/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@

<head>
<meta charset="UTF-8">
<link rel="apple-touch-icon" sizes="76x76" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="icon" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="apple-touch-icon" sizes="76x76" href="/favicon.ico">
<link rel="icon" href="/favicon.ico">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=5.0, shrink-to-fit=no">
<meta http-equiv="x-ua-compatible" content="ie=edge">

Expand Down
Binary file added favicon.ico
Binary file not shown.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
12 changes: 6 additions & 6 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@

<head>
<meta charset="UTF-8">
<link rel="apple-touch-icon" sizes="76x76" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="icon" href="https://avatars.githubusercontent.com/u/55052188?v=4">
<link rel="apple-touch-icon" sizes="76x76" href="/favicon.ico">
<link rel="icon" href="/favicon.ico">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=5.0, shrink-to-fit=no">
<meta http-equiv="x-ua-compatible" content="ie=edge">

Expand Down Expand Up @@ -224,21 +224,21 @@


<div class="col-12 col-md-4 m-auto index-img">
<a href="/2023/07/05/zhihu-aac-old/" target="_self">
<img src="/imgs/image.png" srcset="/img/loading.gif" lazyload alt="聊聊知乎盐选反爬 (回答页篇)">
<a href="/2023/07/04/zhihu-aac-old/" target="_self">
<img src="/img/image.png" srcset="/img/loading.gif" lazyload alt="聊聊知乎盐选反爬 (回答页篇)">
</a>
</div>

<article class="col-12 col-md-8 mx-auto index-info">
<h2 class="index-header">

<a href="/2023/07/05/zhihu-aac-old/" target="_self">
<a href="/2023/07/04/zhihu-aac-old/" target="_self">
聊聊知乎盐选反爬 (回答页篇)
</a>
</h2>


<a class="index-excerpt " href="/2023/07/05/zhihu-aac-old/" target="_self">
<a class="index-excerpt " href="/2023/07/04/zhihu-aac-old/" target="_self">
<div>
近些阵子,知乎上线了针对专栏中盐选文章的反爬系统,随后该系统也被运用在知乎回答页面中的盐选文章上。具体表现为爬取的文章内容中出现大量的错乱词汇。而在本篇文章中,我们将一步步带领各位解开这些乱码。在这个过程中,我们将对字体反爬有更深入的认识,并学到运用字体反爬时需要注意的问题。
</div>
Expand Down
Loading

0 comments on commit be86817

Please sign in to comment.