<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Murrough Foley</title>
    <description>The latest articles on Forem by Murrough Foley (@murroughfoley).</description>
    <link>https://forem.com/murroughfoley</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F405413%2F1ae7b6e4-9915-4827-a69e-5ce8e147b1ae.jpg</url>
      <title>Forem: Murrough Foley</title>
      <link>https://forem.com/murroughfoley</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/murroughfoley"/>
    <language>en</language>
    <item>
      <title>How to Use rs-trafilatura with Scrapy</title>
      <dc:creator>Murrough Foley</dc:creator>
      <pubDate>Fri, 03 Apr 2026 14:23:44 +0000</pubDate>
      <link>https://forem.com/murroughfoley/how-to-use-rs-trafilatura-with-scrapy-1i9b</link>
      <guid>https://forem.com/murroughfoley/how-to-use-rs-trafilatura-with-scrapy-1i9b</guid>
      <description>&lt;p&gt;&lt;a href="https://scrapy.org" rel="noopener noreferrer"&gt;Scrapy&lt;/a&gt; is the standard Python framework for web scraping. It handles crawling, scheduling, and data pipelines. rs-trafilatura plugs into Scrapy as an item pipeline — your spider yields items with HTML, and the pipeline adds structured extraction results automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;rs-trafilatura scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;Add the pipeline to your Scrapy project's &lt;code&gt;settings.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ITEM_PIPELINES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rs_trafilatura.scrapy.RsTrafilaturaPipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Every item that passes through the pipeline with a &lt;code&gt;body&lt;/code&gt; (bytes) or &lt;code&gt;html&lt;/code&gt; (string) field will get an &lt;code&gt;extraction&lt;/code&gt; dict added to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Writing the Spider
&lt;/h2&gt;

&lt;p&gt;Your spider yields items with the response body and URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ContentSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# raw bytes — rs-trafilatura auto-detects encoding
&lt;/span&gt;        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;# Follow links
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;href&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getall&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;href&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline picks up &lt;code&gt;body&lt;/code&gt; (bytes) or &lt;code&gt;html&lt;/code&gt; (string). When it finds one, it runs extraction and adds the results under &lt;code&gt;item["extraction"]&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Pipeline Adds
&lt;/h2&gt;

&lt;p&gt;Each processed item gets an &lt;code&gt;extraction&lt;/code&gt; dict:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/blog/post&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;html&amp;gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extraction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Blog Post Title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;John Doe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-01-15T00:00:00+00:00&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;main_content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The full extracted text...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content_markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;# Blog Post Title&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;The full extracted text...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extraction_quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sitename&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Example Blog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A blog post about...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Enabling Markdown Output
&lt;/h2&gt;

&lt;p&gt;Add to &lt;code&gt;settings.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;RS_TRAFILATURA_MARKDOWN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This populates &lt;code&gt;item["extraction"]["content_markdown"]&lt;/code&gt; with GitHub Flavored Markdown.&lt;/p&gt;

&lt;h2&gt;
  
  
  Filtering by Page Type
&lt;/h2&gt;

&lt;p&gt;The page type classification lets you route items differently based on what kind of page they are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ContentSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;custom_settings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ITEM_PIPELINES&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rs_trafilatura.scrapy.RsTrafilaturaPipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;myproject.pipelines.PageTypeRouter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;href&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getall&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;href&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# myproject/pipelines.py
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PageTypeRouter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;ext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extraction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
        &lt;span class="n"&gt;page_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;page_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Save to products table
&lt;/span&gt;            &lt;span class="nf"&gt;save_product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;page_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;forum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Save to discussions table
&lt;/span&gt;            &lt;span class="nf"&gt;save_forum_post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;page_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Save to articles table
&lt;/span&gt;            &lt;span class="nf"&gt;save_article&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Default handling
&lt;/span&gt;            &lt;span class="nf"&gt;save_generic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Filtering by Extraction Quality
&lt;/h2&gt;

&lt;p&gt;Drop items where extraction quality is low:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;QualityFilter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;ext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extraction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
        &lt;span class="n"&gt;quality&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extraction_quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;quality&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DropItem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Low extraction quality (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add it before the router:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ITEM_PIPELINES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rs_trafilatura.scrapy.RsTrafilaturaPipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;myproject.pipelines.QualityFilter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;350&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;myproject.pipelines.PageTypeRouter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Exporting to JSON Lines
&lt;/h2&gt;

&lt;p&gt;Scrapy's built-in feed exports work out of the box:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy crawl content &lt;span class="nt"&gt;-o&lt;/span&gt; output.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each line in &lt;code&gt;output.jsonl&lt;/code&gt; will contain the full item including the &lt;code&gt;extraction&lt;/code&gt; dict. You can then process it with any tool that reads JSON Lines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;rs-trafilatura extracts in ~44ms per page via compiled Rust (PyO3, no subprocess). On a typical Scrapy crawl, extraction adds negligible overhead compared to network latency. The pipeline processes items synchronously in the Scrapy reactor thread, but since extraction is CPU-bound and fast, it doesn't block the download pipeline.&lt;/p&gt;

&lt;p&gt;For very high-throughput crawls (1000+ pages/second), consider running extraction in a separate process and communicating via Scrapy's item pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Items Without HTML
&lt;/h2&gt;

&lt;p&gt;If an item doesn't have &lt;code&gt;body&lt;/code&gt; or &lt;code&gt;html&lt;/code&gt;, the pipeline passes it through unchanged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This item has no HTML — pipeline ignores it
&lt;/span&gt;&lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;custom_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;something&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;# → No "extraction" key added, item passes through as-is
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;rs-trafilatura Python package&lt;/strong&gt;: &lt;a href="https://pypi.org/project/rs-trafilatura" rel="noopener noreferrer"&gt;pypi.org/project/rs-trafilatura&lt;/a&gt; · &lt;a href="https://github.com/Murrough-Foley/rs-trafilatura-python" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rust crate&lt;/strong&gt;: &lt;a href="https://crates.io/crates/rs-trafilatura" rel="noopener noreferrer"&gt;crates.io/crates/rs-trafilatura&lt;/a&gt; · &lt;a href="https://github.com/Murrough-Foley/rs-trafilatura" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scrapy&lt;/strong&gt;: &lt;a href="https://scrapy.org" rel="noopener noreferrer"&gt;scrapy.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark&lt;/strong&gt;: &lt;a href="https://webcontentextraction.org" rel="noopener noreferrer"&gt;webcontentextraction.org WCXB&lt;/a&gt; · &lt;a href="https://github.com/Murrough-Foley/web-content-extraction-benchmark" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://doi.org/10.5281/zenodo.19316874" rel="noopener noreferrer"&gt;Zenodo&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>webcontentextraction</category>
      <category>rust</category>
      <category>scrapy</category>
    </item>
    <item>
      <title>How to Use rs-trafilatura with Firecrawl</title>
      <dc:creator>Murrough Foley</dc:creator>
      <pubDate>Fri, 03 Apr 2026 14:22:45 +0000</pubDate>
      <link>https://forem.com/murroughfoley/how-to-use-rs-trafilatura-with-firecrawl-36p9</link>
      <guid>https://forem.com/murroughfoley/how-to-use-rs-trafilatura-with-firecrawl-36p9</guid>
      <description>&lt;p&gt;&lt;a href="https://firecrawl.dev" rel="noopener noreferrer"&gt;Firecrawl&lt;/a&gt; is an API service for scraping web pages. It handles JavaScript rendering, anti-bot bypass, and rate limiting — you send it a URL, it gives you back the page content. By default, Firecrawl returns Markdown. But if you request the raw HTML, you can run rs-trafilatura on it for page-type-aware extraction with quality scoring.&lt;/p&gt;

&lt;p&gt;This is useful when you need structured metadata (title, author, date, page type) or when you want to know how confident the extraction is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;rs-trafilatura firecrawl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You also need a Firecrawl API key from &lt;a href="https://firecrawl.dev" rel="noopener noreferrer"&gt;firecrawl.dev&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic Usage
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;firecrawl&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FirecrawlApp&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rs_trafilatura.firecrawl&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;extract_firecrawl_result&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FirecrawlApp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fc-your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Request HTML format (required for rs-trafilatura)
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scrape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/blog/post&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;formats&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Extract with rs-trafilatura
&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_firecrawl_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Title: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Author: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;author&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Date: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Page type: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quality: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extraction_quality&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main_content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key is &lt;code&gt;formats=["html"]&lt;/code&gt; — this tells Firecrawl to return the raw HTML alongside whatever else it produces. Without it, you only get Markdown and there's nothing for rs-trafilatura to extract from.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Just Use Firecrawl's Markdown?
&lt;/h2&gt;

&lt;p&gt;Firecrawl's built-in Markdown output is good for articles. The difference shows on non-article pages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Product pages&lt;/strong&gt;: Firecrawl may include navigation, filters, and "related products" sections in its Markdown. rs-trafilatura recognises the page type and extracts just the product description, falling back to JSON-LD structured data when needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forums&lt;/strong&gt;: Firecrawl treats the entire page as content. rs-trafilatura identifies user posts and excludes voting controls, user profile panels, and moderation UI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service pages&lt;/strong&gt;: Firecrawl may over-extract or under-extract multi-section layouts. rs-trafilatura's multi-candidate merge handles hero + features + testimonials + pricing sections.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The other advantage is the quality score. Firecrawl doesn't tell you how confident it is. rs-trafilatura's &lt;code&gt;extraction_quality&lt;/code&gt; field gives you a 0.0–1.0 score so you can flag unreliable extractions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Both Firecrawl Markdown and rs-trafilatura Extraction
&lt;/h2&gt;

&lt;p&gt;You can request both and compare:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scrape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;formats&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Firecrawl's own Markdown
&lt;/span&gt;&lt;span class="n"&gt;firecrawl_markdown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;markdown&lt;/span&gt;

&lt;span class="c1"&gt;# rs-trafilatura extraction
&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_firecrawl_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_markdown&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rs_markdown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;extracted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content_markdown&lt;/span&gt;
&lt;span class="n"&gt;rs_quality&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;extracted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extraction_quality&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Firecrawl markdown: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;firecrawl_markdown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rs-trafilatura markdown: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rs_markdown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extraction quality: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rs_quality&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Batch Scraping
&lt;/h2&gt;

&lt;p&gt;Firecrawl supports batch scraping. Combine it with rs-trafilatura for structured extraction at scale:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;firecrawl&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FirecrawlApp&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rs_trafilatura.firecrawl&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;extract_firecrawl_result&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FirecrawlApp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fc-your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/products/widget&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/docs/getting-started&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/blog/announcement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://forum.example.com/thread/help&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;batch_scrape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;formats&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;extracted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_firecrawl_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (quality: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extraction_quality&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: the batch API returns a result object with a &lt;code&gt;.data&lt;/code&gt; attribute containing a list of Document objects. The &lt;code&gt;extract_firecrawl_result&lt;/code&gt; adapter handles both Document objects (v4) and legacy dicts (v1).&lt;/p&gt;

&lt;h2&gt;
  
  
  Options
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Stricter filtering — less noise
&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_firecrawl_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;favor_precision&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# More inclusive — captures more content
&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_firecrawl_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;favor_recall&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Get Markdown output
&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_firecrawl_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_markdown&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What You Get
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;extract_firecrawl_result&lt;/code&gt; returns an &lt;code&gt;ExtractResult&lt;/code&gt; with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;title&lt;/code&gt;, &lt;code&gt;author&lt;/code&gt;, &lt;code&gt;date&lt;/code&gt; — structured metadata&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;main_content&lt;/code&gt; — clean extracted text&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;content_markdown&lt;/code&gt; — GFM Markdown (when enabled)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;page_type&lt;/code&gt; — article, forum, product, collection, listing, documentation, service&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;extraction_quality&lt;/code&gt; — 0.0–1.0 confidence score&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;language&lt;/code&gt;, &lt;code&gt;sitename&lt;/code&gt;, &lt;code&gt;description&lt;/code&gt; — additional metadata&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;images&lt;/code&gt; — extracted image data with src, alt, caption&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;rs-trafilatura Python package&lt;/strong&gt;: &lt;a href="https://pypi.org/project/rs-trafilatura" rel="noopener noreferrer"&gt;pypi.org/project/rs-trafilatura&lt;/a&gt; · &lt;a href="https://github.com/Murrough-Foley/rs-trafilatura-python" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rust crate&lt;/strong&gt;: &lt;a href="https://crates.io/crates/rs-trafilatura" rel="noopener noreferrer"&gt;crates.io/crates/rs-trafilatura&lt;/a&gt; · &lt;a href="https://github.com/Murrough-Foley/rs-trafilatura" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Firecrawl&lt;/strong&gt;: &lt;a href="https://firecrawl.dev" rel="noopener noreferrer"&gt;firecrawl.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WCXB Benchmark&lt;/strong&gt;: &lt;a href="https://webcontentextraction.org" rel="noopener noreferrer"&gt;webcontentextraction.org&lt;/a&gt; · &lt;a href="https://github.com/Murrough-Foley/web-content-extraction-benchmark" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://doi.org/10.5281/zenodo.19316874" rel="noopener noreferrer"&gt;Zenodo&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>scraping</category>
      <category>firecrawl</category>
      <category>webcontentextraction</category>
      <category>rust</category>
    </item>
    <item>
      <title>How to Use rs-trafilatura with spider-rs</title>
      <dc:creator>Murrough Foley</dc:creator>
      <pubDate>Fri, 03 Apr 2026 14:21:31 +0000</pubDate>
      <link>https://forem.com/murroughfoley/how-to-use-rs-trafilatura-with-spider-rs-de4</link>
      <guid>https://forem.com/murroughfoley/how-to-use-rs-trafilatura-with-spider-rs-de4</guid>
      <description>&lt;p&gt;&lt;a href="https://crates.io/crates/spider" rel="noopener noreferrer"&gt;spider&lt;/a&gt; is a high-performance async web crawler written in Rust. It discovers, fetches, and queues URLs — but content extraction is left to you. rs-trafilatura slots in as the extraction layer, giving you page-type-aware content extraction with quality scoring on every crawled page.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;Add both crates to your &lt;code&gt;Cargo.toml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[dependencies]&lt;/span&gt;
&lt;span class="py"&gt;rs-trafilatura&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;features&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"spider"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="py"&gt;spider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"2"&lt;/span&gt;
&lt;span class="py"&gt;tokio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;features&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"full"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;spider&lt;/code&gt; feature flag enables &lt;code&gt;rs_trafilatura::spider_integration&lt;/code&gt;, which provides convenience functions that accept spider's &lt;code&gt;Page&lt;/code&gt; type directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic: Crawl Then Extract
&lt;/h2&gt;

&lt;p&gt;The simplest approach — crawl a site, then extract content from every page:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;spider&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;website&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Website&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;rs_trafilatura&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;spider_integration&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;extract_page&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;website&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Website&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://example.com"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;website&lt;/span&gt;&lt;span class="nf"&gt;.crawl&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;website&lt;/span&gt;&lt;span class="nf"&gt;.get_pages&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.into_iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.flatten&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="nf"&gt;extract_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"[{}] {} (confidence: {:.2})"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.metadata.page_type&lt;/span&gt;&lt;span class="nf"&gt;.unwrap_or_default&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.metadata.title&lt;/span&gt;&lt;span class="nf"&gt;.unwrap_or_default&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.extraction_quality&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"  Content: {} chars"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.content_text&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nd"&gt;eprintln!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"  Extraction failed: {e}"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;extract_page&lt;/code&gt; takes a &lt;code&gt;&amp;amp;Page&lt;/code&gt; and returns &lt;code&gt;Result&amp;lt;ExtractResult&amp;gt;&lt;/code&gt;. The page URL is automatically passed to the classifier for page type detection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Streaming: Extract As Pages Arrive
&lt;/h2&gt;

&lt;p&gt;For large crawls, you don't want to wait until everything is fetched. spider's subscribe channel lets you process pages as they arrive:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;spider&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;website&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Website&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;rs_trafilatura&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;spider_integration&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;extract_page&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;website&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Website&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://example.com"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;rx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;website&lt;/span&gt;&lt;span class="nf"&gt;.subscribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;move&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rx&lt;/span&gt;&lt;span class="nf"&gt;.recv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"[{count}] {} → {} ({:.2})"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="nf"&gt;.get_url&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.metadata.page_type&lt;/span&gt;&lt;span class="nf"&gt;.unwrap_or_default&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.extraction_quality&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Extracted {count} pages"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="n"&gt;website&lt;/span&gt;&lt;span class="nf"&gt;.crawl&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;website&lt;/span&gt;&lt;span class="nf"&gt;.unsubscribe&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each page is extracted in the spawned task as soon as spider fetches it. Extraction takes ~44ms per page, so it easily keeps up with typical crawl rates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Custom Options
&lt;/h2&gt;

&lt;p&gt;Use &lt;code&gt;extract_page_with_options&lt;/code&gt; for fine-grained control:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;rs_trafilatura&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;Options&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;spider_integration&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;extract_page_with_options&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;rs_trafilatura&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;page_type&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;PageType&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Options&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;output_markdown&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;// Get GFM Markdown output&lt;/span&gt;
    &lt;span class="n"&gt;include_images&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;// Extract image metadata&lt;/span&gt;
    &lt;span class="n"&gt;favor_precision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;// Stricter filtering&lt;/span&gt;
    &lt;span class="n"&gt;page_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;PageType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;// Force page type&lt;/span&gt;
    &lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="nn"&gt;Options&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_page_with_options&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.content_markdown&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Markdown:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;md&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.images&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Image: {} (hero: {})"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="py"&gt;.src&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="py"&gt;.is_hero&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you provide &lt;code&gt;url&lt;/code&gt; in the options, it takes precedence over the page URL for classification. If you don't, the page URL is used automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quality-Gated Processing
&lt;/h2&gt;

&lt;p&gt;The extraction quality score lets you filter or flag low-confidence results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;website&lt;/span&gt;&lt;span class="nf"&gt;.get_pages&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.into_iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.flatten&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="nf"&gt;.get_url&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.extraction_quality&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.80&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nd"&gt;eprintln!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"⚠ Low confidence on {url}: {:.2}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.extraction_quality&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="c1"&gt;// Log for manual review, or route to a fallback extractor&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Process high-confidence extractions&lt;/span&gt;
    &lt;span class="nf"&gt;save_to_database&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the WCXB benchmark, about 8% of pages score below 0.80. These are typically product pages with content in JSON-LD, forums with unusual markup, or service pages with highly distributed content.&lt;/p&gt;

&lt;h2&gt;
  
  
  What extract_page Returns
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;ExtractResult&lt;/code&gt; gives you:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;content_text&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;String&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Main content as plain text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;content_markdown&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Option&amp;lt;String&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GFM Markdown (when enabled)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;content_html&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Option&amp;lt;String&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Extracted content as HTML&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metadata.title&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Option&amp;lt;String&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Page title&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metadata.author&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Option&amp;lt;String&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Author name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metadata.date&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Option&amp;lt;DateTime&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Publication date&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metadata.page_type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Option&amp;lt;String&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Detected page type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;extraction_quality&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;f64&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.0–1.0 confidence score&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;images&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Vec&amp;lt;ImageData&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Image URLs, alt text, captions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why Not spider_transformations?
&lt;/h2&gt;

&lt;p&gt;spider ships with its own &lt;code&gt;spider_transformations&lt;/code&gt; crate that can convert pages to Markdown or plain text. It works, but it's a basic readability-style extractor without:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ML page type classification&lt;/li&gt;
&lt;li&gt;Type-specific extraction profiles (forum comment handling, multi-section merge, JSON-LD fallback)&lt;/li&gt;
&lt;li&gt;Extraction quality scoring&lt;/li&gt;
&lt;li&gt;Structured metadata extraction from JSON-LD, Open Graph, and Dublin Core&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;rs-trafilatura gives you all of these. For article-heavy crawls, spider_transformations is fine. For crawls that hit diverse page types, rs-trafilatura produces substantially better results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;rs-trafilatura&lt;/strong&gt;: &lt;a href="https://crates.io/crates/rs-trafilatura" rel="noopener noreferrer"&gt;crates.io/crates/rs-trafilatura&lt;/a&gt; · &lt;a href="https://github.com/Murrough-Foley/rs-trafilatura" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python package&lt;/strong&gt;: &lt;a href="https://pypi.org/project/rs-trafilatura" rel="noopener noreferrer"&gt;pypi.org/project/rs-trafilatura&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;spider&lt;/strong&gt;: &lt;a href="https://crates.io/crates/spider" rel="noopener noreferrer"&gt;crates.io/crates/spider&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark&lt;/strong&gt;: &lt;a href="https://webcontentextraction.org" rel="noopener noreferrer"&gt;webcontentextraction.org WCXB&lt;/a&gt; · &lt;a href="https://github.com/Murrough-Foley/web-content-extraction-benchmark" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://doi.org/10.5281/zenodo.19316874" rel="noopener noreferrer"&gt;Zenodo&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>webcontentextraction</category>
      <category>rust</category>
      <category>scraping</category>
    </item>
    <item>
      <title>How to Use rs-trafilatura with crawl4ai</title>
      <dc:creator>Murrough Foley</dc:creator>
      <pubDate>Fri, 03 Apr 2026 14:19:29 +0000</pubDate>
      <link>https://forem.com/murroughfoley/how-to-use-rs-trafilatura-with-crawl4ai-3nfd</link>
      <guid>https://forem.com/murroughfoley/how-to-use-rs-trafilatura-with-crawl4ai-3nfd</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/unclecode/crawl4ai" rel="noopener noreferrer"&gt;crawl4ai&lt;/a&gt; is an async web crawler built for producing LLM-friendly output. By default, it converts pages to Markdown using its own scraping pipeline. But if you want page-type-aware content extraction with quality scoring, you can swap in rs-trafilatura as the extraction strategy.&lt;/p&gt;

&lt;p&gt;This tutorial shows how to set that up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;rs-trafilatura crawl4ai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this is your first time with crawl4ai, you also need Playwright browsers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; playwright &lt;span class="nb"&gt;install &lt;/span&gt;chromium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Basic Usage
&lt;/h2&gt;

&lt;p&gt;rs-trafilatura provides &lt;code&gt;RsTrafilaturaStrategy&lt;/code&gt;, a drop-in replacement for crawl4ai's built-in extraction strategies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crawl4ai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncWebCrawler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CrawlerRunConfig&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rs_trafilatura.crawl4ai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RsTrafilaturaStrategy&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;strategy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RsTrafilaturaStrategy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CrawlerRunConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extraction_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AsyncWebCrawler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arun&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extracted_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Title: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Page type: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;page_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quality: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extraction_quality&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;main_content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The extracted content is a JSON array with one item containing the extraction result. crawl4ai serialises it automatically — you just &lt;code&gt;json.loads()&lt;/code&gt; the &lt;code&gt;extracted_content&lt;/code&gt; field.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Get Back
&lt;/h2&gt;

&lt;p&gt;Each extraction result is a dict with these fields:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;title&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Page title&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;author&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Author name (if detected)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;date&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Publication date (ISO 8601)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;main_content&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Clean extracted text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;content_markdown&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Markdown output (if enabled)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;page_type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;article, forum, product, collection, listing, documentation, service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;extraction_quality&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.0–1.0 confidence score&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;language&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Detected language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sitename&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Site name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;description&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Meta description&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Enabling Markdown Output
&lt;/h2&gt;

&lt;p&gt;Pass &lt;code&gt;output_markdown=True&lt;/code&gt; to get Markdown alongside plain text:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;strategy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RsTrafilaturaStrategy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_markdown&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CrawlerRunConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extraction_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AsyncWebCrawler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arun&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extracted_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;markdown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content_markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you GitHub Flavored Markdown with headings, lists, tables, bold/italic, code blocks, and links preserved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Precision vs Recall
&lt;/h2&gt;

&lt;p&gt;By default, rs-trafilatura balances precision and recall. You can tip the scale:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Stricter filtering — less noise, may miss some content
&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RsTrafilaturaStrategy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;favor_precision&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# More inclusive — captures more content, may include some boilerplate
&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RsTrafilaturaStrategy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;favor_recall&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Crawling Multiple Pages
&lt;/h2&gt;

&lt;p&gt;crawl4ai handles concurrency. rs-trafilatura runs extraction in a thread per page, so it doesn't block the async crawl loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;strategy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RsTrafilaturaStrategy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_markdown&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CrawlerRunConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extraction_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/blog/post-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/products/widget&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/docs/getting-started&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://forum.example.com/thread/123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AsyncWebCrawler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arun&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extracted_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;page_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (quality: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extraction_quality&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each page gets classified into its type and extracted with the appropriate profile. A product page gets JSON-LD fallback. A forum thread gets comment-as-content handling. A docs page gets sidebar removal. All automatic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using the Quality Score for Hybrid Pipelines
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;extraction_quality&lt;/code&gt; field tells you how confident rs-trafilatura is in its extraction. You can use this to build a hybrid pipeline — fast heuristic extraction for most pages, with LLM fallback for the hard cases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crawl4ai.extraction_strategy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLMExtractionStrategy&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_with_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arun&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extracted_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extraction_quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.80&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Low confidence — use crawl4ai's built-in LLM extraction as fallback
&lt;/span&gt;        &lt;span class="n"&gt;llm_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CrawlerRunConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;extraction_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;LLMExtractionStrategy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arun&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extracted_content&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;main_content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the WCXB benchmark, about 8% of pages score below 0.80. Routing just those pages to a neural fallback improves the overall F1 from 0.859 to 0.862 on the development set and from 0.893 to 0.910 on the held-out test set.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works Under the Hood
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;RsTrafilaturaStrategy&lt;/code&gt; inherits from crawl4ai's &lt;code&gt;ExtractionStrategy&lt;/code&gt; when crawl4ai is installed, so it passes the &lt;code&gt;isinstance()&lt;/code&gt; check in &lt;code&gt;CrawlerRunConfig&lt;/code&gt;. It sets &lt;code&gt;input_format="html"&lt;/code&gt; which tells crawl4ai to pass raw HTML (not Markdown) and to skip chunking. The extraction runs in Rust via PyO3 — no subprocess, no binary to find.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;rs-trafilatura Python package&lt;/strong&gt;: &lt;a href="https://pypi.org/project/rs-trafilatura" rel="noopener noreferrer"&gt;pypi.org/project/rs-trafilatura&lt;/a&gt; · &lt;a href="https://github.com/Murrough-Foley/rs-trafilatura-python" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rust crate&lt;/strong&gt;: &lt;a href="https://crates.io/crates/rs-trafilatura" rel="noopener noreferrer"&gt;crates.io/crates/rs-trafilatura&lt;/a&gt; · &lt;a href="https://github.com/Murrough-Foley/rs-trafilatura" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;crawl4ai&lt;/strong&gt;: &lt;a href="https://github.com/unclecode/crawl4ai" rel="noopener noreferrer"&gt;github.com/unclecode/crawl4ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark&lt;/strong&gt;: &lt;a href="https://webcontentextraction.org" rel="noopener noreferrer"&gt;webcontentextraction.org WCXB&lt;/a&gt; · &lt;a href="https://github.com/Murrough-Foley/web-content-extraction-benchmark" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://doi.org/10.5281/zenodo.19316874" rel="noopener noreferrer"&gt;Zenodo&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>webcontentextraction</category>
      <category>rust</category>
      <category>scraping</category>
    </item>
    <item>
      <title>rs-trafilatura: Page-Type-Aware Web Content Extraction in Rust</title>
      <dc:creator>Murrough Foley</dc:creator>
      <pubDate>Fri, 03 Apr 2026 14:17:29 +0000</pubDate>
      <link>https://forem.com/murroughfoley/rs-trafilatura-page-type-aware-web-content-extraction-in-rust-2ppf</link>
      <guid>https://forem.com/murroughfoley/rs-trafilatura-page-type-aware-web-content-extraction-in-rust-2ppf</guid>
      <description>&lt;p&gt;Web content extraction is the task of isolating the main content of a web page from its surrounding boilerplate — navigation menus, cookie banners, ads, sidebars, footers, and the other 80% of a page that isn't the actual content. If you process web pages at scale, you need it. Search engines use it for indexing. RAG pipelines use it to feed clean context to LLMs. SEO practitioners use it to approximate what Google sees when it evaluates a page.&lt;/p&gt;

&lt;p&gt;The open-source ecosystem for this is strong. Trafilatura (Python), Readability (JavaScript), jusText, BoilerPy3 — all solid tools that work well on news articles and blog posts. On articles, the top systems all converge above F1 = 0.90. The problem is largely solved.&lt;/p&gt;

&lt;p&gt;But the web is not just articles.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Everything That Isn't an Article
&lt;/h2&gt;

&lt;p&gt;When I was running SEO audits across thousands of competitor pages from search results, I kept hitting the same issue. The extraction tools worked on articles but fell apart on everything else:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Product pages&lt;/strong&gt; encode descriptions in JSON-LD structured data rather than the visible DOM. Extractors that only parse what they see in the HTML miss the content entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forum pages&lt;/strong&gt; wrap user posts in CSS classes like &lt;code&gt;comment&lt;/code&gt; and &lt;code&gt;reply&lt;/code&gt; — the exact patterns that article extractors have been trained to strip as boilerplate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service pages&lt;/strong&gt; spread content across 5 to 15 independent &lt;code&gt;&amp;lt;section&amp;gt;&lt;/code&gt; elements. An extractor that picks a single best node grabs the hero section and throws away the rest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation pages&lt;/strong&gt; embed content alongside sidebar navigation, version pickers, and table-of-contents panels that extractors include as if they were content.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't edge cases. In the WCXB dataset, 47% of pages are non-articles. And the failures are architectural — no amount of parameter tuning within an article-focused extractor can fix them. You need a different extraction strategy for each page type.&lt;/p&gt;

&lt;h2&gt;
  
  
  What rs-trafilatura Does Differently
&lt;/h2&gt;

&lt;p&gt;rs-trafilatura is a Rust library that classifies pages into seven types (article, forum, product, collection, listing, documentation, service) and applies type-specific extraction profiles. The classifier is a three-stage pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;URL heuristics&lt;/strong&gt; — fast pattern matching on domain and path. URLs containing &lt;code&gt;/forum/&lt;/code&gt;, &lt;code&gt;/products/&lt;/code&gt;, or &lt;code&gt;docs.&lt;/code&gt; subdomains resolve immediately. This handles ~63% of pages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTML signal analysis&lt;/strong&gt; — JSON-LD &lt;code&gt;@type&lt;/code&gt; values, Open Graph meta tags, DOM patterns like product grids or code blocks. This catches ~15% more.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;XGBoost ML classifier&lt;/strong&gt; — 200 trees over 181 features (DOM structure, vocabulary density, link ratios) for the remaining ambiguous pages. 86.6% accuracy.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once the page type is known, extraction uses a type-specific profile:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Forums get &lt;code&gt;comments_are_content = true&lt;/code&gt; and platform-specific selectors for XenForo, vBulletin, Discourse, phpBB, and others.&lt;/li&gt;
&lt;li&gt;Service pages get multi-candidate content merging — selecting the top-scoring sections and concatenating them.&lt;/li&gt;
&lt;li&gt;Products get JSON-LD structured data fallback when DOM extraction produces poor results.&lt;/li&gt;
&lt;li&gt;Documentation gets framework-specific boilerplate removal for Sphinx, Rustdoc, MDN, and ReadTheDocs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Extraction Quality Predictor
&lt;/h2&gt;

&lt;p&gt;After extraction, an ML quality predictor estimates how reliable the result is. It's a 27-feature XGBoost regression model that predicts the expected F1 score. Pages scoring below 0.80 are candidates for LLM fallback — you can route them to MinerU-HTML or another neural extractor while keeping the fast heuristic path for the 92% of pages where it works well.&lt;/p&gt;

&lt;p&gt;This is the basis for hybrid extraction pipelines: heuristic speed on most pages, neural quality on the hard cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;Benchmarked on the &lt;a href="https://webcontentextraction.org" rel="noopener noreferrer"&gt;Web Content eXtraction Benchmark WCXB&lt;/a&gt; — 1,497 pages from 1,295 domains across 7 page types:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;F1&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;rs-trafilatura&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.859&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;44 ms/page&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MinerU-HTML (0.6B)&lt;/td&gt;
&lt;td&gt;0.827&lt;/td&gt;
&lt;td&gt;1,570 ms/page&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trafilatura (Python)&lt;/td&gt;
&lt;td&gt;0.791&lt;/td&gt;
&lt;td&gt;94 ms/page&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ReaderLM-v2 (1.5B)&lt;/td&gt;
&lt;td&gt;0.741&lt;/td&gt;
&lt;td&gt;10,410 ms/page&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On a separate 511-page held-out test set (never used during development), rs-trafilatura achieves F1 = 0.893, confirming that results generalise. With a hybrid pipeline routing low-confidence pages to MinerU-HTML, the held-out F1 reaches 0.910.&lt;/p&gt;

&lt;p&gt;On articles, every top system converges around F1 = 0.93 — the differences are marginal. The gap opens on non-article types:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Page Type&lt;/th&gt;
&lt;th&gt;rs-trafilatura&lt;/th&gt;
&lt;th&gt;Trafilatura&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Article&lt;/td&gt;
&lt;td&gt;0.932&lt;/td&gt;
&lt;td&gt;0.926&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Documentation&lt;/td&gt;
&lt;td&gt;0.931&lt;/td&gt;
&lt;td&gt;0.888&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service&lt;/td&gt;
&lt;td&gt;0.843&lt;/td&gt;
&lt;td&gt;0.763&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Forum&lt;/td&gt;
&lt;td&gt;0.792&lt;/td&gt;
&lt;td&gt;0.585&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Collection&lt;/td&gt;
&lt;td&gt;0.713&lt;/td&gt;
&lt;td&gt;0.553&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Listing&lt;/td&gt;
&lt;td&gt;0.704&lt;/td&gt;
&lt;td&gt;0.589&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Product&lt;/td&gt;
&lt;td&gt;0.670&lt;/td&gt;
&lt;td&gt;0.567&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Forums: +0.207 over Trafilatura. Collections: +0.160. These aren't marginal improvements — they're the difference between getting most of the content and getting almost none.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using rs-trafilatura
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Rust
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;rs_trafilatura&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Title: {:?}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.metadata.title&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Content: {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.content_text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Page type: {:?}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.metadata.page_type&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Confidence: {:.2}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.extraction_quality&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[dependencies]&lt;/span&gt;
&lt;span class="py"&gt;rs-trafilatura&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.2"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Python
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;rs_trafilatura&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rs_trafilatura&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main_content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extraction_quality&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;rs-trafilatura
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Python package bundles four Rust crates into a single native extension: content extraction, page type classification, HTML cleaning, and Markdown conversion. No subprocess overhead — it's compiled Rust called directly from Python via PyO3.&lt;/p&gt;

&lt;h3&gt;
  
  
  Spider-rs Integration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;spider&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;website&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Website&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;rs_trafilatura&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;spider_integration&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;extract_page&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;website&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Website&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://example.com"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;website&lt;/span&gt;&lt;span class="nf"&gt;.crawl&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;website&lt;/span&gt;&lt;span class="nf"&gt;.get_pages&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.into_iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.flatten&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"[{}] {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.metadata.page_type&lt;/span&gt;&lt;span class="nf"&gt;.unwrap_or_default&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="py"&gt;.metadata.title&lt;/span&gt;&lt;span class="nf"&gt;.unwrap_or_default&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[dependencies]&lt;/span&gt;
&lt;span class="py"&gt;rs-trafilatura&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;features&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"spider"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Benchmark
&lt;/h2&gt;

&lt;p&gt;WCXB is the dataset behind these numbers — 2,008 pages from 1,613 domains, with a dev/test split and ground truth annotations for all seven page types. It's the first benchmark that measures extraction quality across structurally different page types rather than just articles.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Website&lt;/strong&gt;: &lt;a href="https://webcontentextraction.org" rel="noopener noreferrer"&gt;webcontentextraction.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="https://github.com/Murrough-Foley/web-content-extraction-benchmark" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://doi.org/10.5281/zenodo.19316874" rel="noopener noreferrer"&gt;Zenodo&lt;/a&gt; · &lt;a href="https://huggingface.co/datasets/murrough-foley/web-content-extraction-benchmark" rel="noopener noreferrer"&gt;HuggingFace&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rust crate&lt;/strong&gt;: &lt;a href="https://crates.io/crates/rs-trafilatura" rel="noopener noreferrer"&gt;crates.io/crates/rs-trafilatura&lt;/a&gt; · &lt;a href="https://github.com/Murrough-Foley/rs-trafilatura" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python package&lt;/strong&gt;: &lt;a href="https://pypi.org/project/rs-trafilatura" rel="noopener noreferrer"&gt;pypi.org/project/rs-trafilatura&lt;/a&gt; · &lt;a href="https://github.com/Murrough-Foley/rs-trafilatura-python" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're extracting content at scale and need reliability beyond articles, give it a try. If you run it on your own data, I'd love to hear how it performs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Murrough Foley&lt;/strong&gt; — Technical SEO consultant and researcher                                                                                                                                                                                                                                                                                                                                 &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://murroughfoley.com" rel="noopener noreferrer"&gt;murroughfoley.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Murrough-Foley" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://linkedin.com/in/m-foley-seo/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://orcid.org/0009-0008-3127-2101" rel="noopener noreferrer"&gt;ORCID&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://crates.io/crates/rs-trafilatura" rel="noopener noreferrer"&gt;crates.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/rs-trafilatura/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://webcontentextraction.org" rel="noopener noreferrer"&gt;Web Content Extraction Benchmark&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>wecontetextraction</category>
      <category>scraping</category>
      <category>rust</category>
    </item>
    <item>
      <title>6 Things Dev.To Can Do To Improve Organic Traffic</title>
      <dc:creator>Murrough Foley</dc:creator>
      <pubDate>Wed, 03 Aug 2022 08:45:00 +0000</pubDate>
      <link>https://forem.com/murroughfoley/6-things-devto-can-do-to-improve-organic-traffic-4922</link>
      <guid>https://forem.com/murroughfoley/6-things-devto-can-do-to-improve-organic-traffic-4922</guid>
      <description>&lt;p&gt;Dev.to is a great site, with a great community. It's grown rapidly over the years from it's beginnings in 2016, amassing over 5.2 million backlinks from almost 65 thousand domains and cemented itself as one of the most authoritative community sites around coding on the net. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But why has it's organic traffic stalled and is it important?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has Traffic Plateaued?&lt;/li&gt;
&lt;li&gt;Is It Important?&lt;/li&gt;
&lt;li&gt;So what's going on?&lt;/li&gt;
&lt;li&gt;What Are The Issues &amp;amp; How To Fix Them?&lt;/li&gt;
&lt;li&gt;#1 - Dev.to &amp;amp; The Bloated Index&lt;/li&gt;
&lt;li&gt;#2 - Lonely Pages &amp;amp; Limited Internal Interlinking&lt;/li&gt;
&lt;li&gt;#3 - Tag Pages Need Lovin Too&lt;/li&gt;
&lt;li&gt;#4 - Spam Abuse &amp;amp; Content Moderation&lt;/li&gt;
&lt;li&gt;#5 - Smaller Considerations That Don't Deserve A Number&lt;/li&gt;
&lt;li&gt;#6 - Curated content to highlight the community posts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Notes, Exceptions and Caveats: If you find this beneficial and have an open-source or commercial project you want me to take a look at, connect with me, Murrough Foley, on &lt;a href="https://www.linkedin.com/in/m-foley-seo/" rel="noopener noreferrer"&gt;Linkedin&lt;/a&gt;. I've moved notes and asides to the bottom. It's there for the curious and interested and linked via the numbers.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Dev.to is first and foremost a community site built with UX/UI in mind. Ben Halpern, Jess Lee, Peter Frank and the Forem team have done a fantastic job at building a safe platform where developers are encouraged to share their knowledge and discuss each other's content respectfully.&lt;/p&gt;

&lt;p&gt;As a corporate enterprise, they must balance a host of aspects such as moderation, technical maintenance, features vs ease of management and the commercial side of the business too.&lt;/p&gt;

&lt;p&gt;To be clear, I am an SEO guy and only looking at the Dev.to site in this article. I do not have the wider considerations of the open-source platform and it's commercial aspects in mind. For example, a change to dev.to for more visibility may not fit with the Forem commercial partners etc. &lt;/p&gt;

&lt;h1&gt;
  
  
  Has Traffic Plateaued?
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ngyj2avsbtugjakuvqz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ngyj2avsbtugjakuvqz.png" alt="dev-to-seo-case-study" width="800" height="221"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I am only looking at organic traffic and have no inside baseball as to the direct or social traffic. Ahrefs traffic estimates aren't great but in general do a good job of showing the general trend and it does look like organic took a big hit during the January '22 roll back of the December '21 update. &lt;/p&gt;

&lt;p&gt;Organic growth looks steady but it's not on that explosive path it's started on. &lt;/p&gt;

&lt;h1&gt;
  
  
  Why Is This Important?
&lt;/h1&gt;

&lt;p&gt;I would say it's very important. &lt;/p&gt;

&lt;p&gt;I know that Dev.to is an open-source project and that it's main focus is the community. But an essential part of driving the community is page views, interaction and discussion, it's the dopamine hit that encourages people to set aside time and share their endeavors. It's the ability to reach others with similar interests and share ideas. &lt;/p&gt;

&lt;p&gt;For Dev.to it's also a force multiplier. Added visibility in the search engine results pages(SERPS) encourages more people to join and get involved. &lt;/p&gt;

&lt;p&gt;Increased visibility for Dev.to puts it on the map(even more) for it's corporate partners and advertises the platform to new parties, all without a major marketing drive. &lt;/p&gt;

&lt;h1&gt;
  
  
  So what's going on?
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjo8aujnp1dsbk5k7s1rf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjo8aujnp1dsbk5k7s1rf.png" alt="dev-to-seo-case-study" width="596" height="252"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The site is making good progress and has accrued a good number of quality links in recent months.&lt;/p&gt;

&lt;p&gt;But there are some SEO fundamentals that are holding the site back.&lt;/p&gt;

&lt;p&gt;Changing the first two and a half of these issues should see a great improvement and put the site back on to that original trajectory. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Okay get on with it&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  1 Dev.to &amp;amp; The Bloated Index.
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; A serious problem that Dev.to has is that there are too many low quality pages available for Google to crawl. These consist mainly of comment pages that contain little content and those same comments create a duplicate content issue as they appear on the article page too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wasting Google's Resources
&lt;/h2&gt;

&lt;p&gt;Google's spiders are constantly roaming the net, visiting pages and making decisions on whether to index a pages content or not. Each time a Googlebot visits a site, there is a cost associated with that. &lt;/p&gt;

&lt;p&gt;Now think of the exponential growth of content across the web and think of a bunch of engineers at Google, looking at the problem of resources and costs.&lt;/p&gt;

&lt;p&gt;Google has long encouraged sites to have a slim and trim sitemap, with multiple statements about &lt;a href="https://developers.google.com/search/docs/advanced/guidelines/thin-content" rel="noopener noreferrer"&gt;thin content&lt;/a&gt; ,&lt;a href="https://developers.google.com/search/docs/advanced/guidelines/doorway-pages" rel="noopener noreferrer"&gt;low quality doorway pages&lt;/a&gt; and the need to no-index pages that aren't useful to the public.&lt;/p&gt;

&lt;p&gt;So the idea here is pretty simple, be Google's friend and don't waste their resources and do your bit for the planet too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Okay, so how bad is the problem?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Welllll, it's pretty bad.&lt;/p&gt;

&lt;p&gt;Let's take a quick look at Dev.to in the Google Index using Google dorks/advanced operators.&lt;/p&gt;

&lt;p&gt;The two operators we'll be using are &lt;em&gt;site:&lt;/em&gt; and &lt;em&gt;inurl:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's in the index?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using site:dev.to we can show the number of results in the index.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;site:dev.to&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fujao7l98xcg1ur7wxm22.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fujao7l98xcg1ur7wxm22.png" alt="dev-to-seo-case-study" width="704" height="162"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This number is unreliable as mentioned in the notes, but it's useful nonetheless.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;site:dev.to/t&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4hqd7pk5wzcz9ysvjsb5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4hqd7pk5wzcz9ysvjsb5.png" alt="dev-to-seo-case-study" width="704" height="149"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Shows the number of &lt;strong&gt;tag pages&lt;/strong&gt; in the index. It's important to be aware how many there are and I'll come back to this later.&lt;/p&gt;

&lt;p&gt;But here is the kicker. Using the operator below, we are asking Google to show us the results from the site dev.to that include the word 'comment' in the url structure. &lt;/p&gt;

&lt;p&gt;&lt;code&gt;site:dev.to/ inurl:comment&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fafsfqxxwbmns4s4wsilh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fafsfqxxwbmns4s4wsilh.png" alt="dev-to-seo-case-study" width="712" height="158"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now the &lt;a href="https://www.google.com/search?q=site%3Adev.to+inurl%3Acomment"&gt;first 50 results&lt;/a&gt; are articles where the url contains the word "comment" but all the other results are user comment pages that follow the url structure below.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;dev.to/user1/comment/1pp5k&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Google estimates that there are &lt;strong&gt;961,000&lt;/strong&gt; of these pages in it's index.&lt;/p&gt;

&lt;p&gt;I've randomly picked a comment page out of the SERPS to give you an idea of what a low quality page that Google doesn't like looks like. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7r8eeo9axmfpqjlde6s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7r8eeo9axmfpqjlde6s.png" alt="dev-to-seo-case-study" width="800" height="667"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This page has about 400 words, no real topic, doesn't answer a question and holds no value without the article that in accompanies. &lt;/p&gt;

&lt;p&gt;So if it's useless to the user(coming at it from the search results page), Google can't understand it, why is it in the index taking up space?&lt;/p&gt;

&lt;p&gt;Which bring us neatly to the next issue these pages create. &lt;/p&gt;

&lt;h2&gt;
  
  
  Duplicate Content
&lt;/h2&gt;

&lt;p&gt;All the comments on these discussion pages mirror the comments left under the articles. &lt;/p&gt;

&lt;p&gt;Duplicate content has been a contentious issue in the SEO world for years and without rehashing the arguments, there are many(me) that believe it is disregarded or attributed to a single page and then ignored.&lt;/p&gt;

&lt;p&gt;Perhaps these discussion pages are a great resource for the account holders to view all their discussions in one place but, there is no reason for them to be accessible via the search engine. &lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;

&lt;p&gt;The solution is easy to remedy, adding the noindex tag to the page or via the robots.txt file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User-agent: *
Disallow: /*/actions_panel*
Disallow: /users/auth/twitter*
Disallow: /users/auth/github*
Disallow: /report-abuse?url=*
Disallow: /connect/@*
Disallow: /search?q=*
Disallow: /search/?q=*
Disallow: /search/feed_content?*
Disallow: /listings*?q=*
Disallow: /mod/*
Disallow: /mod?*
Disallow: /admin/*
Disallow: /reactions?*
Disallow: /async_info/base_data

Sitemap: https://dev.to/sitemap-index.xml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Would there be an immediate positive move in the SERPS?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It's difficult to say how long it would take for these pages to wash out of the index and to see the rewards. It depends on the crawl budget, how proactive the team are at getting the site recrawled and other factors. The Forem team might even have to wait for the next update(usually November or December) to see the benefit.&lt;/p&gt;

&lt;h1&gt;
  
  
  2 - Lonely Pages &amp;amp; Limited Internal Interlinking
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Google rewards experts in a subject matter. Currently Dev.to has an enormous amount of high quality content around different technologies but the interlinking is extremely limited and inconsistent(the links generated to other articles are random to each user/session).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I can hear people saying, "What does that mean?".&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Google rewards subject matter experts. When Google understands that a site or site section is about a singular topic and covers it to exhaustion, Google rewards that site with better positions and more traffic. &lt;/p&gt;

&lt;p&gt;Let me take a mundane example from the SERPs to explain.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu355gahdfc9ssbf6wf3m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu355gahdfc9ssbf6wf3m.png" alt="dev-to-seo-case-study" width="800" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;My choice of keyword is about as boring as you can get. The image above is taken from Ahrefs for the search term "best dehumidifier" using a UK proxy. I've highlighted the little affiliate site choosedehumidifier.co.uk in red. In green you can see how authoritative the sites are. &lt;/p&gt;

&lt;p&gt;That site is ranking alongside huge national newspapers due to all it's content being about a single topic, and all the pages being interlinked in a coherent, hierarchical way.&lt;/p&gt;

&lt;p&gt;So what's the situation with Dev.to?&lt;/p&gt;

&lt;p&gt;Good question. Authors have the option to cross-link other relevant pages on Dev.to but there is no reminder(visual cue) and no incentive to do this.&lt;/p&gt;

&lt;p&gt;The team have done a good job at prompting users to add "Alt text" to images as shown below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjn3d4iy21dl4334b8xsb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjn3d4iy21dl4334b8xsb.png" alt="dev-to-seo-case-study" width="707" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Currently, cross-linking is reliant on the "Read Next" box/module at the end of each article. &lt;/p&gt;

&lt;p&gt;And this has a serious issue too!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fro013ftal9pym3wqy0ng.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fro013ftal9pym3wqy0ng.png" alt="dev-to-seo-case-study" width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The "Read Next" module (probably) uses the tags and keywords of users to generate links to recent articles. Refresh the page in the same browser and by magic you'll get the same suggestions. Check the page in a private browser (with a fresh session) and you'll get a new set of suggestions.&lt;/p&gt;

&lt;p&gt;All this is perfect for the user, but what about Google's crawler - Googlebot?&lt;/p&gt;

&lt;p&gt;As it stands at the moment, every time Googlebot crawls the site, it will see different links to somewhat related topics but these change every time Googlebot comes back to take a look at the page.&lt;/p&gt;

&lt;p&gt;How can Google understand these silos of inter-related content without permanent links telling it that these pages are related and share a similar topic?&lt;/p&gt;

&lt;p&gt;Without permanent links between closely related articles, Google can never understand that dev.to has a huge amount of resources around &lt;a href="https://www.google.com/search?q=site%3Adev.to+javascript+tutorial"&gt;Javascript tutorials&lt;/a&gt; or CSS tips and tricks or any of the incredible number of topics that are covered here.&lt;/p&gt;

&lt;p&gt;It gains &lt;strong&gt;no bonus&lt;/strong&gt; for all the specialised knowledge within each of the tags or topics.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solutions
&lt;/h2&gt;

&lt;p&gt;Incentivise the user to cross link relevant articles on the site or other good resources. &lt;/p&gt;

&lt;p&gt;Visually nudge them with a reminder. The team have done it with image alt-tags, why not prompt the user to reward other good content on the site with a link?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Well, that's all well and good pal, but what about the million great pieces already on the site?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yeah, that is a conundrum. But there are a couple of ways you could tackle that. &lt;/p&gt;

&lt;p&gt;I'm not going to go into them here as it would add a couple of thousand words and I'm already tired, but basically you are looking at categorising and subcatagorizing content in topic clusters and then adding permanent links in a second module/bot called related content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After fixing this, how long would it take to see a benefit in search engine visibility?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Applying good interlinking is pretty fundamental to helping Google understand content, and it's relationship to other content. There should be an almost immediate improvement in rankings. &lt;/p&gt;

&lt;h1&gt;
  
  
  3 - Tag Pages Need Lovin Too
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; The tag pages have been built for UX purposes without consideration of how Google views them. They are great for getting new content crawled but due to the lack of content Google doesn't understand the context of these hub pages and doesn't rank them outside of "tag title +community".&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzictsh90dxpqh51vgdi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzictsh90dxpqh51vgdi.png" alt="dev-to-seo-case-study" width="713" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The most used tag on Dev.to(I think) is &lt;strong&gt;javascript&lt;/strong&gt; and the site has a huge amount of content around this topic and all it's subtopics with &lt;strong&gt;62980&lt;/strong&gt; at the time of writing. So why does it perform so poorly in the SERPS?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80w222braozelvacxdbq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80w222braozelvacxdbq.png" alt="dev-to-seo-case-study" width="788" height="691"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Outside of the dynamic content, there are only about 250 words that are there consistently(each time Googlebot comes to visit) that give an indication as to the purpose of the page. &lt;/p&gt;

&lt;h2&gt;
  
  
  The Solutions
&lt;/h2&gt;

&lt;p&gt;The solutions here really depend on the goal. A lot of written content that is not dynamic would broaden the pages visibility across 100s of keywords associated with Javascript. &lt;/p&gt;

&lt;p&gt;Contextually cross linking of sub-topic hub pages would also  be of great benefit. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F64nu9yk5l7gplhzr9rt1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F64nu9yk5l7gplhzr9rt1.png" alt="dev-to-seo-case-study" width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So you might have an expandable box at the top of the page instead of the highlighted text, with an article linking to the front-end frameworks, Node, TypeScript and related tag pages. &lt;/p&gt;

&lt;p&gt;These pages are the most linked pages internally (via tags)and could be huge drivers of organic traffic with a little love.&lt;/p&gt;

&lt;h2&gt;
  
  
  Seriously, Dev.to is on page 6. for the term "Javascript" Why?
&lt;/h2&gt;

&lt;p&gt;It bounces from page 6 to page 3. But you are right, there must be more to it.&lt;/p&gt;

&lt;p&gt;And there is.&lt;/p&gt;

&lt;p&gt;Dev.to has thousands of posts about Javascript and each of these pages link to the hub page "&lt;a href="https://dev.to/t/javascript"&gt;https://dev.to/t/javascript&lt;/a&gt;&lt;br&gt;
" via a tag just below the page title. So all that link equity from related pages should place it better than page 6 or 3.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgfo88hlzr3lhrqph0i5u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgfo88hlzr3lhrqph0i5u.png" alt="dev-to-seo-case-study" width="716" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When we link a page in html it's pretty simple stuff.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;lt;a href="url"&amp;gt;link text&amp;lt;/a&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The url is the target page and the link text is the anchor. The link anchor gives Google context, it tells Google what the target page is about and this is one of the primary ranking factors.&lt;/p&gt;

&lt;p&gt;**So what's the issue.&lt;/p&gt;

&lt;p&gt;The issue is that Dev.to tag links do not contain anchors or any indication as to the topic of the target page. The text in our example is javascript, which is in a child span tag, so how can Google understand the link and it's topic?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyxa9slj6gjeah0bvdfzr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyxa9slj6gjeah0bvdfzr.png" alt="dev-to-seo-case-study" width="562" height="152"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's not forget that anchors are used for accessibility for those who are sight impaired too.&lt;/p&gt;

&lt;p&gt;** If Dev.to changed how the tags are coded, would they see a jump in SERP position for these tag pages?**&lt;br&gt;
Yes, it would be a change from Google's point of view as you are giving context to thousands of links. I would expect rankings to jump around for a few weeks before settling into a higher position. &lt;/p&gt;

&lt;p&gt;** What about that thing about adding a hidden article in a dropdown box at the top of the tag pages, what would be the benefit of that?**&lt;br&gt;
By adding more static content to the page and cross linking these tag pages, Google should start to rank the page for a whole host(several hundreds) of related keywords. &lt;/p&gt;

&lt;h1&gt;
  
  
  4 - Spam Abuse &amp;amp; Content Moderation
&lt;/h1&gt;

&lt;p&gt;There are a lot of shady tactics in SEO. Some work for a while, then fall out of fashion as the search engines address the issue and sometimes, inexplicably start working again years later.&lt;/p&gt;

&lt;p&gt;But one tactic that hasn't really gone away is webspam to a tier 1 property. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You are speaking gibberish again!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the image below you can see a spam profile page that links to a Malaysian Lotto page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn76m7pjjirtmvoy3osl9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn76m7pjjirtmvoy3osl9.png" alt="dev-to-seo-case-study" width="800" height="318"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The idea is to build low quality links to this spam profile page and "wash" the link equity through the authoritative domain. In this case Dev.to&lt;/p&gt;

&lt;p&gt;In the image below, I've pulled some of the top linked pages as seen by Ahrefs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fztrrdfuxmfoyxpd1lz2f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fztrrdfuxmfoyxpd1lz2f.png" alt="dev-to-seo-case-study" width="800" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Starting from the bottom in purple, we have the Lotto site who have built 1,106 do-follow links and 623 no-follow links from 390 different domains.&lt;/p&gt;

&lt;p&gt;I'll not go through the others, and the issue is not as widespread as it is on other platforms. &lt;/p&gt;

&lt;p&gt;Ideally, the solution would be to delete this accounts and redirect the links to a 410 server error and add the toxic links to a disavow file. There are ongoing arguments as to whether this is necessary though as Google is pretty good at ignoring spam links.&lt;/p&gt;

&lt;p&gt;This tactic still works in some non English geos.&lt;/p&gt;

&lt;h1&gt;
  
  
  5 - Smaller Considerations That Don't Deserve A Number
&lt;/h1&gt;

&lt;p&gt;There are a number of smaller areas that could be considered when looking to boost the site visibility on Google and the other search engines. I'll briefly touch on those here. &lt;/p&gt;

&lt;h2&gt;
  
  
  Different Languages
&lt;/h2&gt;

&lt;p&gt;Due to an overzealous rate-limit by the dev.to server on my scans, I don't have hard figures on the level of multi-lingual or articles in other languages present on the site. It may not amount to much as a percentage but as a driver of future growth, dev.to could make a small change to help Google find these.&lt;/p&gt;

&lt;p&gt;No hard figures but anecdotally I have come across quiet a bit of Spanish and Portuguese on the site with the Ahrefs traffic estimates showing about 5% from Brazil. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogumhuobr9dm2grs6kgm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogumhuobr9dm2grs6kgm.png" alt="dev-to-seo-case-study" width="651" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One way to help the search engines digest content on a multilingual site is to explicitly tell it the language and or region being targeted.&lt;/p&gt;

&lt;p&gt;To my knowledge there is no way to add a &lt;strong&gt;hreflang tag&lt;/strong&gt; to a Dev.to article in the markup and the html lang attribute is set to  as the sitewide default.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfraz9r9e36jlhc3ksei.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfraz9r9e36jlhc3ksei.png" alt="dev-to-seo-case-study" width="800" height="296"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Would this be difficult to implement? Would it be worth it? *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is one of those questions that I can't answer, every year the internet is a more diverse place with more and more users coming online across the globe. Perhaps different languages aren't a priority for the Forem team, or maybe they have a different solution. Smart people make smart decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other Considerations
&lt;/h2&gt;

&lt;p&gt;I think this article is already long enough and I still have one other thing to cover. &lt;/p&gt;

&lt;h1&gt;
  
  
  6 - Curated content to highlight the community posts
&lt;/h1&gt;

&lt;p&gt;Dev.to has some great stuff written each week by the community. It has a points system that allows users to highlight great articles to the admins.&lt;/p&gt;

&lt;p&gt;And the community admins do a great job of picking out quality pieces every week. These are &lt;a href="https://dev.to/devteam/top-7-featured-dev-posts-from-the-past-week-2nie"&gt;featured&lt;/a&gt; in &lt;a href="https://dev.to/devteam/discussion-and-comment-of-the-week-v12-1jn5"&gt;roundups&lt;/a&gt; pulled out by &lt;a class="mentioned-user" href="https://dev.to/graciegregory"&gt;@graciegregory&lt;/a&gt; and &lt;a class="mentioned-user" href="https://dev.to/michaeltharrington"&gt;@michaeltharrington&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So if everything is great, what's your point?&lt;/strong&gt;&lt;br&gt;
These posts are put together for an internal audience. If you want to boost organic traffic you need to cater to an audience coming from search engines too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I follow you, how would Dev.to do that&lt;/strong&gt;&lt;br&gt;
It's pretty simple really. You take a look at the great user generated content and create roundup posts around that tailored to search traffic.&lt;/p&gt;

&lt;p&gt;Let me give you an example of two. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;SVG &amp;amp; CSS Animation Tutorials [56 Cool Guides]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And you build an article that categorises and introduces the best dev.to has to offer &lt;a href="https://www.google.com/search?q=site%3Adev.to+SVG+css+animations"&gt;on this topic&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Dev.to articles are often hyper focused on a problem or topic, but rounding up related articles into more generalised categories and serving them to Google would be targeting more exploratory search terms with much higher search volume.&lt;/p&gt;

&lt;p&gt;Time for one more example. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.google.com/search?q=site%3Adev.to+node.js+security"&gt;Secure Node.js | 18 Tutorials To Harden Node Server&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This title is going after the search term "secure node.js" and "harden node server"&lt;/p&gt;

&lt;p&gt;I think you get the idea.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is pretty time intensive and would require a lot of curation, how long would it take to see the benefits?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yeah no pain, no gain. It would take time and some knowledge to curate the best content on dev.to and produce these types of round up posts. But you are broadening the types of search queries that dev.to ranks for, highlighting the great efforts of the community and creating a positive feedback loop of more fresh eyes on the site and content. &lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Okay that's it. If you read all the way to here, pat yourself on the back. If you understood everything give youreself a round of applause, SEO lingo can be difficult to get your head around. &lt;/p&gt;

&lt;p&gt;Any corrections, comments or complaints reach out to &lt;a href="https://www.linkedin.com/in/m-foley-seo/" rel="noopener noreferrer"&gt;M Foley on Linkedin&lt;/a&gt;. And if you have a web facing open-source project, a commercial project or question, please reach out. I'm always working. &lt;/p&gt;

&lt;h1&gt;
  
  
  Notes, Exceptions &amp;amp; Caveats
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Dev.to server rate-limiting:&lt;/strong&gt; There were a number of technical SEO checks I wanted to run through but the server kept throwing a 429 error every now and then. Things that would have been interesting to know include.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Amount of articles less than 400 words. (Thin content)&lt;/li&gt;
&lt;li&gt;Number of articles in different languages.&lt;/li&gt;
&lt;li&gt;Max crawl depth(maximum number of hops to a piece of content)&lt;/li&gt;
&lt;li&gt;Number of canonicalised pages.&lt;/li&gt;
&lt;li&gt;Average number of internal cross-linking per article(not in the "Read Next" module)
&lt;/li&gt;
&lt;li&gt;How series posts perform in the serps vs regular ole posts.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Ahrefs Traffic Estimates:&lt;/strong&gt; It should be noted that all traffic estimate sites produce wildly inaccurate data at the best of times. I'd hazard to say that this is particularly true for dev.to due to the nature of the topic. I'd also say that a lot of organic traffic comes from highly specific long tail search queries which will never turn up on these tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advanced Google Operators:&lt;/strong&gt; The site operator has become less and less useful over the years and without access to the Search Console Data this proves the point but it does highlight the issue and so I've included it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Different Languages&lt;/strong&gt;: As a side note, cross linking topically relevant articles within the same language would probably be of more benefit then adding the hreflang tags. This of course would be it's own headache to implement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Murrough Foley&lt;/strong&gt; — Technical SEO consultant and researcher                                                                                                                                                                                                                                                                                                                                 &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://murroughfoley.com" rel="noopener noreferrer"&gt;murroughfoley.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Murrough-Foley" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://linkedin.com/in/m-foley-seo/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://orcid.org/0009-0008-3127-2101" rel="noopener noreferrer"&gt;ORCID&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://crates.io/crates/rs-trafilatura" rel="noopener noreferrer"&gt;crates.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/rs-trafilatura/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://webcontentextraction.org" rel="noopener noreferrer"&gt;Web Content Extraction Benchmark&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>seo</category>
      <category>devto</category>
      <category>meta</category>
    </item>
    <item>
      <title>How BBC News Uses This One Hack to Steal Hundreds of Thousands of Visitors Every Month</title>
      <dc:creator>Murrough Foley</dc:creator>
      <pubDate>Fri, 31 Jul 2020 22:52:19 +0000</pubDate>
      <link>https://forem.com/murroughfoley/how-bbc-news-uses-this-one-hack-to-steal-hundreds-thousands-of-visitors-every-month-from-54kd</link>
      <guid>https://forem.com/murroughfoley/how-bbc-news-uses-this-one-hack-to-steal-hundreds-thousands-of-visitors-every-month-from-54kd</guid>
      <description>&lt;h3&gt;
  
  
  How BBC News Uses This One Hack to Steal Hundreds Thousands of Visitors Every Month from Competitors
&lt;/h3&gt;

&lt;p&gt;An awful headline, I know but it’s true. I was curious how the Corona Virus had effected traffic for some of the large media publications in the UK and whilst taking a look at the top 10 news sites, I came across something interesting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5z4qi5hg8mg1xmfrv2sc.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5z4qi5hg8mg1xmfrv2sc.jpeg" alt="canonical-hack-seo" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Is the BBC using &lt;em&gt;shady tactics&lt;/em&gt; to grab more organic traffic or is it an SEO quick fix for something that has gone unaddressed for years?&lt;/p&gt;

&lt;p&gt;Anyone who juggles clients or who works for a large company knows that there is always work to do. Much of a task schedule is a matter of prioritisation and triage. What’s the most important thing, that needs to be done yesterday? Are the BBC using a trick to generate more organic visitors and if so, should the UK national broadcaster be operating in such a manner?&lt;/p&gt;

&lt;p&gt;So what are BBC News doing to grab hundreds of thousands of extra visitors per month?&lt;/p&gt;

&lt;h3&gt;
  
  
  Ranking The Same Page Twice For Numerous Terms
&lt;/h3&gt;

&lt;p&gt;The BBC’s server set up, has all of its category pages duplicated and ranking for multiple terms.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.bbc.com/news//world
https://www.bbc.com/news/world

https://www.bbc.com/news//england
https://www.bbc.com/news/england

https://www.bbc.com/news//uk
https://www.bbc.com/news/uk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And as far as I can see, it’s been like this for a long time. Judging from the number of &lt;a href="https://web.archive.org/web/20140701000000*/https://www.bbc.com/news//world" rel="noopener noreferrer"&gt;301s shown in the Wayback Machine&lt;/a&gt;, it looks like they have tested changing things at various points.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AsuDqlc1ntGoOMCU5" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AsuDqlc1ntGoOMCU5" width="1024" height="351"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The duplicated link is canonicalised to the main URL as shown below but still tagged &lt;em&gt;index&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2A9U2lwXtIKFtXQL3K" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2A9U2lwXtIKFtXQL3K" width="1024" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The canonical tag doesn’t stop them from ranking for some very large keywords. Below Ahrefs shows that the canonicalised version is sitting at position 6 for the keyword today’s news (70,000 monthly) in the UK.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2ATly6vljbKmKUGyd0" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2ATly6vljbKmKUGyd0" width="1024" height="95"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The same goes for the larger keyword “world news” (222,000 monthly) in position 6.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AdOc8yGwROLLpjFVX" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AdOc8yGwROLLpjFVX" width="1024" height="108"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On closer inspection, the BBC are taking up both position #1 and a lower position 5/6 for &lt;em&gt;thousands of keywords&lt;/em&gt; using this trick/bug/exploit/mistake.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AenFdKT4c6X4Lzljr" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AenFdKT4c6X4Lzljr" width="1024" height="786"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If I was doing SEO for the BBC, would I want to fix an issue that could be generating an extra 8–12% of traffic from all these searches? Probably not.&lt;/p&gt;

&lt;p&gt;The canonical is supposed to indicate to the search engines, that one page’s content is similar to another’s and indicate the primary source to be ranked. Maybe when you reach the levels of trust and authority that the BBC.com has, things work differently.&lt;/p&gt;

&lt;p&gt;As a public service, should the BBC be limiting the choice of the public by taking up two positions via this hack?&lt;/p&gt;

&lt;p&gt;And can you do the same for your site? The only way to know for sure, is to test.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Also by Murrough Foley&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hackernoon.com/seo-data-science-and-correlative-analysis-for-google-organic-traffic-qrq3ukg" rel="noopener noreferrer"&gt;&lt;strong&gt;SEO, Data Science &amp;amp; Correlative Analysis For Google Organic Traffic&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Murrough Foley&lt;/strong&gt; — Technical SEO consultant and researcher                                                                                                                                                                                                                                                                                                                                 &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://murroughfoley.com" rel="noopener noreferrer"&gt;murroughfoley.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Murrough-Foley" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://linkedin.com/in/m-foley-seo/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://orcid.org/0009-0008-3127-2101" rel="noopener noreferrer"&gt;ORCID&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://crates.io/crates/rs-trafilatura" rel="noopener noreferrer"&gt;crates.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/rs-trafilatura/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://webcontentextraction.org" rel="noopener noreferrer"&gt;Web Content Extraction Benchmark&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>bbc</category>
      <category>searchengineoptimiza</category>
      <category>marketing</category>
    </item>
    <item>
      <title>11 Sneaky Negative SEO Attacks &amp; How To Protect Against Them</title>
      <dc:creator>Murrough Foley</dc:creator>
      <pubDate>Tue, 28 Jul 2020 20:53:51 +0000</pubDate>
      <link>https://forem.com/murroughfoley/11-sneaky-negative-seo-attacks-how-to-protect-against-them-25lm</link>
      <guid>https://forem.com/murroughfoley/11-sneaky-negative-seo-attacks-how-to-protect-against-them-25lm</guid>
      <description>&lt;p&gt;Negative SEO is nothing new. As long as there have been ways to improve your position in the organic results, there have been malicious actors who target their competition with negative search engine optimisation techniques.&lt;/p&gt;

&lt;p&gt;This article will cover the common types of SEO attack, one or two less common types and a couple of sneaky ways bad actors can disrupt your business.&lt;/p&gt;

&lt;p&gt;It is important to note at the outset that the best ways to protect your sites or your clients businesses, is to be aware of these various different strategies and closely monitor what is going on.&lt;/p&gt;

&lt;p&gt;If you think your site has been the victim of an attack, a good SEO specialist should be able to diagnose the problem quickly and most times, remediate the situation. But like all things, prevention is better than cure and some of the attacks listed below require security measures put in place at the outset.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Spam Links / Link Farms / Manual Penalties&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Anchor Text Ratio Unbalancing&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Purchased Links / Cheap PBNs&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Link Removal&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Email Blacklisting Or Reducing Sender Score&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hot Linked Images&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scraped Content&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DDOS&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Slow Loris&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Poisoned Canonical&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Straight Up Hacking&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;If good search engine optimisation is knowing what and where the boundaries are, then negative SEO is knowing how to push the target into and over those boundaries.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Strictly speaking, the term negative SEO should only really be applied to attacks that aim to reduce your SERP visibility, but I’ve come to look at it more holistically as any digital attack meant to disrupt your online business operations. That can include &lt;em&gt;brand sabotage&lt;/em&gt;, some variations of &lt;em&gt;hacking&lt;/em&gt; or &lt;em&gt;disruption of communications&lt;/em&gt;. This view may seem excessively broad, but as we go through the examples, you may start to agree with me.&lt;/p&gt;

&lt;h3&gt;
  
  
  Spam Links / Link Farms / Manual Penalties
&lt;/h3&gt;

&lt;p&gt;Spam links are low quality links that can be built to negatively impact a site’s SERP visibility. Within this category of attack, there’s a lot of scope to do damage in multiple different ways. But to understand spam links, it’s best to understand a little bit of the history.&lt;/p&gt;

&lt;p&gt;During the hay day of automated tools when Blackhat SEOs were able to rank poor quality sites for competitive keywords with ease, Google needed a way to improve the quality of organic results and filter these automatically generated links out of their link index.&lt;/p&gt;

&lt;p&gt;It’s my view that Google’s web spam team were ready to throw a machine learning algorithm at the problem but needed a data source to separate the good links from the bad.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cue the arrival of Google’s disavow tool.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Google introduced the disavow tool under the auspices of giving webmasters the ability to remove poor quality(purchased, automatically built) links from negatively effecting their site.&lt;/p&gt;

&lt;p&gt;It was the proverbial confessional box for any nervous webmaster and promised a swift return to page 1 for those that had sinned.&lt;/p&gt;

&lt;p&gt;Couple this technical “Mea Culpa” machine with a hearty dose of manual penalties applied to both the most egregious of offenders and Google had the perfect mood of fear, uncertainty and doubt spreading to fuel submissions to the disavow tool.&lt;/p&gt;

&lt;p&gt;This fear drove countless webmasters to use the tool and feed Google’s Machine Learning system.&lt;/p&gt;

&lt;p&gt;Using this constantly growing bank of submitted links, built by tools such as GSA SER, Xrumer and Senuke, Google was able to identify and devalue these links. And pretty quickly these automated tools fell out of widespread use to all but the most creative users.&lt;/p&gt;

&lt;p&gt;How the Disavow tool works is open to debate. Is it a complicated machine learning algo that identifies common variables from the submitted pages? Or is it a far more simple, weighted list of bad domains with poor user registration controls and moderation?&lt;/p&gt;

&lt;p&gt;Due to the sheer amount of computing power needed to analyse the common factors of a “bad site”, and then apply these identified signals to the entire web, my money is on a simpler type of solution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So when&lt;/strong&gt; &lt;a href="https://twitter.com/JohnMu/status/1197604755784261633?ref_src=twsrc%5Etfw" rel="noopener noreferrer"&gt;&lt;strong&gt;John Muller&lt;/strong&gt;&lt;/a&gt; ** says;**&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;FWIW looking at the site, I don’t see any negative seo effects there. Like the others in the forums posted, most of those links have no effect at all — sites collect links from all kinds of weird &amp;amp; spammy places over time,&lt;/em&gt; &lt;strong&gt;&lt;em&gt;we’re pretty good at ignoring them&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;He is not completely lying. But he is doing his usual thing of telling half truths.&lt;/p&gt;

&lt;p&gt;Whilst a massive number of low quality article type links, showing up one day in your backlink analysis tool might look shocking, it’s probably not going to do too much damage. You ought to still disavow them though.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2Ax8Mrsdaru3ByW0z0" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2Ax8Mrsdaru3ByW0z0" width="1024" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Google can filter out many automated links such as those that turn up for purchase on link lists. However, the more creative people out there will have built their own engines to post on more niche areas or done in a way that blends in with user generated content much better. This can’t be filtered.&lt;/p&gt;

&lt;p&gt;Modern use of low quality links for negative SEO is more likely to be much slower and stretched out over several months so as not to arouse suspicion. The aim is to take down your site and not alert you to the process so that you will take action.&lt;/p&gt;

&lt;p&gt;A large part of it is manipulating the link anchors that are pointed to your site.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anchor Text Ratio Unbalancing
&lt;/h3&gt;

&lt;p&gt;The Penguin update targeted webmasters who had over optimised anchor text, that is too say, webmasters who were gaming the system by using targeted key phrases in their anchor text and spawned the use of the full stop and coma as an anchor as SEOs tried to bring things within acceptable levels.&lt;/p&gt;

&lt;p&gt;Unbalancing anchor text happens when a malicious actor builds ‘good links’ with over optimised anchors to one(or all) of your pages. Initially this can look like a positive signal, your page is generating natural links but if not monitored can push you over the edge into a penalty.&lt;/p&gt;

&lt;p&gt;Things get more complicated when you understand that anchor text ratios flow through do-follow links.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Situation 1 — New tier 1 exact match do follow links are built to your site.&lt;/strong&gt; A good SEO can easily monitor new inbound links on tier 1 and the effects to anchor text. Then react and disavow links when he/she feels things are getting critical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Situation 2 — Exact match anchors are built on tier 2 to do follow tier 1 links.&lt;/strong&gt; This is more difficult to identify and deal with, as most backlink tools are just not set up to easily provide you with the anchor used on tier 2. So without spending the time to dig into your website’s tier 2 anchors, you may never know why your rankings have dropped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How To Mitigate Or Prevent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The only way to prevent anchor text unbalancing is to regularly do anchor text audits that look at Tier 2 and maybe even Tier 3 anchors. This is a proactive step that most businesses won’t be able to afford.&lt;/p&gt;

&lt;h3&gt;
  
  
  Obviously Purchased Links
&lt;/h3&gt;

&lt;p&gt;Buying links is against Google’s terms of service. Most search engine optimisation professionals will keep a wide berth of any site which is clearly selling “guest posts” or indicates that a post was sponsored.&lt;/p&gt;

&lt;p&gt;A bad actor can use this to his/her advantage seeking out domains that have been obviously penalised and dropped dramatically in rankings to buy a link and point it at your site.&lt;/p&gt;

&lt;p&gt;The other variation of this would be buying links on the cheapest PBNs available on Blackhat SEO marketplaces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Private/Public Blog Networks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A PBN, unless built very carefully, will leave a footprint that Google is well versed in spotting, the simplest of which would be all the sites on the same IP address but there are many more.&lt;/p&gt;

&lt;p&gt;If a large number of your links are coming from a PBN with an easily discernible footprint, your site may be at risk of a penalty.&lt;/p&gt;

&lt;p&gt;The problem you may face with this is, that many PBN operators cloak their sites from the &lt;a href="https://medium.com/@foley.murrough/an-almost-complete-list-of-seo-tools-2020-e598939497a0#8df0" rel="noopener noreferrer"&gt;Backlink analysis tools&lt;/a&gt; such as Majestic and Ahrefs, so you may have to trawl through your Search Console data to spot this type of attack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How To Mitigate Or Prevent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There is no way to completely protect yourself against this type of attack but there are a number of steps to mitigate the risk of a malicious actor being successful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#1.&lt;/strong&gt; Monitor your backlink profile and investigate any new links to see if they have natural anchors and check the quality of the domain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#2.&lt;/strong&gt; Use the disavow tool for anything that you are uncertain about.&lt;/p&gt;

&lt;h3&gt;
  
  
  Link Removal
&lt;/h3&gt;

&lt;p&gt;As you may have guessed, links are a major part of how Google identifies and rewards good sites. And large chunks of a companies marketing budget goes into raising brand awareness and the natural links that accompany that spend.&lt;/p&gt;

&lt;p&gt;Those links are hard earned and what makes this type of attack really sneaky is that you may never be aware it’s going on.&lt;/p&gt;

&lt;p&gt;An attacker might set up a branded domain with a different TLD or CCTLD for the purpose of impersonating someone from your company and requesting good links be removed.&lt;/p&gt;

&lt;p&gt;All the pages linking to your site can be pulled from the various backlink databases and contact details of webmasters can be scrapped or manually retrieved.&lt;/p&gt;

&lt;p&gt;It’s at this point the negative outreach campaign begins from an email that may look official;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Hey {first-name/}&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;It’s Jane Bloggs from YourCompany. I just have a quick request. We are facing some SEO issue that, to be honest, I don’t really understand. Anyway, I’ve been tasked with removing some of our links that we’ve accumulated over the years. Can I get you to remove the link on the page {target-for-link-removal}. I know it would only take a minute and it would really help me out. Thanks!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Regards&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Jane Bloggs&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The emails don’t have to come from a similarly branded domain, but obviously it’s the kind of thing that adds some authenticity to the email request.&lt;/p&gt;

&lt;p&gt;These link removal requests can also be perpetrated from social media accounts that impersonate your brand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How To Mitigate Or Prevent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There is no way to completely protect yourself against this type of attack but there are a number of steps to mitigate the risk of a malicious actor being successful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#1.&lt;/strong&gt; The first would be to own all the top level domains that match your main domain. So if your site is &lt;strong&gt;greatcompany.com&lt;/strong&gt; , then buy the .net, .org and the main country tld where you operate whether it be .co.uk or .ru.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#2.&lt;/strong&gt; The next step would be claiming all of your brand accounts on social media. Having a legitimate account with a little bit of activity on each of the major social media platforms should be enough raise alarm bells if someone tries link removal outreach via a different account.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#3.&lt;/strong&gt; Finally, adding some security to your email service such as SPF, DKIM and DMARC records will stop anyone from spoofing your company email addresses.&lt;/p&gt;

&lt;p&gt;Which brings us on to email. Most people wouldn’t classify this as a negative SEO attack, but &lt;a href="https://original.bluehatseo.com/2010/06/" rel="noopener noreferrer"&gt;at one time&lt;/a&gt;, spam reports could have your server IP pulled.&lt;/p&gt;

&lt;h3&gt;
  
  
  Email Blacklisting Or Reducing Sender Score
&lt;/h3&gt;

&lt;p&gt;Email is the life-blood communication of most businesses, it’s used for everyday back and forth, in the sales process, in the generation of leads and it provides delivery of all sorts of essential information between systems.&lt;/p&gt;

&lt;p&gt;Disrupting a business’s communications could lead to a massive decrease in revenue and might not be identified until it’s too late.&lt;/p&gt;

&lt;p&gt;One very important aspect that affects &lt;a href="https://dev.to/murroughfoley/increase-email-outreach-campaign-effectiveness-with-these-tips-185k"&gt;email deliverability&lt;/a&gt;, is a domain’s ‘sender score’. It’s calculated by the ISP (or Email Service Provider) based on a number of factors including email open rates, spam reports and . If an attacker can lower this score, your company’s emails will take a hit.&lt;/p&gt;

&lt;p&gt;This can be done by sending spammy emails purporting to be from your domain.&lt;/p&gt;

&lt;p&gt;Imagine hundred’s of thousands of emails being sent from your spoofed domain with subject lines like;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Buy Cheap C*alis — International Delivery!&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;College E**ays | Best Prices | Quick TAT&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Beautiful Slavic M*il O*der Br*des — From The Russian Steppes To Your Doorstep&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A sustained attack of this nature could get your domain blacklisted and stop all of your email reaching it’s intended target.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How To Mitigate Or Prevent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#1.&lt;/strong&gt; The main way to prevent this type attack is by installing &lt;a href="https://dev.to/murroughfoley/increase-email-outreach-campaign-effectiveness-with-these-tips-185k#authenticate-email"&gt;SPF, DKIM and DMARC&lt;/a&gt; records for your email server. You might also consider moving the email to a subdomain of your main domain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#2.&lt;/strong&gt; Checking your domains &lt;a href="https://talosintelligence.com/" rel="noopener noreferrer"&gt;Sender Score&lt;/a&gt; and whether your domain can be found on the &lt;a href="https://mxtoolbox.com/blacklists.aspx" rel="noopener noreferrer"&gt;email blacklists&lt;/a&gt; might be overkill unless you notice issues.&lt;/p&gt;

&lt;p&gt;This is not a common type of attack, that I am aware of but one that could be devastating to many companies. It’s likely that the issue wouldn’t be diagnosed for weeks and in the mean time, all those missed sales would be gone.&lt;/p&gt;

&lt;p&gt;Setting up the needed records in your DNS and with your email provider shouldn’t be overlooked.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hot Linked Images
&lt;/h3&gt;

&lt;p&gt;Image hot linking is when another site uses your image on their site. More often than not, they will not have permission and the low quality site will look something like this with thousands of pages covering various niches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AcB43I5erHtN_SVzW" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AcB43I5erHtN_SVzW" width="1024" height="655"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The category pages are tagged ‘index’, whilst the individual pages for each image are tagged ‘no-index’. Very often these sites are built in subdomains of sites with tlds like .tk or .xyz and also on blogspot sites.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AUGa236bIVbXOctK_" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AUGa236bIVbXOctK_" width="1024" height="488"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’ve seen sites ranking locally and nationally in Ireland, the UK and the US powered mainly by these type of links.&lt;/p&gt;

&lt;p&gt;The problem is that once your images are being hot-linked by a couple of these sites, you will get caught up in more and more over time. When your site tips over some unknown ratio, you’ll see rankings and traffic plummet.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AkH0DhNrqH2nHaKkC" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AkH0DhNrqH2nHaKkC" width="1024" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Is this a directed negative SEO attack? In most cases, probably not. But it is an attack vector that can be prevented.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How To Mitigate Or Prevent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#1.&lt;/strong&gt; Preventing hot linking of images is relatively simple. It just requires adding a little bit of code in your htaccess file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RewriteEngine on
RewriteCond %{HTTP\_REFERER} !^$
RewriteCond %{HTTP\_REFERER} !^http://(www\.)yoursite.com/.\*$ [NC]
RewriteRule \.(gif|jpg|jpeg|bmp|zip|rar|mp3|flv|swf|xml|php|png|css|pdf)$ - [F]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will of course need to change ‘yoursite.com’ to your own domain name.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#2.&lt;/strong&gt; This blocks the hot linking of the images, but a more preventative method would be to block all unwanted bots in both your website’s htaccess/nginx.conf file and the robots.txt.&lt;/p&gt;

&lt;p&gt;Blocking all malicious bots has a couple of benefits including limiting the amount information available to your customers and is always a good idea.&lt;/p&gt;

&lt;p&gt;But this brings us neatly along to the next topic&lt;/p&gt;

&lt;h3&gt;
  
  
  Scraped Content
&lt;/h3&gt;

&lt;p&gt;Imperva / Incapsula reported earlier in 2020, that up to &lt;a href="https://www.imperva.com/blog/bad-bot-report-2020-bad-bots-strike-back/" rel="noopener noreferrer"&gt;37% of all internet traffic is some form of automation&lt;/a&gt;. Personally, I’d like to see exactly how they arrived at that figure, but as proxy providers have entered the corporate world servicing big data, content scraping is becoming more pervasive.&lt;/p&gt;

&lt;p&gt;Whilst content scrapers and other bots pollute your analytics, from a marketing perspective, it’s not the major problem that it once was.&lt;/p&gt;

&lt;p&gt;Google, and the other search engines to an extent too, have become much better at assigning ownership to the original content author.&lt;/p&gt;

&lt;p&gt;The main problem, arises if your site(or page) goes offline for a time. This could be down to server error, a misconfigured index tag, or unpublishing a page by mistake.&lt;/p&gt;

&lt;p&gt;Given enough time, Google will reassign the content’s authorship to the next most authoritative source. In this case, a low quality scraper site would now own your content.&lt;/p&gt;

&lt;p&gt;To check how many sites are scraping your content, grab a random sentence and Google it surrounded by double quotes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7f2aiqkbnj6v8lmvixwv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7f2aiqkbnj6v8lmvixwv.png" width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How To Mitigate Or Prevent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A determined adversary will be able to circumvent any controls you try and put in place, but most content scraping bots are simple and not targeted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#1.&lt;/strong&gt;  &lt;strong&gt;Block bots&lt;/strong&gt;. Stoping bots that declare their useragent, is the first and most obvious step. This just requires editing your .htaccess or nginx.conf files. CCarter has a regularly updated &lt;a href="https://www.buildersociety.com/threads/block-unwanted-bots-on-apache-nginx-constantly-updated.1898/" rel="noopener noreferrer"&gt;bot list on BuSo&lt;/a&gt;, with &lt;a href="https://github.com/mitchellkrogza" rel="noopener noreferrer"&gt;Mitchell Krog&lt;/a&gt; being another good option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#2.&lt;/strong&gt; &lt;strong&gt;Invest in a Web Application Firewall(WAF)&lt;/strong&gt; — Whether you opt for a paid service such as CloudFlare and AWS, or go for an open source solution such as ModSecurity, a firewall will prevent a large number of bots, hitting your sites.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#3.&lt;/strong&gt;  &lt;strong&gt;Modify your RSS Feeds&lt;/strong&gt;  — RSS feeds have a range of different uses but can be exploited by scraper sites. Modify your RSS feeds to serve only a summary that doesn’t include any hyperlinks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#4.&lt;/strong&gt;  &lt;strong&gt;DMCA Request &lt;/strong&gt; — If scraper sites are causing you issues, your last resort may be a &lt;a href="https://www.dmca.com/faq/European-DMCA-Takedown-process" rel="noopener noreferrer"&gt;DMCA request&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#5.&lt;/strong&gt;  &lt;strong&gt;Google Copyright Infringement Tool &lt;/strong&gt; — Google provide &lt;a href="https://www.google.com/webmasters/tools/dmca-notice" rel="noopener noreferrer"&gt;a tool for copyright infringement&lt;/a&gt;, if your report is successful, the infringing party is notified via Google Search Console and the infringing page is delisted. This tool is open to abuse by bad actors and negative SEO.&lt;/p&gt;

&lt;h3&gt;
  
  
  (Distributed) Denial Of Service Attack
&lt;/h3&gt;

&lt;p&gt;Picture this, it’s Black Friday, the busiest day of the year for online sales and your e-commerce site is under constant attack. A DOS attack will effect your site’s responsiveness and a slow server means visitors and customers will drop off like flies.&lt;/p&gt;

&lt;p&gt;A sustained attack could overwhelm your server under the load of connections, robbing you of sales for hours.&lt;/p&gt;

&lt;p&gt;Depending on the traffic to your site, prevention is probably better than cure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How To Mitigate Or Prevent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Like other automated attacks, the best way to mitigate a DDOS is by paying for enterprise level protection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#1. Invest in a WAF &lt;/strong&gt; — A Web Application Firewall will protect your site against all but the most leveraged types of Denial of Service Attack.&lt;/p&gt;

&lt;p&gt;The next type of attack is related to a DDOS but with a twist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Slow Loris
&lt;/h3&gt;

&lt;p&gt;A slow loris attack’s main aim is to slow down your server by limiting the number of connections available to real visitors and GoogleBot. The attacker send partial HTTP requests that are never finished, and one by one, takes over all your servers sockets.&lt;/p&gt;

&lt;p&gt;If Googlebot is unable to fully crawl your site several time, your rankings will drop site wide.&lt;/p&gt;

&lt;p&gt;Most modern Apache servers are configured to prevent this type of attack out-of-the-box and whilst I’ve heard of a modified version called Gloris used against Nginx servers, I’m not familiar with it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Poisoned Canonical
&lt;/h3&gt;

&lt;p&gt;Anyone who has bought domains on the secondary market is aware of how diligent you have to be before buying a dropped domain. You have to check the existing and historical links and anchors for indications of spam.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2Aa8A-VMt_Bv9vBy7M" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2Aa8A-VMt_Bv9vBy7M" width="1024" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You have to eyeball the domain’s archived history in the Wayback machine to see if it was used to sell knockoff Nike’s or Ugg Boots at any stage and then sleuth for a missing year, just in case the spammers were blocking the Internet Archive’s bot “ia_archiver”.&lt;/p&gt;

&lt;p&gt;But it’s par for the course that after doing all the checks, you’ll pick up a domain, load it up on a server, register it in the Search Console only to see a penalty of some sort appear after a few days. Or worse still, it’ll just never perform.&lt;/p&gt;

&lt;p&gt;The point is that penalised domains are not uncommon. They can be bought and used for negative SEO. That might be to redirect toward a target’s site but more recently bad actors have been exploiting the canonical tag.&lt;/p&gt;

&lt;p&gt;The canonical tag tells the search engines which is the master copy of a page. It combines similar pages into one in the eyes of Google including the links associated with each canonicalised page.&lt;/p&gt;

&lt;p&gt;Using the canonical tag, an attacker can tell Google that your domain is the master copy for the penalised domain and Google will share the penalty with your site.&lt;/p&gt;

&lt;h3&gt;
  
  
  Straight Up Hacking
&lt;/h3&gt;

&lt;p&gt;If someone is determined and patient, they will be able to breach your security at some stage. Whether this is a simple password are a 0day vulnerability. The key is not to leave the key in the lock. Take basic precautions.&lt;/p&gt;

&lt;p&gt;Once an attacker has control of your site, there is a range of things they can do, from defacing the site, to adding links to gambling domains, or changing your robots.txt file to disallow all.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Whilst negative SEO attacks are unscrupulous, I don’t believe they are against the law, with the clear exception of hacking. Add to this the fact that they can be difficult to identify and when done right, impossible to trace and you have a recipe for disaster.&lt;/p&gt;

&lt;p&gt;Most SEOs are not in the business of destroying the competition out of spite and want to compete on a level playing field. Negative SEO is far less common than you’d think.&lt;/p&gt;

&lt;p&gt;If you are considering negative SEO as a strategy for beating the competition, understand that it’s not an exact science and you may end up strengthening the domain that you were trying to impact.&lt;/p&gt;

&lt;p&gt;Also consider the moral issues, if you negatively effect someone’s business, you are not just hurting the owner, but also the employees and their families. Plot these effects on a timeline and you might be casting a very long shadow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Murrough Foley&lt;/strong&gt; — Technical SEO consultant and researcher                                                                                                                                                                                                                                                                                                                                 &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://murroughfoley.com" rel="noopener noreferrer"&gt;murroughfoley.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Murrough-Foley" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://linkedin.com/in/m-foley-seo/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://orcid.org/0009-0008-3127-2101" rel="noopener noreferrer"&gt;ORCID&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://crates.io/crates/rs-trafilatura" rel="noopener noreferrer"&gt;crates.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/rs-trafilatura/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://webcontentextraction.org" rel="noopener noreferrer"&gt;Web Content Extraction Benchmark&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>seo</category>
      <category>searchenginemarketin</category>
      <category>search</category>
      <category>marketing</category>
    </item>
    <item>
      <title>Increase Email Outreach Campaign Effectiveness With These Tips</title>
      <dc:creator>Murrough Foley</dc:creator>
      <pubDate>Wed, 15 Jul 2020 18:50:00 +0000</pubDate>
      <link>https://forem.com/murroughfoley/increase-email-outreach-campaign-effectiveness-with-these-tips-185k</link>
      <guid>https://forem.com/murroughfoley/increase-email-outreach-campaign-effectiveness-with-these-tips-185k</guid>
      <description>&lt;p&gt;There's nothing worse than spending all that time setting up your email campaign only to discover that nobody is opening your carefully crafted mail. So, where did you go wrong? Here's a list of simple steps and things to bear in mind before pressing the send button.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table Of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Check Your Domain Against Email Blacklists&lt;/li&gt;
&lt;li&gt;Authenticate Your Email With SPF, DKIM &amp;amp; DMARC&lt;/li&gt;
&lt;li&gt;Sender Policy Framework&lt;/li&gt;
&lt;li&gt;DomainKeys Identified Mail&lt;/li&gt;
&lt;li&gt;Domain Message Authentication Reporting and Conformance&lt;/li&gt;
&lt;li&gt;Add A Feedback Loop To Your Email Header&lt;/li&gt;
&lt;li&gt;Avoid Spam Traps&lt;/li&gt;
&lt;li&gt;Use Trusted Email Servers&lt;/li&gt;
&lt;li&gt;Manage Your Email Lists&lt;/li&gt;
&lt;li&gt;Warm Up A New Email Account&lt;/li&gt;
&lt;li&gt;Avoid Spammy Subject Lines &amp;amp; Content&lt;/li&gt;
&lt;li&gt;Creative Subject Lines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most of the following steps can be broken up into two main categories, technical issues, and content mistakes. Let's deal with the technical aspect first.&lt;/p&gt;

&lt;h3&gt;
  
  
  Check Your Domain Against Email Blacklists &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;If you've purchased a domain with a history, it's a good idea to check to make sure that it wasn't used for email spam in the past. This is a relatively easy step and just requires you to plug your domain name into one of the many email blacklist checkers that can be found online. &lt;/p&gt;

&lt;p&gt;My favorite tool for this is &lt;a href="https://mxtoolbox.com/blacklists.aspx" rel="noopener noreferrer"&gt;MXTools&lt;/a&gt;, which checks around 100 different blacklists within seconds.&lt;/p&gt;

&lt;p&gt;Anytime you see your email deliverability drop, the first check should always be the blacklists so you can rule it out and zero in what the problem is.&lt;/p&gt;

&lt;h3&gt;
  
  
  Authenticate Your Email With SPF, DKIM &amp;amp; DMARC &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Just like regular post, protecting the privacy and ensuring the integrity of email is essential. And as the threats to that mail system have become more sophisticated, so to have the protocols that protect us.&lt;/p&gt;

&lt;h4&gt;
  
  
  Sender Policy Framework &lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;Sender Policy Framework, was probably one of the first protocols used to authenticate the sender. Simply put, the SPF record allows the receiving email server to check that the sender is on a list of authorised IP addresses. You can think of this like checking a return address on a regular letter.&lt;/p&gt;

&lt;p&gt;To set up an SPF record you'll need to add a TXT record to your DNS. &lt;a href="https://www.spfwizard.net" rel="noopener noreferrer"&gt;SPFWizard&lt;/a&gt; is a handy little tool to assist in creating the TXT record.&lt;/p&gt;

&lt;h4&gt;
  
  
  DomainKeys Identified Mail &lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;DomainKeys Identified Mail or DKIM is a newer and more robust way to ensure the authenticity of the sender and the integrity of the email contents and uses a public / private key cryptographic system. You can think of DKIM authenticated email as akin to the registered post.&lt;/p&gt;

&lt;p&gt;Setting up and configuring DKIM can be a more involved process, depending on your email server set up, but essential can be broken down into the following steps.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate your key pair&lt;/li&gt;
&lt;li&gt;Add the public key as a txt record in DNS&lt;/li&gt;
&lt;li&gt;Generate and save the DKIM-Signature&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://www.zerobounce.net/services/dkim-generator.html" rel="noopener noreferrer"&gt;Zerobounce&lt;/a&gt; have a great little tool to help generate the needed keys. Ideally, for best security practice, you should be doing this on your local machine.&lt;/p&gt;

&lt;p&gt;If you are using GSuite for email, there's a pretty concise tutorial below.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/o9wvgfZbR8o"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h4&gt;
  
  
  Domain Message Authentication Reporting and Conformance&lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;Domain Message Authentication Reporting and Conformance or DMARC for short, is the latest protocol and it relies on SPF and DKIM. You can think of it as the final check to see whether the email is coming from the verified sender, if anything is amiss, the receiving Email server will drop the message and let the proposed sender know of the issue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dmarc.org/resources/deployment-tools/" rel="noopener noreferrer"&gt;Dmarc.org&lt;/a&gt;, have a great list of tools and resources to help with the process, but you shouldn't have too much difficulty with the process. You might consider setting up a specific email address to store and monitor the DMARC daily emails.&lt;/p&gt;

&lt;p&gt;By setting up email authentication, you should improve mail deliverability, particularly for new domains. But this won't assure that all your emails get to where they are going.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Check Your Email Sender Reputation Score &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Your email sender reputation score is assigned by each of the ISPs and indicates how trusted your email is. Each ISP will have different weighted values for different attributes giving you a different score for each ISP.&lt;/p&gt;

&lt;p&gt;The sender reputation score is an essential part of email deliverability. Here are some services to check your email sender score.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.senderscore.org" rel="noopener noreferrer"&gt;Sender Score&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://talosintelligence.com" rel="noopener noreferrer"&gt;Talos Intelligence&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.reputationauthority.org" rel="noopener noreferrer"&gt;Watch Guard's Reputation Authority&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Add A Feedback Loop To Your Email Header &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Am email Feedback Loop or FBL is a system that reports when an email is marked as spam in a user's inbox. It can be used to keep an email marketing list clean by removing users who indicate they are not interested in your email. &lt;/p&gt;

&lt;p&gt;This type of service is provided by many of the large ISPs and some of the more specialized services such as SendGrid have done all the heavy lifting for you. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;NB&lt;/em&gt;&lt;/strong&gt; Google offers their FBL to high volume senders.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Avoid Spam Traps &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Spam traps are the honeypots of unwanted email. They are specifically set up email accounts that are only available to spammers who scrape email addresses from sites or use some other questionable way of compiling their lists. &lt;/p&gt;

&lt;p&gt;To avoid spam traps, don't buy email lists. Make sure you use an email validation service before sending emails and clean your list every few months.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Trusted Email Servers &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Google aren't the only game in town, but they are the largest with over 1.8 billion users worldwide and one recent report stating that they are responsible for around &lt;a href="https://techjury.net/blog/gmail-statistics/" rel="noopener noreferrer"&gt;43% of email opens&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Google trusts its own systems and some argue that by using Google's own email servers, you'll see a higher level of deliverability. &lt;/p&gt;

&lt;p&gt;Personally, I have only used Google and AWS SES, so can't compare.&lt;/p&gt;

&lt;p&gt;Other email providers include;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.mailgun.com" rel="noopener noreferrer"&gt;Mail Gun&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/ses/" rel="noopener noreferrer"&gt;Amazon AWS SES&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//www.microsoft.com/en-ie/microsoft-365/business/affiliate-program-buy-microsoft-365-business-standard"&gt;Office365&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sendgrid.com" rel="noopener noreferrer"&gt;Send Grid&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are many more, each with their own unique offering that will suit different types of use cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Manage Your Email Lists &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Managing your email list is an essential part of ensuring high deliverability. The more recipients who don't open your email or worse, consign it to the spam folder, the worse your future campaigns will be effected. &lt;/p&gt;

&lt;p&gt;Managing your email list once a quarter should be part of any digital marketing calendar. The cleaner your list the better, but there are lots of ways to go about this task.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Removing anyone who hasn't interacted in the last 90 days&lt;/li&gt;
&lt;li&gt;Remove duplicate email addresses&lt;/li&gt;
&lt;li&gt;Remove emails that have a history of a 'hard bounce'&lt;/li&gt;
&lt;li&gt;Remove any clearly spammy email addresses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many marketers will advise on sending a reengagement email before cleaning your lists.&lt;/p&gt;

&lt;h3&gt;
  
  
  Validate Your Email List Before Sending &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Many email apps will have a function to validate the list of email addresses before sending. For larger lists, this can get expensive but is always a good idea. &lt;/p&gt;

&lt;p&gt;There are a range of different services that offer this feature, some use an API to work with your existing mail application, others are baked into the service such as Mail Gun.&lt;/p&gt;

&lt;h2&gt;
  
  
  Warm Up A New Email Account &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Any sudden spike in email activity will trip spam alerts. It's always best to start slow and build up the number of emails you send per day. This is especially important on a new email account on a brand new domain. &lt;/p&gt;

&lt;p&gt;Historically, spammers would set up and start mass emailing from an account within hours. If you mimick this type of activity, even by mistake, you'll be subject to the same low rates of email deliverability.&lt;/p&gt;

&lt;p&gt;Warm up the account by progressively increasing the number of emails you send each day and don't be overly aggressive. &lt;/p&gt;

&lt;h3&gt;
  
  
  Avoid Spammy Subject Lines &amp;amp; Content &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;This could be a blog post in itself. There are any number of key phrases which if included in your subject line may result in your email being filtered to spam. Take a look at your own inbox for things like;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I need your urgent reply&lt;/li&gt;
&lt;li&gt;Win Big!!!&lt;/li&gt;
&lt;li&gt;Free Discounts!&lt;/li&gt;
&lt;li&gt;High DA Domains&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Creative Subject Lines &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;A well crafted subject line will result in more email opens. These opens improve your sender reputation so care should always be taken.&lt;/p&gt;

&lt;p&gt;Also, be aware that the Google email client shows the first couple of words of the email body, so you can often elicit curiosity within this space.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;That's it for improving email deliverability, if you think I've missed anything, I'm always looking to learn so be sure to drop a comment.&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Murrough Foley&lt;/strong&gt; — Technical SEO consultant and researcher                                                                                                                                                                                                                                                                                                                                 &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://murroughfoley.com" rel="noopener noreferrer"&gt;murroughfoley.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Murrough-Foley" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://linkedin.com/in/m-foley-seo/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://orcid.org/0009-0008-3127-2101" rel="noopener noreferrer"&gt;ORCID&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://crates.io/crates/rs-trafilatura" rel="noopener noreferrer"&gt;crates.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/rs-trafilatura/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://webcontentextraction.org" rel="noopener noreferrer"&gt;Web Content Extraction Benchmark&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>email</category>
      <category>seo</category>
      <category>dns</category>
      <category>webdev</category>
    </item>
    <item>
      <title>An Almost Complete List Of SEO Tools</title>
      <dc:creator>Murrough Foley</dc:creator>
      <pubDate>Sun, 05 Jul 2020 20:16:00 +0000</pubDate>
      <link>https://forem.com/murroughfoley/an-almost-complete-list-of-seo-tools-2020-1oeh</link>
      <guid>https://forem.com/murroughfoley/an-almost-complete-list-of-seo-tools-2020-1oeh</guid>
      <description>&lt;h3&gt;
  
  
  A List Of SEO Tools (2020)
&lt;/h3&gt;

&lt;p&gt;There are a huge number of great paid tools on the market, from suites that try and do everything like SEMrush, to the highly specialised ones that focus on a single task such as Copyscape. I’ve collected a few here that are worth investigating for junior search engine optimisation professionals.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keyword Research&lt;/li&gt;
&lt;li&gt;Browser Extensions&lt;/li&gt;
&lt;li&gt;Wordpress Plugins&lt;/li&gt;
&lt;li&gt;Conversion Rate Optimisation&lt;/li&gt;
&lt;li&gt;Email Tools&lt;/li&gt;
&lt;li&gt;Rank Trackers&lt;/li&gt;
&lt;li&gt;Automation&lt;/li&gt;
&lt;li&gt;Link Analysis Tools / Backlink Checkers&lt;/li&gt;
&lt;li&gt;Crawlers&lt;/li&gt;
&lt;li&gt;Backlink Monitoring Tools&lt;/li&gt;
&lt;li&gt;Content Analysis Tools&lt;/li&gt;
&lt;li&gt;Productivity &amp;amp; Project Management Tools&lt;/li&gt;
&lt;li&gt;OSINT &amp;amp; Research Tools&lt;/li&gt;
&lt;li&gt;Citation Resources &amp;amp; Services&lt;/li&gt;
&lt;li&gt;Proxy Providers&lt;/li&gt;
&lt;li&gt;Content Delivery Networks&lt;/li&gt;
&lt;li&gt;Search Engine Provided Tools &amp;amp; Services&lt;/li&gt;
&lt;li&gt;Search Engine Tools / Hacks&lt;/li&gt;
&lt;li&gt;Quick &amp;amp; Dirty Tools&lt;/li&gt;
&lt;li&gt;Python Libraries&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Keyword Research&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The most important aspect of search engine optimisation. Without the correct targets, how can you have the correct goals. Keyword research should lead the site architecture and give you an in-depth insight into how to group your content. It will also indicate how best to interlink that content and uncover the most profitable keywords and phrases.&lt;/p&gt;

&lt;p&gt;Keyword volume is one of the key metrics you’ll be using to priortise or rank your keywords but care should be taken. Many of the keyword research tools below just pull the data straight from Google’s Keyword Planner, this data is not reliable particularly when looking at keywords with low volume. Ahrefs is slightly more reliable as they have an internal process that references data from Alexa/Clickstream. But the important takeaway is that keyword volume is guideline and not a reliable metric of the traffic you can expect to see.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Keyword Sheeter&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Originally under a slightly different spelling, this great tool *sheets* out keywords from a variety of sources and is a great resource when doing the time intensive process of keyword research.&lt;/p&gt;

&lt;p&gt;You will need to filter and prioritise these results as the amount of keywords you’ll get back can be intimidating.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SECockPit&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Swiss Made Marketing came out with this expensive tool in 2014/2015 and simplified the arduous task of completing comprehensive keyword research. Whilst I don’t use it so much anymore, it’s still worth a look.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Browser Extensions&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Depending on what task you are doing the will be a number of add-ons or extensions that will give you more insight. So probably the most important extension is your extension manager, which will allow you to active and deactivate these plugins as needed. This is not an exhaustive list but more a place to get started.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SEO Quake&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A long time staple of search engine optimisation folks, the browser plugin provides a wealth of information at the page and domain level. There are literally too many features to list. It can be used to tighten up content for on page SEO, is a handy way to pull the top 100 SERP results into a CSV file, plugs into many backlink checkers and many more.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Detailed — SEO Extension&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not quite a replacement for SEO Quake but a much more streamlined plugin with many of the same features. Things like the robots.txt, &lt;a href="https://it.toolbox.com/blogs/murroughfoley/schema-for-local-businesses-022318" rel="noopener noreferrer"&gt;schema&lt;/a&gt; and the archive.org are only a click a way.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://chrome.google.com/webstore/detail/remove-breadcrumbs/banhponphmmpnpogmfaahcgkgbgkcoka/related?hl=en" rel="noopener noreferrer"&gt;&lt;strong&gt;Detailed — Remove Bredcrumbs&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Another one from Viperchill’s big boy pants company. This removes the breadcrumbs that obscure the url path in Google’s SERPS.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Browseo V3&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a browser extension but a customised browser that allows you to maintain multiple personas each with an individualized proxy attached. There are a large number of reasons why this can come in handy and whilst this is a great tool, the recurring fee makes it a little too expensive unless you have a particular use-case. And there are other options.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;What Runs Where&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What Runs Where gives a basic overview of the web technologies used to build a site or web-app. If you are checking the competence of your competition, this is a good place to start.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Builtwith&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Built With Site Profiler provides a huge amount of useful data including connected websites that are using the same tags, redirected domains, a much more in-depth look at the technologies being used on the site and meta information.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Mangools SEO Extension&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A like for like replacement for SEO Quake, you may prefer the interface. It’s not as feature rich with things like a like to domain whois and archive.org missing.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Disable Javascript&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There will be multiple occasions when you’ll need to see how a page is rendered before javascript gets involved. This is a handy little switch that gives you an immediate view of the vanilla HTML and CSS.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SEERobots&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I highly recommend this extension as checking the X-Robots tags is needed on an ongoing basis, whether you need to check if a link is on an no-indexed page, or you have correctly switched a transferred domain back to the live version.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;MagicCSS&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A handy little extension that allows you to edit the CSS on a live page.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Grammarly&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Poor spelling and grammar is something that can be easily avoided. There are privacy concerns with this tool due to their rumoured deal with ClickStream but it does one thing very well.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Ghosterly&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whilst marketed as a way to block tracking scripts around the web, it’s also very useful for checking that your own Facebook pixel / Google Ads / Remarketing cookies have been installed correctly.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Wordpress Plugins&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;There is a sea of Wordpress plugins that have value, but ideally we should all be avoiding using plugins as much as possible due to their inherent security risk. One increasing trend is the use of plugins as a backdoor to place links on your site, so take care and do some research before installing something.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;RankMath&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After Yoast has proved themselves completely unreliable in recent years with bug after bug that negatively effected rankings, a better alternative is RankMath who are adding functionality to their core plugin rather than reducing features.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Redirection&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Includes a host of useful features including 404 error log files from within WP, which allows anybody to find and fix misconfigured inbound links, and lost link juice.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;WP Tag Manager&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are a number of ways to integrate Google’s Tag Manager into a WP site, the best option is to include the code in a child theme but this is plugin is useful if that’s not an option.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conversion Rate Optimisation
&lt;/h3&gt;

&lt;p&gt;Conversion rate optimisation or what HR like to advertise as “post click analysis” when done right, can have a huge impact on your ROI per visitor.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;VWO&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feature rich and at a great price point, VWO allows A/B testing segmentation, mutually exclusive groups all based on an intuitive platform. Highly recommended for sites that have the needed traffic to gain useful insights. Customer support is also excellent.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Effective Experiments&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not very familiar with this platform yet but a number of people have recommended it. It’s on my to do list.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Hotjar&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Handy platform particularly the heat mapping function allows you to pinpoint any CRO errors on specific pages. It has a generous free tier and should be implemented on any landing pages.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Email Tools&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Email is the bread and butter of any digital marketer and is probably one of the biggest time-sinks there is, if not done right. Here are a couple of tools that are useful at reducing the amount of time you’ll spend on email.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Hunter.io&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are a lot of tools that provide a similar service with more being brought to market all the time. Hunter.io has a pretty reliable database of domain and email addresses and has a handy little tool baked in for deducing the email structure of a company.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GMass&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One of the first mass emailing services I tried that allows you to use GMail’s email servers. When I used it, it wasn’t a mature product and lacked a number of important features, the most important being error checks. It was possible to send several hundred email(or thousand) by mistake, and not have a fallback for the first name. I’m sure it’s been improved since I looked at it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;StreakCRM&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Streak CRM allows for open and click tracking, automation of leads through the sales process, integrates with Google Sheets and is pretty intuitive to use.&lt;/p&gt;

&lt;p&gt;Despite the advertised limitations of the free plan, there is no limit on the number of open trackers per month, or at least there wasn’t in April of 2019 when I last tested.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Mailshake&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Send email using Google’s trusted email servers allowing for a higher level of deliverability. It has integrated automated email follow-up and helpful functions that stop you emailing the same person multiple times.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rank Trackers
&lt;/h3&gt;

&lt;p&gt;Rank Trackers are pretty boring, generally you need them to do one thing and one thing well. There has been some innovation in this space, but not a whole lot.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SERPWoo&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Is a not really a rank tracker, at least not in the conventional sense. It tracks the SERP and not just your position within it. This allows for detailed analysis on changes after updates, and investigation of competitors who are making gains.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Ahrefs&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you are tracking keywords for a large project conventional rank trackers can get prohibitively expensive. Ahrefs can give you a top down view of your site with delayed results. It does have an integrated “Project Tracker”, but like the more conventional trackers out there, this can get very expensive.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SEMrush&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SEMrush tries to do it all and in some areas fails, the rank tracking part of their set up is quite robust and feature rich though.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SERPRobot&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The best bang for your buck rank tracker out there. A no frills service that does one thing really well.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Automation&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;UBot Studio&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ubot studio is not cheap and to be honest is pretty buggy. It does give you the ability to automate almost any task quickly from building a twitter bot to manipulating RSS feeds. Over the years, I find I use it less and less and many things can be done with a rudimentary understanding of Python.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Scrapebox&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scrapebox has a bad reputation due to its overuse by comment spammers but it has a 101 small features that always come in handy including the quick url dedupe feature.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GSA SER&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not particularly relevant anymore but like Scrapebox, can still have its uses.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GSA Email Finder&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Due to changes in EU legislation, scraping information of any kind is now illegal. A quick look at any server logs will show that this is not adhered to by many and enforcing the new law appears to be impossible, but the fact remains that tools such as GSA’s are now illegal.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Link Analysis Tools / Backlink Checkers&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Backlink profile analysis is one of the fundamental aspects of SEO and a true understanding of why some sites rank and others tank often comes down to careful investigation of the a domain’s link profile. Whilst proprietary metrics such as Moz’s DR and Ahrefs DA give an indication as to a site’s authority, the devil is in the detail and you have to look at the links on tier 1 and tier 2 to really understand what is happening.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Ahrefs&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Considered by many to be the best tool in the game. Whilst Ahrefs don’t have the biggest link index, they seem to be able to filter the low quality links better than the competition. New links appear sooner than the competition and the company have steadily improved the feature set over the years. Plus the interface allows you to drill down to get at useful information without exporting the data to a spreadsheet.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Moz Open Site Explorer&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The industry standard for many years, Moz’s metrics were too easily manipulated and they really stopped trying in 2015/2016 allowing others to move ahead. They have an impressive link index but for everyday use, just don’t seem to find new links for months and bad data makes for bad decisions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SEMRush&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Have recently boasted of the largest link index in the industry, but again like Moz, new links just don’t appear in their system quickly enough to be relied upon. SEMrush may be a good option for small shop digital marketing firms with a wide range of tools, but link analysis and understanding a domain’s link profile, I personally find it difficult to work with.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Majestic&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Majestic’s two proprietary metrics, TrustFlow and CitationFlow were modelled closely after one of Google’s original patents and give a useful view of the power and trust of a link or domain. It’s still open to manipulation but is more useful than many other proprietary metrics. Majestic’s link index is comprehensive and their crawler surfaces links quicker than many of the competition, even turning up links that Ahrefs fails to.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Search Console&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Google’s search console contains a selection of links that Google’s crawler has come across but like most data Google provides is purposefully incomplete.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Linkody&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Linkody is a meta type service that allows you to combine backlink sources such as MOZ and Ahrefs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Link Research Tools&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I haven’t used LRT’s platform as it’s always been outside of my budget. They do pull data from a number of sources and deduplicate the results. They have an excellent reputation with many and a great blog.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Netpeak Checker&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Very useful tool, you plug in your API keys for the services such as Ahrefs, Majestic, etc and pull only the specific data you need.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Crawlers&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Technical SEO is an essential part of any search optimisation role and it usually begins with a crawl.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Screaming Frog&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The long time industry standard continues to add features as technical SEO evolves with recent integration of page &lt;a href="https://marketing.toolbox.com/blogs/murroughfoley/10-tips-to-improve-site-speed-before-googles-speed-update-022318" rel="noopener noreferrer"&gt;speed tests&lt;/a&gt;, schema mark up and an update to its server log file analysis. Some argue that the UI could do with an update but why change when everyone is used to it as it is?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SiteBulb&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SiteBulb has a similar feature set to Screaming Frog but gives you that information in a way that is easier to act upon. It’s also more beginner friendly, in that issues are categorised and ranked in “perceived importance”.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Xenu Link Sleuth&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A bit of a throwback, this Win32 crawler is free and could still be useful to some.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Deepcrawl&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deepcrawl is an enterprise level SASS targeting sites with a footprint of larger than 500,000 pages. I have worked on sites of a max 380,000 pages and never had the need to employ Deepcrawl. It has a good reputation though.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Backlink Monitoring Tools&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When you’ve worked hard to earn your links, you don’t want to give them up without a fight, or at least an email. Like all things nowadays, the trend is toward recurring fee SASS applications. This of course, is far from ideal and whilst applications benefit from a SASS type set up, this is unnecessary for backlink monitoring.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Inspyder Backlink Monitor&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A great little tool that can be run when needed that checks backlinks and notifies you when you loose a link. You can set it up for tiered links too.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Content Analysis Tools&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The term ‘Scientific SEO’ has been thrown around in recent years but it’s really just a sales pitch for an industry that has to juggle so many ranking factors. That’s not to say that content analysis tools aren’t important.&lt;/p&gt;

&lt;p&gt;Content analysis allow you to statistically investigate the factors that correlate with a high position for a given search term. Peeling back some aspects of your page that you may have overlooked, that need improvement. More tools are coming to market all the time offering these features, but here are the most popular.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://surferseo.com" rel="noopener noreferrer"&gt;&lt;strong&gt;SurferSEO&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reasonably priced, SurferSEO has the most streamlined user interface that allows you to focus improving the content. It’s easy for writers to understand and make the most of and you still have the data when you need to look under the hood.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CoraSEO&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CoraSEO the greatest amount of raw data. The developer has gone to some lengths recently to present that data in an easily digestible way. But you are still working with a Spreadsheet. And whilst you can gain the most insight from this tool, it’s not something you can hand off to a writer or junior SEO. The data has to be interpreted and the SERP has to be analysed in context.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Page Optimiser Pro&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I was an early adopter of Pop after seeing the results Kyle Roof got with his now infamous local &lt;em&gt;lorem ipsum&lt;/em&gt; test. The UI passed through several iterations as features were added and after trying SurferSEO, I made the switch.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Siteliner&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a great tool to help quickly diagnose the amount of duplicate content on your site. You won’t need a paid account to quickly get an estimate of how serious an issue you have. Remember the more unique your page, the higher the chance that Google will index it and keep it indexed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Copyscape&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One of the first checks you perform on any content you receive from writers is a plagiarism check. Whilst there are several new services out in the last number of years, many of them use Copyscape’s API, so you might as well get the service from the source and save in the process.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Productivity &amp;amp; Project Management Tools&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;OmniFocus&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One of the first checks you perform on any content you recevie from writers is a plagiarism check. Whilst there are several new services out in the last number of years, many of them use Copyscape’s API, so you might as well get the service from the source and save in the process.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;OmniPlan&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One of the first checks you perform on any content you recevie from writers is a plagiarism check. Whilst there are several new services out in the last number of years, many of them use Copyscape’s API, so you might as well get the service from the source and save in the process.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Asana&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One of the first checks you perform on any content you recevie from writers is a plagiarism check. Whilst there are several new services out in the last number of years, many of them use Copyscape’s API, so you might as well get the service from the source and save in the process.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;OSINT &amp;amp; Research Tools&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Open source intelligence tools may seem like overkill but are important if you need to find a competitor’s PBN or want to map a business’s various web properties to understand the full breadth of their business.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;TinyEye&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Excellent reverse image look up service. Numerous applications.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;DomainIQ&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Serves some data on domains, not very reliable, the data should be verfied using other services.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ipfingerprints&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Standard reverse IP lookup service&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;DomainTools Whois&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Comprehensive whois service with paid historical data.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Citation Resources &amp;amp; Services&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You may think that citations aren’t relevant for international SEO, but have a variety of uses. There are some countries, where you need a minimum number of links from local ccTLDs. They can also be used as pillowing links to dilute your backlink profile.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Brightlocal&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Managed service based in the UK with a nice portal for keeping an eye on progress and indexation. Worth the money if you prefer a more hands off approach.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;WhiteSpark&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Similar to bright spark but with a greater database of niche specific directory sites. Won’t suite every time of business but things like dentists and lawyers are covered.&lt;/p&gt;

&lt;h3&gt;
  
  
  Proxy Providers
&lt;/h3&gt;

&lt;p&gt;There are a variety of situations where you may need proxies. Describing the different types of proxies and their uses is outside the scope of this article.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Infatica&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Residential and mobile proxies with a global network. Pricing is per GB and of course, not cheap.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;BuyProxies&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Haven’t used them in a number of years, they were the go to provider for a long time. Unaware of the proxy quality or reliability in recent years.&lt;/p&gt;

&lt;h3&gt;
  
  
  Content Delivery Networks
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CloudFlare&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Free service with an integrated WAF to protect against DDOS attacks. CloudFlare greatly simplifies the process of updating or switching servers with immediate DNS record updates.&lt;/p&gt;

&lt;p&gt;They provide a free SSL cert but the increasing number of sites that use CloudFlare and the way the free SSL cert is implemented mean you are trusting them with your data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Amazon Web Services / CloudFront&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Enterprise level content delivery network that can greatly reduce a server’s time to first byte. Integrates with S3 and the Amazon WAF meaning faster and more secure sites, at a price though.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Incapsula / Imperva&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I have yet to use this Incapsula / Imperva but I’ve heard good things and would like to take a closer look when the opportunity presents itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  E-Commerce Tools &amp;amp; Resources
&lt;/h3&gt;

&lt;p&gt;Out of the box, most e-commerce platforms are horribly optimised for SEO. The worst offender being WooCommerce.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CartFlows&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cartflows doesn’t solve all of WooCommerce issues, but it does remedy a couple of the major shortfalls of the default cart.&lt;/p&gt;

&lt;h3&gt;
  
  
  Search Engine Provided Tools &amp;amp; Services
&lt;/h3&gt;

&lt;p&gt;The various search engines provide some useful tools that can be used in various ways for digital marketing. Everything from a customised search engine to notify you of when your brand is mentioned, to help identifying global trends can be found.&lt;/p&gt;

&lt;p&gt;There is one caveat though and that is that the data should be taken as an indication rather than the cold hard truth. Google’s keyword volume is the prime suspect here. Google has no responsibility to provide accurate data and nor would it be in their best interests to provide anything more than an approximation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Google Trends&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A good way to identify seasonal trends for a topic or product. It’s always best to compare your target with a topic or product that you have hard data on so that you can infer the general levels of interest.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Google AdWords Keyword Tool&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The source data for most keyword volume tools, but is wildly inaccurate. Keyword groupings can provide some insight.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Google Alerts&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are a number of SASS based projects that sell Google Alerts for ridiculous prices with a nice front end. Google Alerts allows you to set up notifications for keywords. When Google’s crawler encounters your keyword, you get an alert. Very handy tool to keep an eye on your brand name and competitors.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Google Analytics&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Industry standard analytics program, very powerful but with a couple of missing features, or features that require a deep dive into the config panels.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Google Custom Search Engine&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Google allows you to “create” your own search engine and only include your competitors. This helps to identify new posts or major changes within your niche.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Google Search Console&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Allows for mining a plethora of information on your progress in organic search. It often turns up keywords that you’ll miss via traditional keyword research.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Bing Webmaster Tools&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With less than 3% of the population using Bing, it’s ignored by most SEOs and digital marketers as the results are too easily manipulated. The webmaster tools used to turn up a number of inbound links that did not appear in Google’s offering and this data can come n handy when you are disavowing links and want the largest dataset to work with.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Searx&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not useful for SEO particularly but it’s a really interesting project. It’s meta-search engine that you self host. They have numerous engines that allow for a range of customisations plus integration of proxy support.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Search Engine Tools / Hacks&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;USearchFrom.Com&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dynamic nature SERP results mean that the Google results you are seeing, is not always the same as what your clients or competitors are seeing. Whilst there are ways around this, one quick hack to see the results in a different country is usearchfrom.com. The creator built it out of frustration with a similar site who failed to implement some basic but essential upgrades.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Penguin Analysis Tool&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A handy little tool for identifying the nature of a penalty. Less useful in the last number of years with the rolling core releases.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Quick &amp;amp; Dirty Tools&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Text Mechanic&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whilst many of the features of this web app can be found in a good text editor, the ease of use of text mechanic makes it a big winner. Started as a small weekend project years ago, the creator checked his analytics one day to the shock that if was seeing high traffic and that SEOs and web masters had come to rely on it. He has since monetised it and the restrictions are a bit of an annoyance but hey, still a great tool in a pinch.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Mergewords&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Handy tool when doing Adwords or keyword research.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Httpstatus.io&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When performing a 301, installing an SSL cert or a million other small tasks, you may need to check that all versions of a URL have been correctly configured. This little tool comes in handy.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Wordcounter.io&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Another quick single use tool if you need to check word count on the fly.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Python Libraries&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Python is a great language and one I ’m still learning.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CloudFail&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Python script to find ip addresses hidden behind CloudFlare name servers.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Beautiful Soup&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Worth taking a look at but not worth spending much time on. A pretty limited framework for scraping data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Scrapy&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Far more robust and feature rich library capable of querying APIs.&lt;/p&gt;

&lt;p&gt;You can find Murrough Foley on &lt;a href="https://twitter.com/MFoleySEO" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/m-foley-seo/" rel="noopener noreferrer"&gt;Linkedin&lt;/a&gt; or at his personal blog &lt;a href="https://www.mfoley.ie" rel="noopener noreferrer"&gt;mfoley.ie&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
