<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Jonathan Grandperrin</title>
    <description>The latest articles on Forem by Jonathan Grandperrin (@jgrandperrin).</description>
    <link>https://forem.com/jgrandperrin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F462773%2F51c250bf-1267-4bb2-9edd-2a12dcf21bfc.jpg</url>
      <title>Forem: Jonathan Grandperrin</title>
      <link>https://forem.com/jgrandperrin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/jgrandperrin"/>
    <language>en</language>
    <item>
      <title>Document Classification API</title>
      <dc:creator>Jonathan Grandperrin</dc:creator>
      <pubDate>Sun, 28 Mar 2021 10:27:19 +0000</pubDate>
      <link>https://forem.com/mindee/document-classification-api-8jl</link>
      <guid>https://forem.com/mindee/document-classification-api-8jl</guid>
      <description>&lt;p&gt;Workflows often involve document processing. And sometimes, you need to classify those documents automatically in your software. One reason can be that your users upload a bunch of different data in a unique flow, or they upload a single pdf including many different documents. It can be very tricky to automate this depending on your use case.&lt;/p&gt;

&lt;p&gt;In this article, we’ll show you how to build an accurate document classification API that fits exactly your needs. In minutes, you’ll get your API up and running and you’ll be able to process millions of documents synchronously.&lt;/p&gt;

&lt;h2&gt;
  
  
  Our use case
&lt;/h2&gt;

&lt;p&gt;Let’s take an example where your users are uploading documents on a single endpoint of your backend, and you want to classify them into 5 categories:&lt;/p&gt;

&lt;p&gt;-W9&lt;br&gt;
-1040 Forms&lt;br&gt;
-Invoices&lt;br&gt;
-Payslip&lt;br&gt;
-Other&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--NSHtJ1mr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gj42rqozg7f7udi0i4an.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--NSHtJ1mr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gj42rqozg7f7udi0i4an.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once our API is trained, we’ll be able to launch specific workflows on those different types of documents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Set up our document classification API
&lt;/h2&gt;

&lt;p&gt;Create an account here: &lt;a href="https://platform.mindee.com/signup"&gt;https://platform.mindee.com/signup&lt;/a&gt; and sign in. You’ll land on our home page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bOek-n6b--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/39pb1sofzt712rkoywq8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bOek-n6b--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/39pb1sofzt712rkoywq8.png" alt="image"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Click on the &lt;strong&gt;“Create a new API”&lt;/strong&gt; button.&lt;/p&gt;

&lt;p&gt;Fill out a few information about your API. Give it a name, a description, and a cover image if you want.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--T7wItBDL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/e338kva9wtiw7v9zyhyj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--T7wItBDL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/e338kva9wtiw7v9zyhyj.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then click on &lt;strong&gt;“Next”&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8dlza423--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uq4tj8888eszxiw7avqh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8dlza423--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uq4tj8888eszxiw7avqh.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the step where you are going to define your classes. Add a Classification field:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pEdGsSX_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/73mv8ns68g8khxyjlpeo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pEdGsSX_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/73mv8ns68g8khxyjlpeo.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A popup will show up, we need now to input our different possible classes. Let’s fill the form with the classes defined earlier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--tdcG5DB5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/68khke6be7un2fyed7e2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--tdcG5DB5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/68khke6be7un2fyed7e2.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once you click the &lt;strong&gt;"Add this classification field"&lt;/strong&gt; button, we are all set. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9bXcs7sV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0t0k6ns3njen5g4flk3v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9bXcs7sV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0t0k6ns3njen5g4flk3v.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click the &lt;strong&gt;“Start training your model”&lt;/strong&gt; button.&lt;/p&gt;

&lt;h2&gt;
  
  
  Train your document classifier
&lt;/h2&gt;

&lt;p&gt;Your API was just deployed! Now we need to &lt;strong&gt;train the model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To do so, we’ll need data, &lt;strong&gt;15 samples for each type&lt;/strong&gt; should be enough to get very high performances, but it’s up to you to train with more if you want to. It’s going to take you &lt;strong&gt;no more than 10 minutes&lt;/strong&gt; to annotate your data once it’s uploaded.&lt;/p&gt;

&lt;p&gt;The training interface looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wYNT2zZB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/s7s7em4h3ktp7tw02pls.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wYNT2zZB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/s7s7em4h3ktp7tw02pls.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the left part of the screen, you can &lt;strong&gt;upload images, pdf, or zip archives&lt;/strong&gt;. If you have all your training data in a folder on your laptop, just zip it and drag and drop it on the upload interface. You can mix pdfs and images, it’s not a problem as our backend will take care of this.&lt;/p&gt;

&lt;p&gt;Gathering your samples for training is actually the most boring part of the process.&lt;/p&gt;

&lt;p&gt;In my example, I have a total of 90 data, equally distributed. As it’s a dummy example, I’ve put random documents for the “other” class, but in your real-world use case, it’s better to use real data from your flow that you’d consider as “other”.&lt;/p&gt;

&lt;p&gt;My zip file is ready. When I drag and drop the file on the left part of the screen, the data management pane opens:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--MSL4SF2X--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/okizyu72ymgkmena0eev.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--MSL4SF2X--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/okizyu72ymgkmena0eev.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each data will appear automatically in the pane when it’s ready for annotation. &lt;/p&gt;

&lt;p&gt;To make the annotation process easier, click on the &lt;strong&gt;setting icon in the header&lt;/strong&gt;, and check the automatic data loading:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Lx4piRkp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9076ju7kjrcky4oukqdx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Lx4piRkp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9076ju7kjrcky4oukqdx.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s start annotating the data.&lt;/p&gt;

&lt;p&gt;Click on &lt;strong&gt;“Your data set”&lt;/strong&gt; on the left part of the screen, and click on the first document you see in the list.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Mte_QScU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tmbj3fkhvubc1d1mhvqp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Mte_QScU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tmbj3fkhvubc1d1mhvqp.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, it’s very simple. Click on the desired class for each data on the right part of the interface:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Y-fvo0yf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bv8ku84k5niq5y1ae3xn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Y-fvo0yf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bv8ku84k5niq5y1ae3xn.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validate,  and repeat&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It took me &lt;strong&gt;4 minutes and 51 seconds to annotate my 89 data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A model is trained every 20 data, and each of them is automatically deployed on your API under new versions:&lt;/p&gt;

&lt;p&gt;V1.0 = no model&lt;br&gt;
V1.1 = 1st model (20 data)&lt;br&gt;
V2.2 = 2nd model (40 data)&lt;br&gt;
…&lt;/p&gt;

&lt;p&gt;You get an email when a model is deployed. My last model was deployed 15 minutes after I finished my 89 annotations. The first one was ready before I finished.&lt;/p&gt;

&lt;p&gt;To know the performances or your model, ask the chat, we’ll give you the accuracy of your model. I got an overall accuracy of 96%, with confusion coming from invoices being classified as others. Adding a few more invoices and other documents would fix this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use the API
&lt;/h2&gt;

&lt;p&gt;Once your first model is deployed you can test it right away with new data.&lt;/p&gt;

&lt;p&gt;Hit the &lt;strong&gt;“Live interface”&lt;/strong&gt; button on the sidebar, drag and drop a document. You should see something like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--x0lNS-u7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dtgi5kg184zs105w1vlg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--x0lNS-u7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dtgi5kg184zs105w1vlg.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The latest version of your API (i.e the latest trained model) is automatically set for the live interface. &lt;/p&gt;

&lt;p&gt;To integrate your API in your application, you can now hit the &lt;strong&gt;“Documentation”&lt;/strong&gt; button in the sidebar.&lt;/p&gt;

&lt;p&gt;There is everything you need to use the API:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BmALSJ6e--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kdr2hx55igk93trgo7ma.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BmALSJ6e--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kdr2hx55igk93trgo7ma.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API Reference: base url, request body, headers and sample codes&lt;/li&gt;
&lt;li&gt;Response scheme&lt;/li&gt;
&lt;li&gt;Limitations: technical limitations in payload size, rate limit&lt;/li&gt;
&lt;li&gt;Open API: you can download the open API configuration to build your swagger collection, create automatically an SDK etc...&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conclusion&lt;/p&gt;

&lt;p&gt;In under an hour, we’ve trained and deployed an API for classifying documents from 5 different classes. Either you have to process a few hundred documents per month, or tens of millions, you can safely use your API in your production environment. Our whole architecture scales automatically as the number of requests grows.&lt;/p&gt;

&lt;p&gt;Feel free to contact us using the chat on &lt;a href="https://mindee.com"&gt;https://mindee.com&lt;/a&gt; if you have any questions or if you just want to chat and understand how our algorithm works.&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>api</category>
    </item>
  </channel>
</rss>
