<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Murilo Azevedo</title>
    <description>The latest articles on Forem by Murilo Azevedo (@muriloazevedo).</description>
    <link>https://forem.com/muriloazevedo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1215681%2F0193b02d-7258-474e-8760-c2072b750153.jpeg</url>
      <title>Forem: Murilo Azevedo</title>
      <link>https://forem.com/muriloazevedo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/muriloazevedo"/>
    <language>en</language>
    <item>
      <title>Vamos construir um painel dos blocos de carnaval de São Paulo usando python? - Parte I</title>
      <dc:creator>Murilo Azevedo</dc:creator>
      <pubDate>Thu, 01 Feb 2024 01:19:55 +0000</pubDate>
      <link>https://forem.com/muriloazevedo/vamos-construir-um-painel-dos-blocos-de-sao-paulo-usando-python-parte-i-2c5l</link>
      <guid>https://forem.com/muriloazevedo/vamos-construir-um-painel-dos-blocos-de-sao-paulo-usando-python-parte-i-2c5l</guid>
      <description>&lt;p&gt;Lidar com dados não estruturados é sempre um desafio. Esse tipo de dado pode ter os mais variados formatos e as fontes de dados rudimentares.&lt;/p&gt;

&lt;p&gt;Dados não estruturados são em resumo, aqueles que não possuem uma padronização de formatação e não são facilmente categorizados, como por exemplo uma página web de um site. Você provavelmente não controla esse site e extrair dados dele pode ser um desafio. Outro exemplo são dados do poder público que não necessariamente tem o compromisso de serem acessíveis.&lt;/p&gt;

&lt;h2&gt;
  
  
  A agenda de blocos do carnaval de São Paulo
&lt;/h2&gt;

&lt;p&gt;Anualmente a prefeitura de São Paulo divulga uma agenda com os blocos de carnaval com data, horário, nome e itinerário, entre outras informações.&lt;/p&gt;

&lt;p&gt;Nesta série de artigos, quero trazer os seguintes recursos:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extrair dados de um PDF e transformá-los em tabulares&lt;/li&gt;
&lt;li&gt;Limpeza básica de dados&lt;/li&gt;
&lt;li&gt;Transformar alguns dados brutos em entidades que depois podem ser usadas de várias maneiras&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Primeiro, vamos baixar o pdf de 2024 neste &lt;a href="https://diariooficial.prefeitura.sp.gov.br/md_epubli_visualizar.php?kHdgtACkKWJxjOVDE7BkQy24RK_w0WYcDMI4xNfLbdk5LQ0S2WkkSGqARrYgZ4ufqAeRzWFkFbOj_VQhDUemV7REuH1qaEri4G5JmdIyrMhE8x1eUwy8MthWoUFce6FD"&gt;endereço&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;O documento possui 35 páginas com uma tabela assim:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7yqtq9tz2fdvqd1yy6b5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7yqtq9tz2fdvqd1yy6b5.png" alt="Tabela com os dados de cada bloco" width="800" height="310"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;O primeiro passo será nós transformarmos esse pdf em uma &lt;a href="https://en.wikipedia.org/wiki/Pandas_(software)#DataFrames"&gt;DataFrame&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;camelot.io&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;camelot&lt;/span&gt;

&lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;blocos.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;camelot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No código acima estamos fazendo 3 coisas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uso da biblioteca chamada &lt;code&gt;camelot&lt;/code&gt;, ela irá fazer o trabalho pesado para nós.&lt;/li&gt;
&lt;li&gt;Lendo o arquivo com todas as páginas&lt;/li&gt;
&lt;li&gt;Mostrando o dataframe&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;O resultado foi o seguinte:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;       0                               1                            2                            3                                                  4             5         6          7
0  BLOCO                                     DATA DO &lt;span class="se"&gt;\n&lt;/span&gt;DESFILE &lt;span class="se"&gt;\n&lt;/span&gt;2024    SUBPREFEITURA &lt;span class="se"&gt;\n&lt;/span&gt;DE ORIGEM                                    ITINERÁRIO 2023  Concentração   Desfile  Dispersão
1      1                50 tons de Pinga  17/02/2024 - &lt;span class="se"&gt;\n&lt;/span&gt;Pós Carnaval                        Mooca  R. Airi, R. Platina, R. Serra &lt;span class="k"&gt;do &lt;/span&gt;Japi, R. Serr...      12:00:00  13:00:00   17:00:00
2      2                A Bruxa ta Solta  03/02/2024 - &lt;span class="se"&gt;\n&lt;/span&gt;Pré carnaval  Vila Maria/Vila &lt;span class="se"&gt;\n&lt;/span&gt;Guilherme                                R. Quedas 500 a 200      14:00:00  15:00:00   19:00:00
3      3  A Ema Gemeu de &lt;span class="se"&gt;\n&lt;/span&gt;Canto a Canto  17/02/2024 - &lt;span class="se"&gt;\n&lt;/span&gt;Pós Carnaval                    Freguesia  Av. Brigadeiro Faria Lima 364, R. Padre Garcia...      12:00:00  13:00:00   17:00:00
4      4               A Madonna Tá Aqui  03/02/2024 - &lt;span class="se"&gt;\n&lt;/span&gt;Pré carnaval                           Sé  Pateo &lt;span class="k"&gt;do &lt;/span&gt;Colégio, R. Boa Vista, R. Libero Bada...      11:00:00  12:00:00   16:00:00
5      5                 A praça é nossa  03/02/2024 - &lt;span class="se"&gt;\n&lt;/span&gt;Pré carnaval                    Jabaquara  R. Alba, R. das Cavas, Av. João Barreto de Men...      14:00:00  15:00:00   19:00:00
6      6                Abacaxi de Irará  17/02/2024 - &lt;span class="se"&gt;\n&lt;/span&gt;Pós Carnaval                         Lapa  R. Minerva 188, R. Itapicuru, R. Ministro Godó...      10:00:00  11:00:00   15:00:00
7      7           Acadêmicos da &lt;span class="se"&gt;\n&lt;/span&gt;Ursal  03/02/2024 - &lt;span class="se"&gt;\n&lt;/span&gt;Pré carnaval                           Sé  R. Fortunato, 64, R. Canuto &lt;span class="k"&gt;do &lt;/span&gt;Val, R.  Martim...      11:00:00  12:00:00   16:00:00
8      8   Acadêmicos &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="se"&gt;\n&lt;/span&gt;Baixo Augusta  04/02/2024 - &lt;span class="se"&gt;\n&lt;/span&gt;Pré Carnaval                           Sé                                      R. Consolação      14:00:00  15:00:00   19:00:00
9      9         Acadêmicos &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="se"&gt;\n&lt;/span&gt;Ipanema  03/02/2024 - &lt;span class="se"&gt;\n&lt;/span&gt;Pré carnaval                     Itaquera  R. Filippo Juvara, 296, R. Gondarem, R. Jose S...      14:00:00  15:00:00   19:00:00
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Observe que ainda temos espaço para fazer algumas melhorias como retirar &lt;code&gt;\n&lt;/code&gt; que pode atrapalhar a formatação futuramente e o itinerário.&lt;/p&gt;

&lt;p&gt;Para retirar as quebra de linha que vieram do pdf, vamos fazer o seguinte:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dataFrame&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;regex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;O próximo passo é remover os espaços das colunas&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# remove os espaços no nome das colunas
&lt;/span&gt;&lt;span class="n"&gt;new_header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;results_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;results_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new_header&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Vamos atribuir a coluna que estava com nome de espaço em branco para bloco
&lt;/span&gt;&lt;span class="n"&gt;results_raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;BLOCO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;results_raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;results_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;columns&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;results_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;E por fim, vamos extrair a data para uma nova coluna&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# extract date
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\d{2}\/\d{2}\/\d{4}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strptime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%d/%m/%Y&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;date&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;

&lt;span class="n"&gt;results_raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data_desfile&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;results_raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data_do_desfile_2024&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extract_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Desse modo já temos os dados limpos e podemos exportar para CSV, ficando mais fácil de disponibilizar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;results_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;carnaval_sp_2024.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Na próxima parte desta série, vamos trabalhar melhor esses dados fazendo algumas transformações como: datas em &lt;a href="https://pt.wikipedia.org/wiki/ISO_8601"&gt;ISO 8601&lt;/a&gt;, construir itinerário dos blocos e outras coisas.&lt;/p&gt;

</description>
      <category>pandas</category>
      <category>python</category>
      <category>data</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
