<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: ksn-developer</title>
    <description>The latest articles on Forem by ksn-developer (@ksndeveloper).</description>
    <link>https://forem.com/ksndeveloper</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F944486%2F4098fa87-6a48-4152-9a4d-45a0c91878b8.png</url>
      <title>Forem: ksn-developer</title>
      <link>https://forem.com/ksndeveloper</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ksndeveloper"/>
    <language>en</language>
    <item>
      <title>Webscraping using pandas</title>
      <dc:creator>ksn-developer</dc:creator>
      <pubDate>Wed, 01 Mar 2023 18:38:12 +0000</pubDate>
      <link>https://forem.com/ksndeveloper/webscraping-using-pandas-fph</link>
      <guid>https://forem.com/ksndeveloper/webscraping-using-pandas-fph</guid>
      <description>&lt;p&gt;Web scraping refers to the process of extracting data from websites using automated tools and scripts. Web scraping can be used for a variety of purposes, such as market research, competitor analysis, and data analysis.&lt;/p&gt;

&lt;p&gt;Pandas is a popular data analysis library in Python that provides powerful tools for working with structured data. In this article, we will explore how to use Pandas for web scraping and how it can make the process easier and more efficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Pandas &lt;code&gt;read_html()&lt;/code&gt; Function&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the key features of Pandas for web scraping is the &lt;code&gt;read_html()&lt;/code&gt; function. This function allows you to read HTML tables from web pages and convert them into Pandas DataFrames. The &lt;code&gt;read_html()&lt;/code&gt; function takes a URL as input and returns a list of all HTML tables found on the page.&lt;/p&gt;

&lt;p&gt;Here's an example of how to use &lt;code&gt;read_html()&lt;/code&gt; to scrape a table from a web page:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import matplotlib.pyplot as plt

# Wikipedia page for total wealth data
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_total_wealth'

# read HTML tables from URL
tables = pd.read_html(url)

# extract the first table (which contains the wealth data)
wealth_table = tables[0]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, we first import the Pandas library and specify the URL of the web page we want to scrape. We then call the &lt;code&gt;read_html()&lt;/code&gt; function with the URL as input, which returns a list of all tables found on the page. We extract the first table from the list by indexing it with [0].&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Cleaning and Manipulation with Pandas&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once you have scraped the data from a web page into a Pandas DataFrame, you can use the full power of Pandas to clean, manipulate, and analyze the data.&lt;/p&gt;

&lt;p&gt;Here's an example of how to clean and manipulate data in a scraped DataFrame:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;wealth_table["Total wealth (USD bn)"] = wealth_table['Total wealth (USD bn)'].replace("—",pd.NA)

# remove unnecessary columns

wealth_table = wealth_table[['Country (or area)', 'Total wealth (USD bn)']]

# remove rows with missing values
wealth_table = wealth_table.dropna()

top10 = wealth_table.head(10)

# plot a bar chart of the top 10 countries by total wealth
plt.bar(top10['Country (or area)'], top10['Total wealth (USD bn)'])
plt.xticks(rotation=90)
plt.ylabel('Total wealth (USD bn)')
plt.title('Top 10 Countries by Total Wealth')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that the &lt;code&gt;read_html()&lt;/code&gt; function may not work for all web pages, especially those with complex or dynamic HTML structures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Web scraping with Pandas can be a powerful tool for extracting and analyzing data from web pages. The &lt;code&gt;read_html()&lt;/code&gt; function provides an easy way to scrape HTML tables, and Pandas provides a wide range of tools for cleaning, manipulating, and analyzing the data. However, it's important to be mindful of the legal and ethical implications of web scraping, as some websites may prohibit or restrict scraping activities. &lt;/p&gt;

&lt;p&gt;Github link:&lt;a href="https://gist.github.com/ksn-developer/bb541c1aa2c13b423cdef188b2444661" rel="noopener noreferrer"&gt;https://gist.github.com/ksn-developer/bb541c1aa2c13b423cdef188b2444661&lt;/a&gt;&lt;/p&gt;

</description>
      <category>discuss</category>
    </item>
    <item>
      <title>Parsing nginx logs using python</title>
      <dc:creator>ksn-developer</dc:creator>
      <pubDate>Sun, 26 Feb 2023 19:13:29 +0000</pubDate>
      <link>https://forem.com/ksndeveloper/parsing-nginx-logs-using-python-1m6k</link>
      <guid>https://forem.com/ksndeveloper/parsing-nginx-logs-using-python-1m6k</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Nginx is a popular web server software used to serve web pages and other content on the internet. Nginx produces logs that contain information about the requests it receives and the responses it sends. Parsing these logs can provide valuable insights into website traffic and usage patterns. In this article, we will explore how to parse Nginx logs using Python.&lt;/p&gt;

&lt;p&gt;Step 1: Understanding Nginx Log Format&lt;/p&gt;

&lt;p&gt;Nginx logs are stored in a file, usually located in the /var/log/nginx directory. The log format can be configured using the nginx.conf file. The default log format for Nginx is the Combined Log Format, which includes the following fields:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The remote IP address
The time of the request
The request method (GET, POST, etc.)
The requested URL
The HTTP version
The HTTP status code
The size of the response sent to the client
The referrer URL
The user agent string
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The log format can be customized to include or exclude specific fields, or to add custom fields.&lt;/p&gt;

&lt;p&gt;Step 2: Installing Required Libraries&lt;/p&gt;

&lt;p&gt;To parse Nginx logs using Python, we need to install the following libraries:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pandas: used for data manipulation and analysis.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;You can install these libraries using the following command:&lt;br&gt;
   &lt;code&gt;pip install pandas&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Step 3: Parsing Nginx Logs Using Python&lt;/p&gt;

&lt;p&gt;To parse Nginx logs using Python, we can use the pandas library. The pandas library provides a powerful data structure called a DataFrame that allows us to manipulate and analyze data easily.&lt;/p&gt;

&lt;p&gt;Here's an example Python script that reads an Nginx log file and creates a DataFrame:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import re
import shlex
import pandas as pd

class Parser:
    IP = 0
    TIME = 3
    TIME_ZONE = 4
    REQUESTED_URL = 5
    STATUS_CODE = 6
    USER_AGENT = 9

    def parse_line(self, line):
        try:
            line = re.sub(r"[\[\]]", "", line)
            data = shlex.split(line)
            result = {
                "ip": data[self.IP],
                "time": data[self.TIME],
                "status_code": data[self.STATUS_CODE],
                "requested_url": data[self.REQUESTED_URL],
                "user_agent": data[self.USER_AGENT],
            }
            return result
        except Exception as e:
            raise e


 if __name__ == '__main__':
    parser = Parser()
    LOG_FILE = "access.log"
    with open(LOG_FILE, "r") as f:
        log_entries = [parser.parse_line(line) for line in f]

    logs_df = pd.DataFrame(log_entries)
    print(logs_df.head())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 4: Data Analysis &lt;/p&gt;

&lt;p&gt;Once we have the Nginx log data in a DataFrame, we can perform various data analysis tasks.for example :&lt;/p&gt;

&lt;p&gt;All requests with status code 404&lt;br&gt;
&lt;code&gt;logs_df.loc[(logs_df["status_code"] == "404")]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Requests from unique ip addresses&lt;br&gt;
&lt;code&gt;logs_df["ip"].unique()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Get all distinct user agents&lt;br&gt;
&lt;code&gt;logs_df["user_agent"].unique()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Get most requested urls&lt;br&gt;
&lt;code&gt;logs_df["requested_url"].value_counts().to_dict()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Conclusion&lt;/p&gt;

&lt;p&gt;Parsing Nginx logs using Python can provide valuable insights into website traffic and usage patterns. By using the pandas library, we can easily read and manipulate the log data. With the right analysis, we can gain insights into website performance, user behavior, and potential security threats.&lt;/p&gt;

&lt;p&gt;Github link :&lt;a href="https://gist.github.com/ksn-developer/4072a9e092bccf68559c21f1c5ac2de2" rel="noopener noreferrer"&gt;https://gist.github.com/ksn-developer/4072a9e092bccf68559c21f1c5ac2de2&lt;/a&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>scalability</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Extracting text from pdf files using pyPDF3</title>
      <dc:creator>ksn-developer</dc:creator>
      <pubDate>Sun, 26 Feb 2023 17:26:33 +0000</pubDate>
      <link>https://forem.com/ksndeveloper/extracting-text-from-pdf-files-using-pypdf3-2e72</link>
      <guid>https://forem.com/ksndeveloper/extracting-text-from-pdf-files-using-pypdf3-2e72</guid>
      <description>&lt;p&gt;PyPDF3 is a Python library for working with PDF files that builds upon the PyPDF2 library. It provides an easy-to-use interface for reading and writing PDF files, and it includes tools for extracting text from PDF files. In this article, we will explore how to use PyPDF3 to extract text from PDF documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To use PyPDF3, you need to install it using pip. You can do this by running the following command in your command prompt or terminal:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;pip install PyPDF3&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once you have installed PyPDF3, you can import it in your Python script using the following line of code:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;import PyPDF3&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extracting Text from PDF Documents&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To extract text from a PDF document using PyPDF3, you first need to open the PDF file in binary mode using Python's built-in open() function. You can then create a &lt;code&gt;PdfFileReader&lt;/code&gt; object using PyPDF3, which allows you to read the contents of the PDF file. Here's an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   import PyPDF3
   with open('sample.pdf', 'rb') as pdf_file:
     pdf_reader = PyPDF3.PdfFileReader(pdf_file)
     text = ''
     for page_num in range(pdf_reader.numPages):
        page = pdf_reader.getPage(page_num)
        text += page.extractText()
   print(text)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>snippet</category>
      <category>markdown</category>
    </item>
  </channel>
</rss>
