Forem: ksn-developer

Webscraping using pandas

ksn-developer — Wed, 01 Mar 2023 18:38:12 +0000

Web scraping refers to the process of extracting data from websites using automated tools and scripts. Web scraping can be used for a variety of purposes, such as market research, competitor analysis, and data analysis.

Pandas is a popular data analysis library in Python that provides powerful tools for working with structured data. In this article, we will explore how to use Pandas for web scraping and how it can make the process easier and more efficient.

The Pandas read_html() Function

One of the key features of Pandas for web scraping is the read_html() function. This function allows you to read HTML tables from web pages and convert them into Pandas DataFrames. The read_html() function takes a URL as input and returns a list of all HTML tables found on the page.

Here's an example of how to use read_html() to scrape a table from a web page:

import pandas as pd
import matplotlib.pyplot as plt

# Wikipedia page for total wealth data
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_total_wealth'

# read HTML tables from URL
tables = pd.read_html(url)

# extract the first table (which contains the wealth data)
wealth_table = tables[0]

In this example, we first import the Pandas library and specify the URL of the web page we want to scrape. We then call the read_html() function with the URL as input, which returns a list of all tables found on the page. We extract the first table from the list by indexing it with [0].

Data Cleaning and Manipulation with Pandas

Once you have scraped the data from a web page into a Pandas DataFrame, you can use the full power of Pandas to clean, manipulate, and analyze the data.

Here's an example of how to clean and manipulate data in a scraped DataFrame:

wealth_table["Total wealth (USD bn)"] = wealth_table['Total wealth (USD bn)'].replace("—",pd.NA)

# remove unnecessary columns

wealth_table = wealth_table[['Country (or area)', 'Total wealth (USD bn)']]

# remove rows with missing values
wealth_table = wealth_table.dropna()

top10 = wealth_table.head(10)

# plot a bar chart of the top 10 countries by total wealth
plt.bar(top10['Country (or area)'], top10['Total wealth (USD bn)'])
plt.xticks(rotation=90)
plt.ylabel('Total wealth (USD bn)')
plt.title('Top 10 Countries by Total Wealth')
plt.show()

Note that the read_html() function may not work for all web pages, especially those with complex or dynamic HTML structures.

Conclusion

Web scraping with Pandas can be a powerful tool for extracting and analyzing data from web pages. The read_html() function provides an easy way to scrape HTML tables, and Pandas provides a wide range of tools for cleaning, manipulating, and analyzing the data. However, it's important to be mindful of the legal and ethical implications of web scraping, as some websites may prohibit or restrict scraping activities.

Github link:https://gist.github.com/ksn-developer/bb541c1aa2c13b423cdef188b2444661

Parsing nginx logs using python

ksn-developer — Sun, 26 Feb 2023 19:13:29 +0000

Introduction

Nginx is a popular web server software used to serve web pages and other content on the internet. Nginx produces logs that contain information about the requests it receives and the responses it sends. Parsing these logs can provide valuable insights into website traffic and usage patterns. In this article, we will explore how to parse Nginx logs using Python.

Step 1: Understanding Nginx Log Format

Nginx logs are stored in a file, usually located in the /var/log/nginx directory. The log format can be configured using the nginx.conf file. The default log format for Nginx is the Combined Log Format, which includes the following fields:

The remote IP address
The time of the request
The request method (GET, POST, etc.)
The requested URL
The HTTP version
The HTTP status code
The size of the response sent to the client
The referrer URL
The user agent string

The log format can be customized to include or exclude specific fields, or to add custom fields.

Step 2: Installing Required Libraries

To parse Nginx logs using Python, we need to install the following libraries:

pandas: used for data manipulation and analysis.

You can install these libraries using the following command:
pip install pandas

Step 3: Parsing Nginx Logs Using Python

To parse Nginx logs using Python, we can use the pandas library. The pandas library provides a powerful data structure called a DataFrame that allows us to manipulate and analyze data easily.

Here's an example Python script that reads an Nginx log file and creates a DataFrame:

import re
import shlex
import pandas as pd

class Parser:
    IP = 0
    TIME = 3
    TIME_ZONE = 4
    REQUESTED_URL = 5
    STATUS_CODE = 6
    USER_AGENT = 9

    def parse_line(self, line):
        try:
            line = re.sub(r"[\[\]]", "", line)
            data = shlex.split(line)
            result = {
                "ip": data[self.IP],
                "time": data[self.TIME],
                "status_code": data[self.STATUS_CODE],
                "requested_url": data[self.REQUESTED_URL],
                "user_agent": data[self.USER_AGENT],
            }
            return result
        except Exception as e:
            raise e


 if __name__ == '__main__':
    parser = Parser()
    LOG_FILE = "access.log"
    with open(LOG_FILE, "r") as f:
        log_entries = [parser.parse_line(line) for line in f]

    logs_df = pd.DataFrame(log_entries)
    print(logs_df.head())

Step 4: Data Analysis

Once we have the Nginx log data in a DataFrame, we can perform various data analysis tasks.for example :

All requests with status code 404
logs_df.loc[(logs_df["status_code"] == "404")]

Requests from unique ip addresses
logs_df["ip"].unique()

Get all distinct user agents
logs_df["user_agent"].unique()

Get most requested urls
logs_df["requested_url"].value_counts().to_dict()

Conclusion

Parsing Nginx logs using Python can provide valuable insights into website traffic and usage patterns. By using the pandas library, we can easily read and manipulate the log data. With the right analysis, we can gain insights into website performance, user behavior, and potential security threats.

Github link :https://gist.github.com/ksn-developer/4072a9e092bccf68559c21f1c5ac2de2

Extracting text from pdf files using pyPDF3

ksn-developer — Sun, 26 Feb 2023 17:26:33 +0000

PyPDF3 is a Python library for working with PDF files that builds upon the PyPDF2 library. It provides an easy-to-use interface for reading and writing PDF files, and it includes tools for extracting text from PDF files. In this article, we will explore how to use PyPDF3 to extract text from PDF documents.

Installation

To use PyPDF3, you need to install it using pip. You can do this by running the following command in your command prompt or terminal:

pip install PyPDF3

Once you have installed PyPDF3, you can import it in your Python script using the following line of code:

import PyPDF3

Extracting Text from PDF Documents

To extract text from a PDF document using PyPDF3, you first need to open the PDF file in binary mode using Python's built-in open() function. You can then create a PdfFileReader object using PyPDF3, which allows you to read the contents of the PDF file. Here's an example:

   import PyPDF3
   with open('sample.pdf', 'rb') as pdf_file:
     pdf_reader = PyPDF3.PdfFileReader(pdf_file)
     text = ''
     for page_num in range(pdf_reader.numPages):
        page = pdf_reader.getPage(page_num)
        text += page.extractText()
   print(text)