Forem: Muchamad Faiz

Scrapping Top Repositories for GitHub Topics

Muchamad Faiz — Sun, 18 Dec 2022 23:08:09 +0000

GitHub is a popular website for sharing open source projects and code repositories. For example, the tensorflow repository contains the entire source code of the Tensorflow deep learning framework.

Repositories in GitHub can be tagged using topics. For example, the tensorflow repository has the topics python, machine-learning, deep-learning etc.

The page https://github.com/topics provides a list of the top topics on Github. In this project, we'll retrive information from this page using web scraping: the process of extracting information from a website in an automated fashion using code. We'll use the Python libraries Requests and Beautiful Soup to scrape data from this page.

Project Outline

1. We're going to scrape https://github.com/topics
2. We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
3. For each topic, we'll get the top 25 repositories in the topic from the topic page
4. For each repository, we'll grab the repo name, username, stars and repo URL
5. By the end of the project, we'll create a CSV and XLX file in the following format:

Instal and import all library needed

Before we begin lets install the library with pip

pip install requests
pip install beautifulsoup4 
pip install pandas

then, lets import all the library into code editor

import requests as req
from bs4 import BeautifulSoup
import pandas as pd

Write a function to download the page

def get_topic_link(base_url):
    response = req.get(url=base_url)
    page_content = response.text
    print(response.status_code)

    soup = BeautifulSoup(page_content, "html.parser")
    tags = soup.find_all("div", class_ = "py-4 border-bottom d-flex flex-justify-between")
    topic_links = []
    for tag in tags:
        url_end = tag.find("a")["href"]
        topic_link = f"https://www.github.com{url_end}"
        topic_links.append(topic_link)
    return topic_links

write a function to extract information

def get_info_topic(topic_link):
    response1 = req.get(topic_link)
    topic_soup = BeautifulSoup(response1.text, "html.parser")
    topic_tag = topic_soup.find("h1").text.strip() #3D
    topic_desc = topic_soup.find("p").text
    info_topic = {
        "title" : topic_tag,
        "desc" : topic_desc
    }
    return info_topic


def get_info_tags(topic_link):
    response = req.get(topic_link)
    info_soup = BeautifulSoup(response.text, "html.parser")
    repo_tags = info_soup.find_all("div", class_ = "d-flex flex-justify-between flex-items-start flex-wrap gap-2 my-3")
    return repo_tags


def get_info(tag):
    repo_username = tag.find_all("a")[0]
    repo_name = tag.find_all("a")[1]

    url_end = tag.find("a")["href"]
    repo_url = f"https://www.github.com{url_end}"

    repo_star = tag.find("span",{"id":"repo-stars-counter-star"}).text.strip()
    repo_value = int(float(repo_star[:-1]) * 1000)

    topics_data = {
        "repo_name" : "repo_name",
        "repo_username" : repo_username.text.strip(),
        "repo_name" : repo_name.text.strip(),
        "repo_url" : repo_url,
        "repo_star" : repo_value,
        }
    return topics_data

Create CSV file(s) with the extracted information

def save_CSV(results):
    df = pd.DataFrame(results)
    df.to_csv("github.csv", index=False)

Create XLX file(s) with the extracted information

def save_XLX(results):
    df = pd.DataFrame(results)
    df.to_excel("github.xlsx", index=False)

Putting it all together

we have a function to get the list of topics
we have a function to create a CSV file for scraped repos from a topics page
Let's create a function to put them together

def main():
    base_url = "https://github.com/topics"
    topic_links = get_topic_link(base_url) # list of url ex: https://github.com/topics/3d, https://github.com/topics/AJAX, etc
    result2 = []
    for topic_link in topic_links: # https://github.com/topics/3d
        print(f"getting info {topic_link}")
        topic_tags = get_info_topic(topic_link) #title, desc
        repo_tags = get_info_tags(topic_link) # some repo tags, so we can use for loop
        result1 = []
        for tag in repo_tags:
            repo_info = get_info(tag)
            result1.append(repo_info)
        for x in result1:
            gabungan = topic_tags | x
            result2.append(gabungan)
        save_CSV(result2)
        save_XLX(result2)


if __name__ == "__main__":
    main()

Conclusion

We are done here, i hope this simple project can be valuable to your practice in python web scrapping.

Github :https://github.com/muchamadfaiz
Email : muchamadfaiz@gmail.com

How to manage browser driver with Web Driver Manager

Muchamad Faiz — Thu, 15 Dec 2022 21:09:28 +0000

In this article i will explain how to use Web Driver Manager to manage browser driver easily

Background

when we make a project using selenium library every time there is a newer version of the driver we need to download it and we have to set the corresponding executable path explicitly. After that, we make object from the driver and continue with the code we want to run. these steps become complicated as we need to repeat these steps out every time the versions change.So therefore, we use WebDriverManager to easily manage it.

Task

Create a simple session to open google.com and search with keyword "web scrapping"

Getting Started

assuming python is intalled on your machine or virtual environment, then you must install WebDriverManager and selenium

pip install webdriver-manager
pip install selenium

after that on code editor lets import all library we need

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

next we start the session by instatiating driver object

driver = webdriver.Chrome(options=Options, service=ChromeService(ChromeDriverManager().install()))

the script above will install driver browser automatically,

then, lets call the driver object to take action on google.com

driver.get("https:www.google.com")

and lets find element search and send key "web scraping" on it

search_el = driver.find_element(By.CSS_SELECTOR,"input[title='Search']")
search_el.send_keys("web scrapping")

if you have arrived at this step, then your screen will be the same as this image

Conclusion

now we know that WebDriverManager automates the browser setup in the Selenium code and this make the difficult steps to store newer version of driver become automatic so we can focus on our execution script rather than browser settings

Github :https://github.com/muchamadfaiz
Email : muchamadfaiz@gmail.com

Google Image Scraping Using Selenium - Part 1

Muchamad Faiz — Tue, 13 Dec 2022 09:35:06 +0000

If you wanna learn automation scrapping with selenium, then this simple project can be the starting point of your journey. In this tutorial i will explain how to scrape image from google using selenium.

Background

on my case i want to scrape image on google with some keyword, lets say "cat" then i will store some links as csv files.

Getting Started

What is Selenium

before we go any further we must know what is selenium. Selenium is a tool for controlling web browsers through programs and performing browser automation. It is mainly used as a testing framework for cross-web browser platforms. However, Selenium is also a very capable tool to use for general web automation, as we are able to program it to do what a human user can do on a browser (in this case, to programmatically download images from Google).

Scraping with Selenium

So how does Selenium exactly work? well, Selenium provides the mechanisms to locate elements on a web page and it mimic the user behaviour. here is the table for most used attribute and locator

These elements can be found in feature Developer Tools on web browsers

and now lets start coding!

Set up the necessary libraries required for the script

pip install selenium

Import Libraries for this tutorial i will be using google chrome so,

from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

go get into google.com then search with keyword "cat"

driver = Chrome()
driver.get("https://www.google.com/")
search_el = driver.find_element(By.XPATH, "//input[@title='Search']")
search_el.send_keys("cat")
search_el.send_keys("Keys.ENTER")
image_el = driver.find_element(By.XPATH, "//a[href]")
image_el.click()

the google window will pop up with cat image

Congratulations! You have successfully open a browser and and navigate to cat images automatically, next we will scroll the page and extract the url image. i will cover it in next part

Github :https://github.com/muchamadfaiz
Email : muchamadfaiz@gmail.com

This 5 Reasons of Web Scraping May Benefit Your Business

Muchamad Faiz — Thu, 10 Nov 2022 04:59:18 +0000

As the digital economy expands, the role of web scraping becomes ever more important. In this article i will explain how web scraping can benefit your business

1. Real Estate Listing Scraping

In this era, the housing industry more like a complicated system of housing sites having big datas about travel destinations, reviews of places to stay, comments, and user profiles.
Many real estate agents use web scraping to populate their database of available properties for sale or for rent.

For example, a real estate agency will scrape MLS (private databases that are created, maintained and paid for by real estate professionals to help their clients buy and sell property) listings to build an API that directly populates this information onto their website. This way, they get to act as the agent for the property when someone finds this listing on their site.

Most listings that you will find on a Real Estate website are automatically generated by an API.

2. Shopping Sites Comparison

Some websites and applications can help you to easily compare pricing between several retailers for the same product.

One way that these websites work is by using web scrapers to scrape product data and pricing from each retailer daily. This way, they can provide their users with the comparison data they need.

High-quality web scraped data obtained in large volumes can be very helpful in analyzing the product trends and forecasting the price.

3. Lead Generation

web scraping is used by many companies to collect contact information about potential customers or clients. This is incredibly common in the business-to-business space, where potential customers will post their business information publicly online. You can take your target persona: education, company, job title, etc. and then find relevant websites on your niche: physicians on health care providers; plumbers, drycleaner, restaurant, etc from Yell.com; KOL (key opinion leader) from big-leap start-up companies

4. Investment Decisions

Investment decisions are complex, as it usually involves a series of process before a sound decision can be made from setting up a hypothetical thesis, experimenting, to researching. The most effective way to test an investment thesis is through historical data analysis. It allows you to gain insights into the root cause of past failures or successes, pitfalls you should have avoided, and future investment returns you might gain.

However, web scraping does not guarantee improved returns on its own. With some financial knowledge and a little for data gathering and analysis, it can prove to be an invaluable tool in today’s information-driven economy.

Web data has no limitation. It’s ever-growing and full of information that can influence the market. Knowing how to harness the power of a web scraping tool can pave the way to better returns.

5. Product Optimization

Suppose you want to collect customers’ feedback to cross-examine your product and make improvements. The sentiment analysis technique is widely used to analyze customers’ attitudes, whether it is positive, neutral or negative. However, the analysis needs a considerable amount of data. Web scraping can automate the extraction process faster which saves tons of time and effort for such work.

That’s all, if you have any comment or question please dont mind me.