Forem: spkibe

The Silent Bug That Exposed All Tenant Data in Databricks Unity Catalog

spkibe — Mon, 11 May 2026 07:39:39 +0000

We were building a multi-tenant data platform on Databricks. Multiple organisations sharing the same physical tables — each one should see only their own rows. Standard stuff.
We implemented it using Unity Catalog's row-level security and column masking. The functions compiled. The filter showed as applied in DESCRIBE EXTENDED. Every test from the admin account looked perfect.
Then we logged in as a real tenant user.
They could see every tenant's data.

What Row-Level Security and Column Masking Actually Do
Before getting to the bug, a quick primer on how Unity Catalog security works — because understanding the mechanism is what makes the bug obvious in hindsight.

Row-Level Security — Row Filters
A row filter is a SQL function you attach to a table. Unity Catalog calls it automatically on every query, passing the value of a specified column from each row. If the function returns TRUE, the row is shown. If it returns FALSE, the row is completely hidden — not counted, not visible, not even hinted at.

-- Attach a row filter to a table
ALTER TABLE my_catalog.my_schema.my_table
  SET ROW FILTER my_catalog.governance.filter_by_tenant
  ON (TENANT_KEY);

The user never writes a WHERE clause for this. They cannot remove it. It fires invisibly on every query from every tool — SQL editor, notebook, BI connection, API call.

Column-Level Masking — Column Masks
A column mask is a SQL function attached to a specific column. Instead of hiding rows, it transforms values at query time. The row is visible but sensitive fields are replaced, generalized, or redacted based on who is asking.

-- Attach a column mask
ALTER TABLE my_catalog.my_schema.my_table
  ALTER COLUMN FIRST_NAME
  SET MASK my_catalog.governance.mask_name;

The same SELECT returns different values depending on the user's group membership:

One table. One query. Different results per role. Platform-enforced.

Why This Matters
The old approach — dynamic views, one per tenant per role — requires you to trust that every developer always queries the right view, that views stay in sync with schema changes, and that no one ever accidentally gets direct table access. Unity Catalog removes all of that trust dependency. Security lives at the storage engine layer, not the SQL layer.

The Bug
Here is the row filter function we wrote:

CREATE OR REPLACE FUNCTION
my_catalog.governance.filter_by_tenant(tenant_key BIGINT)
RETURNS BOOLEAN
RETURN
  IS_ACCOUNT_GROUP_MEMBER('admin_group')
  OR
  EXISTS (
    SELECT 1
    FROM my_catalog.governance.tenant_group_mapping tgm
    WHERE IS_ACCOUNT_GROUP_MEMBER(tgm.group_name)
      AND CAST(tgm.tenant_key AS BIGINT) = tenant_key
  );

Read it carefully.
The function parameter is named tenant_key.
The mapping table column is also named tenant_key.
In the WHERE clause:

AND CAST(tgm.tenant_key AS BIGINT) = tenant_key
SQL sees two references to tenant_key. It resolves both as the table column tgm.tenant_key. The function parameter is completely ignored.

The comparison becomes:

tgm.tenant_key = tgm.tenant_key

Why It Was So Hard to Spot

No error was thrown. The function compiled without warnings. Unity Catalog reported it as valid SQL.
DESCRIBE EXTENDED showed the filter was applied. Row Filter: my_catalog.governance.filter_by_tenant(TENANT_KEY)

Everything looked correct at the metadata level. The filter was attached. The problem was invisible in the schema description.

Admin tests passed. Our initial testing was done from an admin account. The admin bypass (IS_ACCOUNT_GROUP_MEMBER('admin_group')) fires before the EXISTS check, so it returned TRUE for the correct reason. We never noticed the EXISTS was broken.
The function fails open, not closed. When Unity Catalog cannot properly evaluate a row filter, it fails open — showing rows rather than blocking them. This is the safer choice for uptime but the dangerous choice for security. A broken filter that silently shows everything is much harder to detect than a broken filter that throws an error.

The Diagnosis
The key test was running the filter function directly as the tenant user:

-- Run as the tenant user, not the admin
SELECT
  my_catalog.governance.filter_by_tenant(1) AS can_see_tenant_1,
  my_catalog.governance.filter_by_tenant(2) AS can_see_tenant_2,
  my_catalog.governance.filter_by_tenant(3) AS can_see_tenant_3;

Result:

can_see_tenant_1 = true
can_see_tenant_2 = true
can_see_tenant_3 = true

A user who should only see tenant 3 could see all three. The function was returning true everywhere regardless of tenant key. That confirmed the EXISTS logic was broken — and pointed directly to the parameter name collision.

The Fix — Rename the Parameter

CREATE OR REPLACE FUNCTION
my_catalog.governance.filter_by_tenant(p_tenant_key BIGINT)
RETURNS BOOLEAN
RETURN
  CASE
    -- Null tenant keys are always hidden
    WHEN p_tenant_key IS NULL THEN FALSE

    -- Admin bypass
    WHEN IS_ACCOUNT_GROUP_MEMBER('admin_group') THEN TRUE

    -- Tenant check — p_tenant_key is the parameter
    -- tgm.tenant_key is the table column
    -- SQL can now distinguish between them
    WHEN EXISTS (
      SELECT 1
      FROM my_catalog.governance.tenant_group_mapping tgm
      WHERE IS_ACCOUNT_GROUP_MEMBER(tgm.group_name)
        AND CAST(tgm.tenant_key AS BIGINT) = p_tenant_key
    ) THEN TRUE

    -- Explicit deny — everything else sees zero rows
    ELSE FALSE
  END;

Two changes:

Parameter renamed from tenant_key to p_tenant_key — eliminates the name collision
CASE structure with explicit ELSE FALSE — makes the deny-by-default behaviour visible and intentional

After recreating the function and reapplying the row filter, the same test returned:

can_see_tenant_1 = false
can_see_tenant_2 = false
can_see_tenant_3 = true

Drop and Reapply After Fixing
Updating the function is not enough on its own. You also need to drop and reapply the row filter so the table picks up the new function definition:

ALTER TABLE my_catalog.my_schema.my_table
  DROP ROW FILTER;

ALTER TABLE my_catalog.my_schema.my_table
  SET ROW FILTER my_catalog.governance.filter_by_tenant
  ON (TENANT_KEY);

The Column Masking Side
For completeness — column masking uses the same pattern and has the same naming risk. Here is what a safe masking function looks like with the p_ prefix convention applied:

CREATE OR REPLACE FUNCTION
my_catalog.governance.mask_name(p_name STRING)
RETURNS STRING
RETURN CASE
  WHEN IS_ACCOUNT_GROUP_MEMBER('full_access_group') THEN p_name
  WHEN IS_ACCOUNT_GROUP_MEMBER('admin_group')       THEN p_name
  WHEN IS_ACCOUNT_GROUP_MEMBER('partial_access_group')
    THEN CONCAT(LEFT(p_name, 1), '***')
  ELSE '#### MASKED ####'
END;

Apply it inline at table creation to avoid broken dependencies later:

CREATE TABLE IF NOT EXISTS my_catalog.my_schema.members
(
    MEMBER_KEY   BIGINT  NOT NULL,
    TENANT_KEY   BIGINT  NOT NULL,
    FIRST_NAME   STRING  MASK my_catalog.governance.mask_name,
    LAST_NAME    STRING  MASK my_catalog.governance.mask_name,
    DATE_OF_BIRTH DATE   MASK my_catalog.governance.mask_dob
)
USING DELTA;

-- Row filter applied separately
ALTER TABLE my_catalog.my_schema.members
  SET ROW FILTER my_catalog.governance.filter_by_tenant
  ON (TENANT_KEY);

Declaring masks inline means they survive DROP TABLE / CREATE TABLE cycles. The row filter does not — always reapply it after recreating a table.

The Rule

Never name a row filter function parameter the same as a column in any table the function queries.

Prefix all function parameters with p_. It is one character. It prevents this entire class of silent security failure.

filter_by_tenant(tenant_key BIGINT)   ← dangerous
filter_by_tenant(p_tenant_key BIGINT) ← safe

Full Verification Checklist
Run these in order before trusting any row filter in production:

-- 1. Confirm groups are account-level (not workspace-level)
--    Run as the target user:
SELECT IS_ACCOUNT_GROUP_MEMBER('your_tenant_group');
-- Expected: true

-- 2. Confirm filter function returns correct values per tenant
SELECT
  my_catalog.governance.filter_by_tenant(1) AS t1,
  my_catalog.governance.filter_by_tenant(2) AS t2,
  my_catalog.governance.filter_by_tenant(3) AS t3;
-- Expected: false, false, true (for a tenant 3 user)

-- 3. Confirm filter is attached to the table
DESCRIBE EXTENDED my_catalog.my_schema.my_table;
-- Look for: Row Filter: my_catalog.governance.filter_by_tenant(TENANT_KEY)

-- 4. Confirm mapping table has correct data
SELECT * FROM my_catalog.governance.tenant_group_mapping;

-- 5. Confirm the EXISTS subquery works correctly
SELECT EXISTS (
  SELECT 1
  FROM my_catalog.governance.tenant_group_mapping tgm
  WHERE IS_ACCOUNT_GROUP_MEMBER(tgm.group_name)
    AND tgm.tenant_key = 3
) AS exists_result;
-- Expected: true (for tenant 3 user)

-- 6. Run query as target user and confirm only their rows appear
SELECT COUNT(*), TENANT_KEY
FROM my_catalog.my_schema.my_table
GROUP BY TENANT_KEY;
-- Expected: only their tenant_key in results

Other Gotchas We Hit Along the Way
While we are here — these are the other issues that burned us during the same implementation:
Workspace groups vs account groups. IS_ACCOUNT_GROUP_MEMBER() only recognises account-level groups created in the Databricks Account Console, not workspace-level groups. A workspace group always returns false. This one caused hours of confusion.
Cluster identity. Notebooks attached to a cluster run queries as the cluster owner's identity, not the logged-in user. IS_ACCOUNT_GROUP_MEMBER() _checks the cluster owner's groups. Switch to a SQL Warehouse — it always evaluates per the logged-in user.
Broken dependencies after catalog deletion. Column masks hold references to functions by their fully-qualified path. Delete the catalog containing a masking function without first dropping the masks, and every table with that mask becomes unqueryable with _UC_DEPENDENCY_DOES_NOT_EXIST. Always drop masks before dropping catalogs.
Row filter lost after DROP TABLE. When you drop and recreate a table, inline column masks are preserved in the CREATE TABLE statement. Row filters are not. Always reapply ALTER TABLE SET ROW FILTER after recreating any filtered table.

Summary
Unity Catalog row-level security and column masking are genuinely powerful. One filter function and one masking function replace hundreds of views, a duplicate encrypted schema, and developer-discipline-as-security-policy.
But the parameter name collision bug is subtle enough that it will catch you if you are not looking for it. The function looks right. It compiles cleanly. It attaches without errors. And it silently hands every user a complete view of every tenant's data.
Prefix your parameters. Always.

Star Schema: The Data Warehouse Hack So Simple, Experts Hate It

spkibe — Sun, 16 Mar 2025 20:45:58 +0000

Let’s cut through the jargon: a star schema is the easiest, most badass way to build a data warehouse. Picture a fact table—say, sales—sitting in the center like a king, surrounded by dimension tables—products, time, customers—like loyal knights. That’s it. No convoluted hierarchies, no endless joins, just a radial, denormalized beauty built for speed and simplicity. Your notes call it out: “A dimensional model in a star configuration… de-normalized for better performance and easier understanding.” It’s the data equivalent of a cheat code.

Why’s it genius? Because it’s optimized for OLAP queries—those big, juicy SELECT statements that slice and dice your data into insights. Fewer joins mean faster results. Hierarchies like “year → quarter → month” live right in the dimension table, so you’re not chasing parent tables across a database swamp. I’ve seen star schemas turn a 10-minute report into a 10-second one—try that with a normalized mess. Plus, it’s extensible: slap on a new dimension or tweak a hierarchy, and your old queries still work. It’s like Lego for data nerds.

But here’s the controversial twist: some “experts” overcomplicate it because they can’t handle its simplicity. They’ll snowflake it up (more on that tomorrow) or drown it in metadata until it’s unrecognizable. I once worked with a team that spent weeks debating surrogate keys for a star schema that was already humming—meanwhile, the business was begging for actionable numbers. The truth? Star schemas shine when you keep them lean. Over-engineer it, and you’re just flexing for the wrong crowd.

Sure, it’s not perfect. Denormalization means redundancy—updating a customer’s address might hit multiple rows, and that can sting. But in a data warehouse, where reads outnumber writes 100-to-1, who cares? It’s built for analysts, not accountants. So, next time someone smirks at your star schema’s “simplicity,” tell them: “It’s not basic—it’s brilliant.” Then watch their report lag while yours sings.

"Data Warehouses: The Silent Powerhouse Your Boss Doesn’t Understand"

spkibe — Wed, 12 Mar 2025 08:12:10 +0000

Picture this: mountains of transactional data piling up in your company’s systems—sales, clicks, shipments, complaints—all screaming for attention. But your fancy OLTP database is choking, optimized for quick inserts, not big-picture insights. Enter the data warehouse: the relational (or multidimensional) beast built for query and analysis, not petty transaction processing. It’s the unsung hero of business decisions, and most executives don’t even know it exists.

Then What is a Data WareHouse?

It’s a subject-oriented, integrated, time-variant, and non-volatile collection of historical data to support management decision-making.” Translation? It’s a time machine for your business, hoarding years of data from every source imaginable—ERP, CRM, that sketchy Excel sheet your intern made—and turning it into something useful. Non-volatile means once it’s in, it stays in. Time-variant means it evolves with your business. Subject-oriented? It’s laser-focused on what matters: sales trends, customer behavior, profit margins.

But here’s the controversial bit: data warehouses are overkill for 90% of startups and small fries. You don’t need a Ferrari to drive to the corner store. I’ve seen companies sink millions into a warehouse only to query it once a quarter for a PowerPoint slide. Meanwhile, big players—like retail giants or banks—live or die by them, slicing and dicing historical data to predict the next big move.

The real kicker? It separates analysis from transactions, so your operational systems don’t crash when the CEO demands a 5-year sales report. It’s not sexy, but it’s the backbone of every KPI dashboard you’ve ever bragged about. So next time someone asks, “Why do we even have this?”—tell them it’s the difference between guessing and knowing.

Uses of Snowflake Schema

spkibe — Wed, 15 Jan 2025 09:09:37 +0000

The snowflake schema is a type of database schema that organizes data into a centralized fact table surrounded by normalized dimensions. Unlike a star schema, where dimensions are typically denormalized into flat tables, the snowflake schema splits dimensions into related sub-dimensions, reducing data redundancy and improving storage efficiency.

Dimensions with hierarchies can be decomposed into a snowflake structure when you want to avoid joins to big dimension tables when you are using an aggregate of the fact table. For example, if you have brand information that you want to separate out from a product dimension table, you can create a brand snowflake that consists of a single row for each brand and that contains significantly fewer rows than the product dimension table. The following figure shows a snowflake structure for the brand and product line elements and the brand_agg aggregate table.

Snowflake schemas are especially useful in scenarios where certain attributes apply only to subsets of a dimension, leading to sparse data and inefficiencies in traditional denormalized structures.

Below are three practical use cases where the snowflake schema is applied, along with clear data models to demonstrate how it works.

Use Case 1: Large Customer Dimension

Scenario:

In businesses such as online marketing, there are two types of customers:

Anonymous Visitors: Identified only by cookie data, with minimal attributes.

Registered Customers: Have detailed information, including demographics, address, and payment history.

Storing these two types of entities in a single table results in inefficiencies as most attributes remain null for anonymous visitors.

Solution:

Using a snowflake schema:

The base Customer Dimension holds common attributes for both visitors and registered customers.

Separate sub-dimensions store specific attributes for Visitors and Registered Customers.

Data Model:

+---------------------+        +--------------------------+
|  Customer (Base)    |        | Customer Details (Snow)  |
+---------------------+        +--------------------------+
| Customer_ID (PK)    |--------| Customer_ID (FK)         |
| Customer_Type       |        | Demographics             |
| Last_Visit_Date     |        | Address                 |
| Signup_Date         |        | Payment_History         |
+---------------------+        +--------------------------+

       |
       |
       V
+---------------------+
| Visitors (Snow)     |
+---------------------+
| Visitor_ID (PK)     |
| Cookie_ID           |
| Visit_Frequency     |
| Browsing_History    |
+---------------------+

Use Case 2: Financial Product Dimension

Scenario:

In financial services, different product types (e.g., loans and insurance) have distinct attributes. Attempting to store all attributes in one dimension results in sparse data, as many attributes will not apply to all products.

Solution:

Using a snowflake schema:

The base Product Dimension contains attributes common to all products.

Separate sub-dimensions store specialized attributes for different product types.

Data Model:

+----------------------+        +------------------------------+
| Product (Base)       |        | Product Details (Snowflake) |
+----------------------+        +------------------------------+
| Product_ID (PK)      |--------| Product_ID (FK)             |
| Product_Type         |        | Specialized_Attribute_1     |
| Core_Attribute       |        | Specialized_Attribute_2     |
+----------------------+        +------------------------------+

       |
       |
       V
+----------------------+
| Loan Products (Snow) |
+----------------------+
| Loan_ID (PK)         |
| Interest_Rate        |
| Loan_Term            |
| Collateral_Type      |
+----------------------+

       |
       |
       V
+-----------------------+
| Insurance Products    |
+-----------------------+
| Insurance_ID (PK)     |
| Coverage_Type         |
| Premium_Amount        |
| Policy_Duration       |
+-----------------------+

Use Case 3: Multi-Enterprise Calendar Dimension

Scenario:

In international businesses, calendars vary by country. For example:

The US might have specific fiscal quarters and national holidays.

The UK might have unique bank holidays.

India might have a calendar with festival-specific dates.

Storing all attributes in one table leads to complexity and inefficiency.

Solution:

Using a snowflake schema:

The base Calendar Dimension contains attributes common to all countries.

Separate sub-dimensions store country-specific calendar attributes.

Data Model:

+----------------------+        +----------------------------+
| Calendar (Base)      |        | US Calendar (Snowflake)   |
+----------------------+        +----------------------------+
| Calendar_ID (PK)     |--------| Calendar_ID (FK)          |
| Date                |         | National_Holiday          |
| Week_Number          |        | US_Fiscal_Quarter         |
| Fiscal_Year          |        | US_Specific_Attribute     |
+----------------------+        +----------------------------+

       |
       |
       V
+----------------------------+
| UK Calendar (Snowflake)    |
+----------------------------+
| Calendar_ID (FK)           |
| National_Holiday           |
| Bank_Holiday               |
| UK_Fiscal_Quarter          |
+----------------------------+

Conclusion

The snowflake schema is an efficient and organized approach to handling complex dimensions with sparse data. By breaking down dimensions into smaller, logical sub-dimensions, it:

Reduces storage requirements.
Improves query performance for specific attribute groups.
Enhances clarity in schema design.

The examples above highlight how snowflake schemas can be applied to real-world scenarios, such as customer data, financial products, and multi-country calendars, ensuring data is both accessible and efficiently structured.

Web Scraping - Buy rent kenya website

spkibe — Mon, 26 Feb 2024 02:35:42 +0000

Web Scraping is getting data from websites that is contained in it's html tags of the website.
Here is the link to the website we'll be scraping (https://www.buyrentkenya.com/) This websites list the the projects and house available for selling or renting in kenya.
I utilized BeautifulSoup, Python library to scrape this site. The first thing to do is to install the necessary libraries. I used other tools like requests and pandas.
To easily easily install these python libraries it's wise to create a python environment that will contain our required libraries.
utilizing the Virtualenv tool, we set up the environment, firstly install the tool:
pip install virtualenv

To use venv in your project, in your terminal, create a new project folder, cd to the project folder in your terminal, and run the following command:

mkdir web_scraping #creating a new folder
cd web_scraping 
python -m venv venv # creating and environment named venv

to activate the environment use
source venv/bin/activate - for Linux and Mac users
Scripts\activate - for Windows users.

After activating the environment you need to install the required libraries:
Below is a snippet of how I did the above processes:

Now the environment is ready, we are set to begin our web Scraping process.

Getting Started:
Before we dive into the code, let's understand the goal. We want to collect data on houses for rent, including details such as title, location, number of bedrooms and bathrooms, description, and price. We'll be scraping data from multiple pages to create a comprehensive dataset.

Create a python script, mine I named buy_rent_kenya.py.
The code is well-structured and efficient, following these main steps:

Send a GET request to the initial URL.
Use BeautifulSoup to parse the HTML content.
Extract information from each listing on the page.
Iterate through multiple pages, repeating the process.
Store the collected data in a Pandas DataFrame.
Save the DataFrame to a CSV file.

import the necessary libraries for our task as below.

import pandas as pd
from bs4 import BeautifulSoup
import requests

Next thing is to get your browser agent, just search "

what is my browser agent

On your browser and you'll definitely get it.
or get it from this link

The browser agent in web scraping is crucial for mimicking different browsers, avoiding detection by websites, and ensuring compatibility. It helps prevent being flagged as a scraper, allows access to content tailored for specific browsers, and enhances overall scraping efficiency.

url = "https://www.buyrentkenya.com/houses-for-rent"
agent = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

These lines set the target URL and the user agent, which simulates a web browser. It helps in avoiding any potential blocking or restrictions imposed by the website.
Note: Replace the agent with yours, obtained from above search.

# Set headers for the HTTP request
HEADERS = ({'User-Agent':agent,'Accept-Language':'en-US, en;q=0.5'})
# Send a GET request to the URL
response = requests.get(url,headers=HEADERS)

Here, headers are defined for the HTTP request, and a GET request is made to the specified URL using the requests library

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(response.content,'html.parser')

The HTML content of the page is parsed using BeautifulSoup, making it easier to navigate and extract information.

# Initialize lists to store data
titles = []
locations = []
no_of_bathrooms = []
no_of_bedrooms = []
descriptions = []
prices = []
links= []

Empty lists are initialized to store the extracted data.

# Find all listing cards on the page
houses = soup.find_all("div",class_="listing-card")

This line locates all HTML elements with the class "listing-card," which corresponds to individual housing listings on the page.

# Extract information from each listing
for house in houses:
    # Extract title
    title = house.find("span",class_="relative top-[2px] hidden md:inline").text.strip()
    # Extract location
    location = house.find("p",class_="ml-1 truncate text-sm font-normal capitalize text-grey-650").text.strip()
    # Extract number of bedrooms and bathrooms
    no_of_bedroom = house.find_all("span",class_="whitespace-nowrap font-normal")[0].text.strip()
    no_of_bathroom = house.find_all("span",class_="whitespace-nowrap font-normal")[1].text.strip()
    # Extract description
    description = house.find("a",class_="block truncate text-grey-500 no-underline").text.strip()
    # Extract price
    price = house.find("p",class_="text-xl font-bold leading-7 text-grey-900").text.strip()
    # Extract link
    link = house.find("a",class_="text-black no-underline").get("href")

    # Append extracted data to respective lists
    titles.append(title)
    locations.append(location)
    no_of_bathrooms.append(no_of_bathroom)
    no_of_bedrooms.append(no_of_bedroom)
    descriptions.append(description)
    prices.append(price)
    links.append(link)

In this loop, data such as title, location, number of bedrooms and bathrooms, description, price, and link are extracted from each listing and appended to their respective lists.

The title is found within a span tag with the class "relative top-[2px] hidden md:inline".
Both the number of bedrooms and bathrooms are within span tags with the same class "whitespace-nowrap font-normal". Thus we need to utilize the BeautifulSoup find_all()which returns them as a list, thus we use indexing to return each differently. The index [0] corresponds to bedrooms, and [1] corresponds to bathrooms.
The description is found within an a tag with the class "block truncate text-grey-500 no-underline".
The price is located within a p tag with the class "text-xl font-bold leading-7 text-grey-900".
The link is obtained from an a tag with the class "text-black no-underline" using the get("href") method.

Note: Make sure to inspect the HTML structure of the website you are scraping to adapt these identifiers accordingly. If the website structure changes, you may need to update these selectors accordingly.

# Display the number houses extracted from the first page about the first page
print(f"The First Page No Of Titles is {len(titles)}")

This prints out the number of titles on the first page.

The website has a pagination after the first page, which changes dynamic on the url, incrementing its number in the format below:

url = f"https://www.buyrentkenya.com/houses-for-rent?page={page}

Thus a code to extract more information for the pagenated urls is like as:

# Iterate through multiple pages
for page in range(2,56):
    url = f"https://www.buyrentkenya.com/houses-for-rent?page={page}"
    # Make a GET request for each page
    response = requests.get(url,headers=HEADERS)
    print(url)
    houses = soup.find_all("div",class_="listing-card")
    for house in houses:
        # Repeat the process of extracting data from each listing

These lines iterate through multiple pages, updating the URL for each page, making a GET request, and extracting data from each listing on the page.

# Display the total number of titles scraped
print(f"The  Total no of Titles we have scraped is {len(titles)}")

This prints out the total number of titles scraped from all pages.

The Last part is to save our extracted data into a csv file:

# Organize data into a DataFrame
data = {
    "Titles": titles,
    "Locations": locations,
    "No Of Bathrooms": no_of_bathrooms,
    "No Of Bedrooms": no_of_bedrooms,
    "Prices": prices,
    "Description": descriptions
}
df = pd.DataFrame(data)
print(df.shape)
#print(df.head(10))

The extracted data is organized into a Pandas DataFrame for better structure and analysis.

# Save DataFrame to a CSV file
df.to_csv("buy_rent_kenya.csv",index=False)

Finally, the DataFrame is saved as a CSV file named "buy_rent_kenya.csv". The index=False parameter ensures that the DataFrame index is not included in the CSV file.

Conclusion

Web scraping is a powerful tool for extracting valuable information from websites. This Python script provides a glimpse into the process of scraping rental property listings from Buy Rent Kenya. Keep in mind that web scraping should be done responsibly and in compliance with the terms of service of the website being scraped.

Feel free to explore, modify, and adapt the code for your specific needs. Happy coding and may your data exploration endeavors be fruitful.
Here is the link to the full code on github

Be in the look out for our next article automating the Scraping process above Using Airflow.

Introduction to Data Structures and Algorithms.

spkibe — Mon, 20 Jun 2022 19:05:25 +0000

A carpenter has several tools, yet each one has a specific purpose in completing a task. Similar to a programmer, depending on the task at hand, programmers will require the appropriate tool to handle a certain challenge.
Data structures are programmers' tools, and each one serves a specific purpose. Many companies uses data structures challenges in their interviews to see if a programmer is a strong problem solver.
Data structures are classified into two major sections:
1. Linear Data structures → stores data in a sequential manner.
2. Non-linear Data Structure → stores data in a non- sequential manner.

Types of Linear data structures:

1) Arrays → stores values in arranged continuous memory. Elements stored in the arrays are determined by the programming language.
2) Stack →  It stores its elements according to the LIFO (last in, first out) principle, which means that the last element added is the first one withdrawn.
3) Queues → It stores its elements using the FIFO (first in, first out) principle, which means that the first element inserted will be the first one withdrawn
4) LinkedList → It organizes its data as a connected network of nodes, with each element containing the address of the next node.

Types of Non-Linear data structures:

1) Graphs → it’s made up of nodes or vertices and edges. Edges connects two nodes.
2) Trees → stores data in a hierarchical manner which is tree-like structures arranged in multiple levels. It has the root node(top most part) which is the central node.

Algorithms
Is a set of instructions or procedure for solving a specific problem, you can think of it as a recipe to solve a problem or as a blueprint for solving a problem to make it easier to grasp.
Types of Algorithms:
1) Sort algorithms
2) Search algorithms
3) Hashing

Each written algorithm uses some memory to complete. This is where algorithm complexity comes into play; it calculates the amount of time and space required to execute an algorithm.
Space complexity → The overall amount of space taken up by the algorithm in relation to the input size.
Time complexity → is the amount of time algorithms takes to run. It is mostly expressed using the big O notation(asymptotic notation to represent time complexity). For example,a problem of size n:
1. O(1) is a constant-time function
2. O(n) is a linear-time function
3. O(n^2) is a quadratic-time function