Forem: Ruma Sinha

Pandas Dataframe to AVRO

Ruma Sinha — Sun, 05 Feb 2023 01:00:45 +0000

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format.
Moving data from source to destination involves serialization and deserialization. Serialization means encoding the data from a source and preparing data structures for transmission and intermediate storage stages.
Avro provides data serialization service. Avro stores both the data definition and the data together in one message or file.
Avro relies on schemas. When Avro data is read, the schema used when writing it is always present.

import pandas as pd
from fastavro import writer, parse_schema, reader

With mall datasets from Kaggle, will read the data into pandas dataframe then create the AVRO schema and convert the pandas dataframe into records. Then write the data into avro file format.
Validate the avro file by reading it back into pandas dataframe.

# specifying the avro schema
schema = {
    'doc': 'malldata',
    'name': 'malldata',
    'namespace': 'malldata',
    'type': 'record',
    'fields': [
        {'name': 'CustomerID', 'type': 'int'},
       {'name': 'Gender', 'type': 'string'},
       {'name': 'Age', 'type': 'int'},
       {'name': 'Income', 'type': 'float'},
       {'name': 'SpendingScore', 'type': 'float'}
    ]
}
parsed_schema = parse_schema(schema)

# converting dataframe to records
records = df.to_dict('records')

# writing to avro file format
with open('malldata.avro', 'wb') as out:
    writer(out, parsed_schema, records)

# reading it back into pandas dataframe
avro_records = []

#Read the Avro file
with open('/content/malldata.avro', 'rb') as fo:
    avro_reader = reader(fo)
    for record in avro_reader:
        avro_records.append(record)

#Convert to pd.DataFrame
df_avro = pd.DataFrame(avro_records)

Lets upload the avro file to Google Cloud Storage and create BigQuery table with this avro file.

GCS bucket

The python code snippet for table data load

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

table_id = "<GCP Project>.avro.demo_avro_tbl"

job_config = bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.AVRO)
uri = "gs://<bucket>/malldata.avro"

load_job = client.load_table_from_uri(
    uri, table_id, job_config=job_config
)  # Make an API request.

load_job.result()  # Waits for the job to complete.

destination_table = client.get_table(table_id)
print("Loaded {} rows.".format(destination_table.num_rows))

In the gcloud shell run the python file as python3 avro_file_load.py Prints Loaded 200 rows on successful completion.
In the BigQuery console, we can view the table

Analytics Hub Data Exchange Platform

Ruma Sinha — Sun, 22 Jan 2023 22:28:45 +0000

Analytics Hub is a data exchange platform that enables sharing of data and insights.
The core components of Analytics Hub are:

Analytics Hub Publisher: Publishes the data and share data at real time.
Listings: Enables sharing of data without replicating the shared data.
Analytics Hub Subscriber: Discover the data one is looking for, can combine data with existing datasets and leverage Bigquery for various analytics. When subscribed to a listing a linked dataset gets created in the project.
Analytics Hub Viewer: One can browse through the datasets that are accessible in Analytics Hub.
Analytics Hub Administrator: Create data exchange enabling data sharing. Grant permission to data publishers and subscribers to access these data exchanges.

Architecture Flow:

Shared datasets are the collection of tables and views defined by the data publisher.
Data subscriber get read only linked dataset inside their project and VPC perimeter that they can then combine with their own datasets. This is read only data.

Steps in creating and subsribing datasets in the analytics hub.

Enable Analytics Hub API
Create exchange by clicking on the Create Exchange link

Complete the details by providing Region, Display name etc. Setting permissions.

Search for available listing and subscribing to the dataset

Filter for Trends and we find Google Trends data.

Add the dataset to our project.

We can run our queries and do various analytics and dashboarding.

Use Cases: Data sharing in real time.

Reference: https://cloud.google.com/bigquery/docs/analytics-hub-introduction

DataWarehouse and BigQuery

Ruma Sinha — Mon, 16 Jan 2023 13:35:59 +0000

Datawarehouse is the single source of truth where data extracted from multiple sources are either first loaded and then transformed which is ELT or first transformed as per the set of business requirements then loaded which is ETL.
The primary use cases of using datawarehouse is in data science and BI.
BigQuery is GCP serverless data warehouse.In BigQuery the storage and compute is separated and can scale separately.

Designing and building a data warehouse starts with business requirements. What problems we are trying to solve and how best we can solve them. Lets say the initial goal for an organization moving the on prem data to cloud might starts with having the OLTP running on Cloud SQL and then building data pipelines to bring data into BigQuery for performing various analytics and machine learning.Visualizations and dashboards can be built with Looker connected to BigQuery.
End users are never given access to the raw tables in BigQuery. Lets say we have a set of Order and Products and Customers data in BigQuery. We may first build denormalized tables satisfying the various business requirements then create some views on top of these denormalized tables that will have subset of the table data and provide access to various users so that they can answer any business questions by querying table data. The process of loading the tables in BigQuery then creating denormalized tables or views or materialized views will fall under Extract Load Transform data pipeline.

In BigQuery tables are created under DataSets. The datasets are the containers to organize and control access to tables and views.
The datasets are created in specific GCP project and in specified geography location.

An ELT pipeline:

Creating the dataset that will be the container for the table.

# with bq command
bq --location=US mk --description "Dataset to store diabetes table" demodataset

With Python code creating the BigQuery dataset:

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

dataset_id = "{}.demods".format(client.project)

# Construct a full Dataset object to send to the API.
dataset = bigquery.Dataset(dataset_id)

# Specify the geographic location where the dataset should reside.
dataset.location = "US"

dataset = client.create_dataset(dataset, timeout=30)  # Make an API request.
print("Created dataset {}.{}".format(client.project, dataset.dataset_id))

Copying the extracted csv file in the GCS bucket We have created a GCS bucket and uploaded diabetes.csv in the bucket. GCS is the staging area.

Create the BigQuery table from this extracted data

#The schema.json file is created to store the table schema
[
  {
    "mode": "NULLABLE",
    "name": "PatientID",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "Pregnancies",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "PlasmaGlucose",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "DiastolicBloodPressure",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "TricepsThickness",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "SerumInsulin",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "BMI",
    "type": "FLOAT"
  },
  {
    "mode": "NULLABLE",
    "name": "DiabetesPedigree",
    "type": "FLOAT"
  },
  {
    "mode": "NULLABLE",
    "name": "Age",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "Diabetic",
    "type": "INTEGER"
  }
]
#Empty table is created with this schema in the demodataset
bq mk --table --description "Table containing diabetes data" --label organization:dev demodataset.mytable tblschema.json

Loading the csv file in this empty table:

# skip_leading_rows=1 for ignoring the header
bq load --source_format=CSV --skip_leading_rows=1 demodataset.mytable gs://demo_bq_bucket9282/diabetes.csv

Creating the empty table and loading csv data with Python

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

table_id = "<<project>>.demods.demotbl"

schema = [
  {
    "mode": "NULLABLE",
    "name": "PatientID",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "Pregnancies",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "PlasmaGlucose",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "DiastolicBloodPressure",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "TricepsThickness",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "SerumInsulin",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "BMI",
    "type": "FLOAT"
  },
  {
    "mode": "NULLABLE",
    "name": "DiabetesPedigree",
    "type": "FLOAT"
  },
  {
    "mode": "NULLABLE",
    "name": "Age",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "Diabetic",
    "type": "INTEGER"
  }
]

table = bigquery.Table(table_id, schema=schema)
table = client.create_table(table)  # Make an API request.
print(
    "Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)
# data load
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

table_id =  "<<project>>.demods.demotbl"

job_config = bigquery.LoadJobConfig(
    skip_leading_rows=1,
    source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://demo_bq_bucket9282/diabetes.csv"

load_job = client.load_table_from_uri(
    uri, table_id, job_config=job_config
)  # Make an API request.

load_job.result()  # Waits for the job to complete.

destination_table = client.get_table(table_id)  # Make an API request.
print("Loaded {} rows.".format(destination_table.num_rows))

As part of data migration from On Prem to BigQuery we build the ETL/ELT data pipeline.First the inital data load gets completed then the daily load is scheduled daily/weekly based on the business needs.
BigQuery has three writeDisposition which are write append, write empty and write truncate.writeDisposition determines how data gets written to the table.

DevOps SRE Observability

Ruma Sinha — Sun, 01 Jan 2023 14:11:01 +0000

DevOps and SRE

DevOps is a set of practices that combine software development and operations.DevOps influences the application lifecycle throughout its plan, develop, deliver, and operate phases. Site Reliability Engineering (SRE) is a practical way to implement DevOps practices and principles.
SRE implements DevOps practices via SLI,SLO,SLA and error budgets.

Service Level Indicator is the quantitative measure of the level of service provided over a period.SLI are the metrics defined by the user journey for a service. Example Availability, Latency, Throughput etc.

Service Level Objective is the numerical targets that define the reliability of a system. SLO is measured using SLIs.

Service Level Agreement is the commitment that indicates the availability and reliability of the service meeting a certain level of expectation.

Error budget tells us how unreliable our service is.Error budget is 100% - SLO.

The DevOps lifecycle:

CI/CD is a key DevOps practice.
Continuous Integration:
A software development practice where all developers merge code changes in a central repository multiple times a day.Tools to help are cloud source repository, cloud build, artifact registry.
Continuous Delivery:
The practice of automating the entire software release process.
Tools to help are GKE, GKE on prem, Cloud Run.

What is observability?

Reliability is the most important feature of a service, and setting SLOs allow monitoring systems to capture how the service is performing.
System reliability is tracked by SLOs. SLOs require SLIs or specific metrics to monitor.
Monitoring is the process of collecting, processing, aggregating and displaying real time quantitative data about a system.
With monitoring one can understand the trends in application usage patterns which in turn helps in health checks of the system as well as diagonising when things go wrong.
Key areas of operations include gathering logs, metrics and traces.Dashboards for visualizations.Triggering alerts and error reporting.
Operations with tools such as cloud monitoring, cloud logging, error reporting and the application performance management with tools like Debugger, Profiler and Trace.

References:
Google cloud devops certification preparation with acloud guru.

Working with Map() function in Python, Pyspark and Apache Beam

Ruma Sinha — Sun, 25 Dec 2022 14:15:08 +0000

Map() function

The map() is a built in function in Python. The map function executes a specified function for each item in an iterable. An iterable can be a list or a set or a dictionary or a tuple.
The syntax of map function with a list iterable is:
map(funcA, list[item1,item2,item3....itemN])
The function funcA() is applied to each list element. The resulting output is a new iterator. The process is known as mapping.

Example with a Python code

For loop

list_of_states = ['new jersey','new york','texas','california','florida']

# for loop to convert each element into upper case
modified_list_of_states = []
for idx in range(len(list_of_states)):
  modified_list_of_states.append(str.upper(list_of_states[idx]))

modified_list_of_states
['NEW JERSEY', 'NEW YORK', 'TEXAS', 'CALIFORNIA', 'FLORIDA']

Lambda and Map function

modified_list_of_states = map(lambda list_of_states: str.upper(list_of_states), list_of_states)

list(modified_list_of_states)

['NEW JERSEY', 'NEW YORK', 'TEXAS', 'CALIFORNIA', 'FLORIDA']

Map function with a defined function that takes as input each list item and returns the modified value

def convert_case_f(val):
   return str.upper(val)

modified_list_of_states = map(convert_case_f, list_of_states)
modified_list_of_states # creates an iterator. To display the items we use the list()
<map at 0x7f08a15d6ee0>

list(modified_list_of_states)
['NEW JERSEY', 'NEW YORK', 'TEXAS', 'CALIFORNIA', 'FLORIDA']

Map function in Pyspark

We have the movie ratings dataset from Kaggle. The sample data as below:

The third column is the movie ratings data and our use case is to find the total count per rating.
Using the map() function, we can extract the ratings column.

For every record, the lambda function split the columns based on the white space.The third column that is the rating column gets extracted from every data record. This transformation is applied to every row in the dataset with the map() function.

from pyspark import SparkConf, SparkContext
import collections

# Setting up the SparkContext object
conf = SparkConf().setMaster("local").setAppName("MovieRatingsData")
sc = SparkContext(conf = conf)

# Loading data
movie_ratings_data = sc.textFile("/content/u.data")

# Extracting the ratings data with the map() function
ratings = movie_ratings_data.map(lambda x: x.split()[2])

# Count the total per rating
ratings_count = ratings.countByValue()

# Display the result
collections.OrderedDict(sorted(ratings_count.items()))

Map() function with Apache Beam

With the same use case lets see the working example with Apache Beam.First install apache-beam library. Next create the pipeline p and a PCollection movie_data that stores the results of all the transformations.
Pipeline p applies all the transformations in sequence. It first read the data file using read transform into a collection and then split each row into columns. Next, we make a key value pair of each rating where key is the rating and value assigned is 1. This is then combined and summed up. Displaying the final results.

import apache_beam as beam

p = beam.Pipeline()

movie_data = ( 
                      p 
                      | 'Read from data file' >> beam.io.ReadFromText('/content/u.data')
                      | 'Split rows' >> beam.Map(lambda record: record.split('\t'))
                      | 'Fetch ratings data' >> beam.Map(lambda record: (record[2], 1))
                      | 'Total count per rating' >> beam.CombinePerKey(sum)
                      | 'Write results for rating' >> beam.io.WriteToText('results')
                   )

p.run() # creates the result output file

References

https://realpython.com/python-map-function/
https://www.macrometa.com/event-stream-processing/apache-beam-tutorial
Taming Big Data with Apache Spark and Python Udemy

Python zip() function

Ruma Sinha — Fri, 16 Dec 2022 22:57:37 +0000

Python zip function takes in iterables as arguments and creates an iterator.
A simple working example.

list of columns

columns = ["id","name","age","city","country"]

list of values

values = [111,"Mary John",35,"New York","USA"]

create an iterator with the zip function that takes in columns, values as the input parameters

zipped = zip(columns, values)

type(zipped) returns the value as zip

display the values

list(zipped)

Empty zip declaration

zipped = zip()