<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ruma Sinha</title>
    <description>The latest articles on Forem by Ruma Sinha (@rumsinha).</description>
    <link>https://forem.com/rumsinha</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F991671%2F2e553b4c-edbf-4e8a-ae88-0383073c761f.png</url>
      <title>Forem: Ruma Sinha</title>
      <link>https://forem.com/rumsinha</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/rumsinha"/>
    <language>en</language>
    <item>
      <title>Pandas Dataframe to AVRO</title>
      <dc:creator>Ruma Sinha</dc:creator>
      <pubDate>Sun, 05 Feb 2023 01:00:45 +0000</pubDate>
      <link>https://forem.com/rumsinha/pandas-dataframe-to-avro-1hll</link>
      <guid>https://forem.com/rumsinha/pandas-dataframe-to-avro-1hll</guid>
      <description>&lt;p&gt;Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format.&lt;br&gt;
Moving data from source to destination involves serialization and deserialization. Serialization means encoding the data from a source and preparing data structures for transmission and intermediate storage stages.&lt;br&gt;
Avro provides data serialization service. Avro stores both the data definition and the data together in one message or file.&lt;br&gt;
Avro relies on schemas. When Avro data is read, the schema used when writing it is always present.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F04lxvb97hzjg8rc2pzlb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F04lxvb97hzjg8rc2pzlb.png" alt="Image description" width="800" height="208"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
from fastavro import writer, parse_schema, reader
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdonicroct9yri35586kt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdonicroct9yri35586kt.png" alt="Image description" width="661" height="404"&gt;&lt;/a&gt;&lt;br&gt;
With mall datasets from Kaggle, will read the data into pandas dataframe then create the AVRO schema and convert the pandas dataframe into records. Then write the data into avro file format.&lt;br&gt;
Validate the avro file by reading it back into pandas dataframe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# specifying the avro schema
schema = {
    'doc': 'malldata',
    'name': 'malldata',
    'namespace': 'malldata',
    'type': 'record',
    'fields': [
        {'name': 'CustomerID', 'type': 'int'},
       {'name': 'Gender', 'type': 'string'},
       {'name': 'Age', 'type': 'int'},
       {'name': 'Income', 'type': 'float'},
       {'name': 'SpendingScore', 'type': 'float'}
    ]
}
parsed_schema = parse_schema(schema)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# converting dataframe to records
records = df.to_dict('records')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# writing to avro file format
with open('malldata.avro', 'wb') as out:
    writer(out, parsed_schema, records)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# reading it back into pandas dataframe
avro_records = []

#Read the Avro file
with open('/content/malldata.avro', 'rb') as fo:
    avro_reader = reader(fo)
    for record in avro_reader:
        avro_records.append(record)

#Convert to pd.DataFrame
df_avro = pd.DataFrame(avro_records)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fenjybl37ecbxrvit8t5m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fenjybl37ecbxrvit8t5m.png" alt="Image description" width="542" height="325"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lets upload the avro file to Google Cloud Storage and create BigQuery table with this avro file.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GCS bucket&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69nkhhv6bw6079ak729w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69nkhhv6bw6079ak729w.png" alt="Image description" width="800" height="30"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The python code snippet for table data load
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

table_id = "&amp;lt;GCP Project&amp;gt;.avro.demo_avro_tbl"

job_config = bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.AVRO)
uri = "gs://&amp;lt;bucket&amp;gt;/malldata.avro"

load_job = client.load_table_from_uri(
    uri, table_id, job_config=job_config
)  # Make an API request.

load_job.result()  # Waits for the job to complete.

destination_table = client.get_table(table_id)
print("Loaded {} rows.".format(destination_table.num_rows))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;In the gcloud shell run the python file as python3 avro_file_load.py
Prints Loaded 200 rows on successful completion.&lt;/li&gt;
&lt;li&gt;In the BigQuery console, we can view the table&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjn9xu8jlt66l4zlb0ly3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjn9xu8jlt66l4zlb0ly3.png" alt="Image description" width="800" height="335"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbc8zs27dn5wmr8ggrum.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbc8zs27dn5wmr8ggrum.png" alt="Image description" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>writing</category>
    </item>
    <item>
      <title>Analytics Hub Data Exchange Platform</title>
      <dc:creator>Ruma Sinha</dc:creator>
      <pubDate>Sun, 22 Jan 2023 22:28:45 +0000</pubDate>
      <link>https://forem.com/rumsinha/analytics-hub-data-exchange-platform-3gc</link>
      <guid>https://forem.com/rumsinha/analytics-hub-data-exchange-platform-3gc</guid>
      <description>&lt;p&gt;Analytics Hub is a data exchange platform that enables sharing of data and insights.&lt;br&gt;
The core components of Analytics Hub are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Analytics Hub Publisher: Publishes the data and share data at real time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Listings: Enables sharing of data without replicating the shared data. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Analytics Hub Subscriber: Discover the data one is looking for, can combine data with existing datasets and leverage Bigquery for various analytics. When subscribed to a listing a linked dataset gets created in the project.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Analytics Hub Viewer: One can browse through the datasets that are accessible in Analytics Hub.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Analytics Hub Administrator: Create data exchange enabling data sharing. Grant permission to data publishers and subscribers to access these data exchanges.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Architecture Flow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1kca0cli7nyx48lnklbc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1kca0cli7nyx48lnklbc.png" alt="Image description" width="" height=""&gt;&lt;/a&gt;&lt;br&gt;
Shared datasets are the collection of tables and views defined by the data publisher.&lt;br&gt;
Data subscriber get read only linked dataset inside their project and VPC perimeter that they can then combine with their own datasets. This is read only data.&lt;/p&gt;

&lt;p&gt;Steps in creating and subsribing datasets in the analytics hub.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;em&gt;Enable Analytics Hub API&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Create exchange by clicking on the Create Exchange link&lt;/em&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbn1ur9lwstyqvapy1ta.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbn1ur9lwstyqvapy1ta.png" alt="Image description" width="800" height="369"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyhsdwyfhb2ssvf7etl05.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyhsdwyfhb2ssvf7etl05.png" alt="Image description" width="800" height="365"&gt;&lt;/a&gt;&lt;br&gt;
Complete the details by providing Region, Display name etc. Setting permissions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4yyicnemoyxcvovv8aqw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4yyicnemoyxcvovv8aqw.png" alt="Image description" width="800" height="730"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;em&gt;Search for available listing and subscribing to the dataset&lt;/em&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjh08gguwc3cuyte7n9mg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjh08gguwc3cuyte7n9mg.png" alt="Image description" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Filter for Trends and we find Google Trends data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19gkji00b5phxv8tk5u3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19gkji00b5phxv8tk5u3.png" alt="Image description" width="800" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzkdwv99qa9sryq4ipghh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzkdwv99qa9sryq4ipghh.png" alt="Image description" width="800" height="467"&gt;&lt;/a&gt;&lt;br&gt;
Add the dataset to our project.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmn1x5d4xdy4fk4zq1qq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmn1x5d4xdy4fk4zq1qq.png" alt="Image description" width="800" height="323"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can run our queries and do various analytics and dashboarding.&lt;/p&gt;

&lt;p&gt;Use Cases: Data sharing in real time.&lt;/p&gt;

&lt;p&gt;Reference: &lt;a href="https://cloud.google.com/bigquery/docs/analytics-hub-introduction" rel="noopener noreferrer"&gt;https://cloud.google.com/bigquery/docs/analytics-hub-introduction&lt;/a&gt;&lt;/p&gt;

</description>
      <category>career</category>
      <category>hiring</category>
      <category>community</category>
    </item>
    <item>
      <title>DataWarehouse and BigQuery</title>
      <dc:creator>Ruma Sinha</dc:creator>
      <pubDate>Mon, 16 Jan 2023 13:35:59 +0000</pubDate>
      <link>https://forem.com/rumsinha/datawarehouse-and-bigquery-5g9b</link>
      <guid>https://forem.com/rumsinha/datawarehouse-and-bigquery-5g9b</guid>
      <description>&lt;p&gt;Datawarehouse is the single source of truth where data extracted from multiple sources are either first loaded and then transformed which is ELT or first transformed as per the set of business requirements then loaded which is ETL. &lt;br&gt;
The primary use cases of using datawarehouse is in data science and BI.&lt;br&gt;
BigQuery is GCP serverless data warehouse.In BigQuery the storage and compute is separated and can scale separately.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7mz9w3247i95h36eg8hw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7mz9w3247i95h36eg8hw.png" alt="Image description" width="800" height="328"&gt;&lt;/a&gt;&lt;br&gt;
Designing and building a data warehouse starts with business requirements. What problems we are trying to solve and how best we can solve them. Lets say the initial goal for an organization moving the on prem data to cloud might starts with having the OLTP running on Cloud SQL and then building data pipelines to bring data into BigQuery for performing various analytics and machine learning.Visualizations and dashboards can be built with Looker connected to BigQuery.&lt;br&gt;
End users are never given access to the raw tables in BigQuery. Lets say we have a set of Order and Products and Customers data in BigQuery. We may first build denormalized tables satisfying the various business requirements then create some views on top of these denormalized tables that will have subset of the table data and provide access to various users so that they can answer any business questions by querying table data. The process of loading the tables in BigQuery then creating denormalized tables or views or materialized views will fall under Extract Load Transform data pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3jcg8y0herfwr4nwgqnd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3jcg8y0herfwr4nwgqnd.png" alt="Image description" width="800" height="213"&gt;&lt;/a&gt;&lt;br&gt;
In BigQuery tables are created under DataSets. The datasets are the containers to organize and control access to tables and views.&lt;br&gt;
The datasets are created in specific GCP project and in specified geography location.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxtd8fdr6sex0fwt0a4eu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxtd8fdr6sex0fwt0a4eu.png" alt="Image description" width="624" height="526"&gt;&lt;/a&gt;&lt;br&gt;
An ELT pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ymu7ntqpqzirxqjil41.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ymu7ntqpqzirxqjil41.png" alt="Image description" width="800" height="122"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creating the dataset that will be the container for the table.
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# with bq command
bq --location=US mk --description "Dataset to store diabetes table" demodataset
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh3yejjpxkvse4obcgo5t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh3yejjpxkvse4obcgo5t.png" alt="Image description" width="479" height="63"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fve8rhurj0pi0005znpwn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fve8rhurj0pi0005znpwn.png" alt="Image description" width="687" height="264"&gt;&lt;/a&gt;&lt;br&gt;
With Python code creating the BigQuery dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

dataset_id = "{}.demods".format(client.project)

# Construct a full Dataset object to send to the API.
dataset = bigquery.Dataset(dataset_id)

# Specify the geographic location where the dataset should reside.
dataset.location = "US"

dataset = client.create_dataset(dataset, timeout=30)  # Make an API request.
print("Created dataset {}.{}".format(client.project, dataset.dataset_id))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzu03rf8566czcrznfpna.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzu03rf8566czcrznfpna.png" alt="Image description" width="449" height="115"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Copying the extracted csv file in the GCS bucket
We have created a GCS bucket and uploaded diabetes.csv in the 
bucket. GCS is the staging area.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkhwwfhd3mmm99r8q1qaj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkhwwfhd3mmm99r8q1qaj.png" alt="Image description" width="800" height="56"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create the BigQuery table from this extracted data
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#The schema.json file is created to store the table schema
[
  {
    "mode": "NULLABLE",
    "name": "PatientID",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "Pregnancies",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "PlasmaGlucose",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "DiastolicBloodPressure",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "TricepsThickness",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "SerumInsulin",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "BMI",
    "type": "FLOAT"
  },
  {
    "mode": "NULLABLE",
    "name": "DiabetesPedigree",
    "type": "FLOAT"
  },
  {
    "mode": "NULLABLE",
    "name": "Age",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "Diabetic",
    "type": "INTEGER"
  }
]
#Empty table is created with this schema in the demodataset
bq mk --table --description "Table containing diabetes data" --label organization:dev demodataset.mytable tblschema.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3cq8vxic74ybpqb3cx2m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3cq8vxic74ybpqb3cx2m.png" alt="Image description" width="519" height="96"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuurgamqhncpnksy39dgy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuurgamqhncpnksy39dgy.png" alt="Image description" width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Loading the csv file in this empty table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# skip_leading_rows=1 for ignoring the header
bq load --source_format=CSV --skip_leading_rows=1 demodataset.mytable gs://demo_bq_bucket9282/diabetes.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrontkjyytk2ax5rk8jh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrontkjyytk2ax5rk8jh.png" alt="Image description" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Creating the empty table and loading csv data with Python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

table_id = "&amp;lt;&amp;lt;project&amp;gt;&amp;gt;.demods.demotbl"

schema = [
  {
    "mode": "NULLABLE",
    "name": "PatientID",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "Pregnancies",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "PlasmaGlucose",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "DiastolicBloodPressure",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "TricepsThickness",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "SerumInsulin",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "BMI",
    "type": "FLOAT"
  },
  {
    "mode": "NULLABLE",
    "name": "DiabetesPedigree",
    "type": "FLOAT"
  },
  {
    "mode": "NULLABLE",
    "name": "Age",
    "type": "INTEGER"
  },
  {
    "mode": "NULLABLE",
    "name": "Diabetic",
    "type": "INTEGER"
  }
]

table = bigquery.Table(table_id, schema=schema)
table = client.create_table(table)  # Make an API request.
print(
    "Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)
# data load
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

table_id =  "&amp;lt;&amp;lt;project&amp;gt;&amp;gt;.demods.demotbl"

job_config = bigquery.LoadJobConfig(
    skip_leading_rows=1,
    source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://demo_bq_bucket9282/diabetes.csv"

load_job = client.load_table_from_uri(
    uri, table_id, job_config=job_config
)  # Make an API request.

load_job.result()  # Waits for the job to complete.

destination_table = client.get_table(table_id)  # Make an API request.
print("Loaded {} rows.".format(destination_table.num_rows))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhirfhfed6epzv12j0m4c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhirfhfed6epzv12j0m4c.png" alt="Image description" width="465" height="113"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqpzsp52vzbjgorxjdsr7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqpzsp52vzbjgorxjdsr7.png" alt="Image description" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As part of data migration from On Prem to BigQuery we build the ETL/ELT data pipeline.First the inital data load gets completed then the daily load is scheduled daily/weekly based on the business needs.&lt;br&gt;
BigQuery has three writeDisposition which are write append, write empty and write truncate.writeDisposition determines how data gets written to the table.&lt;/p&gt;

</description>
      <category>discuss</category>
    </item>
    <item>
      <title>DevOps SRE Observability</title>
      <dc:creator>Ruma Sinha</dc:creator>
      <pubDate>Sun, 01 Jan 2023 14:11:01 +0000</pubDate>
      <link>https://forem.com/rumsinha/devops-sre-observability-2a5m</link>
      <guid>https://forem.com/rumsinha/devops-sre-observability-2a5m</guid>
      <description>&lt;h4&gt;
  
  
  DevOps and SRE
&lt;/h4&gt;

&lt;p&gt;DevOps is a set of practices that combine software development and operations.DevOps influences the application lifecycle throughout its plan, develop, deliver, and operate phases. Site Reliability Engineering (SRE) is a practical way to implement DevOps practices and principles.&lt;br&gt;
SRE implements DevOps practices via SLI,SLO,SLA and error budgets.&lt;/p&gt;

&lt;p&gt;Service Level Indicator is the quantitative measure of the level of service provided over a period.SLI are the metrics defined by the user journey for a service. Example Availability, Latency, Throughput etc.&lt;/p&gt;

&lt;p&gt;Service Level Objective is the numerical targets that define the reliability of a system. SLO is measured using SLIs. &lt;/p&gt;

&lt;p&gt;Service Level Agreement is the commitment that indicates the availability and reliability of the service meeting a certain level of expectation. &lt;/p&gt;

&lt;p&gt;Error budget tells us how unreliable our service is.Error budget is 100% - SLO.&lt;/p&gt;

&lt;h6&gt;
  
  
  The DevOps lifecycle:
&lt;/h6&gt;

&lt;p&gt;CI/CD is a key DevOps practice.&lt;br&gt;
Continuous Integration:&lt;br&gt;
A software development practice where all developers merge code changes in a central repository multiple times a day.Tools to help are cloud source repository, cloud build, artifact registry.&lt;br&gt;
Continuous Delivery:&lt;br&gt;
The practice of automating the entire software release process.&lt;br&gt;
Tools to help are GKE, GKE on prem, Cloud Run.&lt;/p&gt;

&lt;h6&gt;
  
  
  What is observability?
&lt;/h6&gt;

&lt;p&gt;Reliability is the most important feature of a service, and setting SLOs allow monitoring systems to capture how the service is performing.&lt;br&gt;
System reliability is tracked by SLOs. SLOs require SLIs or specific metrics to monitor. &lt;br&gt;
Monitoring is the process of collecting, processing, aggregating and displaying real time quantitative data about a system.&lt;br&gt;
With monitoring one can understand the trends in application usage patterns which in turn helps in health checks of the system as well as diagonising when things go wrong.&lt;br&gt;
Key areas of operations include gathering logs, metrics and traces.Dashboards for visualizations.Triggering alerts and error reporting.&lt;br&gt;
Operations with tools such as cloud monitoring, cloud logging, error reporting and the application performance management with tools like Debugger, Profiler and Trace.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmlxjwwzpzx9rroivush5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmlxjwwzpzx9rroivush5.png" alt="Image description" width="800" height="479"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;References:&lt;br&gt;
Google cloud devops certification preparation with acloud guru.&lt;/p&gt;

</description>
      <category>emptystring</category>
    </item>
    <item>
      <title>Working with Map() function in Python, Pyspark and Apache Beam</title>
      <dc:creator>Ruma Sinha</dc:creator>
      <pubDate>Sun, 25 Dec 2022 14:15:08 +0000</pubDate>
      <link>https://forem.com/rumsinha/working-with-map-function-in-python-pyspark-and-apache-beam-2c1b</link>
      <guid>https://forem.com/rumsinha/working-with-map-function-in-python-pyspark-and-apache-beam-2c1b</guid>
      <description>&lt;h2&gt;
  
  
  Map() function
&lt;/h2&gt;

&lt;p&gt;The map() is a built in function in Python. The map function executes a specified function for each item in an iterable. An iterable can be a list or a set or a dictionary or a tuple.&lt;br&gt;
The syntax of map function with a list iterable is:&lt;br&gt;
&lt;em&gt;&lt;strong&gt;map(funcA, list[item1,item2,item3....itemN])&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
The function funcA() is applied to each list element. The resulting output is a new iterator. The process is known as mapping.&lt;/p&gt;
&lt;h4&gt;
  
  
  Example with a Python code
&lt;/h4&gt;
&lt;h6&gt;
  
  
  For loop
&lt;/h6&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;list_of_states = ['new jersey','new york','texas','california','florida']

# for loop to convert each element into upper case
modified_list_of_states = []
for idx in range(len(list_of_states)):
  modified_list_of_states.append(str.upper(list_of_states[idx]))

modified_list_of_states
['NEW JERSEY', 'NEW YORK', 'TEXAS', 'CALIFORNIA', 'FLORIDA']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fckqcgcgly5l9k6l34o6k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fckqcgcgly5l9k6l34o6k.png" alt="Image description" width="800" height="338"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h6&gt;
  
  
  Lambda and Map function
&lt;/h6&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;modified_list_of_states = map(lambda list_of_states: str.upper(list_of_states), list_of_states)

list(modified_list_of_states)

['NEW JERSEY', 'NEW YORK', 'TEXAS', 'CALIFORNIA', 'FLORIDA']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F94hx82uph880yys5a8f3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F94hx82uph880yys5a8f3.png" alt="Image description" width="800" height="102"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h6&gt;
  
  
  Map function with a defined function that takes as input each list item and returns the modified value
&lt;/h6&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def convert_case_f(val):
   return str.upper(val)

modified_list_of_states = map(convert_case_f, list_of_states)
modified_list_of_states # creates an iterator. To display the items we use the list()
&amp;lt;map at 0x7f08a15d6ee0&amp;gt;

list(modified_list_of_states)
['NEW JERSEY', 'NEW YORK', 'TEXAS', 'CALIFORNIA', 'FLORIDA']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F98fnfyg7enf5v4pbxq3x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F98fnfyg7enf5v4pbxq3x.png" alt="Image description" width="800" height="290"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Map function in Pyspark
&lt;/h4&gt;

&lt;p&gt;We have the movie ratings dataset from Kaggle. The sample data as below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffuejf8lcab96lt69bsp0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffuejf8lcab96lt69bsp0.png" alt="Image description" width="428" height="271"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The third column is the movie ratings data and our use case is to find the total count per rating.&lt;br&gt;
Using the map() function, we can extract the ratings column.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F850v4pi98x4n68bgs0v5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F850v4pi98x4n68bgs0v5.png" alt="Image description" width="800" height="835"&gt;&lt;/a&gt;&lt;br&gt;
For every record, the lambda function split the columns based on the white space.The third column that is the rating column gets extracted from every data record. This transformation is applied to every row in the dataset with the map() function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark import SparkConf, SparkContext
import collections

# Setting up the SparkContext object
conf = SparkConf().setMaster("local").setAppName("MovieRatingsData")
sc = SparkContext(conf = conf)

# Loading data
movie_ratings_data = sc.textFile("/content/u.data")

# Extracting the ratings data with the map() function
ratings = movie_ratings_data.map(lambda x: x.split()[2])

# Count the total per rating
ratings_count = ratings.countByValue()

# Display the result
collections.OrderedDict(sorted(ratings_count.items()))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Festx5p6gio0uj9crxu5h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Festx5p6gio0uj9crxu5h.png" alt="Image description" width="800" height="630"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Map() function with Apache Beam
&lt;/h4&gt;

&lt;p&gt;With the same use case lets see the working example with Apache Beam.First install apache-beam library. Next create the pipeline p and a PCollection movie_data that stores the results of all the transformations.&lt;br&gt;
Pipeline p applies all the transformations in sequence. It first read the data file using read transform into a collection and then split each row into columns. Next, we make a key value pair of each rating where key is the rating and value assigned is 1. This is then combined and summed up. Displaying the final results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import apache_beam as beam

p = beam.Pipeline()

movie_data = ( 
                      p 
                      | 'Read from data file' &amp;gt;&amp;gt; beam.io.ReadFromText('/content/u.data')
                      | 'Split rows' &amp;gt;&amp;gt; beam.Map(lambda record: record.split('\t'))
                      | 'Fetch ratings data' &amp;gt;&amp;gt; beam.Map(lambda record: (record[2], 1))
                      | 'Total count per rating' &amp;gt;&amp;gt; beam.CombinePerKey(sum)
                      | 'Write results for rating' &amp;gt;&amp;gt; beam.io.WriteToText('results')
                   )

p.run() # creates the result output file
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62bgxs2xv1f6bhkrhqnc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62bgxs2xv1f6bhkrhqnc.png" alt="Image description" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  References
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://realpython.com/python-map-function/" rel="noopener noreferrer"&gt;https://realpython.com/python-map-function/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.macrometa.com/event-stream-processing/apache-beam-tutorial" rel="noopener noreferrer"&gt;https://www.macrometa.com/event-stream-processing/apache-beam-tutorial&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Taming Big Data with Apache Spark and Python Udemy&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>tooling</category>
    </item>
    <item>
      <title>Python zip() function</title>
      <dc:creator>Ruma Sinha</dc:creator>
      <pubDate>Fri, 16 Dec 2022 22:57:37 +0000</pubDate>
      <link>https://forem.com/rumsinha/python-zip-function-2hje</link>
      <guid>https://forem.com/rumsinha/python-zip-function-2hje</guid>
      <description>&lt;p&gt;Python zip function takes in iterables as arguments and creates an iterator.&lt;br&gt;
A simple working example.&lt;/p&gt;

&lt;h3&gt;
  
  
  list of columns
&lt;/h3&gt;

&lt;p&gt;columns = ["id","name","age","city","country"]&lt;/p&gt;

&lt;h3&gt;
  
  
  list of values
&lt;/h3&gt;

&lt;p&gt;values = [111,"Mary John",35,"New York","USA"]&lt;/p&gt;

&lt;h3&gt;
  
  
  create an iterator with the zip function that takes in columns, values as the input parameters
&lt;/h3&gt;

&lt;p&gt;zipped = zip(columns, values)&lt;/p&gt;

&lt;p&gt;type(zipped) returns the value as zip&lt;/p&gt;

&lt;h3&gt;
  
  
  display the values
&lt;/h3&gt;

&lt;p&gt;list(zipped)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--a2fXb9q6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vja148xml5lq6f8gp0l6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--a2fXb9q6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vja148xml5lq6f8gp0l6.png" alt="Image description" width="521" height="232"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Empty zip declaration
&lt;/h3&gt;

&lt;p&gt;zipped = zip()&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Geg14rXB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rg7zkwiqfu8watb69w0s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Geg14rXB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rg7zkwiqfu8watb69w0s.png" alt="Image description" width="306" height="239"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>zip</category>
      <category>iterator</category>
    </item>
  </channel>
</rss>
