Forem: Pratik Singh

[Boost]

Pratik Singh — Wed, 25 Dec 2024 17:23:22 +0000

Say Goodbye to tedious Code Reviews

Pratik Singh ・ Nov 30

#programming #ai #devops #productivity

Say Goodbye to tedious Code Reviews

Pratik Singh — Sat, 30 Nov 2024 15:36:29 +0000

This article will cover how GenAI could be leveraged for Code reviews.

Prerequisites ✅

Basic understanding of Python
Access to a GPU-based machine (*optional)
Trauma from endless Code reviews (*helps to appreciate the idea)

🤔 Understanding the Problem

Code reviews: the necessary evil we all love to hate. Every developer has experienced the pain of endlessly checking code for inconsistencies, missed standards, or overlooked best practices.

Having worked at over six tech companies you start to see patterns. One common problem I've noticed is:
"Code reviews are time-consuming".

We all can agree that if code compiles, generates the desired output, and passes all the TCs. It still ain't enough to push to Production. Else CI/CD Pipelines would suffice.

// Detect dark theme var iframe = document.getElementById('tweet-1852694648831332382-47'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1852694648831332382&theme=dark" }

So we can agree even after the various stages of a pipeline. There is an inherent need of a Human intervention. A Developer who comments on the holier-than-thou words of ✨"LGTM ✅"✨ to an MR. A person who'll be held more accountable for these changes than the developer itself!
As we all know human intervention means chances of human error.

What if AI can take care of it at some level?
AI can't replace developers but can surely assist them!

💡 Solution

In my experience, there are two types of coding standards and rules that a Developer follows:

Rules written in books
Rules your Seniors follow (thus the team follow)

What if we document all of these rules the entire team decides to follow? (takes what 30 mins max?)
Now whenever a new code is written. It is checked by AI against all of these rules.

A few examples of such rules:

Logging structure to be consistent across the repos.
Fail-over approach for network calls
Even naming of variables (camelCase or kebab-case)
Error codes
Cases of panic

I guess this will be enough for you to get a sense of why these rules need to be written (even when no AI).

In my project, the AI sends something like this as a Code Review: See Logs here

Now if you feel the need let's dig into the implementation!

🛠️ Implementation

Talk is cheap, show me the code : Here you go!

In the project link shared above, I tried to implement the idea as a stage of the CI/CD pipeline itself. There are other ways to implement the idea, which will be discussed later.

Let's go step-by-step :

1. RULES.md 📖

As mentioned before, you need to mention all of the rules against which you want your code to be tested in one place. It could be a markdown file or a .txt file. Doesn't matter till it's accessible easily by everyone.

2. Right Machine 🖥️

(* Optional step)
Ensure you have a GPU-based machine. Or any machine with enough RAM to run a LLM well! Since I am broke, you can see I'm stealing the GPU resource provided by GitLab CI/CD Runners.

3. Choose a LLM provider 🔌

There are multiple ways to interact with LLM in 2024!

API Method
There are APIs available by Google like AI Studio or Vertex AI, Open AI has its API, Microsoft has a few offerings.
If you decide to use an externally hosted LLM then the right Machine is not necessary! At the end of the day, it's an API call.
Ollama
Coming from a DevOps Engineer, I'm trained to think of cloud agonistic solutions. So if you have a machine(or VM) at your disposal. You can look into Ollama. It has been one of my fav dev tools for the past few months. It can run an LLM on your machine and expose an API endpoint for interacting with it as well!

4. Choose the Right LLM 🧐

The response from every LLM model can be different. Other factors that should be considered are: response time, context length, size of the model, and more. If you are looking towards fine-tuning LLMs, be my guest and go crazy!
Mainly you gotta hit and try.

For the Ollama approach, you can check the available LLMs: Here

5. Perfect Prompt 🪄

The poison in working with LLM in any aspect of writing a perfect prompt. It's a hit-and-miss, to be honest. But more importantly, you need to prioritize what is the least minimum you need from the response.

In my experience, make sure you escape the string you pass to any LLM. Don't forget to play with the temperature for clear answers.

6. Script 🐍

I tend to use Python for scripting anything on the application level. You can use any other language as well.
We write a script to:

Read the rules from RULES.md file
Check the changes made to the files of your interest
Send a prompt to the LLm over an API call (Ollama or cloud)
Print the response in case of success

To clench your thirst of curiosity, kindly see this Python File

Scope of the idea 🔍

I have kept it as a stage in the CI/CD Pipeline to ensure GPU-based machines are easily utilized. For better optimizations, you can change the point when the workflow is triggered. You can run it only when MR is for the master branch instead of for each commit.
It could be a CRON Job as well.

AI Code review on your local machine

It would be great if this idea is executed on local machines rather than CI/CD pipelines.
Now almost all the new Laptops are capable enough to run an LLM on Ollama. Taking the advice from Arnav, why not utilize the power sitting literally at your fingertips?
// Detect dark theme var iframe = document.getElementById('tweet-1857546282794959046-353'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1857546282794959046&theme=dark" }

You can run Ollama on your machine. Serve the model on a port.
Whenever you build your project locally, trigger the Python script to utilize the port and have a review once the code compiles!

Limitations 😞

NO, it can't abolish Human Code reviews. This will just assist Developers in doing code reviews faster.
The right LLM, Prompt, and set of rules will be refined only after multiple iterations.
Everything we work on as an engineer is a Tradeoff! Resources and time usually drive apart, but both indirectly translate to money.
Reliable answers might not be always achieved. Hallucinations is a topic out of the scope of this article
IF you use the API approach of a third party, please do understand you will be sending your proprietary code to that service! (Ollama FTW!)

My Take on the Issue 😸

Code reviews are the most important job a Developer does! AI or no AI it will remain the same. With this idea, you can not only assist the reviewer but also the Developer making the changes. Developers can see for themselves the scope of improvement in their code.
Something like this will ensure the code repo has common rules followed across the team. Makes it easier for a new dev to get on board.

If you liked this content you can follow me here or on Twitter at kitarp29 for more!

Thanks for reading my article :)

My journey to GitLab

Pratik Singh — Mon, 14 Oct 2024 12:27:00 +0000

In this article, I'll share my interview process at GitLab

GitLab Inc. is an open-core company that operates GitLab, a DevOps software package that can develop, secure, and operate software. GitLab includes a distributed version control based on Git, including features such as access control, bug tracking, software feature requests, task management, and wikis for every project, as well as snippets.

Intro 😃

I am Pratik Singh, a DevOps and Go developer.
Here is my Twitter and GitHub

I go by the name kitarp29 online.

My story ✒️

Being in DevOps, GitHub, and GitLab both have been on my bucket list of companies. I was lucky enough to work with GitHub as an extern back in 2022. I was very close to getting into GitLab back in 2022. Why I didn't get into GitLab at that time is a whole other story. But you can understand the obsession grew stronger.

I was happy at my job at Nasdaq as a Senior Software Developer. I started to look for a completely remote job, because of my family conditions. GitLab was one of the companies I started to apply for.

The whole interview process was spread across 2 months.

Interview Process 👨🏻‍💻

Before I get started, there is already a resource from GitLab for this: Here

I think this answers most of the questions one might have. I will try to just add my personal experience to it.

Let's dig deeper!

1. Application 📝

I am not sure if I found the job application on Linkedin Jobs or from their career page. I follow the red car theory so it's harder to keep track. The form takes barely 5 minutes to fill. But I tried to apply for only relevant jobs.
GitLab career page: Here

2. Screening round 🧐

This was a 30-minute call with the Recruiter. It was a call where we both exchanged basic information about my profile and the job role. I was briefed about the interview process, the compensation, and my role in detail. I don't think it's an elimination unless you have diametrically opposite opinions.

3. Technical Round 🤓

This was one of the longest and in my opinion the second-hardest round of the entire process. It is 90 minutes call. After the initial chitchat, I was briefed about the project I'll be working on. I was told about 2 different tasks. I was made aware of the task and asked to go through the docs and give a possible solution on the spot. For each task, I was given ~45 mins. The interviewers were there to answer my questions and nudged me in the right direction. I was not supposed to code within the call but explain my approach.
Post the call, I was assigned both issues and was asked to complete within 3 days. The interviewers were there to help me with the comments on the task.
Luckily I was able to complete one of the tasks flawlessly. But for the other task, I had made the code changes that seemed correct. Unfortunately, it didn't compile on my system. I worked enough on it and the interviewers were satisfied with my work.

4. Manager Round 👩🏻‍💼

My manager is the sweetest interviewer I have ever dealt with. It was a 60-minute call. She made me feel comfortable at the start. After the basic intro, we went deep into my work. She asked me about my work and my internships. Other than the tech questions.
We also started to discuss behavioural questions. After her round of questions, it was my turn to ask questions. I asked about my day-to-day responsibilities, my team, and opportunities. I was confident that I had cleared this round.

5. Tech/Behavioural round 🫡

This is a 60-minute call. This was a call with two of my other teammates. This round was both of tech and behavioural rounds. It was more about how I work with my team. One mistake I kept making was addressing all my answers as "We". As in what we did as a team or handled some situations as a team. Although they were interested in learning more about me. I don't think there is much preparations needed for it, be honest and communicate clearly.
Read on STAR approach to answer such questions.

6. Leadership round 🙄

This was the hardest round in my opinion. It was a 60-minute call. The interviewer started with a technical chitchat about my work. He went through my resume and was curious about my projects. After a couple of behavioural and Agile-related questions, the hardest part started!
Yes, it was the system design part. I won't be discussing the questions itself. But man, it was so open-ended questions! I guess it was the first call I was silent for more than 5 minutes thinking of a solution. I was terrified that I would ever clear this round!
Under the circumstances of my life, I was not able to prepare for it well. But I strongly urge you to prepare for it well.

My sister had just given birth to my nephew a few days before this call.

7. HR Round 🖊️

It was the point I was assured my team wanted to recruit me. Now it was all discussions about the compensation and references.

The night I got this email. I recall I was with my Mom and sister. We all were SO happy! 🥹

8. References 🫱🏻‍🫲🏼

One has to provide three references, at least one being your manager in one of your past companies. This took a week.

Hired! 😇

My suggestions

Keep applying. I have been applying since 2022. Once I got very close to an offer. But after 2 years I finally got in!
Be thorough with your resume.
Practice STAR pattern-based questions.
Important to be good at System Design questions.
Each round is almost 1 to 2 weeks apart, so it's important to lower your anxiety. My suggestion, go on trips in the meantime.

I traveled to Seattle during this duration

My friends and I took a road trip to Wayanad

If you liked this content you can follow me here or on Twitter at kitarp29 for more!

Thanks for reading my article :)

My NASDAQ Experience

Pratik Singh — Thu, 03 Oct 2024 11:00:00 +0000

This article is about my experience working at Nasdaq. It might be a little lengthy :)

I am Pratik Singh, and I used to work at Nasdaq as a Senior Software Developer in the WebProperties Team.
So let's get started!

How I got selected 🙇🏻 ?

NASDAQ approached with this Job role on my Linkedin. There were multiple tech, managerial, and HR rounds. You can find more details in this article: Here

// Detect dark theme var iframe = document.getElementById('tweet-1675851481163988993-993'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1675851481163988993&theme=dark" }

It's been almost 2 years. I still wonder why they chose me

Work experience 👨🏻‍💻

We were the Platforms team behind the nasdaq.com website. My job role was basically a mix of two things.

Building Go & Python Micro services
Building + Maintaining the CI/CD Pipelines & the diff envs.

It included occasional code calls for Production issues.

Let's dig deeper...

I started to work at Nasdaq as Student worker (Paid internship). Initially, I was assigned to work on some bugs for the Go microservices. This task forced me to look beyond our code. For the first time, I started to look into a dependency code to understand the functions better. This task helped us in saving memory. I got awarded for this :)

Interesting Learning : What is pprof ?

Meanwhile, I started to learn about the different CI/CD pipelines built on GitLab. It helped me to learn git in depth. I was added to the release team. I was managing multiple clusters and various deployments spread across different namespaces.

Interesting Learning: If you have to manage multiple Kubernetes clusters. Instead of using the --kubeconfig flag passing different contexts each time. Set up aliases in Linux for each one of them.

The responsibilities helped me learn deployment strategies other than Kubernetes. I learned about CMSand IIS servers. We were developing Helm charts for the new services while maintaining and upgrading the older deployments. The scale we worked on was amazing!

// Detect dark theme var iframe = document.getElementById('tweet-1735640347823452315-630'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1735640347823452315&theme=dark" }

On a normal day it crosses 1 Billion requests across our systems!

Moving on, we had to ship a certain set of new features within a deadline. We had war room calls. Learned how to build in pressure. Learned how Seniors ping pong ideas off each other to work better.

Interesting Learning: As a CI/CD person, you must ensure developers can deploy their changes on lower environments without your help.

Newer challenges awaited me on this path. For a certain problem statement, we needed a Machine Learning model. Despite I had no prior experience, my manager asked me to learn and implement. I recall him saying to me:
"Understanding Machine Learning Models will not be optional in next 5 years"

It was the first project I was heading. It had to be perfect! I did my research. Find solutions that fit our requirements. With every iteration, my manager pushed me to build better solutions. I got way better at Python and using machine learning models. To reduce response time, learned various algorithms like Cosine Similarity, Neural Networks, KNN, ANN, and much more. I built my own Vector search! Built the pipelines and deployments for this. The fun part of containerizing the ML Project within the size constraints.

Interesting Learning: My manager told me once: "Resilience beats every other thing in Production"

I was fixing CVEs across the different parts of our systems. Learned a little PHP in this process. This may not have been the most interesting task. But definitely, the one that taught me to write better code. It was one of the steps towards Shift Left.

Interesting Learning: What the heck is a CVE?

About this time AI wave started to enter the shores at Nasdaq. Discussions on AI projects were going all round. My team got involved in making amazing projects that use multiple ML models. Setting up CI/CD pipelines, data pipelines, optimized code, better-ranking systems, and reliable cache. I was not the developer directly working on these systems. However, I was supporting this project purely as a DevOps engineer.

Lastly, I learned about how to improve the performance of our web pages. How to track it, why to improve it. I tried to understand how it impacts the business. The insights helped Developers improve their code. My manager taught me:
"Always think about the end user. You will never be wrong"
// Detect dark theme var iframe = document.getElementById('tweet-1753760244822589603-294'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1753760244822589603&theme=dark" }

Interesting Learning: What are Web Vitals?

I am sure you didn't read it through 😂
It's fine! I am just blogging my work :)

Team and Culture 💪🏻

It was one of the best teams I have worked for. We had people spread across continents in a remote setup. Most of my team members were based out of Bangalore. We had a remote setup till January 2024.
Earlier we came to the office for events, war rooms, meetings, and such things. Since this year we have had a Hybrid setup with ~2 days in the office. The team supported me when I was facing the hardest phase of my personal life. The team asked me to take leave and take care of my family and myself for weeks.

// Detect dark theme var iframe = document.getElementById('tweet-1753409152100311495-167'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1753409152100311495&theme=dark" }

Lots of teams work to get features shipped. I loved the fact that once a week my team connected just to discuss tech. Share ideas, and have insights about business impacts.

I would take this moment to appreciate my manager. I believe the culture of your team is set by the manager. He is a person with a technical background. Not only understands our work but gives us the insight to get it done better. I think I have learned most from him, whether be it the technical or business front. A supportive manager who believes in you was my blessing at Nasdaq!
Incredibly grateful to my seniors as well. They helped me, taught me, and even tolerated my stupid queries.

Job and beyond ✨

As a Developer, your job is not limited to only shipping features. When I joined NASDAQ I had planned to achieve things that fulfil my personal goals as well.

Within a couple of months, I became part of the Developer Community. We started to do workshops and events with other developer communities in Bangalore.

// Detect dark theme var iframe = document.getElementById('tweet-1692913560144482461-530'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1692913560144482461&theme=dark" }

Our team collectively did nearly 15 Developer-focused events within a year. I was part of the core team that accepted communities and organized the entire event. Apart from hosting communities in our office, we also sponsored a couple of events. My favorite being when my team was a part of Google DevFest Bangalore 2023.
// Detect dark theme var iframe = document.getElementById('tweet-1736360724203794620-323'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1736360724203794620&theme=dark" }

My manager encouraged me to speak at developer events. With a hint of luck in the last year, I was able to speak at certain events.
I was able to speak at Google Office as a speaker at Google Cloud Community Bangalore.
// Detect dark theme var iframe = document.getElementById('tweet-1756312469189087718-495'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1756312469189087718&theme=dark" }

With my mom's blessings, I was selected as a speaker at GitOpsCon 2024, North America. With the support of the entire team at Nasdaq, I was able to travel halfway across the world. I am grateful for the support of my manager, the Nasdaq India head, and my team. A lovely thing I will always remember is that Smitha (one of the heads at Nasdaq India) wished me luck at 3 AM before my talk.
// Detect dark theme var iframe = document.getElementById('tweet-1780156584943067265-287'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1780156584943067265&theme=dark" }

Other than merging MRS, fixing production, and building pipelines, these were my best memories at Nasdaq.
All of my work and determination were awarded at the 10th anniversary event of Nasdaq India.

// Detect dark theme var iframe = document.getElementById('tweet-1788916768456487252-486'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1788916768456487252&theme=dark" }

Fun 🥳

The office building is equipped with all the facilities like a Gym, Table tennis, 8 ball pool, napping areas, carrom, and an amazing terrace!
The parties at Nasdaq are wild! The very first time I met my team was at the year-end party before I joined the company. They had booked the entire RCB Cafe in Bangalore! We danced till late at night.

Other than the amazing big events, we had amazing fun at the All-hands events organized from time to time. The best event was the 10th Anniversary event of Nasdaq India. It was a week-long series of parties. The day of the main event. We partied in our office building till 3 in the morning! One of the best parties of my life

My team with our CEO

All being said, I had a wonderful time working at Nasdaq. Anyone reading this, Nasdaq is a great place to work.
The conditions of my personal life and goals were the reasons I had to part ways with the company. But I will always keep rooting for the team from the sidelines!

If you liked this content you can follow me here or on Twitter at kitarp29 for more!

Thanks for reading my article :)

Deploy your first Java Application on K8s

Pratik Singh — Sun, 07 Apr 2024 05:19:44 +0000

This article will help you deploy a Java Application on Kubernetes.

Prerequisites :

A compiling Java application.
Basic of Docker
Basics of Kuebernetes

Being in 2023, it's hard to recall a time without Java. But now it is harder to imagine a time without Kubernetes. Thus, today we will be looking into how to utilize the most trusted runtime code, with the most optimized deployment solution.

Overview

We are planning to utilize Kubernetes to deploy a Java application. To do this, we will containerize your application using Docker and create an image that can be pushed to DockerHub or any other image repository. Following that, we will create a Kubernetes cluster and deploy this Docker image to it.

Let's get started!

Steps to follow:

1. Build your Application:

First of all, ensure the Java Application you built compiles and builds on your system.

2. Creating a DockerFile

This would be the most crucial part of the process. As a Developer, you create the DockerFile. This file is like the recipe for anyone in the world who wants to run/cook your code.

Make a file in your project directory called DockerFile. Please make sure this file has no extension. It is only called Dockerfile

This can be referred to as a base DockerFile to build on:

FROM <choose your OS>

# Copy local code to the container image.
WORKDIR /app
COPY . .

# Build a release artifact.
RUN mvn package -DskipTests

# Run the web service on container startup.
CMD ["java", "-jar", "/app/target/my-app-1.0-SNAPSHOT.jar"]

You can refer to the official docs: Here

Docker has simplified the process further by introducing the command of docker init. This automates the process of creating a DockerFile while following best practices.
You can refer to the official docs Here

3. Building the Docker Image:

Now you create a Docker Image out of the DockerFile you just created. Suppose the DockerFile was the recipe of your dish and this is a sample made following it.

Run this command in the same directory:

docker build -t my-java-app .

Replace my-java-app with a name you wish

Your Docker Image is built and saved on your local. Verify by running this command:

 docker images

You should see a my-java-app name in the output
If you see some errors at this step. Ensure the base image is present on your system. In most cases, it's the DockerFile that has the error.

4. Pushing the Image

Just like your code is saved on Code Repositories like GitHub or GitLab. The Docker Images also have something similar called a Image Repository. Connect to the Image Repository of your choice.
For personal usage, most devs prefer the DockerHub. The official docs to connect to DockerHub: Here

Create a Tag for the Image you created in the last step using this command:

docker tag my-java-app:latest yourusername/my-java-app:latest

Push this Image to the image repository:

docker push yourusername/my-java-app:latest

Now anyone in the world can run your application. Now we deploy this Docker Image on K8s.

5. Creating a Kubernetes Cluster

Create a Cluster with Google's GKE, Azure's AKS, AWS's EKS or something on local system with help of kind or minikube.

Create a Kubernetes cluster and establish a connection to run kubectl commands. See the docs of your respective Kubernetes provider. In case of kind or minikube you don't need that.

6. Creating a Deployment yaml

You write this YAML file to tell Kubernetes how to run your application. You provide the location of the Docker Image and specify other deployment parameters in this file.

There are always a lot of things you can add to a Deployment YAML in Kubernetes. Make a file .yaml and write these parameters in it.
But this is the most simplified YAML for a simple deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-java-app-deployment
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: my-java-app
        image: yourusername/my-java-app:latest
        ports:
        - containerPort: 8080

Once you have made the required changes, save the file and run this command:

kubectl apply -f <your-file-name>.yaml

Congratulations!

You just deployed your first Java application on a Kubernetes Cluster.

There are several best practices in building the Dockerfile and the YAML that we missed in this article. These are needed for building production-grade systems.

Hope this helps :)

What is GDPR compliance?

Pratik Singh — Tue, 20 Feb 2024 19:27:53 +0000

In this post, we will talk about GDPR. If you are a student, this is a good to know but not a must! This article is more for working professionals.

What is GDPR?

GDPR or General Data Protection Regulation is a law protecting the use of personal data of EU Citizens. It applies to any company that does business with people in the European Union, even if the company itself isn't located there.

Read the official docs: Here

Why should you care?

What if your company/product is not GDPR Compliant, am I right? Even I did not care about it much until now. I am a Developer, why do I care? Let me just ensure my API calls are fast enough and the deployments are reliable.

But wait... I will explain the stakes here.

Let me just put a line from their official law here:

There are two tiers of penalties, which max out at €20 million or 4% of global revenue (whichever is higher), plus data subjects have the right to seek compensation for damages.

Yes €20 million is the fine 🤯🤯!!

Now that I have your attention let's dig deeper:

GDPR in simple words:

Yes GDPR is not only a legal mumbo jumbo. I feel it's more like a design approach for your software.

Imagine building software with privacy woven in, not bolted on.
Forget collecting everything; minimize data like it's gold.
Secure it fiercely with encryption and access controls.
Keep your software updated and keep vulnerability scanning for weak points in your infra
Be transparent about what you do with it.
Share minimum data with third-party software
Empower users with clear consent and deletion rights

No matter whether you do Front-end, Backend, or DevOps. You need to keep this point in your head.

This is my crude interpretation of the law. You can read the whole in the last link (Hehe I know you didn't read it).

So for you this is a simpler explanation of GDPR: Here

FAQs

1. Do Developers need to check on this?

Honestly depends on your organisation. Some tools and consultants can do it for you. If you are a really small company buckle up kiddo, there is a refactor of code pending. Or you know hire a really good lawyer!

2. If my company doesn't operate in the EU?

Yes GDPR would per se apply to you. But understand 95% of all countries have come up with similar laws. More on it later!
But if you are GDPR compliant, you will not have issues with most of the other laws on this.

3. Do we need to do a regular audit for it?

Well as far as I understand it. As a Developer, you need to keep this in mind while building software. Audits are usually done occasionally depending on your company and country.

4. Who will complain against your company/product?

Well in the EU each country has its own body regulating and enforcing this to most companies used by EU citizens. Even a user who can prove it's a violation can do it!

Extra

You can choose to completely ignore this. It goes beyond the scope of GDPR

While learning about GDPR, I found even in Indian Government came up with Digital Personal Data Protection (DPDP) Act, 2023 🇮🇳🫡. So if you are building within India read : Here
I would like to Thank Sunny Sir, he asked me some good questions on GDPR and I wasn't able to answer them. I learned not to hear a word and accept the explanation told by people. Maybe google it once!

If you liked this content you can follow me or on Twitter at kitarp29 for more!

Thanks for reading my article :)

How to stop useless PRs on Open Source!

Pratik Singh — Thu, 08 Feb 2024 12:46:06 +0000

In this article, I will try to develop a couple of Solutions to stop useless PRs in Open Source.

🤔 Understanding the Problem

I recently came across a post by Arpit Bhayani. I am sure you might have seen this controversy that recently happened:

// Detect dark theme var iframe = document.getElementById('tweet-1754862825342943739-975'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1754862825342943739&theme=dark" }

There are many PRs made to Open Source with useless changes.

In this blog, we will go from the basic way to the Machine Learning approach to reduce this issue.

I am Pratik Singh, a Senior Software Developer at Nasdaq. A major part of my job is building and maintaining CI/CD Pipelines. Let me share a few solutions to fix this.

✨ Possible solutions

This is a problem that doesn't come up in a company. Treat this article as more of a free space to discuss your ideas in the comments.
I will start with basic implementations and we will move up the ladder. Let's Go!

1. KISS Approach 😉

Keep It Simple Stupid!
The very first approach to this issue is to restrict access of users. Github has an in-built feature to limit access.
Go to: GitHub Repo Settings -> Moderation Options -> Interaction Limits

This will help to stop newcomers from making useless PRs!

But what if the user is a prior contributor?

2. Not all can edit Docs!

We are moving to different types of CI Jobs to tackle this problem. The idea is to have a set of Users that are allowed to make changes to the .md files (or any file for that matter). And make this job fail the entire CI pipeline!

Kubernetes has a set of people who make Docs. I know all repos can't do that, but you certainly assign a few people certain people for it!

The Job would look something like this:

name: Checking for authorized Doc changes

on:
  pull_request:
    paths:
      - '**/*.md' 
jobs:
  restrict_md_changes:
    runs-on: ubuntu-latest

    steps:
      - name: Check commit author
        id: check_author
        run: |
          # Get the author of the latest commit
          AUTHOR=$(git log -1 --pretty=format:'%an')

          # List of allowed authors (replace with your own)
          ALLOWED_AUTHORS="kitarp29 user1 user2 "

          # Check if the author is allowed
          if [[ ! $ALLOWED_AUTHORS =~ (^| )$AUTHOR($| ) ]]; then
            echo "Unauthorized commit by $AUTHOR. Only specific accounts are allowed."
            echo "If you see a problem in the Docs, please raise an Issue"
            exit 1
          fi

You see it working on one of my pet projects: Here

3. PR should have an Assigned Issue

The ideal way to do Open Source is:

Create an Issue
Get it assigned to you
Build it
Make a PR to solve the issue

Why not enforce this? This CI job will ensure that the PR raised by the user has an Issue related to it. Also, it is assigned to them.

I understand there will be some requirements you need to declare in CONTRIBUTING.md for this. But the CI would look something like this:

name: Check PR Issue Assignment

on:
  pull_request:
    types:
      - opened
      - synchronize

jobs:
  check-issue:
    runs-on: ubuntu-latest

    steps:
      - name: Check if PR has an issue
        id: check-issue
        run: |
          # Extract the issue number from the PR title
          ISSUE_NUMBER=$(echo "${{ github.event.pull_request.title }}" | grep -oE '#[0-9]+' | sed 's/#//')
          if [ -z "$ISSUE_NUMBER" ]; then
            echo "No issue found in the PR title."
            exit 1
          fi

          # Get the issue details
          ISSUE_DETAILS=$(curl -s "https://api.github.com/repos/${{ github.repository }}/issues/$ISSUE_NUMBER")
          ISSUE_ASSIGNEE=$(echo "$ISSUE_DETAILS" | jq -r '.assignee.login')

          # Get the user making the commit
          COMMITTER=$(git log -1 --pretty=format:"%an")

          # Check if the issue is assigned to the committer
          if [ "$ISSUE_ASSIGNEE" != "$COMMITTER" ]; then
            echo "Issue #$ISSUE_NUMBER is not assigned to $COMMITTER."
            exit 1
          fi
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Every Github Actions runner has it's own GITHUB_TOKEN so no extra charge here

3. gh-cli approach

If you are still here I am sure you are intrigued by the idea. So let's dig deep from here.

Check out: gh-cli, it's mostly an overkill when Github UI is so good. But if you add this to a GithubActions runner, you can automate almost all and every aspect of being a maintainer using it. You can report such spam users as well!

I will not give an exact job here as this idea needs to be tailor-made.

4. Initial Idea

For me, the first idea was this:

// Detect dark theme var iframe = document.getElementById('tweet-1754867914732126382-87'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1754867914732126382&theme=dark" }

Irrespective of all the comments on it. I still think it would be an easy fix for the problem.

The CI Job could look something like this:

name: Check PR Markdown Changes

on:
  pull_request:
    types:
      - opened
      - synchronize
    paths:
      - '**/*.md' # Include all .md files

jobs:
  check-md-changes:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Get changed Markdown files
        id: changed-md-files
        run: |
          CHANGED_FILES=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }} | grep '\.md$')
          echo "::set-output name=changed_md_files::$CHANGED_FILES"

      - name: Count lines changed in Markdown files
        id: count-lines
        run: |
          LINES_CHANGED=0
          for FILE in ${{ steps.changed-md-files.outputs.changed_md_files }}; do
            LINES_CHANGED=$((LINES_CHANGED + $(git diff ${{ github.event.before }} ${{ github.sha }} -- $FILE | wc -l)))
          done
          echo "::set-output name=lines_changed::$LINES_CHANGED"

      - name: Fail if lines changed exceed limit
        run: |
          if [ ${{ steps.count-lines.outputs.lines_changed }} -lt 50 ]; then
            echo "Lines changed in Markdown files: ${{ steps.count-lines.outputs.lines_changed }}"
            exit 1
          fi

Yes, I know you have the point of "False Positive" or others. I will address them towards the end.

5. Machine Learning Approach

The moment you all have been reading for!
Can Machine Learning be used to fix this? Yes

// Detect dark theme var iframe = document.getElementById('tweet-1754868424986026096-753'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1754868424986026096&theme=dark" }

Is it an overkill? Also Yes 😂
But again this can only be stated as "Overkill" depending on the cost it incurs versus the magnitude of the size it solves

We know some models can do this. We can run them within the CI runner or maybe create a microservice for it 😂

You can take the original .md file and the new one. Send both of these as string inputs to your Python and get the results back.

For the Python code, you can take either of the three approaches:

ndiff : Very Basic approach. Not Machine Learning but yes can be used here.
External Service: With the rise of AI Azure, Google and various services have APIs at this point. You can subscribe and your CI will talk to the service to check if the changes are semantically the same or not. You can check for spelling and grammar mistakes as well
Using a Machine Learning Model: For such a use case BERT model seems to the perfect fit. I have worked with this model at scale and can vouch for its accuracy.

This is a sample CI Job template for all of the three:

name: Markdown Similarity Check

on:
  pull_request:
    paths:
      - '**/*.md' # Only trigger on changes to Markdown files

jobs:
  similarity-check:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.x' # Choose the appropriate Python version

      - name: Install dependencies
        run: pip install -r requirements.txt # Add any required dependencies

      - name: Get old and new Markdown content
        run: |
          # Retrieve old and new Markdown content (replace with actual commands)
          git diff --name-only ${{ github.event.before }} ${{ github.sha }} | grep '.md' > changed_files.txt
          # Read the changed Markdown files
          while read -r file; do
            old_content=$(git show ${{ github.event.before }}:$file)
            new_content=$(git show ${{ github.sha }}:$file)
            python calculate_similarity.py "$old_content" "$new_content"
          done < changed_files.txt

My Take on the Issue

I understand that desperation and misdirection can create bad things in the world. Students please understand Open Source Devs are generally polite and take the extra step to help you out.

But I am no politician set to change the world.
I am a Developer, I prefer Code over talk!
These ideas can help to reduce the magnitude of the issue.

Coming back to the "False Positives": I agree there will be some that comes up. But the problem is coming in huge repos. Such repos don't change docs frequently. They have releases. If one has to fix any doc. Create an issue for it first!
There are some existing solutions around it: Check Here

If you liked this content you can follow me or on Twitter at kitarp29 for more!

Thanks for reading my article :)

A programming language coding in a grid

Pratik Singh — Wed, 13 Dec 2023 14:14:24 +0000

What? A programming language coding in a grid?

Yes, you read that right, SPL (Structured Process Language) is a programming language that codes in a grid, and specially used for processing structured data.

We know that almost all programming languages write code as text, so what does SPL code look like, and what is the difference between the grid-style code and the text-style code? Let's take a look at the programming environment of SPL first.

Code in a grid

The middle part is the grid-style code of SPL.

What are the benefits of writing code in a grid?

When programming, we always need to use intermediate variables and name them. However, when we program in SPL, naming variables is often unnecessary. SPL allows us to reference the name of previous cell directly in subsequent steps and get the calculation result of the cell (such as A1), for example,

=A2.align@a(A6:~,date(Datetime))

In this way, racking our brains to define variables is avoided (as the variable has to be given a meaningful name, which is annoying); Of course, SPL also supports defining variables, yet there is no need to define the type of variable, and hence we can name a variable and use it directly, for example:

=cod=A4.new(ID:Commodity,0:Stock,:OosTime,0:TotalOosTime)

In addition, we can temporarily define a variable in expression, for example:

= A1.group@o(a+=if(x,1,0)).

You may worry that a problem will occur when using cell name as variable name, that is, cell’s name (position) will change after inserting or deleting a row and column, and the original name will be referenced. Don't worry, this problem is already solved in SPL’s IDE, and the cell name will automatically change when inserting or deleting a row. For example, when inserting a row, the name of cells (red underlined names) changes accordingly, is it convenient?

The grid-style code will make us feel that the code is very neat. Because code is written in cells, it is naturally aligned. For instance, the indentation of cells indicates it is a code block (the for loop statement from A12 to A18), and any modifier is not needed, and thus it looks very neat and intuitive. Moreover, when the code in a cell is long due to the processing of a complicated task, the code occupies one cell only and will not affect the structure of entire code (it will not affect the read of the codes on the right and below since the code in a certain cell will not exceed the boundary of cell even though it is too long). In contrast, the text-style code doesn’t have this advantage because it has to display entire code.

Besides, it should be noted that there is a collapse button at the row where the for loop is located, this button can collapse entire code block. Although such button is available in the IDE of multiple text-style programming languages, it will make entire code more neat and easier to read when it is used in SPL.

Now let's look at the debug function. In the IDE of SPL, the upper toolbar provides multiple execution/debugging buttons, including run, debug, run to cursor, step in, as well as other buttons like set breakpoints, calculate current cell, which can fully meet the needs of editing and debugging program. It executes one cell in each step, the breakpoint of code is very clear. In contrast, the execution/debugging of text-style code is different, there may be multiple actions in one line, which are not easy to distinguish, and breakpoint is not easy to be located when some statements are too long and have to be divided into multiple lines.

Attention should also be paid to the result panel on the right. Because SPL adopts grid-style programming, the result of each step (cell) is retained after execution/debugging, which allows the programmer to view the calculation result of that step (cell) in real-time by clicking on a cell, so whether the calculation is correct or not is clear, and the convenience of debugging is further enhanced as a result of eliminating the need to export result manually and viewing the result of each step in real time.

The benefits don't stop at grid-style programming

Writing code in cells will make programming convenient, for example, it’s easier to edit or debug. However, it will not simplify writing each statement. Let's take a look at SPL syntax itself.

When processing data, especially complex scenario, we will definitely use loop and branch, which are the relatively basic functionalities of a programming language. Of course, SPL provides such functionalities. In addition, SPL provides many features, such as option syntax, multi-layer parameters, and advanced Lambda syntax.

Function option

Each programming language has a large number of built-in library functions, and the richer the library functions, the more convenient it is for us to implement functionality. Functions are distinguished by different name or parameter (and parameter type). However, when it is impossible to distinguish even by parameter type sometimes, we need to explicitly add an option parameter to tell the compiler or interpreter what we want to do. For example, processing files in Java will involve multiple OpenOptions, when we want to create a file that does not exist, the code is:

Files.write(path, DUMMY_TEXT.getBytes(), StandardOpenOption.CREATE_NEW);

When we want to open an existing file and create a new one that does not exist, the code is:

Files.write(path, DUMMY_TEXT.getBytes(), StandardOpenOption.CREATE);

When we want to append data to a file and ensure that the data will not lose in the case of system crash, the code is:

Files.write(path,ANOTHER_DUMMY_TEXT.getBytes(), StandardOpenOption.APPEND, StandardOpenOption.WRITE, StandardOpenOption.SYNC)

As we can see from the above codes that if we want to implement different functionalities with the same function, we need to select different options. Usually, an option is regarded as a parameter, but this will result in complexity in use, and often makes us confused about the real purpose of these parameters, and for some functions with unfixed number of parameters, there is no way to represent option with parameter.

SPL provides very unique function option, which allow the functions with similar functionality to share one function name, and the difference between functions is distinguished with function option, thus really playing the role of function option. In terms of form of expression, it tends to be a two-layer classification, making both remembering and using very convenient. For example, the pos function is used to search for the position of substring in a parent string, if we want to search from back to front, we can use the option @z:

pos@z("abcdeffdef","def")

To perform a case-insensitive search, we can use the option @c:

pos@c("abcdef","Def")

The two options can also be used in combination:

pos@zc("abcdeffdef","Def")

With the function option, we only need to be familiar with fewer functions. When we use the same function with different functionalities, we just need to find a corresponding option, it is equivalent that SPL classifies the functions into layers, which makes it more convenient to search and utilize.

Cascaded parameter

The parameters of some functions are very complex and may be divided into multiple layers. In view of this situation, conventional programming languages do not have a special syntax solution, and can only generate multi-layer structured data object and then pass them in, which is very troublesome. For example, the following code is to perform a join operation in Java (inner join between Orders table and Employee table):

Map<Integer, Employee> EIds = Employees.collect(Collectors.toMap(Employee::EId, Function.identity()));
record OrderRelation(int OrderID, String Client, Employee SellerId, double Amount, Date OrderDate){}
Stream<OrderRelation> ORS=Orders.map(r -> {
   Employee e=EIds.get(r.SellerId);
   OrderRelation or=new OrderRelation(r.OrderID,r.Client,e,r.Amount,r.OrderDate);
   return or;
}).filter(e->e.SellerId!=null);

It can be seen that it needs to pass a multi-layer (segment) parameter to Map to perform association, which is hard to read, let alone write. If we perform a little more calculations (other calculations are often involved after association), for example, group the Employee.Dept and sum the Orders.Amount, the code is:

Map<String,DoubleSummaryStatistics>c=ORS.collect(Collectors.groupingBy(r->r.SellerId.Dept,Collectors. .summarizingDouble(r->r.Amount)));

for(String dept:c.keySet()){
 DoubleSummaryStatistics r =c.get(dept);
 System.out.println("group（dept）："+dept+" sum（Amount）："+r.getSum());
}

There is no need to explain more about the complexity of such function because programmers have deep experience. In contrast, SQL is more intuitive and simpler.

select Dept,sum(Amount) from Orders r inner join Employee e on r.SellerId=e. SellerId group by Dept

SQL employs some keywords (from, join, etc.) to divide the calculation into several parts, which can be understood as a multi-layer parameter. Such parameters are just disguised as English for easy reading. However, this way is much less universal, because it needs to select special keywords for each statement, resulting in inconsistent statement structure.

Instead of using keyword to separate parameters like SQL, and nesting multiple layers like Java, SPL creatively invents cascaded parameter. SPL specifies that three layers of parameters are supported, and they are separated by semicolon, comma, and colon respectively. Semicolon represents the first level, and the parameters separated by semicolon form a group. If there is another layer of parameter in this group, separate them with comma, and if there is third-layer parameter in this group, separate them with colon. To implement the above association calculation in SPL, the code is:

join(Orders:o,SellerId ; Employees:e,EId).groups(e.Dept;sum(o.Amount))

This code is simple and straightforward, and has no nested layer and inconsistent statement structure. Practice shows that three layers can basically meet requirement, we hardly meet a relationship of parameters which cannot be clearly described in three layers.

Advanced Lambda syntax

We know that Lambda syntax can simplify coding, some programming languages have begun to support this syntax. For example, counting the number of empty strings in Java8 or higher version can be coded like this:

List<String>strings = Arrays.asList("abc", "", "bc", "efg", "abcd","", "jkl");
long count = strings.stream().filter(string -> string.isEmpty()).count();

This “(parameter)-> function body” Lambda expression can simplify the definition of anonymous function and is easy to use.

Nevertheless, for some slightly complex calculations, the code will be longer. For example, perform a grouping and aggregating calculation on two fields:

Calendar cal=Calendar.getInstance();

Map<Object, DoubleSummaryStatistics> c=Orders.collect(Collectors.groupingBy(
   r->{
     cal.setTime(r.OrderDate);
     return cal.get(Calendar.YEAR)+"_"+r.SellerId;
   },

   Collectors.summarizingDouble(r->{
     return r.Amount;
   })
   )
);

for(Object sellerid:c.keySet()){
  DoubleSummaryStatistics r =c.get(sellerid);
  String year_sellerid[]=((String)sellerid).split("_");
  System.out.println("group is (year):"+year_sellerid[0\]+"\t(sellerid):"+year_sellerid[1]+"\t sum is："+r.getSum()+"\t count is："+r.getCount());
}

In this code, any field name is preceded by a table name, i.e., “table name.field name”, and the table name cannot be omitted. The syntax of anonymous function is complex, and the complexity increases rapidly as the amount of code increases. Two anonymous functions will form a nested code, which is harder to understand. Implementing a grouping and aggregating calculation will involve multiple functions and libraries, including groupingBy, collect, Collectors, summarizingDouble, DoubleSummaryStatistics, etc., the complexity is very high.

SPL also supports Lambda syntax, and the support degree is more thoroughly than other languages like Java. Now let's perform the above calculations in SPL.

Count the number of empty strings:

=["abc", "", "bc", "efg", "abcd","", "jkl"].count(~=="")

SPL directly simplifies A.(x).count() to A.count(x), which is more convenient. However, this code doesn't seem to differ much from Java code. Let's see another calculation:

=Orders.groups(year(OrderDate),Client; sum(Amount),count(1))

See the difference? There are many advantages when performing grouping and aggregating calculation in SPL: i)it doesn’t need to define data structure in advance; ii) there is no redundant functions in the whole code; iii) the use of sum and count is simple and easy to understand, it is even difficult to perceive it is a nested anonymous function.

Let's look at another example:

There is a set in which a company's sales from January to December are stored. Based on this set, we can do the following calculations:

A2: filter out the data of even-numbered months; A3: calculate the growth value of monthly sales.

Here we use # and [-1], the former represents the current sequence number, and the latter means referencing the previous member. Similarly, if we want to compare the current member with next member, we can use [1]. The symbol #, [x], together with ~ (current member) are what make SPL unique in enhancing Lambda syntax. With these symbols, any calculation can be implemented without adding other parameter definition, and the description ability becomes stronger, and writing and understanding are easier.

Function option, multi-layer parameters and advanced Lambda syntax are another aspect that sets SPL apart.

Structured data computing ability comparable to SQL

SPL's grid-style coding and code features (function syntax, multi-layer parameter, Lambda syntax) make SPL look interesting. However, the invention of SPL is not for attracting attention but processing data efficiently. For this purpose, SPL provides a specialized structured data object: table sequence (record) and provides rich computing library based on the table sequence. In addition, SPL supports dynamic data structure, which makes SPL have the same complete structured data processing ability as SQL.

In contrast, Java, as a compiled language, is very cumbersome in data calculation due to the lack of necessary structured data object. Moreover, since Java doesn’t support dynamic data structure, the data cannot be generated dynamically during computation and has to be defined in advance, this problem was not well solved even after the emergence of Stream. In short, these shortcomings are all due to the fact that the base of Java doesn't provide sufficient support.

SPL provides rich calculation functions, allowing us to calculate structured data conveniently. The functions include but not limited to:

=Orders.sort(Amount)   // sort
=Orders.select(Amount*Quantity>3000 && like(Client,"*S*"))   // filter
=Orders.groups(Client; sum(Amount))   // group
=Orders.id(Client)   // distinct
=join(Orders:o,SellerId ; Employees:e,EId)   // join

Now let’s see a double-field sorting example:

In this code, @t means that the first row is read as field name, and subsequent rows are calculated directly with the field name rather than data object; -Client means reverse order.

The code can also be written in one line on the premise of not affecting reading, which will make code shorter.

=file("Orders.txt").import@t().sort(-Client, Amount)

Let's recall the example in the previous section. When Java performs a grouping and aggregating calculation on two fields, it needs to write a long two-layer nested code, this will increase the cost of learning and use. For the same calculation, coding in SPL is the same as coding in SQL, whether it is to group one field or multiple fields:

=Orders.groups(year(OrderDate),Client; sum(Amount))

Similarly, for inner join calculation (then aggregation), coding in SPL is much simpler than other high-level languages:

=join(Orders:o,SellerId ; Employees:e,EId).groups(e.Dept; sum(o.Amount))

Similar to SQL, SPL can change the association type with little modifications, and there is no need to modify other codes. For example, join@1 means left join, and join@f means full join.

Rich data objects and libraries make SPL not only have the data processing ability comparable to SQL, but inherit some good features of high-level languages (such as procedural computing), thus making it easy for SPL to process data.

Computing abilities that surpass SQL

From what we've discussed above (interesting grid-style programming, features like option syntax, and complete structured data objects and library), we know that SPL has the structured data processing ability comparable to SQL, allowing programmers to perform a lot of structured data processing and computation tasks in the absence of database.

Then, does SPL merely play the role of “SQL” without database?

Not really! SPL's ability is more than that. In fact, SPL has many advantages over SQL in terms of structured data computation.

In practice, we often meet some scenarios that are difficult to code in SQL, and multiply-level nested SQL code with over a thousand lines are very common. For such scenarios, not only is it difficult to code in SQL, but it is also not easy to modify and maintain. Such SQL code is long and troublesome.

Why does this happen?

This is due to the fact that SQL doesn’t support certain features well, or even doesn’t support at all. Let’s look at a few examples to compare SPL and SQL.

Ordered computing

Calculate the maximum number of trading days that a stock keeps rising based on stock transaction record table.

Coding in SQL:

select max(continuousDays)-1
from (select count(*) continuousDays
   from (select sum(changeSign) over(order by tradeDate) unRiseDays
   from (select tradeDate,
     case when closePrice>lag(closePrice) over(order by tradeDate)
     then 0 else 1 end changeSign
     from stock) )
group by unRiseDays)

This code nests 3 layers. Firstly, mark each record with a rising or falling flag through window function (mark 0 if the price rises, otherwise mark 1), and then accumulate by date to get the intervals with same rising flag (the accumulated value will change if the price doesn’t rise), and finally group by flag, count and find out the maximum value, which is the result we want.

How do you feel about this code? Do you think it is tortuous? Does it take a while to understand? In fact, this is not a very complicated task, but even so, it is so difficult to code/read. Why does this happen? The is due to the fact that SQL’s set is unordered, and the members of set cannot be accessed by sequence number (or relative position). In addition, SQL doesn’t provide ordered grouping operation. Although some databases support window function and support order-related operations to a certain extent, it is far from enough (such as this example).

Actually, this task can be implemented in a simpler way: sort by date; compare the price of the day with that of the previous day (order-related operation), and add 1 if the price rises, otherwise reset the current value as 0; find the maximum value.

SPL directly supports ordered data set, and naturally supports order-related operations, allowing us to code according to natural thinking:

Backed by ordered operation, and procedural computing (the advantage of Java), it is very easy for SPL to express, and the code is easy to write and understand.

Even we follow the thinking of above SQL solution to code in SPL, it will be easier.

stock.sort(trade_date).group@i(close_price<close_price [-1]).max(~.len())

This code still makes use of the characteristic of orderliness. When a record meets the condition (stock price doesn't rise), a new group will be generated, and each rising interval will be put into a separate group. Finally, we only need to calculate the number of members of the group with maximum members. Although the thinking is the same as that of SQL, the code is much simpler.

Understanding of grouping

List the last login interval of each user based on a user login record table:

Coding in SQL:

WITH TT AS
 (SELECT RANK() OVER(PARTITION BY uid ORDER BY logtime DESC) rk, T.* FROM t_loginT)
SELECT uid,(SELECT TT.logtime FROM TT where TT.uid=TTT.uid and TT.rk=1)
 -(SELET TT.logtim FROM TT WHERE TT.uid=TTT.uid and TT.rk=2) interval
FROM t_login TTT GROUP BY uid

To calculate the interval, the last two login records of user are required, which is essentially an in-group TopN operation. However, SQL forces aggregation after grouping, so it needs to adopt a self-association approach to implement the calculation indirectly.

Coding in SPL:

SPL has a new understanding on aggregation operation. In addition to common single value like SUM, COUNT, MAX and MIN, the aggregation result can be a set. For example, SPL regards the common TOPN calculation as an aggregation calculation like SUM and COUNT, which can be performed either on a whole set or a grouped subset (such as this example).

In contrast, SQL does not regard TOPN operation as aggregation. For the TOPN operation on a whole set, SQL can only implement by taking the first N items after sorting the outputted result set, while for the TOPN operation on a grouped subset, it is hard for SOL to implement unless turning to a roundabout way to make up sequence numbers. Since SPL regards TOPN as aggregation, it is easy to implement some calculations (such as this example) after making use of the characteristic of ordered data, and this approach can also avoid sorting all data in practice, thereby achieving higher performance.

Furthermore, the grouping of SPL can not only be followed by aggregation, but retain the grouping result (grouped subset), i.e., the set of sets, so that the operation between grouped members can be performed.

Compared with SPL, SQL does not have explicit set data type, and cannot return the data types such as set of sets. Since SQL cannot implement independent grouping, grouping and aggregating have to be bound as a whole.

From the above two examples, we can see the advantages of SPL in ordered and grouping computations. In fact, many of SPL's features are built on the deep understanding of structured data processing. Specifically, the discreteness allows us to separate the records that from the data table and compute them independently and repeatedly; the universal set supports the set composed of any data, and participating in computation; the join operation distinguishes three different types of joins, allowing us to choose a join operation according to actual situation; the feature of supporting cursor enables SPL to have the ability to process big data... By means of these features, it will be easier and more efficient for us to process data.

For more information, refer to: SPL Operations for Beginners

Unexpectedly, SPL can also serve as a data warehouse

Supporting both in-memory computing and external storage computing means SPL can also be used to process big data, and SPL is higher in performance compared with traditional technologies. SPL provides dozens of high-performance algorithms with “lower complexity” to ensure computing performance, including:

In-memory computing: binary search, sequence number positioning, position index, hash index, multi-layer sequence number positioning...

External storage search: binary search, hash index, sorting index, index-with-values, full-text retrieval...

Traversal computing: delayed cursor, multipurpose traversal, parallel multi-cursor, ordered grouping and aggregating, sequence number grouping...

Foreign key association: foreign key addressization, foreign key sequence-numberization, index reuse, aligned sequence, one-side partitioning...

Merge and join: ordered merging, merge by segment, association positioning, attached table...

Multidimensional analysis: partial pre-aggregation, time period pre-aggregation, redundant sorting, boolean dimension sequence, tag bit dimension...

Cluster computing: cluster multi-zone composite table, duplicate dimension table, segmented dimension table, redundancy-pattern fault tolerance and spare-wheel-pattern fault tolerance, load balancing...

As we can see that SPL provides so many algorithms (some of which are pioneered in the industry), and also provide corresponding guarantee mechanism for different computing scenarios. As a programming language, SPL provides not only the abilities that are unique to database but other abilities, which can fully guarantee computing performance.

In addition to these algorithms (functions), storage needs to be mentioned. Some high-performance algorithms work only after the data is stored as a specified form. For example, the ordered merge and one-side partitioning algorithms mentioned above can be performed only after the data is stored in order. In order to ensure computing performance, SPL designs a specialized binary file storage. By means of this storage, and by adopting the storage mechanisms such as code compression, columnar storage and parallel segmentation, and utilizing the approaches like sorting and index, the effectiveness of high-performance algorithms is maximized, thus achieving higher computing performance.

High-performance algorithms and specialized storage make SPL have all key abilities of data warehouse, thereby making it easy to replace traditional relational data warehouses and big data platforms like Hadoop at lower cost and higher efficiency.

In practice, when used as a data warehouse, SPL does show different performance compared with traditional solutions. For example, in an e-commerce funnel analysis scenario, SPL is nearly 20 times faster than Snowflake even if running on a server with lower configuration; in a computing scenario of NAOC on clustering celestial bodies, the speed of SPL running on a single server is 2000 times faster than that of a cluster composed of a certain top distributed database. There are many similar scenarios, basically, SPL can speed up several times to dozens of times, showing very outstanding performance.

In summary, SPL, as a specialized data processing language, adopts very unconventional grid-style programming, which brings convenience in many aspects such as format and debugging (of course, those who are used to text-style programming need to adapt to this change). Moreover, in terms of syntax, SPL incorporates some new features such as option, multi-layer parameters and Lambda syntax, making SPL look interesting. However, these features actually serve data computing, which stem from SPL's deep understanding of structured data computing (deeper and more complete than SQL). Only with the deep understanding can these interesting features be developed, and only with these features can data be processed more simply, conveniently and quickly. Simpler in coding and faster in running is what SPL aims to achieve, and in the process, the application framework can also be improved (not detailed here).

In short, SPL is a programming language that is worth trying.

GitHub Link: Here

SQL is consuming the lives of data scientists

Pratik Singh — Fri, 10 Nov 2023 12:03:00 +0000

SQL is widely used, and data scientists (analysts) often need to use SQL to query and process data in their daily work. Many enterprises hold the view that as long as the IT department builds a data warehouse (data platform) and provides SQL, data scientists can freely query and analyze enterprise data.

This view is seemingly true, since SQL enables data scientists to query and calculate data. Moreover, SQL is very much like English and seems easy to get started, and some simple SQL statements can even be read as English directly.

For example, the SQL statement for filtering:

Select id,name from T where id=1

This statement is almost identical to English “Query id and name from T if id equals 1”.

Another example: the SQL statement for grouping and aggregating:

Select area,sum(amount) from T group by area

This statement is also very similar to English expression “Summarize amount by area from T”.

Looking like English (natural language) has a significant benefit, that is, simple in coding. The implementation of data query with a natural language makes it possible for even business personnel (data scientists are often those who are familiar with business but not proficient in IT technology) to master, which is exactly the original intention of designing SQL: enable ordinary business personnel to use.

Then what does it actually go?

If all the calculation tasks were simple like grouping and filtering, most business personnel could indeed master, and it is also simple to code in SQL. However, the business scenarios that data scientists face are often not that simple, for example:

Find out the top n customers whose sales account for 50% and 80% of the total sales based on the sales data, so as to carry out precision marketing;
Analyze which restaurants are most popular, which time periods are the busiest, and which dishes are the most popular based on the number of customers, consumption amount, consumption time, consumption location and other data of each chain restaurant;
Calculate each model’s sales, sales volume, average price and sales regions, etc., based on car sales data, so as to analyze which models are the hottest and which models need to adjust price and improve design;
Find out stocks that have experienced a rise by the daily limit for three consecutive trading days (rising rate >=10%) based on the stock trading data to construct an investment portfolio;
Calculate the maximum number of days that a certain stock keeps rising based on its market data to evaluate its historical performance;
Conduct a user analysis based on game login data, listing the first login records of users and the time interval from the last login, and counting users’ number of logins within three days prior to the last login;
Evaluate whether a customer will default on the loan based on some data of his/her account like balance, transaction history and credit rating, and identify which customers are most likely to default;
Determine which patients are most in need of prevention and treatment based on their medical records, diagnostic results and treatment plans;
Calculate user's monthly average call duration, data consumption amount and peak consumption time period, and identify which users are high-consumption user, based on operator’s data such as users' call records, SMS records and data consumption;
Perform a funnel analysis based on e-commerce users’ behavior data, calculating the user churn rate after each event such as page browsing, adding to cart, placing order, and paying;
Divide customers into different groups such as the group having strong purchasing power, the group preferring women's clothing, the group preferring men's clothing based on the e-commerce company's customer data such as purchase history and preferences to facilitate developing different promotional activities for different groups;
…

These examples only account for a very small part of actual calculation tasks. We can see from the examples that most of data analysis that make business sense are somewhat complex, rather than simply filtering and grouping. For such analysis, it is not easy or even impossible for us to code in SQL. Let’s attempt to code several tasks in SQL to see the difficulty to implement.

Find out the top n customers whose cumulative sales account for half of the total sales, and sort them by sales in descending order:

with A as
 (select customer, amount, row_number() over(order by amount) ranking
    from orders)
  select customer, amount
    from (select customer,
                 amount,
                 sum(amount) over(order by ranking) cumulative_amount
            from A)
   where cumulative_amount > (select sum(amount) / 2 from orders)
   order by amount desc

Find out stocks that have experienced a rise by the daily limit for three consecutive trading days (rising rate >=10%):

with A as
 (select code,
         trade_date,
         close_price / lag(close_price) over(partition by code order by trade_date) - 1 rising_range
    from stock_price),
B as
 (select code,
         Case
           when rising_range >= 0.1 and lag(rising_range)
            over(partition by code order by trade_date) >= 0.1 and
                lag(rising_range, 2)
            over(partition by code order by trade_date) >= 0.1 then
            1
           Else
            0
         end rising_three_days
    from A)

select distinct code from B where rising_three_days = 1

Calculate the maximum number of trading days that a certain stock keeps rising:

SELECT max(consecutive_day)
  FROM (SELECT count(*) consecutive_day
          FROM (SELECT sum(rise_or_fall) OVER(ORDER BY trade_date) day_no_gain
                  FROM (SELECT trade_date,
                               CASE
                                 when close_price > lag(close_price)
                                  OVER(ORDER BY trade_date) then
                                  0
                                 Else
                                  1
                               end rise_or_fall
                          FROM stock_price))
         GROUP BY day_no_gain)

e-commerce funnel analysis (this code only counts the number of users of three steps respectively: page browsing, adding to cart, and placing order):

with e1 as
 (select uid, 1 as step1, min(etime) as t1
from event
   where etime >= to_date('2021-01-10')
     and etime < to_date('2021-01-25')
     and eventtype = 'eventtype1'
     and …
   group by 1),
e2 as
 (select uid, 1 as step2, min(e1.t1) as t1, min(e2.etime) as t2
from event as e2
   inner join e1
      on e2.uid = e1.uid
   where e2.etime >= to_date('2021-01-10')
     and e2.etime < to_date('2021-01-25')
     and e2.etime > t1
     and e2.etime < t1 + 7
     and eventtype = 'eventtype2'
     and …
   group by 1),
e3 as
 (select uid, 1 as step3, min(e2.t1) as t1, min(e3.etime) as t3
from event as e3
   inner join e2
      on e3.uid = e2.uid
   where e3.etime >= to_date('2021-01-10')
     and e3.etime < to_date('2021-01-25')
     and e3.etime > t2
     and e3.etime < t1 + 7
     and eventtype = 'eventtype3'
     and …
   group by 1)
select sum(step1) as step1, sum(step2) as step2, sum(step3) as step3
  from e1
  left join e2
    on e1.uid = e2.uid
  left join e3
    on e2.uid = e3.uid

The tasks, without exception, nest multiply-layer subqueries. Although some SQL codes are not long, it’s difficult to understand (like the example of calculating the maximum number of days that a stock keeps rising), let alone code; Some tasks are so difficult that it is almost impossible to code (like the funnel analysis).

It is indeed easy and convenient to code in SQL for simple calculations, but when the computing task becomes slightly complex, it is not easy. The actual computing tasks, however, especially those faced by data scientists, most of them are quite complex. Moreover, simple tasks do not require data scientists to code at all, because many BI tools provide the visual interface through which simple queries can be directly dragged out. Therefore, we can basically conclude that:

The SQL code that needs data scientists to write is not simple!

What consequences will this cause?

This will directly lead to a situation where the data scientist needs to consume a lot of time and energy to write complex SQL code, resulting in a low work efficiency. In short, SQL is consuming the lives of data scientists.

How SQL consumes the lives of data scientists

Difficult to code when encountering complex tasks

Just like the SQL code examples given above, although some codes are not long, it is difficult to understand and more difficult to write. One of the reasons for this phenomenon is that English-like SQL leads to the difficulty in stepwise computing.

The purpose of designing SQL as a language like English is to enable the business personnel (non-technical personnel) to use as well. As discussed earlier, this purpose can indeed be achieved for simple calculations. However, for professional analysts like data scientists, the computing scenarios they face are much more complex, this purpose will cause difficulties instead of bringing convenience once the calculation task becomes complex.

One of the advantages of natural language is that it can express in a fuzzy way, yet SQL needs to follow very strict syntax and a minor non-compliance will be rejected by interpreter. As a result, instead of benefiting from being like English, it causes serious disadvantages. Designing the syntax to be like natural language seems easy to master, but it is the opposite in fact.

In order to make the whole SQL statement conform to English habits, many unnecessary prepositions need to be added. For example, FROM, as the main operation element of a statement, is placed at the end of statement; a redundant BY is added after GROUP.

The main disadvantage of being like natural language is the procedural character. We know that stepwise computing is an effective way to deal with complex calculations, and almost all high-level languages support this feature. However, this is not the case for natural language, which needs to rely on a few pronouns to maintain the relationship between two sentences, yet a few pronouns cannot sufficiently and accurately describe the relationship, so a more common practice is to put as many things as possible into one sentence, resulting in the appearance of a large number of subordinate clauses when handling complex situation. When this practice is manifested in SQL, multiple actions such as SELECT, WHERE and GROUP need to be put into one statement. For example, although WHERE and HAVING have the same meaning, it still needs to use both of them in order to show difference, which will lead to a phenomenon where one SQL statement nests multiple-layer subqueries when the query requirements become complex, and this phenomenon will inevitably bring difficulties to coding and understanding. This is also the case in practice, where the complex SQL statements that analysts face are rarely measured in rows but often in KBs. For the same code of 100 lines, the complexities of writing it as 100 statements and one statement are completely different. It is difficult to understand such SQL statement. Even if the programmers take a great effort to work it out, they may have no idea what it means after two months.

In addition to the lack of procedural character, a more important reason for being difficult to code in SQL is its defect in theoretical basis: the relational algebra, which was born 50 years ago, lacks necessary data types and operations, making it very difficult to support modern data analysis business.

While SQL system has the concept of record data type, it has no explicit record data type. SQL will treat a single record as a temporary table with only one record, i.e., a single-member set. The characteristic of lacking discreteness will make data scientists fail to process analytical task according to natural way of thinking, resulting in serious difficulties in understanding and coding.

For example, for the funnel analysis case above, although CTE syntax makes SQL have the stepwise computing ability to a certain extent, it is still very complicated to code. Specifically, every sub-query needs to associate the original table with the result of the previous sub-query, and it is difficult to code such roundabout JOIN operation, which is beyond the ability of many data scientists. Normally, we only need to group by user, then sort the in-group data by time, and finally traverse each group of data (users) separately. The specific steps depend on actual requirements. Treating the data that meets condition or the grouped data as separate record to calculate (i.e., the discreteness) can greatly simplify the funnel analysis process. Unfortunately, SQL is unable to provide such ordered calculation due to the lack of the support for discreteness, and thus it has to associate repeatedly, resulting in difficult in coding and slow in running.

In fact, such discrete record concept is very common in high-level languages such as Java and C++, but SQL does not support this concept. Relational algebra defines rich set operations, but is poor in discreteness. As a result, it is difficult for SQL to describe complex multi-step operations (the performance is also poor). And such theoretical defect cannot be solved by engineering methods.

For more information about the defects of SQL in data types and operations, visit: Why a SQL Statement Often Consists of Hundreds of Lines, Measured by KBs?

Difficult to debug

In addition to being difficult to code, it is difficult to debug SQL code, which exacerbates the phenomenon “consuming the lives of data scientists”.

Debugging SQL code is notoriously difficult. The more complex the SQL code, the harder it is to debug, and complex SQL code is often the most in need of debugging, as correctness should always be the top priority.

When we execute a long SQL code that nests sub-queries and find the result is incorrect, how should we debug? Under normal circumstances, the only thing we can do is to split the code and execute layer by layer to ascertain the problem.

However, when the SQL statement is too complex, this debugging method may be very time-consuming and difficult, for the reason that there may exist a large number of nested subqueries and association queries in the statement, and splitting is often not easy.

Despite that fact that many SQL editors provide the interactive development interface, it does not help much for debugging complex SQL statement. In addition, difficult in debugging will affect development efficiency, resulting in a further decrease in development efficiency.

Low performance

Besides the two shortcomings mentioned above, complex SQL code often leads to low performance. Low performance means waiting, and in some big data computing scenarios, data scientists need to wait for hours or even a day, and their lives are consumed in this process. When encountering an unlucky situation where data scientists find the calculation result is incorrect after waiting for a long time, they have to repeat this process, which will lead to a multiplied time cost.

Why does complex SQL code run slowly?

The query performance of complex SQL code depends mainly on the optimization engine of database. A good database will adopt more efficient algorithms based on the calculation goal (rather than executing according to the literally expressed logic of SQL code). However, the automatic optimization mechanism often fails in the face of complex situations, and too transparent mechanism will make us difficult to manually intervene in the execution path, let alone make SQL execute the algorithms we specify.

Let's take a simple example: take the top 10 out of 100 million pieces of data, SQL code:

SELECT TOP 10 x FROM T ORDER BY x DESC

Although this code contains sorting-related words (ORDER BY), database’s optimization engine won’t do a big sorting in fact (sorting of big data is a very slow action), and will choose a more efficient algorithm instead.

If we make slight changes to the task above: calculate the top 10 in each group, then SQL code:

SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount
  DESC) rn
FROM Orders )
WHERE rn<=10

While the complexity of this code does not increase much, the optimization engines of most databases will get confused, and cannot identify its real intention, and instead they have to carry out a sorting according to the literally expressed logic (there is still the words ORDER BY in the statement). As a result, the performance decreases sharply.

The SQL codes in real-world business are much more complex than this code, and failure to identify the real intention of code is quite common for database’s optimization engine. For example, the SQL statement for funnel analysis mentioned earlier needs to associate repeatedly, resulting in the difficulty to code, and the extreme low performance to execute.

Of course, using the user-defined function (UDF) can enhance SQL’s ability, allowing us to implement the desired algorithms. However, this method is often unrealistic. Not to mention that the storage of databases cannot ensure performance when algorithm changes, the difficulty of implementing UDF itself is beyond the technical ability of the vast majority of data scientists. Even if UDF is implemented with great effort, it would face the complexity problem mentioned earlier, and performance is often not guaranteed.

Closedness

The shortcomings of SQL don't stop there.

SQL is the formal language of database, yet the database has closedness, which will lead to the difficulty in data processing. The so-called closedness means that the data to be computed and processed by database must be loaded into database in advance, and there is clear definition about whether data is inside or outside the database.

In practice, however, data analysts often need to process data from other sources including text, Excel, program interfaces and web crawlers and so on. Some of these data are only used temporarily, and if they can be used only after loading into database each time, not only will it occupy the space of database, but the ETL process will consume a lot of time. In addition, loading data into database is usually constrained. Some non-compliant data cannot be written to database. In this case, it needs to take time and effort to organize the data first, and then write the organized data to database (writing data to database is time-consuming). Once time is wasted, life is wasted.

Of course, besides SQL, data scientists have other tools like Java and Python. Then, do these tools work?

Java supports procedural calculation and has good discreteness, but its support for set operation is poor. Java lacks basic data types and computing libraries, which makes data processing extremely cumbersome. For example, the grouping and aggregating operation, which is easy to implement in SQL, is not easy for Java, and it is more difficult for Java to implement other operations such as filtering, join and multiple mixed operations. Moreover, Java is too heavy for data analysts, and is very poor in interactivity, and thus it is not usable in practice even if any calculation can be implemented in Java in theory.

Compared with Java, Python is a little bit better. Python has richer computing libraries and is simpler in implementing the same calculation (its computing ability is often comparable to SQL). However, for complex calculations, it is also cumbersome, and Python does not have much advantage over SQL. Moreover, Python’s interactivity is not good, either (it still needs to print and output the intermediate results manually). And, due to the lack of true parallel computing mechanism and storage guarantee, Python also faces performance issue in big data computing.

It there any other choice?

esProc SPL, a tool rescuing data scientist

For data scientists who often process structured data, esProc SPL is a tool that is well worth adding to their data analysis “arsenal”.

esProc is a tool specifically designed for processing structured data, and its formal language, SPL (Structured Process Language), boasts completely different data processing ability from SQL and JAVA.

Simpler in coding

Firstly, let's take a look at how SPL differs from SQL in accomplishing the tasks mentioned above:

Find out the top n customers whose cumulative sales account for half of the total sales, and sort them by sales in descending order:

Calculate the maximum number of days that a stock keeps rising:

Find out stocks that have experienced a rise by the daily limit for three consecutive trading days (rising rate >=10%):

Funnel analysis of an e-commerce business:

From the SPL codes above, we can see that SPL is simpler than SQL. Even those who don’t know SPL syntax can basically understand these SPL codes. If they familiarize themselves with SPL syntax, implementing these calculations is not difficult.

The reason why SPL is simpler is that it naturally supports procedural calculation.

As mentioned earlier, procedural calculation can effectively reduce the implementation difficulty of complex business, and the improvement of development efficiency can help data scientists create more value. Although CTE syntax and stored procedure make SQL have the procedural computing ability to a certain extent, it is far from enough. In contrast, SPL naturally supports procedural calculation, and can divide complex calculation into multiple steps, thereby reducing implementation difficulty.

For example, for the calculation of number of days that a stock keeps rising, SPL allows us to calculate according to natural train of thought: sort by trading days first, and then compare the closing price of the day with that of the previous day (if the comparison result is greater than 1, cumulate with the help of intermediate variable, otherwise reset to zero), and finally find the maximum value in the sequence, which is the value we want. The entire calculation process does not require nesting, and can be easily implemented step by step according to natural thinking, which are the benefits that procedural calculation brings. Also, for the funnel analysis, the implementation difficulty is reduced through stepwise computing, and the code is more universal, and can handle the funnel calculation with any number of steps (the only thing that needs to do is to modify the parameter).

Another reason why SPL is simpler is that it provides richer data types and computing libraries, which can further simplify calculation.

SPL provides a professional structured data object: table sequence, and offers rich computing library based on the table sequence, thereby making SPL have complete and simple structured data process ability.

Here below are part of conventional calculations in SPL:

Orders.sort(Amount) // sort
Orders.select(Amount*Quantity>3000 && like(Client,"*S*")) // filter
Orders.groups(Client; sum(Amount)) // group
Orders.id(Client) // distinct
join(Orders:o,SellerId ; Employees:e,EId) // join

By means of the procedural calculation and the table sequence, SPL can implement more calculations. For example, SPL supports ordered operation more directly and thoroughly. In the above SPL code for calculating the number of days that a stock rises, it uses [-1] to reference the previous record to compare the stock prices. If we want to calculate the moving average, we can write avg(price[-1:1]). Through ordered operation, the calculation of the maximum number of days that a stock keeps rising can be coded this way:

stock.sort(trade_date).group@i(close_price<close_price[-1]).max(~.len())

For the grouping operation, SPL can retain the grouped subset, i.e., the set of sets, which makes it convenient to perform further operations on the grouped result. In contrast, SQL does not have explicit set data type, and cannot return the data types such as set of sets. Since SQL cannot implement independent grouping, grouping and aggregating have to be bound as a whole.

In addition, SPL has a new understanding on aggregation operation. In addition to common single value like SUM, COUNT, MAX and MIN, the aggregation result can be a set. For example, SPL regards the common TOPN as an aggregation calculation like SUM and COUNT, which can be performed either on a whole set or grouped subsets.

In fact, SPL has many other features, making it more complete than SQL and richer than Java/Python. For example, the discreteness allows the records that make up a data table to exist dissociatively and be computed repeatedly; the universal set supports the set composed of any data, and allows such set to participate in computation; the join operation distinguishes three different types of joins, allowing us to choose an appropriate one according to actual situation...

These features enable data scientists to process data more simply and efficiently, putting an end to the waste of lives.

Easy to edit and debug

Another factor affecting development efficiency is the debugging environment. How to debug code and interact with data scientists more conveniently is also a key consideration for SPL. For this purpose, SPL provides an independent IDE:

Unlike other programming languages that use text to program, SPL adopts grid-style code to program. The grid-style code has some natural advantages, which are mainly reflected in three aspects. First, there is no need to define variables when coding. By referencing the name of previous cells (such as A1) directly in subsequent steps, we can utilize the calculation result of the cells. In this way, racking our brains to name variables is avoided. Of course, SPL also supports defining variables; Second, the grid-style code looks very neat. Even if the code in a cell is very long, it will occupy one cell only and will not affect the structure of whole code, thus making code reading more conveniently; Third, the IDE of SPL provides multiple debugging ways, such as run, debug, run to cursor. In short, easy-to-use editing and debugging functionalities improves coding efficiency.

Moreover, on the right side of the IDE, there is a result panel, which can display the calculation result of each cell in real time. Viewing the result of each step in real time further improves the convenience of debugging. With this feature, not only can data scientists easily implement routine data analysis, but they can also conduct interactive analysis, and make decision on what to do next instantly based on the result of the previous step. In addition, this feature enables data scientist to review the result of a certain intermediate step conveniently.

High performance

Supporting procedural calculation and providing rich computing libraries allow SPL to quickly accomplish data analysis tasks, and its easy-to-use IDE further improves development efficiency. Besides, what about the performance of SPL? After all, the computing performance is also crucial for data scientists.

In order to cope with the big data computing scenario where the amount of data exceeds memory capacity, SPL offers cursor computing method.

=file("orders.txt").cursor@t(area,amount).groups(area;sum(amount):amount)

Moreover, SPL provides parallel computing support for both in-memory and external storage calculations. By adding just one @m option, parallel computing can be implemented and the advantages of multi-core CPU can be fully utilized, which is very convenient.

=file("orders.txt").cursor@tm(area,amount;4).groups(area;sum(amount):amount)

In addition to cursor and parallel computing, SPL offers many built-in high-performance algorithms. For example, after SPL treats the previously mentioned TOPN as an ordinary aggregation operation, sorting action is avoided in the corresponding statement, so the execution is more efficient.

Similarly, SPL provides many such high-performance algorithms, including:

In-memory computing: binary search, sequence number positioning, position index, hash index, multi-layer sequence number positioning...
External storage search: binary search, hash index, sorting index, index-with-values, full-text retrieval...
Traversal computing: delayed cursor, multipurpose traversal, parallel multi-cursor, ordered grouping and aggregating, sequence number grouping...
Foreign key association: foreign key addressization, foreign key sequence-numberization, index reuse, aligned sequence, one-side partitioning...
Merge and join: ordered merging, merge by segment, association positioning, attached table...
Multidimensional analysis: partial pre-aggregation, time period pre-aggregation, redundant sorting, boolean dimension sequence, tag bit dimension...
Cluster computing: cluster multi-zone composite table, duplicate dimension table, segmented dimension table, redundancy-pattern fault tolerance and spare-wheel-pattern fault tolerance, load balancing... In order to give full play to the effectiveness of high-performance algorithms, SPL also designs high-performance file storage, and adopts multiple performance assurance mechanisms such as code compression, columnar storage, index and segmentation. Once flexible and efficient storage is available, data scientists can design data storage forms (such as sorting, index and attached table) based on the calculation to be conducted and the characteristic of data to be processed, and adopt more efficient algorithms based on the storage form so as to obtain an extreme performance experience. Saving time is to save lives.

Openness

Unlike the database that requires loading data into database before calculation (closedness), SPL can directly calculate when facing diverse data sources, and hence it has good openness.

SPL does not have the concept of “base” of traditional data warehouses, nor does it have the concept of metadata, let alone constraints. Any accessible data source can be regarded as the data of SPL and can be calculated directly. Importing data into database is not required before calculation, and exporting data out of database is also not required deliberately after calculation, the calculation result can be written to the target data source through an interface.

SPL encapsulates access interfaces for common data sources such as various relational databases (JDBC data source), MongoDB, HBase, HDFS, HTTP/Restful, SalesForces and SAP BW. Logically, these data sources have basically the same status, and can be calculated separately or in a mixed way after being accessed, and the only difference is that different data sources have different access interfaces, and different interfaces have different performance.

Under the support of openness, data scientists can directly and quickly process data of diverse sources, saving the time spent on data organization, importing into and exporting from the database, and improving data processing efficiency.

Overall, SPL provides data scientists with comprehensive structured data processing ability, and structured data is currently the top priority of data analysis. With SPL, not only can faster analysis efficiency be achieved, but it can also sufficiently guarantee the performance. Only with a tool that is simple in coding, fast in running, good in openness and interactivity, will the lives of data scientists not be wasted.

More reference: Here

the link to their GitHub repo: here

My journey to NASDAQ

Pratik Singh — Sun, 13 Aug 2023 11:22:01 +0000

This is my journey to becoming a Senior Software Developer at NASDAQ.

The Nasdaq Stock Market is an American stock exchange based in New York City. It is the most active stock trading venue in the US by volume and ranked second on the list of stock exchanges by market capitalization of shares traded, behind the New York Stock Exchange. The exchange platform is owned by Nasdaq, Inc., which also owns the Nasdaq Nordic stock market network and several U.S.-based stock and options exchanges.

Intro

I am Pratik Singh, just a recent engineering pass-out from Bangalore.
In this blog post, I will describe the application process, the interview process, and my experience working at this company.

My Story

In this section, I will discuss my story with NASDAQ.
One random day I get a message from HR at NASDAQ for an opening at the company. It was a role related to Golang and DevOps. I showed my interest along with my resume.

Selection Process

Interview with Manager 🙇🏻 :

For me, there was not any online assignment at first. I am not sure what was the reason for it. But I guess since they had handpicked my profile, my manager wanted to know me better.
It was a technical interview. We talked a lot about DevOps and System designs. He focused on the tools and architecture they have at work. What helped me was my experience and the learnings I got from my previous internships.

Coding Interview ⚔️ :

Yes! there was a DSA round for this job role. I revised the basics of various Data Structures. I was a bit rusty, to be honest, but my interviewer was very helpful. The first question was very easy, while the second question was based on Golang API calls.

P.S.: I gave this interview from a Google office 😂

Interview with Manager's Manager 🤺 :

This was scary honestly! I was not sure if I would be able to answer all the questions that might be asked of me. But to my surprise, it was a very lighthearted discussion. We discussed tech for a while and then we discussed startups and my past experiences.

HR Round 👩🏻‍🦳:

Not much to tell about this. My manager was impressed with me. He was interested in hiring me for his team. There were a lot of discussions about compensation and other stuff in this meeting.

Hired as a Senior Software Engineer ✌️😊

// Detect dark theme var iframe = document.getElementById('tweet-1675851481163988993-995'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1675851481163988993&theme=dark" }

My experience so far

I love my job at NASDAQ! Our team uses Go and deploys on Kubernetes Clusters. My responsibilities include building and maintaining CI pipelines and completing releases. My manager's technical background is a blessing because we discuss work and bounce ideas together. The seniors on the team are helpful and have included me in several tasks. I am excited about the learning opportunities and experiences ahead at this company.

If you liked this content, you could follow me on Twitter at kitarp29for more!

Thanks for reading my article :)

Declutter your Kubernetes Cluster

Pratik Singh — Mon, 24 Jul 2023 11:30:00 +0000

This article will cover the basics to clean up your Kubernetes Cluster.

Prerequisites ✅

Basic understanding of Kubernetes
Understanding of Docker: Here

🤔 Understanding the Problem

A Kubernetes Cluster is the home for your applications. This platform is responsible to serve your users with little to no downtime. It only makes sense to keep it clean, yet these are three major issues, that I see:

Easy to manage: I have been an SRE, and I assure you a decluttered cluster is easier to manage. In the time of fire, each second that you can save counts.
Reduce resource utilization: Unused deployments take up resources. The resources that could be helpful to other deployments. The scheduler has to work harder to find resources for pods.
Reduce Cost: Save money. FinOps is a big task, and these unwanted resources mainly add to it!

Ways to Clean up your Cluster 💡

Let's get started!

Before we take a deep dive, you have to keep in mind these points:

Never perform any of these actions without consulting the team.
Resource optimization should not hamper functionality.
Cache saves time.
Have a backup strategy.

Now, let's talk about the ways to clean up your Cluster.
Like most of the tasks in Kubernetes, there are three ways to achieve this.

Manual Kubectl commands
In-built service
Third-party services

I. Manual Kubectl commands

In this, I will explain different strategies that can be used to optimize and clean your cluster. The rest of the methods are abstractions on top of these.

i. Remove deadweight:
Delete pods that are in Evicted / Error / Completed state. Be cautious to check for stateful pods before deleting them.
You can use this kubectl command to find such resource:

kubectl get pods --all-namespaces -o wide | grep Evicted

ii. Use of in-built autoscaling:
HPA or Horizontal Pod Autoscaling ensures resources are allocated efficiently and reduces the need for manual intervention.
VPA or Vertical Pod Autoscaling helps to avoid overprovisioning and makes the cluster cleaner by optimizing resource allocations.

iv. Tidy ConfigMaps and Secrets:
Update or delete ConfigMaps that are not in use. Doesn't save a lot of space but certainly makes the life of DevOps folks easier.
As per secrets are concerned, the industry standard is to save Kubernetes Secrets in Secret vaults like Gitlab Secrets, AWS KMS, or Hashicorp Vault.

v. Use Daemonsets and PDBs wisely:
Daemonsets can create pods on each node. And PDB can create pods that are tough to kill. Use these wisely!

vi. Labels and Annotations:
Label your resources, it becomes easier for DevOps folks to manage the cluster. Annotations help DevOps to get a better understanding of the ownership of the pod and contact the right person for debugging.

I can add more steps but it would be more about organizing your clustering than decluttering it. Moving on to the next step!

II. Kubernetes Garbage Collector

Kubernetes has an in-built Garbage Collector, read Here. There are multiple options available in the Garbage collector. There are several flags in the Kubernetes configuration file you can set. Digging deep into this here might be the scope of another article.

I will add the references for this at the end.

III. Third-party apps

It is advised against using third-party apps with permission to delete or clean resources. But just to state some services I found for this task:

Cilium
KubeDog
Some less popular tools like Kleaner and more are there.

I have never used any of these services at any of my clusters. I strongly advise against using them without doing a POC around them.

Conclusion

Try to use Kubernetes Garbage Collector over doing manual Kubectl commands. GC is controlled by the Kubelet does the drift is registered by etcd. Running multiple kubectl commands to delete a bunch of resources is never a good idea. Using third-party apps is the last option. Ideally, no service should have admin access on your cluster. There is another option to build a Kubernetes Operator from scratch.

References

Great commands for the use case: Here
Source code of K8s/garbagecollector: Here
Good read to understand GC: Here

If you liked this content you can follow me on Twitter at kitarp29 for more!

Thanks for reading my article :)

Mercari Internship Experience

Pratik Singh — Sat, 15 Jul 2023 12:00:00 +0000

This article is about my internship at Mercari Japan

Mercari, Inc. is a Japanese e-commerce company founded in 2013. Their main product, the Mercari marketplace app, was first launched in Japan in July 2013. Now grown to become Japan's largest community-powered marketplace with over JPY 10 billion in transactions carried out on the platform each month.

Intro

I am Pratik Singh, just a recent engineering pass-out from Bangalore.
In this blog post, I will describe the application process, the interview process, and my experience working at this company.

My Story

In this section, I will discuss my story with Mercari. Well, back in 2021 I decided I wanted to work as a DevOps engineer rather than a SDE. After this decision, I started to focus only on such openings.

While scrolling LinkedIn I came across a post about this company. Lucky for me, while stalking the company I found an opening fit for me.

Selection Process

The Application 🗒 :

I applied for CI/CD Engineer Opening from LinkedIn: Here
No referrals, I applied directly.

Coding Test 💻 :

The journey of an application followed by a test link of Hackerrank (my last employer😅). There were two questions as far as I recall. The time provided was reasonable and the questions were of an appropriate level (Codeforces A and B level questions). Nothing related to DevOps was asked up till now.

Technical Interview ⚔️ :

The first technical round was 90 mins. It was mostly around DevOps questions. We started by discussing my projects, followed up by my past internships. My experience did elp me out. We focused our conversation on tools like GitHub Actions, Terraform, Kubernetes, Datadog, and more. It went on for like 2 hours and it was a great interview.

I like technical interviews. If I can answer their questions, my confidence boosts. If I don't know the answers, I learn something new.
P.S.: Don't be afraid of interviews.

Interview with Manager 🙇🏻 :

The next round was scheduled a couple of days later with the Manager. It was for 90 mins or so. He is a very wise person with a comforting nature. The interview was a mix of technical and cultural fit questions. I was asked about my experience with DevOps and CI tools. Also was asked to tell instances from my experience. It was not much technical but I sure had to think a lot before I answered. Reading the blogs my manager had published helped me to drive a better conversation 😉.

HR Round 👩🏻‍🦳:

Well, not to tell much in this. She was impressed that I worked for Hackerrank. HR just discussed my days of availability for 3 months. If I am willing to relocate to Japan or not. It was a simple conversion than an interview, to be honest.

Selected!🎉🔥

My Experience

At Mercari for 3 months, I worked in the CI/CD team. This team works under the Platforms team. We focused to build tools and pipelines to ease the work of Developers working in Mercari. My work initially was to get to know the different tools and the company's setup. I implemented several things around Go and helped in FinOps. My primary project was to build a Migration tool. This helped them to migrate an entire CI pipeline from CI tool A to Tool B (can't officially declare names). It not only migrated the pipeline but also helped the team monitor and observe which projects migrated.

Working with an international team was an awesome experience! I went to Japan for a week and worked remotely from India for the rest. There was a time zone difference but my team was very considerate and helped me a lot. The tasks that were assigned to me, were first discussed with me. I could choose the task I want to complete as per the learning involved in them! I learned a lot and had an experience of a lifetime. Speaking of money, you can find them on the Internet. All I can say is what I earned in a 6 months internship even that was less than their monthly stipend😅

My Conclusion

Mercari is an amazing place to be working at! You will learn a lot of DevOps practices. Honestly, the usage of IAC and other implementations I had not seen or even read somewhere. This place has a culture to let people experiment but not at the cost of security. I have a special place for Japan and company in my life!