Forem: Federico Trotta

How to Use Proxies in Python

Federico Trotta — Thu, 14 Nov 2024 19:59:56 +0000

If you've been working with Python for a bit, especially in the particular case of data scraping, you've probably encountered situations where you are blocked while trying to retrieve the data you want. In such a situation, knowing how to use a proxy is a handy skill to have.

In this article, we'll explore what proxies are, why they're useful, and how you can use them using the library request in Python.

What is a Proxy?

Let’s start from the beginning by defining what a proxy is.

You can think of a proxy server as a “middleman” between your computer and the internet. When you send a request to a website, the request goes through the proxy server first. The proxy then forwards your request to the website, receives the response, and sends it back to you. This process masks your IP address, making it appear as if the request is coming from the proxy server instead of your own device.

As understandable, this has a lot of consequences and uses. For example, it can be used to bypass some pesky IP restrictions, or maintain anonymity.

Why use a proxy in web scraping?

So, why proxies might be helpful while scraping data? Well, we already gave a reason before. For example, you can use them to bypass some restrictions.

So, in the particular case of web scraping, they can be useful for the following reasons:

Avoiding IP blocking: websites often monitor for suspicious activity, like a single IP making numerous requests in a short time. Using proxies helps distribute your requests across multiple IPs avoiding being blocked.
Bypassing geo-restrictions: some content is only accessible from certain locations and proxies can help you appear as if you're accessing the site from a different country.
Enhancing privacy: proxies are useful to keep your scraping activities anonymous by hiding your real IP address.

How to use a proxy in Python using `requests`

The requests library is a popular choice for making HTTP requests in Python and incorporating proxies into your requests is straightforward.

Let’s see how!

Getting Valid Proxies

First things first: you have to get valid proxies before actually using them. To do so, you have two options:

Free proxies: you can get proxies for free from websites like Free Proxy List. They're easily accessible but, however, they can be unreliable or slow.
Paid proxies: services like Bright Data or ScraperAPI provide reliable proxies with better performance and support, but you have to pay.

Using Proxies with `requests`

Now that you have your list of proxies you can start using them. For example, you can create a dictionary like so:

proxies = {
    'http': 'http://proxy_ip:proxy_port',
    'https': 'https://proxy_ip:proxy_port',
}

Now you can make a request using the proxies:

import requests

proxies = {
    'http': 'http://your_proxy_ip:proxy_port',
    'https': 'https://your_proxy_ip:proxy_port',
}

response = requests.get('https://httpbin.org/ip', proxies=proxies)

To see the outcome of your request, you can print the response:

print(response.status_code)  # Should return 200 if successful
print(response.text)         # Prints the content of the response

Note that, if everything went smoothly, the response should display the IP address of the proxy server, not yours.

Proxy Authentication Using `requests`: Username and Password

If your proxy requires authentication, you can handle it in a couple of ways.

Method 1: including Credentials in the Proxy URL
To include the username and password to manage authentication in your proxy, you can do so:

proxies = {
    'http': 'http://username:password@proxy_ip:proxy_port',
    'https': 'https://username:password@proxy_ip:proxy_port',
}

Method 2: using HTTPProxyAuth
Alternatively, you can use the HTTPProxyAuth class to handle authentication like so:

from requests.auth import HTTPProxyAuth

proxies = {
    'http': 'http://proxy_ip:proxy_port',
    'https': 'https://proxy_ip:proxy_port',
}

auth = HTTPProxyAuth('username', 'password')

response = requests.get('https://httpbin.org/ip', proxies=proxies, auth=auth)

How to Use a Rotating Proxy with `requests`

Using a single proxy might not be sufficient if you're making numerous requests. In this case, you can use a rotating proxy: this changes the proxy IP address at regular intervals or per request.

If you’d like to test this solution, you have two options: manually rotate proxies using a list or using a proxy rotation service.

Let’s see both approaches!

Using a List of Proxies

If you have a list of proxies, you can rotate them manually like so:

import random

proxies_list = [
    'http://proxy1_ip:port',
    'http://proxy2_ip:port',
    'http://proxy3_ip:port',
    # Add more proxies as needed
]

def get_random_proxy():
    proxy = random.choice(proxies_list)
    return {
        'http': proxy,
        'https': proxy,
    }

for i in range(10):
    proxy = get_random_proxy()
    response = requests.get('https://httpbin.org/ip', proxies=proxy)
    print(response.text)

Using a Proxy Rotation Service

Services like ScraperAPI handle proxy rotation for you. You typically just need to update the proxy URL they provide and manage a dictionary of URLs like so:

proxies = {
    'http': 'http://your_service_proxy_url',
    'https': 'https://your_service_proxy_url',
}

response = requests.get('https://httpbin.org/ip', proxies=proxies)

Conclusions

Using a proxy in Python is a valuable technique for web scraping, testing, and accessing geo-restricted content. As we’ve seen, integrating proxies into your HTTP requests is straightforward using the library requests.

A few parting tips when scraping data from the web:

Respect website policies: always check the website's robots.txt file and terms of service.
Handle exceptions: network operations can fail for various reasons, so make sure to handle exceptions and implement retries if necessary.
Secure your credentials: if you're using authenticated proxies, keep your credentials safe and avoid hardcoding them into your scripts.

Happy coding!

How to Use Lambda Functions in Python

Federico Trotta — Wed, 30 Oct 2024 16:03:10 +0000

Lambda functions in Python are a powerful way to create small, anonymous functions on the fly. These functions are typically used for short, simple operations where the overhead of a full function definition would be unnecessary.

While traditional functions are defined using the def keyword, Lambda functions are defined using the lambda keyword and are directly integrated into lines of code. In particular, they are often used as arguments for built-in functions. They enable developers to write clean and readable code by eliminating the need for temporary function definitions.

In this article, we'll cover what Lambda functions do and their syntax. We'll also provide some examples and best practices for using them, and discuss their pros and cons.

Prerequisites

Lambda functions have been a part of Python since version 2.0, so you'll need:

Minimum Python version: 2.0.
Recommended Python version: 3.10 or later.

In this tutorial, we'll see how to use Lambda functions with the library Pandas: a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation library. If you don't have it installed, run the following:

pip install pandas

Syntax and Basics of Lambda Functions for Python

First, let's define the syntax developers must use to create Lambda functions.

A Lambda function is defined using the lambda keyword, followed by one or more arguments and an expression:

lambda arguments: expression

Let's imagine we want to create a Lambda function that adds up two numbers:

add = lambda x, y: x + y

Run the following:

result = add(3, 5)
print(result)

This results in:

We've created an anonymous function that takes two arguments, x and y. Unlike traditional functions, Lambda functions don't have a name: that's why we say they are "anonymous."

Also, we don't use the return statement, as we do in regular Python functions. So we can use the Lambda function at will: it can be printed (as we did in this case), stored in a variable, etc.

Now let's see some common use cases for Lambda functions.

Common Use Cases for Lambda Functions

Lambda functions are particularly used in situations where we need a temporarily simple function. In particular, they are commonly used as arguments for higher-order functions.

Let's see some practical examples.

Using Lambda Functions with the `map()` Function

map() is a built-in function that applies a given function to each item of an iterable and returns a map object with the results.

For example, let's say we want to calculate the square roots of each number in a list. We could use a Lambda function like so:

# Define the list of numbers
numbers = [1, 2, 3, 4]

# Calculate square values and print results
squared = list(map(lambda x: x ** 2, numbers))
print(squared)

This results in:

[1, 4, 9, 16]

We now have a list containing the square roots of the initial numbers.

As we can see, this greatly simplifies processes to use functions on the fly that don't need to be reused later.

Using Lambda Functions with the `filter()` Function

Now, suppose we have a list of numbers and want to filter even numbers.

We can use a Lambda function as follows:

# Create a list of numbers
numbers = [1, 2, 3, 4]

# Filter for even numbers and print results
even = list(filter(lambda x: x % 2 == 0, numbers))
print(even)

This results in:

[2,4]

Using Lambda Functions with the `sorted()` Function

The sorted() function in Python returns a new sorted list from the elements of any iterable. Using Lambda functions, we can apply specific filtering criteria to these lists.

For example, suppose we have a list of points in two dimensions: (x,y). We want to create a list that orders the y values incrementally.

We can do it like so:

# Creates a list of points
points = [(1, 2), (3, 1), (5, -1)]

# Sort the points and print
points_sorted = sorted(points, key=lambda point: point[1])
print(points_sorted)

And we get:

[(5, -1), (3, 1), (1, 2)]

Using Lambda Functions in List Comprehensions

Given their conciseness, Lambda functions can be embedded in list comprehensions for on-the-fly computations.

Suppose we have a list of numbers. We want to:

Iterate over the whole list
Calculate and print double the initial values.

Here's how we can do that:

# Create a list of numbers
numbers = [1, 2, 3, 4]

# Calculate and print the double of each one
squared = [(lambda x: x ** 2)(x) for x in numbers]
print(squared)

And we obtain:

[1, 4, 9, 16]

Advantages of Using Lambda Functions

Given the examples we've explored, let's run through some advantages of using Lambda functions:

Conciseness and readability where the logic is simple: Lambda functions allow for concise code, reducing the need for standard function definitions. This improves readability in cases where function logic is simple.
Enhanced functional programming capabilities: Lambda functions align well with functional programming principles, enabling functional constructs in Python code. In particular, they facilitate the use of higher-order functions and the application of functions as first-class objects.
When and why to prefer Lambda functions: Lambda functions are particularly advantageous when defining short, "throwaway" functions that don't need to be reused elsewhere in code. So they are ideal for inline use, such as arguments to higher-order functions.

Limitations and Drawbacks

Let's briefly discuss some limitations and drawbacks of Lambda functions in Python:

Readability challenges in complex expressions: While Lambda functions are concise, they can become difficult to read and understand when used for complex expressions. This can lead to code that is harder to maintain and debug.
Limitations in error handling and debugging: As Lambda functions can only contain a single expression, they can't include statements, like the try-except block for error handling. This limitation makes them unsuitable for complex operations that require these features.
Restricted functionality: Since Lambda functions can only contain a single expression, they are less versatile than standard functions. This by-design restriction limits their use to simple operations and transformations.

Best Practices for Using Lambda Functions

Now that we've considered some pros and cons, let's define some best practices for using Lambda functions effectively:

Keep them simple: To maintain readability and simplicity, Lambda functions should be kept short and limited to straightforward operations. Functions with complex logic should be refactored into standard functions.
Avoid overuse: While Lambda functions are convenient for numerous situations, overusing them can lead to code that is difficult to read and maintain. Use them judiciously and opt for standard functions when clarity is fundamental.
Combine Lambda functions with other Python features: As we've seen, Lambda functions can be effectively combined with other Python features, such as list comprehensions and higher-order functions. This can result in more expressive and concise code when used appropriately.

Advanced Techniques with Lambda Functions

In certain cases, more advanced Lambda function techniques can be of help.

Let's see some examples.

Nested Lambda Functions

Lambda functions can be nested for complex operations.

This technique is useful in scenarios where you need to have multiple small transformations in a sequence.

For example, suppose you want to create a function that calculates the square root of a number and then adds 1. Here's how you can use Lambda functions to do so:

# Create a nested lambda function
nested_lambda = lambda x: (lambda y: y ** 2)(x) + 1

# Print the result for the value 3
print(nested_lambda(3))

You get:

Integration with Python Libraries for Advanced Functionality

Many Python libraries leverage Lambda functions to simplify complex data processing tasks.

For example, Lambda functions can be used with Pandas and NumPy to simplify data manipulation and transformation.

Suppose we have a data frame with two columns. We want to create another column that is the sum of the other two. In this case, we can use Lambda functions as follows:

# Create the columns' data
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}

# Create data frame
df = pd.DataFrame(data)

# Create row C as A+B and print the dataframe
df['C'] = df.apply(lambda row: row['A'] + row['B'], axis=1)
print(df)

And we get:

That's it for our whistle-stop tour of Lambda functions in Python!

Wrapping Up

In this article, we've seen how to use Lambda functions in Python, explored their pros and cons, some best practices, and touched on a couple of advanced use cases.

Happy coding!

P.S. If you'd like to read Python posts as soon as they get off the press, subscribe to our Python Wizardry newsletter and never miss a single post!

Accelerating Polars with RAPIDS cuDF

Federico Trotta — Tue, 17 Sep 2024 15:51:31 +0000

If you’re a data scientist who migrated from Pandas to Polars because of its performance, you may be happy that Polars has powered up even further thanks to NVIDIA’s cuDF.

Did I get your attention? Well, read along!

Introducing Polars

In today’s analytics world, data frames are the backbone of most data work. Whether you're cleaning data, transforming it, or running complex analyses, data frames let you organize and manipulate data in a way that feels intuitive. This is mainly because data frames:

Are versatile: DataFrames APIs are less verbose than SQL for complex queries.
Provide easy Integration: Data frames integrate well with existing software solutions (for example. with plotting and ML libraries).
Provide a single format for data science and engineering: Data frames support both data engineering workflows and that of data scientists.

In the last few years, Pandas has been king in this space, but with more data than ever and growing needs for performance, tools like Polars are stepping in to meet those demands without sacrificing the simplicity we’ve come to love from data frames.

In case you didn't know, Polars is a Python library for data analysis that’s gaining popularity as a speedier alternative to Pandas. While Pandas is the go-to for most data scientists and engineers, it can get sluggish when handling really big datasets.

Polars, on the other hand, is built with performance in mind and it’s optimized to handle massive datasets much faster, thanks to its use of parallelization and a more modern backend. So, if you've ever felt like Pandas was holding you back with long processing times, Polars might be the upgrade you’re looking for.

However, if you already use Polar, you may have noticed that its superpowers may not be good enough for very large datasets, especially in distributed systems:

(Image from NVIDIA/Polars)

So, let’s see the solution that has been implemented and what to expect from it.

Accelerating Polars with RAPIDS

When it comes to processing data with very large datasets, as in industries quantitative finance, healthcare research, and similar, the performance needs to be even higher, due to the great amount of data.

And here’s why NVIDIA has accelerated Polars with the RAPIDS cuDF library.

Here’s what they’ve done:

The RAPIDS cuDF library accelerates Polars workflows up to 13x+ using NVIDIA GPUs. In particular, it’s directly integrated into the Polars Lazy API, so you don’t need to change your code.
It has been designed to make processing 100s of millions of rows of data feel interactive with just a single GPU.
The library it’s fully compatible with the ecosystem of tools built for Polars, thus reducing overhead.
It gracefully falls back to the CPU for unsupported queries.

In particular, moving to RAPIDS cuDF if you already have written Polars code is pretty straight forward, as you only need to add an ‘enging=gpu’ method.

For example, this is an example written in plain Polars:

And here’s Polars accelerated with RAPIDS:

What to expect?

First of all, using Polars on a GPU should feel the same as using it on the CPU: just faster for many workflows.

The GPU engine, in fact, fully utilizes the Polars optimizer to ensure efficient execution and minimal memory usage.

Also, as the team was working on accelerating Polars, they benchmarked it with industry standards and found that, as the data scaled, the performance of Polars (accelerated) scaled too:

(The Benchmark made by NVIDIA)

This is perfectly expected, as Polars is accelerated on GPUs (note that the benchmark has been realized on NVIDIA H100).

How to use it?

To accelerate Polars with cuDF, you first need to install it in an environment that allows you to use GPUs, for example in Google Colaboratory:

$ pip install polars\[gpu\] \--extra-index-url=[https://pypi.nvidia.com](https://pypi.nvidia.com)

The following example is taken from a 22GB dataset (link at the end of the article to test it).

Here’s the time needed for an operation with “standard” Polars:

And here’s the time needed for the same operation, with accelerated Polars:

So, the same operation took:

12 seconds with Polars.
0.34 seconds with accelerated Polars.

Conclusions

With this new Polars GPU engine, you can potentially reach high performance with huge datasets, maintaining the same Polars code you are already using.

So, why not give it a try? You can easily test it using a Colab notebook!

Want to read more? Here are all the details about that release directly on the Polars website.

Serverless Cost Optimization Three Key Strategies

Federico Trotta — Tue, 30 Jul 2024 11:44:29 +0000

Serverless computing has revolutionized the way developers build and deploy applications, offering significant benefits such as reduced operational complexity, automatic scaling, and a pay-as-you-go pricing model.

However, while serverless architectures can help you save on costs, they are not free. So, managing their costs effectively requires careful planning and optimization.

This article explores three key techniques for serverless cost optimization, helping you improve your serverless applications and avoid uneccessary expenses.

Serverless computing: an introduction for developers

Before discussing and presenting the strategies for serverless cost optimization, we want to briefly introduce what is serverless computing and why you may need it.

Introducing serverless computing

Serverless computing is a cloud-native development model that allows developers to build and run applications without managing the infrastructure. In a serverless setup, in fact, cloud service providers automatically allocate and manage servers to execute code in response to events, such as HTTP requests, database changes, or message queue activities. This allows developers to focus only on writing and deploying code, rather than worrying about server provisioning, scaling, and maintenance.

Also, serverless architecture is particularly appealing for several reasons like the following:

Payment model. Serverless offers a true pay-as-you-go model, where you only pay for the compute time you consume. This can lead to significant cost savings, especially for applications with variable or unpredictable workloads. A typical use case implemented nowadays regards the fact that big AI models, like Deep Neural Networks or Large Language Models, need GPUs to be trained. To save costs on GPUs, a solution can be the possibility of using serverless so that you pay-as-you-train the models.
Automatic scaling. Serverless provides automatic scaling, which means your application can handle the variation of loads seamlessly without manual intervention. When demand spikes, the serverless platform automatically scales out; when demand drops, it scales back, ensuring optimal resource usage. This also helps save on costs, since you pay-as-you-use the service, without the need to buy extensive hardware or to pay a monthly fee to a cloud service.

Comparing serverless to other technologies

Serverless advantages can be compared to other methodologies like traditional server-based (virtual machines or dedicated servers), Platform as a Service (PaaS), and containerization:

Traditional server-based models. This solution requires developers to manage the entire stack, from the physical or virtual server to the application code. This includes tasks like OS updates, patching, and capacity planning, which can be time-consuming and prone to errors.
PaaS. These solutions simplify some of the tasks needed with traditional server-based models by providing a managed environment for application deployment, but developers still need to handle aspects like scaling and environment configuration.
Containerization. Containers, often implemented with technologies like Docker and Kubernetes, offer another layer of abstraction by packaging applications and their dependencies into containers. This approach provides greater flexibility and scalability compared to traditional servers and PaaS. However, managing container orchestration, scaling, and networking can still be complex and resource-intensive.

Serverless, on the other hand, abstracts all infrastructure management tasks, allowing developers to deploy individual functions that execute in response to specific triggers. This model reduces operational issues, speeds up development cycles, and improves application resilience by leveraging the cloud provider’s infrastructure. It also integrates seamlessly with other cloud services, enabling the creation of highly scalable, event-driven applications with minimal effort.

So, serverless solutions should be preferred to the other mentioned in cases of:

Variable or unpredictable workloads. Serverless is ideal for applications with workloads that vary significantly or are difficult to predict, thanks to its automatic scaling feature.
Event-driven applications. Applications that are inherently event-driven, such as those responding to HTTP requests, processing files in object storage, reacting to database changes, or stream processing, are well-suited for serverless. The event-driven nature of serverless platforms, in fact, allows functions to execute in response to specific triggers, making it efficient and straightforward to build such applications.
Rapid development and deployment situations. When speed to market is crucial, serverless can accelerate development cycles. By eliminating the need to manage infrastructure, in fact, developers can focus only on writing and deploying code. This may be particularly beneficial for startups or projects requiring rapid iteration and deployment.

So, given the fact that serverless computing can help you save time and money with respect to the other methodologies described, they, anyway, come with their costs. So, let's continue this article by providing three strategies for serverless cost optimization.

Serverless cost optimization strategy 1: optimizing function execution time

One of the most direct ways to reduce serverless costs is by minimizing function execution time which refers to the duration from when a serverless function starts executing until it finishes.

Serverless providers such as AWS Lambda, Azure Functions, and Google Cloud Functions, charge based on the time it takes for the function to execute. The billing is typically calculated in milliseconds, and combined with the memory allocated to the function, determines the overall cost.

Here are some best practices to optimize the function execution time:

Write efficient code. Ensure that your code is optimized for performance by avoiding unnecessary computations, and using efficient algorithms. For example, prefer in-memory operations over database queries where possible.
Asynchronous processing. Utilize asynchronous processing to handle tasks that can be performed in parallel or do not require immediate completion. This can reduce the time your functions spend waiting, thus lowering execution time and costs. For instance, background tasks such as sending emails or processing logs can typically be handled asynchronously.
Memory Allocation. Choose the appropriate memory allocation for your functions. Allocating more memory can sometimes speed up execution due to higher CPU availability, but over-allocating memory leads to higher costs. Use monitoring tools to analyze your functions' performance and adjust memory settings accordingly.

A simple example of making efficient Python code that saves memory usage could be the following:

Inefficient:

def sum_of_squares_inefficient(numbers):
    # Use list comprehension inside a sum function
    return sum([x * x for x in numbers])

# Example usage
numbers = list(range(1, 10001))
result = sum_of_squares_inefficient(numbers)
print(result)

Efficient:

def sum_of_squares_efficient(numbers):
    # Use generator expression inside a sum function
    return sum(x * x for x in numbers)

# Example usage
numbers = list(range(1, 10001))
result = sum_of_squares_efficient(numbers)
print(result)

The inefficient version uses a list comprehension inside the sum() function. This creates an intermediate list in memory, which can be memory-intensive and slow, especially for large lists.

The efficient version uses a generator expression inside the sum() function. This avoids creating an intermediate list, yielding elements one by one. This approach is more memory efficient and faster for large datasets and leads to a reduction in the execution time.

Serverless cost optimization strategy 2: implementing auto-scaling and scheduled scaling

Auto-scaling is a fundamental feature provided by serverless platforms that automatically adjusts the number of function instances based on demand. However, without proper configuration, auto-scaling can lead to cost overruns.

So, here are some best practices to implement the auto-scaling feature in serverless and save on costs:

Demand-based auto-scaling. Set up auto-scaling policies that align with your application's usage patterns. Configure thresholds for scaling up and down based on metrics such as CPU usage, memory usage, or custom application metrics. This ensures that you are only using the resources you need, when you need them. Note that the most known serverless providers grant the possibility of implementing auto-scaling. For example, AWS Lambda provides AWS Auto Scaling, Azure Functions provides Azure Monitor, while Google Cloud Functions can be configured for auto-scaling with the help of Google Cloud Monitoring.

Also, consider the possibility of using concurrency autoscaling. This refers to the automatic adjustment of the number of concurrent executions or instances of a serverless function based on the current demand. This helps ensure that the function can handle incoming requests efficiently without being overwhelmed, while also controlling costs by scaling down when demand is low. While the most known serverless providers grant the possibility of implementing concurrency autoscaling, you can also use proper packages for your serverless projects such as the NPM package.
Scheduled scaling. For applications with predictable traffic patterns, scheduled scaling can be highly effective. By scheduling scaling events to match peak and off-peak times, you can ensure that your application has sufficient resources during high-demand periods while saving costs during low-demand periods. For example, if you know that your application experiences high traffic during business hours, you can schedule additional instances to be available during those times.

Implementation example

NOTE: All the code described in this section is available in this repository.

Let's take AWS as an example to illustrate a simple implementation: we want to deploy a Lambda function on AWS using Semaphore CI.

But before that, you need to install the Python package boto3 - if you haven't done it yet - by typing:

pip install boto3

Now, let's create an example that sets up a demand-based auto-scaling solution by dynamically adjusting the resources allocated to your serverless functions, based on real-time usage metrics in Python (see /function/function.py in the linked repository).

First of all, define the metrics that reflect your application's performance and load, such as CPU usage or request latency:

import boto3
cloudwatch = boto3.client('cloudwatch')

response = cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[
        {
            'MetricName': 'CPUUsage',
            'Dimensions': [
                {
                    'Name': 'FunctionName',
                    'Value': 'my_lambda_function'
                },
            ],
            'Value': 70.0,
            'Unit': 'Percent'
        },
    ]
)

Then, create alarms based on these metrics to trigger scaling actions:

response = cloudwatch.put_metric_alarm(
    AlarmName='HighCPUUsageAlarm',
    MetricName='CPUUsage',
    Namespace='MyApp',
    Statistic='Average',
    Period=300,
    EvaluationPeriods=1,
    Threshold=75.0,
    ComparisonOperator='GreaterThanThreshold',
    AlarmActions=[
        'arn:aws:autoscaling:us-west-2:123456789012:scalingPolicy:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx:autoScalingGroupName/my-asg:policyName/MyScalingPolicy'
    ]
)

Finally, link your Lambda function with Application Auto Scaling to adjust concurrency based on the CloudWatch alarms.

appscaling = boto3.client('application-autoscaling')

response = appscaling.register_scalable_target(
    ServiceNamespace='lambda',
    ResourceId='function:my_lambda_function',
    ScalableDimension='lambda:function:ProvisionedConcurrency',
    MinCapacity=1,
    MaxCapacity=10
)

response = appscaling.put_scaling_policy(
    PolicyName='MyScalingPolicy',
    ServiceNamespace='lambda',
    ResourceId='function:my_lambda_function',
    ScalableDimension='lambda:function:ProvisionedConcurrency',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 75.0,
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'LambdaProvisionedConcurrencyUtilization'
        },
        'ScaleOutCooldown': 60,
        'ScaleInCooldown': 60
    }
)

NOTE: Here we reported the code as shown, for simplicity. In the repository, the Python code remains the same but we created a function out of it, for obvious reasons (we are deploying a lambda function on AWS...).

We created the function in Python. Now we need to write the code that executes the deployment:

#!/bin/bash

# Define variables
FUNCTION_NAME="my_lambda_function"
ZIP_FILE="function.zip"
HANDLER="function.lambda_handler"
ROLE_ARN="arn:aws:iam::123456789012:role/my-lambda-role"
RUNTIME="python3.8"
TIMEOUT=30

# Go to /function folder
cd function

# Install requirements and pack the Python function
pip install -r requirements.txt -t .  
zip -r ../$ZIP_FILE .  

# Go to main directory
cd ..

# Verify if lambda function already exists
aws lambda get-function --function-name $FUNCTION_NAME

if [ $? -eq 0 ]; then
  echo "Updating existing function..."
  aws lambda update-function-code \
    --function-name $FUNCTION_NAME \
    --zip-file fileb://$ZIP_FILE
else
  echo "Creating new function..."
  aws lambda create-function \
    --function-name $FUNCTION_NAME \
    --zip-file fileb://$ZIP_FILE \
    --handler $HANDLER \
    --runtime $RUNTIME \
    --role $ROLE_ARN \
    --timeout $TIMEOUT
fi

# Remove zip file after upload
rm $ZIP_FILE

The deploy.sh bash script does the following:

Goes into the function directory and installs the Python dependencies that are in the requirements.txt file in the current directory.
Creates a .zip file that contains the Python function and the dependencies.
Verifies if the lambda function already exists:
- If exists, it updates the code with the new one contained in the .zip file.
- If it does not exist, it creates a new Lambda function using the .zip file.
Removes the .zip file after the deployment is ended.

Finally, the file semaphore.yaml defines the CI/CD pipeline:

version: v1.0
name: Initial Pipeline
agent:
  machine:
    type: e1-standard-2
    os_image: ubuntu2004
blocks:
  - name: Install Dependencies
    task:
      jobs:
        - name: Install AWS CLI
          commands:
            - sudo apt-get update
            - sudo apt-get install -y python3-pip
            - pip3 install awscli
            - pip3 install boto3
      prologue:
        commands:
          - checkout
  - name: Deploy to AWS
    task:
      jobs:
        - name: aws_credentials
          commands:
            - chmod +x deploy.sh
            - ./deploy.sh
      prologue:
        commands:
          - checkout

The semaphore.yaml does the following:

In the initial part, it specifies a version, a name for the pipeline, a machine type (machine), and an OS image.
The section block (Install Dependencies):
- Downloads the latest version of the code with checkout.
- Updates the Ubuntu packages.
- Installs pip to manage Python packages.
- Installs awscli and boto3.
The section Deploy to AWS:
- Defines the credentials to make the deployment throuh secrets.
- Deploys the Lambda function with the latest commands (checkout, etc...).

Note that, to make the code work, you need to configure the secrets in Sempahore CI including:

AWS_ACCESS_KEY_ID: this is the ID to access AWS.
AWS_SECRET_ACCESS_KEY: this is the secret key to access AWS.

NOTE: To learn more about how to use yaml in Semaphore, read the documentation.

Serverless cost optimization strategy 3: monitoring and right-sizing resource usage

Continuous monitoring of serverless functions is another useful way of maintaining cost efficiency in serverless solutions. By regularly reviewing performance metrics and resource usage, in fact, you can make informed decisions about resource allocation and configuration, and make adjustments accordingly.

Here are some best practices to implement as a reference:

Use monitoring tools. Use monitoring tools provided by your serverless platform or third-party solutions to track function performance, execution times, and resource usage. Tools like AWS CloudWatch, Azure Monitor, or Google Cloud Monitoring offer insights into how your functions are performing and where inefficiencies may lie.
Analyze metrics. Analyze metrics regularly to identify patterns and anomalies. For example, look for functions with consistently high execution times or memory usage and investigate potential causes. This can help you pinpoint areas where optimizations are needed.

Also, if you work in a CI/CI environment, you can consider using Semaphore as it streamlines issue detection and addresses error-prone tasks and unpredictable tests that could cause sporadic build failures.
Right-size resources. Based on your analyses, right-size your functions to ensure they have the appropriate resources. This might involve reducing memory allocation for functions that do not require it, or splitting larger functions into smaller, more efficient ones. Right-sizing helps avoid over-provisioning and ensures that you are not paying for unused resources.

Conclusions

In this article, we've shown that serverless cost optimization involves a combination of optimizing function execution time, implementing intelligent scaling strategies, and continuously monitoring and right-sizing resource usage.

By adopting these strategies, you can ensure that your serverless applications run efficiently and cost-effectively.

Pandas reset_index(): How To Reset Indexes in Pandas

Federico Trotta — Sat, 27 Apr 2024 14:34:12 +0000

In data analysis, managing the structure and layout of data before analyzing them is crucial. Python offers versatile tools to manipulate data, including the often-used Pandas reset_index() method.

This article provides an in-depth exploration of the Pandas reset_index() method, explaining its importance, usage, and the scenarios where it’s useful.

What is Pandas reset_index() and when to use it?

![Pandas reset_index() visualized as real pandas playing in the threes by Federico Trotta(https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6u9e4roresb0i6tuu344.png)
(Pandas playing in the threes. Image by Federico Trotta.)

In Pandas, each DataFrame and Series has an index, which is a set of labels used for identifying each row or item uniquely.

The reset_index() method is used to reset the index of the DataFrame or Series, which can involve turning the index into a regular column, or discarding it entirely. This is particularly useful when the index needs reorganizing, or when integrating the index into DataFrame columns for further analysis.

The reset_index() is typically used in the following scenarios:

Reverting an index after group operations. Post-grouping operations might leave you with grouped or multi-level indexes which are sometimes inconvenient for further analysis.
Integrating the index as a feature. If the index itself carries valuable data (e.g., time stamps or unique identifiers), you might want to move it into a DataFrame column to use as a feature in data analysis or machine learning models.
Resetting after sorting or filtering. Sorting or filtering can alter the order or number of rows, and resetting the index can be necessary to maintain a contiguous, integer index.

How to use Pandas reset_index()

The basic syntax of reset_index() is as follows:

DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')

Each parameter has a specific function:

level. It Specifies which index levels to reset (for MultiIndex).
drop. If True, the old index is discarded and not added as a column in the new DataFrame.
inplace. If True, modifies the DataFrame in-place; otherwise, a new DataFrame is returned.
col_level, col_fill. Is used when the columns are a MultiIndex.

Usage examples
Basic reset:

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Data': [10, 20, 30, 40],
}, index=['a', 'b', 'c', 'd'])

print("Original DataFrame:")
print(df)

# Reset the index
reset_df = df.reset_index()

print("\nDataFrame after reset_index():")
print(reset_df)

That results is:

Original DataFrame:
   Data
a    10
b    20
c    30
d    40

DataFrame after reset_index():
  index  Data
0     a    10
1     b    20
2     c    30
3     d    40
Dropping an index

If the index is irrelevant and not needed as a column, set the parameter drop=True:

reset_df_drop = df.reset_index(drop=True)
print(reset_df_drop)

That results is:

Multi-index reset

# Create a MultiIndex DataFrame
mindex = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')], names=['first', 'second'])
df_multi = pd.DataFrame({'Data': [100, 200, 300, 400]}, index=mindex)

print("Original MultiIndex DataFrame:")
print(df_multi)

# Reset the 'second' level of the index
reset_multi_df = df_multi.reset_index(level='second')

print("\nDataFrame after resetting 'second' level:")
print(reset_multi_df)

That results in:

Original MultiIndex DataFrame:
              Data
first second      
1     a        100
      b        200
2     a        300
      b        400

DataFrame after resetting 'second' level:
      second  Data
first             
1          a   100
1          b   200
2          a   300
2          b   400

Conclusions

Pandas reset_index() is a versatile tool in the Pandas library that provides essential functionality for DataFrame and Series index manipulation. Whether you’re preparing data for analysis, integrating index data as a feature, or simply organizing data post-transformation, understanding how it works will speed your processes up.

Hi, my name is Federico and I am a freelance Technical Writer.

Do you want to start a documentation project, collaborating with me? Contact me!

Do you want to know more about my work? You can start with my portfolio.

--
The article "Pandas reset_index(): How T Reset Indexes in Pandas" was first published in my blog.

How To Easily Remove a Password From a PDF file

Federico Trotta — Wed, 24 Apr 2024 13:36:54 +0000

Imagine a scenario (which happened to me): your employer gives you the documentation relating to your financial situation and, at a certain point, you have to give it to your financial advisor.

Problem: the document is in PDF and is password-protected (which is good), but you can give it to your financial advisor only by loading it on a platform. How do you tell your financial advisor the password?

These are the scenarios I imagined to solve the problem:

You can load the file on the platform and tell the financial advisor the password via email or phone.

You can name the file as password_****** so that they can understand what the password is.

You can print the file, scan it, and create a new PDF (problem: in my case, these were 30 pages!!).

Now, as a Cyber Security enthusiast, I didn’t want to share the password with anyone, even if it was a unique one, for obvious reasons. Also, I didn’t want to waste time and paper to print 30 pages just to create another PDF.

So, the only (right) solution was to find something to remove the password from the file, so that it could be opened by my financial advisor.

In this article, I show you how you can remove a password from a PDF in a couple of minutes, and for free.

How to remove a password from a PDF file

Before arriving at the solution I’m showing you, I’ve navigated a lot on the Internet.

Let me tell you one thing: you can find a lot of online services that can remove a password from a PDF file, but, in my case, I didn’t want to use one of them for a simple reason: the file was about my financial situation, so I didn't want to leave somehow a track or a record of my financial situation on a third party database (which is not my financial advisor’s one).

So, in this scenario, I found qpdf which is a library that can convert a PDF file into an equivalent one. It has only one disadvantage: it can be used only via a terminal.

But don’t worry, is very easy to use it.

Let’s see.

Qpdf can be installed and used on any OS, but here we’ll see the procedure to do so on Ubuntu.
If you are a Windows user, don’t worry: you can install WSL)

So, in a Linux environment, install qpdf by typing:

$ sudo apt-get install qpdf

Now, at this point, we have to be careful if we don’t want to make mistakes. If you are using Ubuntu on Windows (via WSL) you have to move your PDF file to the environment where Ubuntu actually works. You should see something like the following:

The Linux environment on Windows created by WSL. Image by Federico Trotta.

So, as you can see, I have Ubuntu 18.04 under the Linux environment. This folder has been created by WSL when you install it.

So, at this point suppose that:

Your file is called my_file and it is located on the following path: C:/Home/Linux/Ubuntu-18.04/my_file.pdf
my_file is protected with the following password: my_fileExample

Qpdf will create a copy of the file named my_file_free located in C:/Home/Linux/Ubuntu-18.04/my_file_free.pdf

Now, via terminal, you just need to type the following:

$ qpdf --decrypt --password=my_fileExample C:/Home/Linux/Ubuntu-18.04/my_file.pdf C:/Home/Linux/Ubuntu-18.04/my_file_free.pdf

And the job is done. So the scheme is:

$ qpdf --decrypt --password=<YOUR PASSWORD> file.PDF new_file.PDF

Conclusions

I hope this article helps you remove a password from a PDF.
Let me tell you that protecting your files with passwords is very important to protect your data, but in special cases, this procedure may cause you a lot of headaches.

The article "How To Easily Remove a Password From a PDF file" has been primarily created form my blog here.

Hi, my name is Federico and I am a freelance Technical Writer:

Do you want to start a documentation project, collaborating with me? Contact me!

Do you want to know more about my work? You can start with my case studies and my portfolio.

How To Create a Repository in GitHub

Federico Trotta — Wed, 24 Apr 2024 13:03:49 +0000

As I’ve worked for several years with documents, I understand the need to define the revision index of a document, when it needs to be changed. So, when I first read about version control (and Git and GitHub), I could immediately understood its importance, even if this is not properly the same thing as a document revision.

One good thing to do when learning to program is to take confidence in version control, especially for two reasons:

When a project needs to be revised, version control gives you the possibility to see all the precedent versions.

A system version control, like GitHub for example, gives you the possibility to work locally on your project and store the versions even online, so that you can show your projects to the world, share knowledge, etc…

So, let’s see how to create a repository on GitHub.

Create a GitHub account and install Git on your PC

The first step is to create a GitHub account. You can do it here.

After that, you have to install Git on your PC. You can download it here.

Now, open the terminal and set up your email by typing the following code:

git config --global user.email YOUR_EMAIL

Of course, YOUR_EMAIL is your complete email address. Make sure you use the same email address for setting up Git and for signing up on GitHub.

Now, set up your Git name by typing this code:

git config --global user.name YOUR_NAME

Where YOUR_NAME can be your complete name or a nickname: you decide.

Git is now set up and configured. The next step is the creation of an SSH key.

Create an SSH key

An SSH key is an item to protect you when accessing a remote PC that guarantees you protection from cyber attacks.

So, let’s see how to set an SSH key.

Open another terminal and type this code:

ssh-keygen -t ed25519 -C "YOUR_EMAIL"

In the above code use the email you’ve used in the previous steps, and use the “” properly as typed.

Now, you will see a lot of code after you have typed this command, as this command generates a private key and a public key.

Before going on, to check that everything works fine type this command:

eval `ssh-agent`

If you get something like:

Agent pid 125746

it means that everything works fine and we can go on.

Now, you can get the public SSH key. If you see the code generated after you typed the ssh-keygen -t ed25519 -C "YOUR_EMAIL" command before you can easily find a line which is something like:

Your public key has been saved in /home/YOUR_PC/.ssh/id_ed25519.pub

Which defines the folder where your keys are stored. Anyway, as said before, we need just the public key. To get it, we can type in the terminal (this is for Linux users. If you are not a Linux user, you can navigate to the file and open it):

cat ~/.ssh/id_ed25519.pub

and we get:

ssh-ed25519 NUMBERS_AND_LETTERS

Now, copy this result: we’ll paste it on GitHub.

Go to GitHub and log in to your account. Go to settings > SSH and GPG key and click on New SSH key. Here's what you'll see:

(Adding SSH key on GitHub. Image by the Federico Trotta.)

Give it a title and copy the key (ssh-ed25519 NUMBERS_AND_LETTERS) under the key tab.

Now, we can create our first local repository and connect it to a remote repository on GitHub.

Your first GitHub repository

Now, let’s create our remote repository on GitHub. Let’s go on New repository and let’s give it a name. Let’s say we call it example and let’s set it to be public (so it can be seen by anyone). This is what you see:

(Your first repository. Image by the Federico Trotta.)

The only thing you have to do now is copy the last three lines of code:

git remote add origin git@github.com:t-YOURNAME/example.git
git branch -M main
git push -u origin main

Get those lines copied: we will paste them into the terminal in a moment.

Now, on your PC create a local folder where the files will be. We’ll call the folder example so that the local and remote repositories have the same name.

Now, we are going to use Git.

On the terminal type the following:

git init

This will make your local folder a repository. Then, you can see its status by typing:

git status

And you will see the status of the repository. Since it is a new one, the terminal will tell you that there is no commit. Moreover, the terminal will tell you all the files that are in the folder.

Let’s say you have a file my_file.py which is the file you want to stay in the repository (the local one, and you want it to be in the remote one in GitHub). You can choose the files to commit or to commit all. In this case, we just have one so we type:

git add my_file.py

(In case you want to commit all the files, type: git add.)

Now, commit it:

git commit -m 'initial version'

This means that the name of the revision is initial version, but you can call it as you want.

Finally, paste the lines copied on GitHub. Let’s catch them again:

git remote add origin git@github.com:t-YOURNAME/example.git
git branch -M main
git push -u origin main

And we are done!

If you now see in your GitHub repository (called example) you will see the my_example.py file.

Conclusions

In this article, we've shown how to create a new repository on GitHub.

I hope you find it useful!

Hi, my name is Federico and I am a freelance Technical Writer:

Do you want to start a documentation project, collaborating with me? Contact me!

Do you want to know more about my work? You can start with my case studies and my portfolio.

The article "How To Create a Repository in GitHub" was first created for my blog here.

Hi, I am Federico Trotta and I'm a freelance Technical Writer.
Do you want to collaborate with me? Hire me.

What if ChatGPT Had Already Reached its Glory?

Federico Trotta — Tue, 09 Apr 2024 07:31:14 +0000

As ChatGPT "was born" nearly a year and a half from now, everyone out there is telling us that AI, especially LLMs, will steal our jobs. No more writers, no more developers, no more marketers, no more operators. The future seems to see AI as the king of the world.

While there's no doubt that AI is here to stay and support us in our daily jobs, is not so easy to say that it will replace jobs (and what jobs, particularly).

Also, as a user of ChatGPT, I noticed a degradation in its performance since it came on the market.

So, in this article, I'd like to raise some questions - and eventually, create discussions - about this topic:

what if ChatGPT has already seen its glorified period? Will it really increase its capabilities in a short period of time or will it take years to make great improvements?

I'll do so by introducing known procedures like training models to guide you through necessary things to take into account to reason about the topic.

NOTE: In this article, I will talk about ChatGPT for simplicity as it's the most famous (and maybe used) LLM, but the considerations apply to other similar software.

An introduction to training and evaluating an ML/DL model

When training and evaluating a Machine Learning (ML) or a Deep Learning (DL) model, data scientists always do the same thing: they get the available dataset and split it into the train and the test set.

This operation is done to find the model that best fits the data. To do so, data scientists train different models on the train set and calculate some performance metrics. Then, they calculate the same performance metrics using the test set and find the best-performing model.

So, the importance of this methodology is that ML and DL models have to be evaluated on new and unseen data to verify that they are generalizing well what they have learned in the train set.
This is an important introduction to keep in mind for the subsequent part of this article.

The sets of combinations

In mathematics - in particular, in linear algebra - we talk about the sets of combinations.

The most famous one is the sets of linear combinations (also called "Span"). To define it, we use Wikipedia:

In mathematics, a linear combination is an expression constructed from a set of terms by multiplying each term by a constant and adding the results.

So, for example, the linear combination of two variables x and y could be created as:

$z = a x + b$

Where a and b are two constant values (two numbers).

Anyway, the sets of combinations can also be non-linear. This means that a variable z can be created as a non-linear combination of x and y. It could be quadratic, cubic, or it could have another mathematical form.

Of course: this applies to variables as well as to all mathematical entities, like datasets.

How sets of combinations influence the performance of a model

Now, considering what we've defined until now, a question may arise:

What happens if the test set is a combination of the train set?

In this case, two major events may occur (both or one of the two):

Data leakage. "Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn, invalidate the estimated performance of the model being constructed."
Overfitting. This phenomenon occurs when the model has learned the specific patterns in the training data. This results in a low performance on the unseen data in the test set.

When these two phenomena occur, the model generalizes poorly. This means that the model may perform well on the test set because it reflects the combinations present in the training set, but its ability to perform on entirely new data is still unknown.

In other words, the model has an evaluation bias. This means that the evaluation metrics calculated on the test set can not reflect the model's actual performance (thus, are biased). This could lead to misguided confidence in the model's abilities.

Possible sets of combinations in training LLMs

Now, our interest here is in LLMs. As known, these models are trained on a vast amount of data. So, the more the data, the more the possibility of getting biased data.

Also, we don't know the actual data used to train ChatGPT, but we know that a significant proportion of the training data came from the Internet.

So, first of all, on the Internet (but this is a consideration that applies, in general, to books and other sources) a lot of websites describe the same topic. This may led to possible sets of combinations between the train and the test sets used.

Also, ML and DL models often need retraining to evaluate if the model still applies to new data incoming.

Imagine that ChatGPT was trained and evaluated in September 2022 for the first time (only using the Internet, for simplicity).

Imagine that the first retraining and re-evaluation was made in March 2023 (still only using the Internet, for simplicity). Some questions that may arise are:

How much new content has been generated on the Internet since the first release of ChatGPT in November 2022 and the first retraining?
How much new content on the Internet has been AI-generated since the first release of ChatGPT in November 2022 and the first retraining?

These questions are interesting to understand if LLMs are really improving or not because - as stated at the beginning of this article - I've seen more of a degradation in the performance, rather than an improvement (but, sure: I may be biased).

So, if the point of training (and re-training) is to evaluate a model on new unseen data, we can state that AI-generated content probably creates a subset of combination from data previously used to train the model, thus leading the model to data leakage, even though the training may be done with proper techniques to avoid it.

Will LLMs need years to get improvements in performance with other training?

When trying to find the model that best fits the data, we know that data quality has a higher impact than using a "better model" or better-fine-tuned hyperparameters of the same model.

So data quality is more important than model tuning. This particularly applies to LLMs that need a vast amount of data to be trained.

So, another question may be:

Given the fact that AI-generated text on the Internet may lead to creating new data that are a subset of combinations of the train data, how much time should pass before a great and new amount of content is generated so that the performance can increase?

In other words: will LLMs need years to make actual improvements, on the contrary to what we are daily reading on the news?

Or, we should ask: what about ChatGPT can not actually be better than we know it today? As it already seen its glory days?

Conclusions

In this article, I wanted to make a reasoning about how ChatGPT is evolving on the side of performance.

I believe that re-training it using the Internet may lead to data leakage because AI-generated content is not a small proportion today of the whole content existing on the Internet. Also, this may lead to data leakage, thus a degradation of the performance, because this content is a subset of combinations of content previously created.

I'd like this article to create a genuine discussion on this topic, hoping to generate a positive and constructive one. Please: share your thoughts in the comments!

Hi, I am Federico Trotta and I'm a freelance Technical Writer.
Do you want to collaborate with me? Hire me.

How Python’s Argparse Can Be Useful in Data Science

Federico Trotta — Mon, 01 Apr 2024 16:37:27 +0000

When I first approached Python’s Argparse, I had great difficulty understanding how it works because I had never programmed before.

Also, I asked myself: “How can a command-line interface be useful in Data Science??”. Well, I’m showing it, with a practical example.

But first, let’s explain what Argparse is.

What is Python’s Argparse?

Python’s Argparse is a library that gives you the possibility to pass arguments via the command-line interface. It is not the only module you can use (you can also use sys.argv), but it is definitely the most complete.

As we can see in its documentation:

The argparse module makes it easy to write user-friendly command-line interfaces. […]. The argparse module also automatically generates help and usage messages and issues errors when users give the program invalid arguments.

How Python’s Argparse can be useful in your Data Science projects: a practical example

Let’s say you have an empirical way to calculate a parameter and this empirical method needs to insert a value to achieve a “considered good result”. The problem is that you have to calculate the right value, iteratively. If you work in Jupyter Notebooks, you’ll need to find the exact line of code to modify the parameter, each time.

For the purpose of this article, I’ve created a dataset with simulated data which reflects the reality of typical distributions, in real cases. Let’s say that our data are measured times in minutes; let’s import the data and see the data frame:

import pandas as pd

# Import data and show head
df = pd.read_excel('example.xlsx')
df.head(10)

Here’s the data frame:

The purpose of the exercise is to find the measured time that best fits the distribution

Let’s say that those measurements are times related to athletes running a fixed distance; let’s say 1 km.

We want to evaluate the athletes based on the time they need to run 1 km. But how can we fix a reasonable value of time to be achieved? One minute is a good time? Can the majority of the athletes run 1 km in one minute? When an athlete can be considered too slow and when too fast?

The purpose of this study relies on that.

As often happens in these cases, the mean value is typically far away from being a good value, because, often, the data are not normally distributed. So we need a different metric, but this metric can rely on the mean time.

To find the metric, we have to empirically find a factor that, multiplied by the mean time, gives a value that is one of the most frequent values.

Let’s show a plot for a better understanding:

As you can see, the mean time (4.4 min) is not a good value to use to evaluate the athletes because the majority of them run 1 km in 3 or 4 minutes. In similar cases, I found that a good value is “0.85*mean time”; but this ‘0.85’ factor is an empirical value and sometimes it can be more, sometimes less (depending on how skewed is the data distribution). So the goal of using Argparse is to modify just the multiplication factor to fit a good final result (a time on which evaluate your athletes on running 1 km).

So, let’s see a bit of code:

import argparse

# Create parser
parser = argparse.ArgumentParser()

# Specify the arguments that has to be insert
parser.add_argument('multiple', type=float, help='moltiplication factor (0.85 is typical)')

# Parse and control the arguments
args = parser.parse_args()

# Define factor of percentage
fac = args.multiple

With the above code, I’ve created the parser, specified the arguments to parse (in this case, the argument is just one: the factor of percentage), and in the end, after controlling the arguments, I’ve defined the factor of percentage as controlled by Argparse (fac = args.multiple). The work is done, and in the end, we can calculate the mean time and the adjusted time (as the mean time multiplied by the factor of percentage):

# Calculate mean values
mean = df['measures [min]'].mean(axis=0) #mean

# Define adjusted value
adj = mean*fac

We can now plot the graph:

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import os

# Define figure size in inches and font scale
plt.rcParams['figure.figsize'] = 15, 10
sns.set(font_scale=1)

# Plotting the frequences
sns.histplot(df, x="measures [min]", binwidth=1, color='red') 

# Addin the time mean and the theoretical time mean
plt.axvline(x=adj, color="blue") # Vertical line to "adjusted" value 
plt.axvline(x=mean, color="green") # Vertical line to "mean" value

#Create labels
plt.title(f"FREQUENCES OF THE MEASURED VALUES", fontsize=18)
plt.xlabel("VALUES", fontsize=12)
plt.ylabel("FREQUENCES",fontsize=12)

# Define mean and adjusted legend
blu_line = mpatches.Patch(color="blue", label=f"adjusted value: {adj:.1f}")
green_line = mpatches.Patch(color="green", label=f"mean value: {mean:.1f}")
plt.legend(handles=[blu_line, green_line], prop={"size":15})

The result is:

As you can see, the adjusted time (3.7 min) can be a good value to evaluate the athletes, instead of the mean time (4.4 min) since it is near the mean of the most bar related to the most frequent times measured. And how can we use Argparse to arrive here?

First of all, save your Jupyter Notebook with .py extension. Let’s call it exercise.py and save it in a directory. Open the file with the terminal and type:

python3 exercise.py --h

This shows the help:

So, if you want to play with the multiplication factors and if you want to try starting from “0.85” you just need to write this in the terminal:

python3 exercise.py 0.85

With the above code, Python will display the image of the plot and if “0.85” is not a good fit, you can change it very easily and in a very fast way, without the need to search in all your Notebook the exact line of code to modify!

Conclusions

This article shows how Python’s Argparse can be useful even in Data Science projects.

Sometimes, in fact, when analyzing data you may need to adjust some parameters: in that case, Python’s Argparse can be the best library you can choose.

The article "How Python’s Argparse Can Be Useful in Data Science" was originally created for my blog.

How to calculate RGB values in Python

Federico Trotta — Mon, 01 Apr 2024 15:41:00 +0000

When managing images, a good exercise is to calculate RGB values in Python.

If you’re asking yourself: “What does RGB mean?”; don’t worry: this was the first question I’ve asked myself before coding for this exercise.

So, before the code, let’s talk about RGB.

Black and white, RGB, Alpha level: basic information about images

RGB stands for “Red, Blue, Green” and it's a model of “additive colors”: their sum results in the white color. In particular, it's a model used in electronic devices because it is helpful to visualize the pixels of an image.

This means, that when analyzing a colored image, Python — in some ways — gives us three numbers: one for Red, one for Green, and the other for Blue.

Of course, this means that, from a black-and-white image, we can calculate just one value.

On the contrary, the alpha level is transparency, and this means that we can calculate four values (one for R, one for G, one for B, and one for the alpha level).

RGB values in Python: a preliminary study

All right, let’s use some code here!

Let’s say we have a folder called images.

In this folder, we have three images:

One in black and white, named bw.png.
One colored, named daffodil.jpg.
One colored with the alpha level, called eclipse.png.

We want to calculate the RGB values in Python for each one of these images.

To do so, we can use the library PIL which can load images, and NumPy to transform the images in NumPy’s arrays.

So, let’s import the libraries:

from PIL import Image
import numpy as np
from tabulate import tabulate
import os

Now, before coding, we have to understand some topics.

We’ll use NumPy’s arrays and, first of all, let's remember that the shape of an array can be defined as the number of elements in each dimension. Moreover, the ndim function returns the number of dimensions of an array.

So, let’s calculate the shape and the ndim for each array.

For the black and white image (bw.png) we have:

arr = np.array(Image.open(os.path.join(dst_img, "bw.png")))
print(arr.shape)
print(arr.ndim)

The result is:

(512, 512)
2

So, this image has a height and a width equal to 512 px. It also has 2 dimensions, which is in accord with the fact that is in black and white (it has just two dimensions).

For the RGB image (daffodil.jpg) we have:

arr = np.array(Image.open(os.path.join(dst_img, "daffodil.jpg")))
print(arr.shape)
print(arr.ndim)

The result is:

(500, 335, 3)
3

So, this image has a height equal to 500 px and a width equal to 335 px. It also has 3 dimensions, which is in accord with the fact that is an RGB image.

In the end, for the RGB+alpha (eclipse.png) image we have:

arr = np.array(Image.open(os.path.join(dst_img, "eclipse.png")))
print(arr.shape)
print(arr.ndim)

The result is:

(256, 256, 4)
3

this image has a height and a width equal to 256 px. It also has 3 dimensions, which is in accord with the fact that is an RGB image, but it has 4 channels!

Using NumPy’s mean function, we can calculate the mean value for each color channel. Now, we can create a loop to calculate our values.

Let’s see the whole code and then I’ll explain some details.

How to calculate RGB values in Python: an exercise

This is the code I’ve used to derive the information we’ve seen before from three images. Of course, this is just one way to do it!

from PIL import Image
import numpy as np
import os

# List files in images folder
dst_img = "images" 

# Iterate over dst_image to get the images as arrays
list_img = os.listdir(dst_img)
for image in sorted(list_img):
[file_name, ext] = os.path.splitext(image) # Split file name from extension
arr = np.array(Image.open(os.path.join(dst_img, image))) # Create arrays for all the images

# Calculate height and width for each image
[h, w] = np.shape(arr)[0:2]

# Calculate the dimension for each array
arr_dim = arr.ndim 

# Calculate the shape for each array
arr_shape = arr.shape 
if arr_dim == 2:
 arr_mean = np.mean(arr)
 print(f'[{file_name}, greyscale={arr_mean:.1f}]')
else:
arr_mean = np.mean(arr, axis=(0,1))
if len(arr_mean) == 3: #RGB CASE
 print(f'[{file_name}, R={arr_mean[0]:.1f}, G={arr_mean[1]:.1f}, B={arr_mean[2]:.1f} ]')
else: #ALPHA CASE
 print(f'[{file_name}, R={arr_mean[0]:.1f}, G={arr_mean[1]:.1f}, B={arr_mean[2]:.1f}, ALPHA={arr_mean[3]:.1f}]')

The result is:

[bw, greyscale=21.5]
[daffodil, R=109.3, G=85.6, B=5.0 ]
[eclipse, R=109.0, G=109.5, B=39.8, ALPHA=133.6]

Some code explanations

The importance of coding is to try to generalize so that we can use the code again in the future if needed (with the due changes, of course).

In this case, I’ve decided to differentiate the images by studying the shape and the ndim values derived from NumPy.

In the beginning, I wanted to derive the general information from all the images, and this is why I’ve imported them all using Numpy and the library PIL; I could, then, calculate immediately the file name and the dimensions for each image, because this information can be calculated for each image, preliminary.

Then, I wanted to study the black-and-white image using the if arr_dim == 2 statement: the black-and-white image, as I said before, has just two dimensions.
Then I wanted to study the RGB and the RGB with the ALPHA channel images.

So, before I wanted to generalize the calculation of the mean values using the arr_mean = np.mean(arr, axis=(0,1)) code; in this case, I had to use axis=(0,1) because those images have 3 dimensions, and the calculation has to be done along the x and y axis in NumPy.

Then, I’ve differentiated the RGB from the RGB+ALPHA image with len(arr_mean); since the RGB image has 3 channels, len(arr_mean) has to be equal to 3; instead, since the RGB+ALPHA has 4 channels, len(arr_mean) has to be equal to 4; hence, the underlined code before.

Conclusions

This article describes what RGB values are and how to calculate RGB values in Python.

The best thing you can do now is to try it with your images.

The post "How to calculate RGB values in Python" was originally created for my blog.

Forem: Federico Trotta

How to Use Proxies in Python

What is a Proxy?

Why use a proxy in web scraping?

How to use a proxy in Python using requests

Getting Valid Proxies

Using Proxies with requests

Proxy Authentication Using requests: Username and Password

How to Use a Rotating Proxy with requests

Using a List of Proxies

Using a Proxy Rotation Service

Conclusions

How to Use Lambda Functions in Python

Prerequisites

Syntax and Basics of Lambda Functions for Python

Common Use Cases for Lambda Functions

Using Lambda Functions with the map() Function

Using Lambda Functions with the filter() Function

Using Lambda Functions with the sorted() Function

Using Lambda Functions in List Comprehensions

Advantages of Using Lambda Functions

Limitations and Drawbacks

Best Practices for Using Lambda Functions

Advanced Techniques with Lambda Functions

Nested Lambda Functions

Integration with Python Libraries for Advanced Functionality

Wrapping Up

Accelerating Polars with RAPIDS cuDF

Introducing Polars

Accelerating Polars with RAPIDS

What to expect?

How to use it?

Conclusions

Serverless Cost Optimization Three Key Strategies

Serverless computing: an introduction for developers

Introducing serverless computing

Comparing serverless to other technologies

Serverless cost optimization strategy 1: optimizing function execution time

Serverless cost optimization strategy 2: implementing auto-scaling and scheduled scaling

Implementation example

Serverless cost optimization strategy 3: monitoring and right-sizing resource usage

Conclusions

Pandas reset_index(): How To Reset Indexes in Pandas

What is Pandas reset_index() and when to use it?

How to use Pandas reset_index()

Conclusions

How To Easily Remove a Password From a PDF file

How to remove a password from a PDF file

Conclusions

How To Create a Repository in GitHub

Create a GitHub account and install Git on your PC

Your first GitHub repository

Conclusions

What if ChatGPT Had Already Reached its Glory?

An introduction to training and evaluating an ML/DL model

The sets of combinations

How sets of combinations influence the performance of a model

Possible sets of combinations in training LLMs

Will LLMs need years to get improvements in performance with other training?

Conclusions

How Python’s Argparse Can Be Useful in Data Science

What is Python’s Argparse?

How Python’s Argparse can be useful in your Data Science projects: a practical example

Conclusions

How to calculate RGB values in Python

Black and white, RGB, Alpha level: basic information about images

RGB values in Python: a preliminary study

How to calculate RGB values in Python: an exercise

Some code explanations

Conclusions

How to use a proxy in Python using `requests`

Using Proxies with `requests`

Proxy Authentication Using `requests`: Username and Password

How to Use a Rotating Proxy with `requests`

Using Lambda Functions with the `map()` Function

Using Lambda Functions with the `filter()` Function

Using Lambda Functions with the `sorted()` Function