Forem: Luca Liu

Stop Using Spark for Your Small Data - Why Azure Functions is the Right Tool for the Job

Luca Liu — Wed, 06 May 2026 09:22:57 +0000

As a data analyst, my job is to get data from A to B, cleaned and ready for use. A common workflow for my team involves users uploading Excel files to a OneDrive folder. A Power Automate flow then syncs these files daily to a container in our Azure Storage Account.

From there, my responsibility begins:

Read the new Excel file from Blob Storage using Python.
Process the data (clean, transform, apply business logic).
Write the final data to an Azure SQL Database.

I needed this to run on two triggers: a time schedule (e.g., every morning at 7 AM) and an event-driven trigger (i.e., as soon as a new file lands in the container).

My first thought was to use the "big data" tools I'd heard of: Azure Databricks or Azure Synapse Analytics.

The "Big Tool" Trap

On the surface, Databricks and Synapse are perfect.

They let me write Python in a Notebook, which I'm very comfortable with.
They have easy-to-use trigger and monitoring tools.

I set up a proof-of-concept, and it worked. But I quickly realized a problem. My Excel files are 10MB, not 10TB.

Using a full Spark cluster (which is what both Databricks and Synapse Notebooks run on) was like using a sledgehammer to crack a nut. I was paying for a powerful, multi-node cluster (which took 5-10 minutes to "cold start") just to run a Python script that finished in 30 seconds. The cost was going to be far too high for such a simple task.

The "Right Tool": Azure Functions

After some research, I found the perfect tool for small-to-medium data tasks: Azure Functions.
Azure Functions, when used on a "Consumption Plan," is a true "serverless" service. This means:

It's cheap: You get a generous free grant every month, and after that, you pay only for the seconds your code is actually running. For my task, the cost is practically $0.
It's fast: It starts in seconds (or less), not minutes.
It's perfect for triggers: It has built-in triggers for exactly my needs (Timer and Blob Storage).

The (Small) Learning Curve

The one trade-off is that it's slightly more complex than a notebook. You can't just write and run your code in a web browser. The modern, recommended workflow is to use Visual Studio Code (VS Code) to develop your code locally and then "deploy" (push) it to the cloud.

This "local development" workflow is a best practice. It means you have a copy of your code, can use source control (like Git), and can test everything on your machine before it goes live.

More Than Just Timers

My needs were simple, but Azure Functions has triggers for almost anything. The most popular ones include:

Timer Trigger: Runs on a schedule (e.g., 0 7 * * 1 for 7 AM every Monday).
Blob Trigger: Runs when a new file is uploaded to a storage container.
HTTP Trigger: Runs when it receives a web request (creating a simple API).
Queue Trigger: Runs when a new message is added to a storage queue.

You can see the full list on the official Microsoft Azure Functions Triggers and Bindings documentation.

Conclusion

Databricks and Synapse are amazing, powerful tools, but they are not the answer for everything. For our team's daily Excel processing, using them was costing us time and money.

By investing a little time to learn the VS Code + Azure Functions workflow, we built a solution that is faster, more efficient, and costs a fraction of the price. Don't pay for a Spark cluster when all you need is a 30-second Python script.

Explore more

Luca LiuFollow

Hello there! 👋 I'm Luca, a Business Intelligence Developer with passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

Connect with me on LinkedIn

Connect with me on X

Data Analyst: Does Your Work Actually Matter?

Luca Liu — Wed, 06 May 2026 09:22:37 +0000

Introduction

I recently saw a question on Reddit that stopped me in my tracks: "Do you feel your work in data analysis is valuable to the organization you work for?"

It is the question that haunts every data analyst.

We spend hours cleaning data and building complex dashboards. We send them out into the void. And then... silence. We wonder: Is anyone actually reading this? Does this dashboard change anything?

If you are just answering ad-hoc requests, the answer is often "no."

The Trap of "Saving Time"

Many analysts get stuck in the "automation trap." A colleague from another department asks you to automate their manual workflow. You do it. They are happy because they save two hours a week.

You feel useful. But does the company see the value?

Often, they don't. From a management perspective, that colleague’s salary is already paid. Unless that saved time is directly used to generate new revenue, your automation didn't change the company's bottom line. You just made someone's life easier.

That is nice, but it isn't necessarily valuable in a way leaders notice.

The Shift: Stop Doing Projects, Start Building Products

If you want your work to matter, you need to stop acting like an IT support desk and start acting like a Product Owner.

What is the difference?

A Data Project has a start and an end date. It is usually a one-time request. The goal is "delivery." Once you hand over the dashboard or report, you are done. It quickly becomes outdated.
A Data Product is a living tool. It doesn't just report the past; it helps shape future decisions. It evolves. Its goal is not "delivery," but measurable "business impact" (like saving money or reducing risk).

Real-World Example: The SpendCube

Let’s look at a real example from my work with a purchasing department.

The "Project" Approach:
The department asks for a report on last month's spending. I pull the data, send an Excel file, and close the ticket.
Result: They look at what happened. Nothing changes. The value is low.

The "Product" Approach (The SpendCube Dashboard):
I build a live dashboard that doesn't just show what was spent, but actively highlights where we are overspending against budget in real-time. It identifies specific suppliers where we could negotiate better contracts tomorrow.
Result: The dashboard isn't just a report; it is a tool they use to actively save the company money. It contributes directly to the P&L (Profit and Loss).

How to Make Your Work Valuable

If you are tired of wondering if your work matters, change your approach.

Don't just accept tasks. When someone asks for a dashboard, ask them: "What decision will you make with this data?" If they can't answer, the dashboard probably isn't necessary.

Move away from automating tasks and start building data products that solve real business problems. When your work directly helps the company save money or make money, you never have to ask if you are valuable. You already know the answer.

Explore more

Luca LiuFollow

Hello there! 👋 I'm Luca, a Business Intelligence Developer with passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

Connect with me on LinkedIn

Connect with me on X

How to Fix "command 'claude-vscode.editor.openLast' not found" in VS Code

Luca Liu — Wed, 06 May 2026 08:06:22 +0000

The Problem

When trying to use the Claude Code extension in VS Code, you might run into this error preventing it from opening (2.1.129):

command 'claude-vscode.editor.openLast' not found

The Solution

The fix is simple: you need to downgrade the extension to a specific stable version (2.1.128).

Here are the exact steps:

Uninstall your current Claude VS Code extension.
Click the Gear (Settings) icon on the Claude extension page in VS Code.
Select "Install Another Version..." from the dropdown menu.
Choose version 2.1.128 from the list.
Reload VS Code.

That's it! The error should be gone and Claude will work properly again.

How to Store JSON and XML in SQL Databases

Luca Liu — Fri, 13 Mar 2026 15:37:17 +0000

Introduction

In the era of big data and diverse data formats, the ability to store and query semi-structured data like JSON (JavaScript Object Notation) and XML (eXtensible Markup Language) in SQL databases has become increasingly important. This article explores how to effectively store and manage JSON and XML data in SQL databases, along with the pros and cons of each approach.

Understanding JSON and XML

JSON

JSON is a lightweight data interchange format that is easy for humans to read and write, and easy for machines to parse and generate. It is often used in web applications for data exchange between clients and servers.

XML

XML is a markup language that defines rules for encoding documents in a format that is both human-readable and machine-readable. It is widely used for data representation and exchange, especially in web services.

Storing JSON in SQL Databases

Many modern SQL databases, such as PostgreSQL, MySQL, and SQL Server, provide native support for JSON data types.

How to Store JSON

Using JSON Data Type: Some databases allow you to define a column with a JSON data type.

   CREATE TABLE Products (
       ProductID int PRIMARY KEY,
       ProductData json
   );

Inserting JSON Data:

   INSERT INTO Products (ProductID, ProductData) VALUES (1, '{"name": "Laptop", "price": 999.99}');

Querying JSON Data

You can use built-in functions to query JSON data.

SELECT ProductData->>'name' AS ProductName FROM Products WHERE ProductID = 1;

Storing XML in SQL Databases

SQL databases also support XML data types, allowing you to store and query XML documents.

How to Store XML

Using XML Data Type: Define a column with an XML data type.

   CREATE TABLE Orders (
       OrderID int PRIMARY KEY,
       OrderDetails xml
   );

Inserting XML Data:

   INSERT INTO Orders (OrderID, OrderDetails) VALUES (1, '<order><item>Book</item><quantity>2</quantity></order>');

Querying XML Data

You can use XPath and XQuery to extract data from XML columns.

SELECT OrderDetails.value('(/order/item)[1]', 'varchar(100)') AS ItemName FROM Orders WHERE OrderID = 1;

Pros and Cons of Storing JSON and XML

Pros

Flexibility: Both JSON and XML allow for flexible data structures, making it easy to store complex data.
Interoperability: They are widely used formats, making it easier to integrate with other systems and APIs.
Schema-less: You can store data without a predefined schema, which is useful for evolving data models.

Cons

Performance: Querying semi-structured data can be slower than querying structured data, especially for large datasets.
Complexity: Managing and querying JSON and XML data can add complexity to your database operations.
Storage Overhead: JSON and XML formats can consume more storage space compared to traditional relational data.

Conclusion

Storing JSON and XML in SQL databases provides a powerful way to handle semi-structured data. By leveraging the native support for these formats in modern SQL databases, you can efficiently store, query, and manage complex data structures. Understanding the advantages and limitations of each format will help you make informed decisions about how to best utilize them in your applications.

Explore more

Luca LiuFollow

Hello there! 👋 I'm Luca, a Business Intelligence Developer with passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

Connect with me on LinkedIn

Connect with me on X

Fixing Azure SQL Connection Errors in Azure Scheduled Python Job

Luca Liu — Fri, 27 Feb 2026 13:37:00 +0000

As a Data Analyst, I recently faced a frustrating issue while automating a daily data processing task in Azure.

The goal was simple: run a scheduled job every morning to process data and sync it to an Azure SQL Database. When I ran the code manually, it worked perfectly. But when the scheduled job (via Azure Functions or Synapse) triggered at 6:00 AM, it crashed immediately.

Here is the solution to fixing the "Database not available" error without increasing your Azure bill.

The Problem

The job failed consistently with Error 40613:

(pyodbc.Error) ('HY000', "[HY000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Database 'xxxxxxx' on server 'xxxxxxxxxxxxxxxxxx' is not currently available. Please retry the connection later. If the problem persists, contact customer support, and provide them the session tracing ID of '{...}'. (40613) (SQLDriverConnect)") (Background on this error at: https://sqlalche.me/e/20/dbapi)

Why this happens

I am using the Azure SQL Database Serverless tier. To save costs, this tier features Auto-pause. If no one uses the database for a set period (e.g., 1 hour), Azure puts it to sleep.

When my scheduled job runs in the morning, the database is cold. It takes approximately 60 to 90 seconds for Azure to spin the compute back up. The default Python connection string gives up before the database is ready.

The Expensive Fix (Don't do this)

My first instinct was to disable Auto-pause.

Go to Azure Portal > SQL Database.
Click Compute + storage.
Uncheck Enable auto-pause

The result: The error stopped, but my costs tripled. I was paying for compute 24/7 for a job that only runs for 10 minutes a day. This is not efficient.

The Smart Fix: Intelligent Retry Logic

Instead of keeping the server running all night, we should write code that is patient enough to wait for the server to wake up.

I wrote a custom wrapper for the SQLAlchemy engine that handles the specific behavior of Azure Serverless cold starts.

The Code

Here is the robust connection function. It attempts to connect, and if it detects the database is sleeping, it waits and retries until the server is back online.

import time
from sqlalchemy import create_engine, text
from sqlalchemy.exc import OperationalError, InterfaceError

def connect_sql_engine(max_retries=10, delay_seconds=30):
    """
    Attempts to connect to the database. 
    If the database is in serverless pause state, it retries until it wakes up.

    max_retries: Default 10. Covers ~5 minutes of startup time.
    delay_seconds: Default 30s. Wait time between attempts.
    """

    # Replace with your credentials or use Environment Variables (Recommended)
    server = 'your-server.database.windows.net'
    database = 'your-database'
    username = 'your-username'
    password = 'your-password' 

    # LoginTimeout=30 gives the driver time to negotiate the handshake
    connection_string = (
        f'mssql+pyodbc://{username}:{password}@{server}/{database}'
        f'?driver=ODBC+Driver+18+for+SQL+Server&LoginTimeout=30'
    )

    # Create the engine with connection pooling enabled
    engine = create_engine(
        connection_string,
        fast_executemany=True, # Optimized for bulk inserts
        pool_pre_ping=True,    # Checks connection health before usage
        pool_recycle=1800
    )

    print(f"Attempting to connect to {database}...")

    for attempt in range(1, max_retries + 1):
        try:
            # Try to execute a simple query to wake the DB
            with engine.connect() as conn:
                conn.execute(text("SELECT 1"))

            print(">>> Success: Database is connected and awake!")
            return engine

        except (OperationalError, InterfaceError) as e:
            print(f"Attempt {attempt}/{max_retries} failed. Database might be auto-paused.")
            print(f"Error details: {e}")
            print(f"Waiting {delay_seconds} seconds for wake-up...")
            time.sleep(delay_seconds)

    # If we reach here, the database is genuinely down or credentials are wrong
    raise Exception(">>> Failed to wake up the database after multiple attempts.")

How it works

The Loop: It tries to run SELECT 1. This is a lightweight query that forces Azure to trigger the resume process.
The Trap: If it catches an OperationalError (which covers the 40613 code), it pauses the script for 30 seconds using time.sleep().
The Success: Once Azure allocates the compute (usually after attempt 2 or 3), the connection succeeds, and the function returns the active engine object for your pipeline to use.

Summary

Don't change your infrastructure to fit your code; change your code to fit the infrastructure. By handling the "cold start" in Python, you keep the cost benefits of Serverless architecture while maintaining the reliability of a Production environment.

Happy coding!

Explore more

Luca LiuFollow

Hello there! 👋 I'm Luca, a Business Intelligence Developer with passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

🚀 Connect with me on LinkedIn

🎃 Connect with me on X

How to Install Python Package in Azure Synapse for Apache Spark pools

Luca Liu — Tue, 06 Jan 2026 21:58:00 +0000

Efficiently Installing Python Packages in Azure Synapse Analytics

When working in Azure Synapse notebooks, you can use the %pip command (e.g., %pip install pandas) in a code cell to install packages. However, this method is temporary. The package is only installed for the current notebook session and must be re-installed every time the session starts.

This repetition can lead to significant delays in notebook execution and is inefficient for frequently run jobs.

A more permanent and efficient solution is to install packages directly onto the Apache Spark pool. This approach ensures the libraries are pre-installed and automatically available in every session attached to that pool.

How to Install Packages at the Spark Pool Level

This method involves uploading a requirements.txt file that specifies the packages and versions you need.

Go to your Azure Synapse workspace in the Azure portal.
Navigate to the "Manage" section on the left-hand side.
Select "Apache Spark pools" under the "Analytics pools" section.
Choose the Spark pool where you want to install the package.
move your mouth to the three dots on the right side of the Spark pool and click on "Packages".
upload requirements.txt file which contains the list of packages you want to install.
Click Apply to save the changes.

The Spark pool will update and automatically install the specified packages. This may take a few minutes. Once complete, all notebooks attached to this pool will have access to these libraries by default.

How to generate `requirements.txt` file

The requirements.txt file is a simple text file that lists the packages to be installed. You can easily generate this file from your local Python environment.

Open your terminal or command prompt and run the following command:

pip freeze > requirements.txt

This command captures all packages and their exact versions from your current environment and saves them into a file named requirements.txt. Uploading this file ensures that the exact same package versions are installed in your Synapse environment, providing consistency and preventing dependency conflicts.

Explore more

Luca LiuFollow

Hello there! 👋 I'm Luca, a Business Intelligence Developer with passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

🚀 Connect with me on LinkedIn

🎃 Connect with me on X

How to Calculate a Dynamic Truncated Mean in Power BI Using DAX

Luca Liu — Tue, 06 Jan 2026 21:57:00 +0000

Why You Need a Truncated Mean

In data analysis, the standard AVERAGE function is a workhorse, but it has a significant weakness: it is highly susceptible to distortion from outliers. A single extreme value, whether high or low, can skew the entire result, misrepresenting the data's true central tendency.

This is where the truncated mean becomes essential. It provides a more robust measure of average by excluding a specified percentage of the smallest and largest values from the calculation.

While modern Power BI models have a built-in TRIMMEAN function, this function is often unavailable when using a Live Connection to an older Analysis Services (SSAS) model. This article provides a robust, manual DAX pattern that replicates this functionality and remains fully dynamic, responding to all slicers and filters in your report.

The DAX Solution for a Dynamic Truncated Mean

This measure calculates a 20% truncated mean by removing the bottom 10% and top 10% of values before averaging the remaining 80%.

You can paste this code directly into the "New Measure" formula bar.

Trimmed Mean (20%) = 
VAR TargetTable = 'FactTable'
VAR TargetColumn = 'FactTable'[MeasureColumn]
VAR LowerPercentile = 0.10 // Defines the bottom 10% to trim
VAR UpperPercentile = 0.90 // Defines the top 10% to trim (1.0 - 0.10)

// 1. Find the value at the 10th percentile
VAR MinThreshold =
    PERCENTILEX.INC(
        FILTER( 
            TargetTable, 
            NOT( ISBLANK( TargetColumn ) ) 
        ),
        TargetColumn,
        LowerPercentile
    )

// 2. Find the value at the 90th percentile
VAR MaxThreshold =
    PERCENTILEX.INC(
        FILTER( 
            TargetTable, 
            NOT( ISBLANK( TargetColumn ) ) 
        ),
        TargetColumn,
        UpperPercentile
    )

// 3. Calculate the average, including only values between the thresholds
RETURN
CALCULATE(
    AVERAGEX(
        FILTER(
            TargetTable,
            TargetColumn >= MinThreshold &&
            TargetColumn <= MaxThreshold
        ),
        TargetColumn
    )
)

Deconstructing the DAX Logic

This formula works in three distinct steps, all of which execute within the current filter context (e.g., whatever slicers the user has selected).

Define Key Variables
TargetTable & TargetColumn: We assign the table and column names to variables for clean, reusable code. You must change 'FactTable'[MeasureColumn] to match your data model.
LowerPercentile / UpperPercentile: We define the boundaries. 0.10 and 0.90 mean we are trimming the bottom 10% and top 10%. To trim 5% from each end (a 10% total trim), you would use 0.05 and 0.95.

2. Find the Percentile Thresholds

MinThreshold & MaxThreshold: These variables store the actual values that correspond to our percentile boundaries.
PERCENTILEX.INC: We use this "iterator" function because it allows us to first FILTER the table.
`FILTER(..., NOT(ISBLANK(...))): This is a crucial step. We calculate the percentiles only for rows where our target column is not blank. This prevents BLANK() values from skewing the percentile calculation.
The result is that MinThreshold holds the value of the 10th percentile (e.g., 4.5) and MaxThreshold holds the value of the 90th percentile (e.g., 88.2) for the currently visible data.

3. Calculate the Final Average

RETURN CALCULATE(...): The CALCULATE function is the key to making the measure dynamic. It ensures the entire calculation respects the filters applied by any slicers or visuals in the report.
AVERAGEX(FILTER(...)): The core of the calculation. We use AVERAGEX to iterate over a table.
FILTER(...): We filter our TargetTable a final time. This filter is the "trim." It keeps only the rows where the value in TargetColumn is:
- Greater than or equal to our MinThreshold
- AND
- Less than or equal to our MaxThreshold
AVERAGEX(..., TargetColumn): AVERAGEX then calculates the simple average of TargetColumn for only the rows that passed the filter.

Conclusion

By implementing this DAX pattern, you create a robust, dynamic, and outlier-resistant KPI. This measure provides a more accurate picture of your data's central tendency and will correctly re-calculate on the fly as users interact with your Power BI report.

Explore more

Luca LiuFollow

Hello there! 👋 I'm Luca, a Business Intelligence Developer with passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

🚀 Connect with me on LinkedIn

🎃 Connect with me on X

Data Security in SQL: Encryption, Roles, and Permissions

Luca Liu — Tue, 09 Dec 2025 16:45:00 +0000

Introduction

In today's digital age, data security is paramount. SQL databases often store sensitive information, making it crucial to implement robust security measures. This article explores three key strategies for securing data in SQL: encryption, roles, and permissions.

Encrypting Sensitive Columns

Encryption is the process of converting data into a coded format to prevent unauthorized access. In SQL, encrypting sensitive columns such as passwords and credit card data is essential.

How to Encrypt Data in SQL

Choose an Encryption Algorithm: Common algorithms include AES (Advanced Encryption Standard) and RSA (Rivest-Shamir-Adleman).
Implement Column-Level Encryption: Use SQL commands to encrypt specific columns. For example:

   CREATE TABLE Users (
       UserID int,
       Username varchar(255),
       Password varbinary(256) ENCRYPTED FOR COLUMN ENCRYPTION
   );

Manage Encryption Keys: Store and manage encryption keys securely, using a key management system.

Using Roles and Permissions Effectively

Roles and permissions control who can access or modify data within the database. Properly configured roles and permissions are vital for data security.

Setting Up Roles

Define Roles: Identify different user roles (e.g., admin, user, guest) and their access needs.
Create Roles in SQL:

   CREATE ROLE admin;
   CREATE ROLE user;

Assigning Permissions

Grant Permissions: Assign specific permissions to roles. For example:

   GRANT SELECT, INSERT ON Users TO user;
   GRANT ALL PRIVILEGES ON Users TO admin;

Review and Update Regularly: Regularly audit permissions to ensure they align with current security policies.

Masking Sensitive Data with Views

Data masking involves creating a version of the data that obscures sensitive information, allowing users to work with data without exposing sensitive details.
Implementing Data Masking

Create Views: Use SQL views to present masked data. For example:

    CREATE VIEW MaskedUsers AS
    SELECT UserID, Username, '****' AS Password FROM Users;

Control Access to Views: Ensure only authorized users can access the views.

Conclusion

Securing data in SQL databases requires a multi-faceted approach. By encrypting sensitive columns, using roles and permissions effectively, and masking data with views, you can significantly enhance your database's security. Implement these strategies to protect your data from unauthorized access and breaches.

Explore more

Luca LiuFollow

Hello there! 👋 I'm Luca, a Business Intelligence Developer with passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

🚀 Connect with me on LinkedIn

Stuck in a Version Trap - How I Used Azure ML to Deploy an Azure Function

Luca Liu — Mon, 08 Dec 2025 09:52:00 +0000

As a developer, there is no worse feeling than being completely blocked. This is the story of how I got stuck in a "version trap" between my company PC, VS Code, and Azure... and how I used a cloud VM to escape.

Date: November 17, 2025

The Version Trap

My goal was to create a new Azure Function in Python. I checked the Azure Portal, and I was excited to see that the Function App runtime now supports Python 3.13.

My company laptop has Python 3.13 installed, so I thought this would be easy. I opened VS Code, installed the Azure Functions extension, and tried to create a new project.

When the extension asked me to select my Python interpreter, I pointed it to my Python313\python.exe. Immediately, I hit a wall:

Error: Python version 3.13.8 does not match supported versions...

The problem is that the cloud runtime (in Azure) is updated before the local development tools (the VS Code extension and Core Tools). My local tools were out of sync with the cloud and didn't recognize 3.13 as valid yet.

The Real-World Constraint: The Corporate PC

The standard solution is simple: "Just install a supported version, like Python 3.11."

My problem: I can't. This is a locked-down company laptop. Installing new software requires a multi-day approval process with the IT department. (My other local Python 3.11 installation was also broken and missing key modules like pip and venv, but I couldn't get admin rights to fix it.)

I was completely blocked. I couldn't develop locally.

The "Aha!" Moment: Use a Cloud Dev Box

As a Data Analyst, I already have access to an Azure ML (Machine Learning) Compute Instance. I realized: that compute instance is just a fully-featured Linux VM in the cloud that I control.

What if I treated my Azure ML instance as my new "local" development machine?

The Solution: Deploying from Azure ML to Azure Functions

This workflow completely bypassed my locked-down company PC and was surprisingly simple.

Step 1: Connect VS Code to the Azure ML Instance This is the most important step. In VS Code, I installed the Azure Machine Learning extension. In its panel, I found my Compute Instance, right-clicked, and selected "Connect to Compute Instance." VS Code reloaded in a "Remote SSH" session, and my VS Code terminal was now a terminal inside my cloud VM.

Step 2: Create the Project on the ML Instance Now, inside this remote session, I opened a folder on the ML instance and ran the F1 > Azure Functions: Create New Project... command. The VM already had Python 3.10 installed, so the tools were perfectly happy. I also created my TimerTrigger function.

Step 3: Set Up the Environment (The "F5" Fix) My code needs pandas and pyodbc. I opened the VS Code terminal (which is connected to my ML instance) and ran these commands to create a virtual environment and install my packages:

# Create a virtual environment using the VM's Python 3.10
python3.10 -m venv .venv

# Activate it
source .venv/bin/activate

# Install my packages
pip install -r requirements.txt

Step 4: Debug "Remotely" This is the magic part. I pressed F5. The code ran on the ML instance, but the debugger connected to my local VS Code. I could set breakpoints and inspect variables just as if it were running on my own laptop. I successfully debugged my function.

Step 5: Deploy from Cloud to Cloud Once I was happy with my code, I clicked on the Azure extension icon (inside my remote VS Code session). I found my target Function App, right-clicked, and selected "Deploy to Function App...".

VS Code packaged all the code from my Azure ML instance and deployed it directly to my Azure Functions app. My local PC was just a "thin client" for the whole process.

Conclusion

Don't let a locked-down corporate PC block you from getting work done. If your local tools are out of date or broken, you can use any cloud VM (like an Azure ML Compute Instance) as a powerful, modern development environment. By using the VS Code Remote-SSH features, you can get the best of both worlds.

Explore more

Luca LiuFollow

Hello there! 👋 I'm Luca, a Business Intelligence Developer with passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

🚀 Connect with me on LinkedIn

10 Essential Data Science Algorithms & Techniques

Luca Liu — Mon, 08 Dec 2025 09:51:00 +0000

Introduction

The world of data science can seem intimidating, filled with complex equations and advanced statistical concepts. Many aspiring data scientists feel they need to be a "math master" before even beginning. But here's a secret: while a deep understanding of the mathematical foundations of every algorithm is certainly powerful, it's not a prerequisite to becoming an effective data scientist.

What truly matters is developing an intuitive understanding of what these powerful algorithms do, when to unleash them, and why one might be chosen over another. Think of it less like building an engine from scratch, and more like knowing which tool to pick from a well-stocked toolbox to get the job done right. This article will cut through the jargon and introduce you to 10 essential algorithms and techniques—the workhorses of data science—equipping you with the practical knowledge you need to start building intelligent solutions today.

I. Foundational Supervised Learning

Supervised Learning is the most common type of machine learning. It's like learning with a teacher or flashcards. You give the algorithm a dataset where you already know the correct answers (called "labels").

1. Linear Regression

What it is: Linear Regression is a fundamental algorithm that finds the best-fit straight line showing the relationship between variables. Its goal is to predict a continuous numerical value (e.g., a house price, a person's weight, or sales) based on one or more input features (e.g., house size, a person's height, or ad spending).

When to use it:

When your goal is to predict a continuous number (e.g., forecasting sales, estimating a price).
When you need to understand the strength and direction of the relationship between variables (e.g., "How much does ad spending really impact sales?").
As a simple, fast baseline to compare against more complex models.

The Data Scientist's "Sense": You should think of Linear Regression immediately when your primary question is "How much...?" or "What value...?" and you have a numerical target to predict. If you suspect the relationship between your inputs and output is relatively simple (e.g., "more square footage = higher house price"), and you value speed and interpretability (it's easy to explain why it made a prediction), it's your perfect starting point.

Python Package & Code: The most common tool is Scikit-learn (sklearn).

from sklearn.linear_model import LinearRegression

# X = your features (e.g., [[square_feet, num_bedrooms]])
# y = your target (e.g., [price])

# 1. Create the model
model = LinearRegression()

# 2. Train the model
model.fit(X_train, y_train)

# 3. Get predictions
predictions = model.predict(X_test)

# 4. Check the relationship (e.g., the slope of the line)
print(f"Coefficients: {model.coef_}")

2. Logistic Regression

What it is: Despite its name, Logistic Regression is used for classification tasks. Its goal is to predict the probability that an input belongs to a specific category(e.g., spam vs. not spam, disease vs. no disease) based on input features.

When to use it:

When your goal is to predict a category(e.g., spam/not spam, fraud/not fraud, pass/fail). This is most common for binary problems.
When you need the probability of an outcome(e.g., what is the likelihood this customer will click the ad?).
As a simple, fast and highly interpretable baseline for classification.

The Data Scientist's "Sense": You should think of Logistic Regression immediately when your primary question is "Is it A or B?" "Will this happen?" or "What's the probability of...?" for a categorical outcome. It's the classification equivalent of Linear Regression—your first, most straightforward tool for the job. Its ability to provide probabilities makes it more useful than just a "yes" or "no" answer.

Python Package & Code: The most common tool is Scikit-learn (sklearn).

from sklearn.linear_model import LogisticRegression

# X = your features (e.g., [[hours_studied, past_failures]])
# y = your target (e.g., [pass, fail])

# 1. Create the model
model = LogisticRegression()

# 2. Train the model
model.fit(X_train, y_train)

# 3. Get class predictions (e.g., 'pass' or 'fail')
predictions = model.predict(X_test)

# 4. Get the probabilities
probabilities = model.predict_proba(X_test)

3. K-Nearest Neighbors (KNN)

What it is: KNN is a simple and intuitive algorithm that classifies a new data point based on its 'neighbors', it finds the 'k' closest data points from the training set and makes a prediction based on their majority vote. If K=5 and 3 out of 5 neighbors are 'spam', the new point is classified as 'spam'.

When to use it:

For classification (and regression) tasks where the underlying data relationships are complex but "similarity" is a good predictor (e.g., "birds of a feather flock together").
As a simple, "non-parametric" or "lazy" model, meaning it makes no assumptions about the underlying data distribution. It doesn't "learn" a line; it just memorizes the data.
For tasks like recommendation engines (e.g., "users similar to you also liked...").

The Data Scientist's "Sense": You should think of KNN when your features are in a similar scale (e.g., all numbers from 1-10) and you believe the core idea "tell me who your friends are, and I'll tell you who you are" applies to your data. It's great when you have well-defined, distinct clusters in your data. It's often outperformed by more advanced models but is a fantastic, simple baseline, especially if you don't have a lot of features.

Python Package & Code: The most common tool is Scikit-learn (sklearn).

from sklearn.neighbors import KNeighborsClassifier

# X = your features
# y = your target classes

# 1. Create the model (e.g., we'll look at 5 neighbors)
model = KNeighborsClassifier(n_neighbors=5)

# 2. Train the model (it just stores the data)
model.fit(X_train, y_train)

# 3. Get class predictions
predictions = model.predict(X_test)

4. Support Vector Machines (SVM)

What it is: SVM is a powerful classification algorithm that finds the optimal "hyperplane" (a boundary line) that best separates data points into different classes. Its main goal is to find the line that has the largest possible "margin" or buffer zone between the closest points of each class. These closest points are called the "support vectors."

When to use it:

For complex classification tasks where classes are well-defined but may not be separable by a simple straight line.
In high-dimensional spaces (data with many features), such as text classification (where every word is a feature) or image recognition.
When you need a model that is robust against overfitting, especially in cases with many features.

The Data Scientist's "Sense": You should think of SVM when you need a highly accurate classifier and believe a clear separating boundary exists, even if it's complex. If Logistic Regression is too simple, but a Neural Network seems like overkill, SVM is your strong, sophisticated middle-ground. It's particularly powerful for text classification and other "wide" data problems (more columns/features than rows).

Python Package & Code: The most common tool is Scikit-learn (sklearn).

from sklearn.svm import SVC

# X = your features
# y = your target classes

# 1. Create the model
# (kernel='linear' is a straight line, 'rbf' is more complex)
model = SVC(kernel='rbf') 

# 2. Train the model
model.fit(X_train, y_train)

# 3. Get class predictions
predictions = model.predict(X_test)

II. Ensemble Methods(The Power-Players)

Ensemble Methods are techniques that combine multiple machine learning models to produce one, superior model. Instead of relying on a single "expert," this method gets the "opinion" (prediction) from a diverse group of models and combines them.

5. Decision Trees

What it is: A Decision Tree is an intuitive algorithm that works like a flowchart. It asks a series of sequential "if-then-else" questions about your data's features, splitting the data at each step. This process continues until it reaches a "leaf node" that provides a final prediction (either a class or a numerical value).

When to use it:

For both classification (e.g., "survived" or "died") and regression (e.g., "predict price") tasks.
When the most important requirement is interpretability. You can visually see and explain every step the model took to reach its decision.
As the fundamental building block for more powerful ensemble models like Random Forests and XGBoost.

The Data Scientist's "Sense": You should think of a Decision Tree whenever a non-technical stakeholder needs to understand why a prediction is being made. It's the "white-box" model. While often not the most accurate on its own (it can easily "overfit" or memorize the data), it's the perfect tool for explaining complex relationships in a simple, visual way and serves as a great baseline.

Python Package & Code: The most common tool is Scikit-learn (sklearn).

# For classification:
from sklearn.tree import DecisionTreeClassifier

# For regression:
from sklearn.tree import DecisionTreeRegressor

# X = your features
# y = your target classes

# 1. Create the model (e.g., limit depth to prevent overfitting)
model = DecisionTreeClassifier(max_depth=5)

# 2. Train the model
model.fit(X_train, y_train)

# 3. Get class predictions
predictions = model.predict(X_test)

6. Random Forests

What it is: A Random Forest is an ensemble algorithm. It builds a large number of individual Decision Trees during training. For a new prediction, each tree "votes," and the Random Forest outputs the most popular class (for classification) or the average (for regression) from all the trees. It uses randomness when building the trees to ensure they are all different, which makes the combined model much more powerful and accurate.

When to use it:

For both classification and regression tasks where you need high accuracy and robustness.
When you want to prevent overfitting, which is a common problem with single Decision Trees.
To get a good "out-of-the-box" model with very little tuning required.

The Data Scientist's "Sense": This is the go-to, workhorse algorithm. You should think of Random Forest when a single Decision Tree isn't accurate enough. It's the "wisdom of the crowd" approach—one tree might be wrong, but the average of 1,000 trees is highly reliable. It's almost always a strong first choice when you need a high-performance model and don't want to spend a lot of time on complex tuning.

Python Package & Code: The most common tool is Scikit-learn (sklearn).

# For classification:
from sklearn.ensemble import RandomForestClassifier

# For regression:
from sklearn.ensemble import RandomForestRegressor

# X = your features
# y = your target

# 1. Create the model (e.g., build 100 trees)
model = RandomForestClassifier(n_estimators=100)

# 2. Train the model
model.fit(X_train, y_train)

# 3. Get class predictions
predictions = model.predict(X_test)

7. Gradient Boosting Machines (GBM)

What it is: GBM is a powerful ensemble technique that builds models (typically decision trees) sequentially. Unlike Random Forest which builds trees independently, GBM builds one tree at a time, where each new tree's job is to correct the errors and weaknesses of all the trees that came before it. It's a "boosting" method because it incrementally "boosts" the model's performance by focusing on its past mistakes.

When to use it:

For classification and regression tasks where high accuracy is the top priority.
When you are willing to spend more time tuning parameters to get the best possible performance.
When a Random Forest model is performing well, but you need an extra performance boost.

The Data Scientist's "Sense": You should think of GBM when "good" isn't good enough and you need "great." It's the "team of experts" approach: the first tree makes a guess, the second tree corrects the first tree's mistakes, the third corrects the remaining mistakes, and so on. It's extremely powerful but can overfit if not tuned carefully (e.g., by limiting the number of trees or their depth). It's the direct predecessor to XGBoost.

Python Package & Code: The most common tool is Scikit-learn (sklearn).

# For classification:
from sklearn.ensemble import GradientBoostingClassifier

# For regression:
from sklearn.ensemble import GradientBoostingRegressor

# X = your features
# y = your target

# 1. Create the model (e.g., build 100 trees sequentially)
model = GradientBoostingClassifier(n_estimators=100)

# 2. Train the model
model.fit(X_train, y_train)

# 3. Get class predictions
predictions = model.predict(X_test)

8. XGBoost(Extreme Gradient Boosting)

What it is: XGBoost is not a new algorithm, but a specific implementation of Gradient Boosting (GBM) that has been heavily optimized for speed, efficiency, and performance. Like GBM, it builds trees sequentially to correct errors, but it includes several clever tricks (like parallel processing and built-in "regularization") that make it faster and generally more accurate.

When to use it:

When maximum predictive accuracy is the absolute top priority.
On structured or tabular data (like spreadsheets or database tables).
In data science competitions (like Kaggle), where it is famous for being a dominant, winning algorithm.
When you need a model that's both high-performing and computationally efficient (faster than standard GBM).

The Data Scientist's "Sense": You should think of XGBoost as the default "go-to" algorithm for high-performance modeling on tabular data. It's the "race car" version of Gradient Boosting. If your Random Forest or basic GBM model is good, XGBoost is what you use to make it great. It's the first thing most data scientists try when they are serious about winning a competition or squeezing every last drop of accuracy out of their data.

Python Package & Code: It uses its own dedicated library, xgboost.

import xgboost as xgb

# For classification:
model = xgb.XGBClassifier()

# For regression:
# model = xgb.XGBRegressor()

# X = your features
# y = your target

# 1. Create the model
# (XGBoost has many tuning parameters, but defaults work well)
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# 2. Train the model
model.fit(X_train, y_train)

# 3. Get class predictions
predictions = model.predict(X_test)

III. Unsupervised Learning & Deep Learning

Unsupervised Learning is a type of machine learning where the algorithm is given data without any labels or correct answers. It's like "learning without a teacher."

Deep Learning is a specific, advanced subfield of machine learning that uses "deep" Neural Networks—networks with many layers. These layers allow the model to learn incredibly complex, hierarchical patterns directly from raw data

9. K-Means Clustering

What it is: K-Means is the most popular unsupervised algorithm. This means it's used when you don't have a target variable or pre-defined labels. Its goal is to find hidden structures in data by automatically grouping similar data points into "K" (a number you choose) distinct clusters. It works by finding "centroids" (the center point of a cluster) and assigning each data point to the nearest one.

When to use it:

When you have unlabeled data and want to discover its natural groupings.
For customer segmentation (e.g., finding different types of shoppers).
For anomaly detection (points far from any cluster center can be outliers).
To simplify a dataset by grouping similar items.

The Data Scientist's "Sense": You should think of K-Means immediately when your primary question is "What are the natural groups in my data?" or "How can I segment this?" It's not for predicting a known answer, but for discovering unknown patterns. It's the go-to tool for exploratory analysis when you need to understand your data's inherent structure.

Python Package & Code: The most common tool is Scikit-learn (sklearn).

from sklearn.cluster import KMeans

# X = your features (unlabeled data)

# 1. Create the model (e.g., we want to find 3 clusters)
model = KMeans(n_clusters=3)

# 2. Train the model (it finds the clusters)
model.fit(X)

# 3. Get the cluster labels for each data point
cluster_labels = model.labels_

# 4. Get the center point of each cluster
centroids = model.cluster_centers_

10. Neural Networks

What it is: A Neural Network is a powerful algorithm inspired by the structure of the human brain. It's built from layers of interconnected "nodes" or "neurons" that process information. "Deep Learning" simply refers to Neural Networks that have many layers ("deep" networks), allowing them to learn extremely complex, hierarchical patterns from vast amounts of data.

When to use it:

When working with unstructured data like images (e.g., object recognition), text (e.g., translation, sentiment analysis), and audio (e.g., speech-to-text).
For highly complex problems where other models (like XGBoost) are not powerful enough.
When peak performance is the primary goal, and "explainability" (interpretability) is less of a concern.

The Data Scientist's "Sense": You should think of Neural Networks as your heavy-duty, specialized tool. While XGBoost dominates on tabular (spreadsheet) data, Deep Learning is the undisputed champion for perception and language tasks. If your problem involves "seeing" (images), "hearing" (audio), or "understanding" (text), a Neural Network is almost always the right choice.

Python Package & Code: The most popular libraries are Keras (often with TensorFlow) and PyTorch.

# A simple example using Keras (with TensorFlow backend)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# X = your features
# y = your target

# 1. Create the model (a simple, sequential stack of layers)
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(X_train.shape[1],))) # Input layer
model.add(Dense(32, activation='relu'))                            # Hidden layer
model.add(Dense(1, activation='sigmoid'))                          # Output layer (for classification)

# 2. Compile the model (set up the learning process)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# 3. Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32)

# 4. Get predictions
predictions = model.predict(X_test)

Conclusion

We've journeyed through 10 essential algorithms and techniques, from the foundational simplicity of Linear Regression to the advanced power of Deep Learning. Remember, the goal isn't to become a theoretical mathematician overnight, but to cultivate a practical intuition for these tools.

Explore more

Luca LiuFollow

Hello there! 👋 I'm Luca, a Business Intelligence Developer with passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

🚀 Connect with me on LinkedIn

Essential Services for Newcomers in Germany - Personal Recommendations

Luca Liu — Tue, 18 Nov 2025 15:44:00 +0000

Before diving into essential services in Germany, you might want to learn how to obtain the German Opportunity Card. I've previously written a detailed guide on The Opportunity Card Germany: My First-Hand Experience and Complete Guide, which shares the application process, required documents, and experiences after arriving in Germany. If you're planning to apply for the Opportunity Card or have already been approved, the recommended services below will help you settle smoothly in Germany.

Essential Services for Newcomers in Germany: My Personal Recommendations

If you've recently arrived in Germany with an Opportunity Card or are planning to come soon, you'll need to set up essential services to make your transition smoother. Based on my personal experience, I've compiled a list of recommended services that will help you get settled quickly. By using my referral links, we both can benefit from special bonuses!

1. Expatrio - Your Blocked Account Solution

When preparing for your visa application, you'll need to deposit your financial proof into a Blocked Account. I personally used Expatrio's services, which made the process incredibly convenient. After arriving in Germany, Expatrio automatically transfers the monthly unfrozen amount to your designated account.

2. N26 - Modern Digital Banking

N26 is a leading digital bank in Germany that offers a fully mobile banking experience. Their services include a free basic account, easy international transfers, and investment options for stocks and funds.

My experience: N26 has been extremely convenient - you can open an account remotely via video verification, link it to Apple Pay and Google Pay, and even invest in stocks and funds. I use N26 as my salary account, and the standard plan meets all my basic needs. Transfers between friends are instant!

Join N26 today using my invitation link: https://n26.com/r/xianjinl1671

3. Telekom - Premium Mobile and Internet Service

Telekom is Germany's largest telecommunications provider, offering mobile plans, home internet, and TV services with excellent coverage throughout the country.

My experience: I use Telekom's network and the signal quality is outstanding. Their coverage extends to rural areas where other providers might have weak signals.

Use my referral link to sign up and we both can receive a cash bonus of up to €90: https://www.telekom-empfehlen.de/PcT4hPGk

4. Ostrom - Smart Green Energy Provider

Ostrom is an innovative green energy provider offering flexible monthly electricity contracts without long-term commitments. Their smart app allows you to track your energy usage in real-time.

My experience: The mobile app makes it extremely convenient to monitor electricity usage, and their green energy focus aligns with my environmental values.

You can save up to 35% on your electricity bill (approximately €500 per year on average) with Ostrom. Sign up with my referral code to receive a €50 bonus or €100 store credit: https://join.ostrom.de/?referralCode=XIANEJCXJC

5. Payback - Germany's Popular Loyalty Program

Payback is Germany's largest loyalty program, partnering with numerous retailers including supermarkets, drug stores, gas stations, and online shops. You collect points with every purchase that can be redeemed for cash or rewards.

My experience: This is a money-saving essential in Germany! The program includes many stores you'll visit regularly. Simply scan your Payback code after the cashier scans your items to collect points instantly. I earn about €200 cash back per year through this program.

Register using my link to receive 200 bonus points: https://www.payback.de/anmelden/freunde-werben?mgm-ref=c6d1ccf5-362e-4fc0-8adf-6677707797c6&excid=mgm&incid=mgm

6. American Express - Premium Credit Cards

American Express offers various credit cards in Germany with benefits ranging from travel insurance to rewards points and exclusive offers.

My experience: I got the rose gold metal card after finding employment in Germany. While the €20 monthly fee isn't cheap, the points can be exchanged for airline miles and the card itself is beautifully designed.

Apply through my referral link: https://americanexpress.com/de-de/referral/gold?ref=xIAOYQMATA&XL=MIANS

7. American Express Payback Card - No Annual Fee Option

This American Express card is co-branded with Payback, allowing you to collect additional points on your Payback account with every purchase.

My experience: The biggest advantage is that this card has no annual fee. You can earn extra points when shopping at Payback partner stores - 3 euros equals 1 point. Using my link, you can receive an additional 2000 Payback points (equivalent to €20).

Apply for the Amex Payback card using my link: https://americanexpress.com/de-de/referral/payback?ref=xIANJL6aY9&XL=MIMNS

Conclusion

Setting up these essential services will make your transition to life in Germany much smoother. Using referral links not only helps you get started quickly but also provides additional bonuses for both of us. Welcome to Germany, and I hope these recommendations help you settle in comfortably!

Explore more

Luca LiuFollow

Hello there! 👋 I'm Luca, a Business Intelligence Developer with passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

🚀 Connect with me on LinkedIn

English Speaking Companies in Germany

Luca Liu — Tue, 18 Nov 2025 15:43:00 +0000

English-Speaking Companies in Germany: A Guide for Job Seekers

As a foreigner who spent 8 months searching for a job in Germany, I understand the challenges of finding English-speaking opportunities in a predominantly German-speaking country. After applying to over 500 positions and interviewing with numerous companies, I've compiled this list of companies where English is the primary working language, to help fellow job seekers who aren't fluent in German.

1. SAP

SAP is one of Germany's largest software companies and a global leader in enterprise application software. With approximately 107,000 employees worldwide and headquartered in Walldorf, Baden-Württemberg, SAP has a market value of over €160 billion.

My experience: During my interview process, English was used throughout. The manager explicitly mentioned that while German language skills are a plus, they are not mandatory. SAP's international environment makes it an excellent choice for non-German speakers in the tech industry.

2. ALDI Süd

ALDI Süd is a global discount supermarket chain with its headquarters in North Rhine-Westphalia. With over 6,500 stores worldwide and approximately 155,000 employees, it's one of the largest retailers in Germany.

My experience: I had two interviews with ALDI Süd, and both were conducted entirely in English. The interviewers didn't even ask if I preferred English or German, indicating their comfort with an international workforce. Their IT department particularly operates in an English-speaking environment.

3. Kaufland E-commerce

Kaufland is a German hypermarket chain with a growing e-commerce division. Part of the Schwarz Group (which also owns Lidl), Kaufland has approximately 132,000 employees across Europe and a strong presence in the digital retail space.

My experience: I met their recruiters and team members at the ITCS Tech Conference in Cologne. When I inquired about German language requirements, the manager confirmed they operate in an English-speaking environment. Although I didn't receive an interview opportunity, this information was surprising and valuable.

4. Freenow

Freenow is a mobility service provider headquartered in Hamburg. With around 1,000 employees, it's one of Europe's leading mobility platforms operating in over 100 European cities.

Freenow has established itself as a technology-driven company with a diverse, international team. Their working language is English, making it accessible for international tech professionals. The company offers various roles in software development, data science, and product management.

5. Allianz Global Investors

Allianz Global Investors (AllianzGI) is a major asset management company headquartered in Frankfurt. As part of the Allianz Group, one of the world's largest financial services providers, AllianzGI manages approximately €582 billion in assets for institutional and retail investors worldwide. With over 25 offices globally and around 2,500 employees, the company has a truly international presence.

My experience: I advanced to the second round of interviews. The first round was with an HR representative located in Romania, and the second round involved three managers based in Germany, two of whom didn't speak German. This confirms their full English working environment, especially in their investment division.

6. HolidayCheck

HolidayCheck is a leading online travel agency and review site headquartered in Munich. With approximately 300 employees, it's a significant player in the European travel tech sector.

As one of Germany's most popular travel platforms, HolidayCheck maintains an English-speaking work environment to accommodate its international team. The company focuses on technology and user experience, offering various positions for software engineers, product managers, and data specialists.

7. Uniper

Uniper is a German energy company that focuses on power generation, global energy trading, and energy services. Uniper’s headquarters is located in Düsseldorf, Germany. The company operates across multiple countries, with key markets in Europe, Russia, and other parts of the globe. Uniper employs roughly 7,000 people worldwide.

Conclusion: Supporting Each Other in the German Job Market

Finding a job in Germany without fluent German skills can be challenging. I hope this list helps reduce uncertainty in your job search journey.

If you know other English-speaking companies in Germany, please share in the comments. Together, we can build a comprehensive resource for international job seekers navigating the German job market.

Your contributions could significantly ease someone else's job search in Germany's economy!

Explore more

Luca LiuFollow

Hello there! 👋 I'm Luca, a Business Intelligence Developer with passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

🚀 Connect with me on LinkedIn

Forem: Luca Liu

Stop Using Spark for Your Small Data - Why Azure Functions is the Right Tool for the Job

The "Big Tool" Trap

The "Right Tool": Azure Functions

The (Small) Learning Curve

More Than Just Timers

Conclusion

Explore more

Luca LiuFollow

Data Analyst: Does Your Work Actually Matter?

Introduction

The Trap of "Saving Time"

The Shift: Stop Doing Projects, Start Building Products

Real-World Example: The SpendCube

How to Make Your Work Valuable

Explore more

Luca LiuFollow

How to Fix "command 'claude-vscode.editor.openLast' not found" in VS Code

The Problem

The Solution

How to Store JSON and XML in SQL Databases

Introduction

Understanding JSON and XML

JSON

XML

Storing JSON in SQL Databases

How to Store JSON

Querying JSON Data

Storing XML in SQL Databases

How to Store XML

Querying XML Data

Pros and Cons of Storing JSON and XML

Pros

Cons

Conclusion

Explore more

Luca LiuFollow

Fixing Azure SQL Connection Errors in Azure Scheduled Python Job

The Problem

Why this happens

The Expensive Fix (Don't do this)

The Smart Fix: Intelligent Retry Logic

The Code

How it works

Summary

Explore more

Luca LiuFollow

How to Install Python Package in Azure Synapse for Apache Spark pools

Efficiently Installing Python Packages in Azure Synapse Analytics

How to Install Packages at the Spark Pool Level

How to generate requirements.txt file

Explore more

Luca LiuFollow

How to Calculate a Dynamic Truncated Mean in Power BI Using DAX

Why You Need a Truncated Mean

The DAX Solution for a Dynamic Truncated Mean

Deconstructing the DAX Logic

2. Find the Percentile Thresholds

3. Calculate the Final Average

Conclusion

Explore more

Luca LiuFollow

Data Security in SQL: Encryption, Roles, and Permissions

Introduction

Encrypting Sensitive Columns

How to Encrypt Data in SQL

Using Roles and Permissions Effectively

Setting Up Roles

Assigning Permissions

Masking Sensitive Data with Views

Conclusion

Explore more

Luca LiuFollow

Stuck in a Version Trap - How I Used Azure ML to Deploy an Azure Function

The Version Trap

The Real-World Constraint: The Corporate PC

The "Aha!" Moment: Use a Cloud Dev Box

The Solution: Deploying from Azure ML to Azure Functions

Conclusion

Explore more

How to generate `requirements.txt` file