Rebuilding the national narrative with AI and Docker

Brenda .. — Tue, 21 Apr 2026 10:50:34 +0000

Whether you’re a Founder in the 22@ district trying to track market shifts, or a Tech Enthusiast looking for your first break in the industry, the problem is identical: the "Information Sludge." Every day, Spain generates headlines across Finance, Tech, and Real Estate.
Most of it is noise... I want the signal.

So I built a refinery.

The refinery isn't just to "scrape news". I want to build an automated engine that reads the national sentiment, distilling it into actionable insights.

🍽️ The "Data Kitchen" Architecture

To explain this to my non-tech friends, I tell them to imagine a high-end restaurant in Poblenou:

The Ingestion (Mage AI): My Head Chef... Every 6 hours, he hits the digital markets (Google News) to find the freshest raw ingredients.
The Brain (Robertuito NLP): My Specialist Saucier... This is a specialized AI model trained on native Spanish text. It doesn't just "read"; it understands the cultural nuance to detect if the "vibe" is Positive, Neutral, or Negative.
The Transformation (dbt): My Bouncer. He runs integrity tests. If a headline is a duplicate or a "ghost" with no link, it doesn't get past the velvet rope.
The Showroom (Streamlit): My Waiter. He serves the final, high-purity insights on interactive charts.

🏗️ For the Docker Wizards: Forging the "Steel"

If you live in the terminal, you know that "Integrity" isn't a buzzword—it’s a configuration. Building this refinery required overcoming some hurdles that nearly melted my CPU.

The Cartesian Explosion (The "Freeze" Bug) 🧊
Early on, I joined my raw data to my "date spine" too early in the pipeline. It created a Cartesian product that multiplied rows exponentially. It was like trying to fit the entire crowd of a Barça match into a single tiny bar in the Raval.
The Fix: Pre-Aggregation. I moved a lot of the "cooking" into CTEs, shrinking the data to its daily grain before the joins. The machine finally stopped freezing.
The Docker Path Labyrinth (The GPS Mismatch) 🧭
Containers are isolated worlds. My code was looking for database files in a local Barcelona folder, while the container was sitting in its own "Virtual Madrid."
The Fix: Absolute Path Sovereignty. I enforced deterministic paths starting from the root (/home/src/). The system is now self-healing; if the environment resets, the refinery knows exactly how to rebuild its own infrastructure.
The Runtime vs. Build-time War ⚔️
Installing heavy AI libraries like PyTorch every time the container started made the "cold start" slower than a siesta in August.
The Fix: I shifted from runtime installation to Build-time Provisioning. By "baking" the heavy dependencies into the Docker image, the refinery is now Ready to Serve the second it turns on.

In Data Engineering, your Lineage Graph is your bond. Inspired by the principle of being an "ensample in purity," I codify my ethics into the SQL.

By using TRY_CAST and NULLIF patterns, I neutralized data "sludge" (like literal 'null' strings in CSVs) before it poisons the metrics. I standardized codes and fix typos.
Why? Because Clean Data is the highest form of professional honesty.

🚀 The Roadmap: What’s Next?

I have set a deadline to land my next role here in Barcelona, and the refinery is my primary proof of work. But a true architect never stops building:

Market Deep-Listening: Expanding the "Loophole Ingestor" to catch specific industrial shifts, predicting economic trends before they hit the mainstream.

🏁 The Bottom Line

I didn’t just build a dashboard; I built an engine. In a world of fake news and fragmented data, the Spanish Pulse is a reminder that we can use technology to find the truth—and have a little Joie de Vivre while doing it.

Check out the repo

Analysis of a Reviews dataset

Brenda .. — Mon, 11 Dec 2023 16:00:03 +0000

Looking through kaggle, there's a lot of projects you and I can do when deeping your feet into the scary world of data science. Of all the datasets, discussions and notebooks on the platform is the Amazon Reviews dataset; but alas why take a project without an understanding of future deliverables, defining project process.

In undertaking this project, I wondered:

Why is this project important?
How will this project help in the real-world?
What does a potential employer see by adding this project to my portfolio? What do I want them to see?
How do I showcase this project in my portfolio?
What is the main takeaway from this project for myself, and anyone who comes across it?

Defining the project

By defining the project, we are able to manage expectations, set a time frame and plan steps to execute the project. Generally, for most data projects, the outline is:

Data collection and Initial analysis
Pre-processing
Feature Extraction
Model selection and Evaluation
Deployment

I'm not looking to re-invent the wheel so I'm using the same approach.

Process

For this project, the data is taken from amazon reviews data source; in the source, the data was split into train and test files but I merged them to be more random.

Looking through the data, I realized there were 6 columns; 2 numerical and the other 4 having text which I noticed were split reviews.

To preprocess the data, I merged the textual data into a single review and got a singular rating value for each review.

I was interested in looking at the general ratings and identifying the main topic of each review.
But why will this knowledge be important? Who will need insights such as this? Retailers need to understand what products are being sold and what aren't. It would have been more insightful to have information about locations to better help the retailers but alas such data wasn't added to the dataset.

After preparing the data, the next step was to get the topics from the sentiments (Feature Extraction).

HuggingFace is known for its Transformers Python library, which simplifies the process of downloading and training machine learning models. I used 2 of these models to 'predict' the ratings of the reviews and compare the reviews to the provided review.
To evaluate the models' predictions, I employed an evaluate_models function.

I found out that the autoTokenizer is less accurate compared to huggingFace's distilbert; regardless, I left the model in there as a visible lesson.

Finally, use CountVectorizer to convert text data into numerical representations that can be understood by machine learning models. CountVectorizer tokenizes the text, removing basic english words, and then builds a vocabulary of known words. This technique is used to create a fixed-length vector of numbers representing the occurrences of words in the text.
After vectorizing the data, I used LDA, NMF and SVD to get topics from each review.

From the columns generated, I generated another column of common topics to all columns from the topic extraction.

Generating these columns makes me want to actually look at these ratings and their topics.
The data we ended up with after all this manipulation looks like this:

And from the data, these charts were produced

Challenges

While working on this project, I faced an issue of too much data. The data was hard to manipulate due to it's volume.

To overcome this, I sampled the data then implemented concurrency on batches of the samples. Concurrency handles the complexity of multi-threading and multi-processing and enables asynchronous execution and results.

Lessons learned

As a data scientist, we are accustomed to working in notebooks rather than working in scripts. But in the current day and age, the data scientist is to be familiar and comfortable with writing and working with scripts.
Before commencing a project, especially a portfolio project, think about what you want to showcase, what you are learning and always practice time management. And be sure to look at the data and think of where this data is coming from _ in the future where would this data come from, who is going to collect this data _; then think of how this can be automated.
What is your final product? Where is it? What about updates? I chose to deploy the results on streamlit

Future progress

In the future, I want to pull data from an api. In that case the data will be more relevant and could include other attributes such as location which will be important to the retailers. The model will be updated accordingly.
In case I use an api, then I could use apache airflow to build a pipeline which automates this via a dag in a script.

Conclusion

These are my thoughts about a beginner project in data science tackling data analysis in python and sentiment analysis of an amazon reviews dataset collected over the span of 18 years. These thoughts would be relevant to a beginner data scientist or someone wanting to get into the data science field.
Hopefully, you learnt something from my ramblings.

Forem: Brenda ..