Forem: Cesar Mostacero

In BIG Data, SMALL Things Matter

Cesar Mostacero — Tue, 02 Jan 2024 23:40:52 +0000

Introduction

In the ever-accelerating world of big data, where technological advancements unfold at an unprecedented pace, it's easy to become enamored with cutting-edge solutions. However, amidst the excitement, the critical significance of meticulous planning, the dangers of over-engineering, and the enduring principles of core programming often take a back seat. This article aims to shed light on the often-overlooked aspects that can make or break a big data project — the small things.

Good Planning and The Imperative of Automated processes

In the realm of big data, where processes tend to move at a slower pace compared to other development fields, quality planning is non-negotiable. The sheer volume of data not only impacts processing capabilities but also involves a complex web of cross-dependent teams, resources, and systems. Every change must undergo meticulous analysis, considering not only the happy path but all possible scenarios, including edge cases. Automation has become the backbone of this process, extending from the development lifecycle to observability and, where feasible, even to support and maintenance layers.

Example:

Consider a data engineering project aiming to build a robust ETL (Extract, Transform, Load) pipeline to handle large volumes of diverse data sources. In the era of big data, meticulous quality planning is paramount to ensure the reliability and efficiency of such pipelines.

Now, imagine a scenario where the team relies on manual processes for data validation and testing. Each change in the pipeline triggers a manual review, involving multiple teams and extensive coordination. This process is not only time-consuming but also prone to human error, especially when dealing with intricate data transformations and dependencies.

The true power of quality planning shines through when automation becomes the cornerstone of the process. By implementing automated testing at every stage of the ETL pipeline, from data ingestion to transformation and loading, the team can catch potential issues early in the development lifecycle. Automated tests simulate real-world scenarios, ensuring that the pipeline not only performs efficiently but also maintains data integrity across diverse datasets.

In contrast to the manual approach, automation accelerates the development cycle, reduces the risk of errors, and enhances the overall quality of the data engineering project. The mantra becomes clear: in the realm of data engineering, where precision is paramount, automation is not just a convenience; it's a necessity for ensuring the success of complex and dynamic pipelines.

Over Engineering: Technology Alignment with Business Needs

The plethora of technologies available to solve similar big data problems presents both opportunities and challenges. The ideal scenario involves selecting a technology that aligns seamlessly with the specific business problem at hand. Technology should be a flexible tool that adapts to the business, not a constraint that dictates how the business operates. Choosing a technology based solely on trends can lead to complex implementations, turning what should be a smooth process into a development nightmare.

Example:

In the labyrinth of big data technologies, choosing the right tool for the job is crucial. Let's delve into the world of data engineering, where the choice of technology can significantly impact project success. Consider a scenario where a team, enticed by the latest and most complex data processing framework, opts for a solution that promises unparalleled performance and scalability. However, the catch is that this technology comes with a steep learning curve and demands extensive customization.

Despite its technical prowess, the chosen technology does not align seamlessly with the specific business problem at hand. In this case, the business requirements were relatively straightforward: process and analyze incoming data streams for real-time insights. The chosen complex framework, while technically impressive, introduced unnecessary complexities and increased the project's time-to-market.

Opting for a technology solely based on trends and technical capabilities can lead to over-engineering, turning what should be a streamlined data engineering process into a cumbersome development nightmare. A more pragmatic approach would have involved selecting a technology that aligns closely with the business needs, offering the necessary features without introducing unnecessary complexities. This way, the team could have delivered a robust solution more efficiently, meeting both technical and business objectives.

Logic Is the Core: The Power of Unit Testing

In the realm of development, especially for those new to object-oriented programming (OOP), the value of unit testing is often underestimated. A fundamental principle every engineer should remember is that logic issues need to be addressed within unit tests, not in higher or shared environments. Robust unit testing not only ensures the reliability of the code but also significantly reduces the time spent on support tasks, a crucial advantage in a field where processes tend to move at a slower pace.

Example:

Imagine a scenario where a critical data pipeline was not thoroughly tested before deployment. In the rush to meet tight deadlines, the team skipped comprehensive unit testing, relying solely on the assumption that the code worked as intended. The oversight went unnoticed until the system was in production, and a seemingly minor logic error caused a cascading failure. The beauty of unit testing lies in catching such errors before they reach the production environment. In this case, had the team tested the pipeline thoroughly on a local environment, they could have replicated and rectified the issue, avoiding the costly consequences of a failure in a live system.

Conclusion

In the vast landscape of big data projects, it's imperative not to underestimate the impact of seemingly "small" things. Quality planning, steering clear of over-engineering, and adhering to core programming principles are not mere niceties but the pillars supporting successful and sustainable big data endeavors. As we navigate the complexities of big data, let's not forget that sometimes, the smallest details make the biggest difference.

Certified !== qualified?

Cesar Mostacero — Wed, 19 Oct 2022 00:57:28 +0000

Recently I had the chance to got some certifications on different technologies according the plan I designed at the beginning of the year... and combined with a discussion a saw in LinkedIn regarding if does they worth, I wanted to share my opinion about this topic.

While I completely agree on certified !== qualified, I personally think certification still worth but we need to have clear the real benefits we can get from there to avoid, eventually, disappointments.

Below are the main items I think we should consider regarding these processes.

Certifications might open a door, but they're not the complete solution

There are always different motives behind to make the people want to take a certification, one of the most common is while someone is looking to apply for a new job/position.

Depending on the position, it might list some “nice to have” items, and certifications are usually listed on this section. However, it’s just part of the expected profile. This should be implicit and obvious, however for beginners it would be helpful to clarify this point and we can not expect that once you get certified will reduce the recruitment process.

Something similar would apply for those cases that in the company is expecting to be promoted for the next role by getting a certification. There are guidelines in each company (or they should be), expected skills and level to claim for a promotion, but similarly, it’s just one feature/skill of the complete profile, so we need to have it clear.

They’re a nice differentiators in this competitive world, because not all the people have them. But they are not the only bullet in the list, keep in mind that.

Knowledge is the core

How many times you saw someone from the school cheating on an exam and get approved anyways? That’s one of reasons why the proctored exams exist. But even with that, we can not consider that, after getting certified someone is enough qualified for certain position.

The badge you can display on LinkedIn is nice, however it just that: an institution that verifies you passed an exam.

The real value behind a certification, and this is some of the times ignored or under-estimated, but it is the knowledge from the process, since the institute/company verifies the approaches, concepts and understanding about a technology, and from there either confirm your experience or learn new and valid concepts during the preparation for the exam.

Again, once you get the certification, you can share that fancy badge in social media and it’s great! However, the day when you face a problem to solve regarding X technology, the badge won’t help and the time you used for the complete preparation will be considered as an investment in case you got the knowledge or a waste of time if you only studied to pass the exam.

Optimize your time

Connecting with the above point, there is something we need to consider almost in every step of our career: time.

Who doesn’t want to know everything? Or in this case, get “all” (or many) certifications…

The most valuable asset is always the time, and we need to use it smartly

As a recommendation, plan as much as you can and take decisions based on that plan. Yeah, it can be modified during the way, new technologies, methodologies, etc. However, It’s quite important, at least, to define a career path and from there it’s going to be easier to decide if you need or not a certification (or any other process, asset, etc).

You will work with different technologies through all the projects on your career, some of them will be aligned on what you want and some other doesn't.

Let’s put an example:

Suppose that you’re a mid-level software engineer and currently working in a temporary project that demands Java as main programming language and AWS as cloud provider, your plan is to become a software architect.

So, taking the current assignation and trying to take advantage of that experience that you’re currently getting, it would be better if you apply for some entry-level certifications on a given cloud, in this case AWS, rather than choose for Java-programmer certification, right?

With this, I’m not saying that Java-programmer certification does not worth, but for the example context, we need to prioritize short, mid and long term impact for the invested time. It would be different if the person from the example would like to specialize in a short/mid term on Java solution and/or framework, on that case I would think the option would be the opposite.

Some additional recommendations:

Take what you need and will apply in a short term - at the end if you don't apply you'll forget
Consider expiration dates - not all, but some recommendations (AWS, GCP, etc) have expiration date
Define a preparation plan: time and costs. At the end, evaluate if in your current context (project and personal time) will allow you to prepare well for the test
Take certifications based on a career plan
Play smart. Some of the certifications will list other ones as pre-requisite to take it, but some other just list them as recommended, but not blocking you to take it. GCP is one example. If you feel ready to attempt directly for a Professional level, go for it. But, if you’re just starting with GCP, choose entry-level one. Avoid unnecessary frustration.
Decide by yourself. Everything is relative to the context, so listen and read all opinions, but analyze based on what you want.

These were the main items I considered regarding that discussion about if the certifications still worth… but, what do you think?

Apache-Spark introduction for SQL developers

Cesar Mostacero — Thu, 29 Sep 2022 00:30:32 +0000

Introduction

When I started to work in data engineering projects, my first assignation was developing ETL’s with Scala and Apache-Spark. At that point, my background was as a software engineer in roles like web developer with Javascript and a little bit of native android development using Java.

Talking about the learning curve for Apache-Spark, the reality is that it was not complex due to the background: Java as strong-typed language helped with the first steps on Scala, Javascript for functional programming and SQL background for Spark transformations.

I mentioned above context due to during these years, I had the chance to work as a mentor for people that wants/needs to learn Apache-Spark.

From this experience I saw there are 2 common “profiles” depending on the background: people that has software engineering as a background and people working as SQL-developers (for ETL).

For the people with software development experience, the learning curve is usually shorter compared with only-SQL people. This because the profile is implicitly familiar with programming languages like Python, Scala, Java (which are the ones to work with Spark), terms like immutability and paradigms like functional programming.

SQL developers (not all) usually do not have above context or practical experience, therefore the learning curve is not only because of Apache Spark at all, it’s from programming languages, terms, paradigms, etc.

One key on the learning process is a constant progress and avoid frustration, the last section of this article lists some recommendations to help having a smooth learning process.

Let’s try to illustrate some core concepts of how Apache-Spark works, not matter the programming language and focused on SQL developers that want to start with this technology.

RDDs and dataframes

These are the basic data structures that Spark uses. While the RDD is the basic (and first) data structure, the data frame is an optimized one.

Let’s talk about data frames.

It is an object with a tabular structure, defined by a schema that can be provided by the developer or inferred by Spark. An easy comparison for better understanding is a SQL-table, that is a collection of rows with a defined schema.

So, how the data frame is “populated” with data? Well, there are multiple ways to do it, but the a real example would be if you want to read a SQL table or some files on a cloud storage using Spark. Let’s take a look into the below illustration:

Above illustration is a representation of how this object can be created. The data is going to be read and living into the dataframe objects on the Spark session which is in memory of your cluster.

The data that’s in a data frame after read, it’s just a copy of the original source, meaning that if we perform some operations in the spark-dataframe it won’t affect the real source data. Example, if we make a string-column as uppercase, they will apply only for the data frame values, but not for the SQL table that was used on the read statement.

Immutability

On the previous example we mentioned a scenario to convert a string-column values into uppercase characters, let’s use same scenario to illustrate a programming concept used on objects like Dataframes and RDD’s.

There are multiple ways to do it, but for this example, let’s consider below example:

As you can see, there is a transformation (select statement) which is applied to the initial dataframe (df1). This function (select statement) will take 3 columns and for one of them (column 1) will generate an uppercase value for that column for each row, additional to that, the corresponding output column has an alias and for the other 2 column values there is not going to be a change applied neither on the value nor on the column name. So, the result (return value) will be a new dataframe: df2.

This example represents how dataframe transformations work.

This kind of functions where we manipulate the data, are not going to be executed at specific time, instead, Spark will perform some optimizations by set of transformations that can be executed together.

This groups of transformations are known as Stages (also keno as set of spark-tasks).

For now, what we need to keep in mind is, we can not manipulate directly the data of a dataframe, it can not be updated like a database table. But, we can create new data frames from a different one.

There is no restriction to create as many data frames as we want, at the end, the restriction won’t be the number, but the memory that we consume.

Transformations and actions

The previous section we talked about transformations. Now, let’s talk about actions and what are the differences between them.

In spark there are 2 categories for the data frames functions: transformations and actions.

While transformations are related business logic (ie. select, filter, groupBy, etc), actions are those functions which will trigger an execution for it and the previous stages (ie: show, collect, etc)

There is a concept in spark known as “lazy evaluation”, that basically refers that, spark won’t execute at that time the functions (transformations) only until there is a trigger (action). With that, Spark has a pre-execution and execution steps where it can optimize the logic internally (see more of Optimizer: https://data-flair.training/blogs/spark-sql-optimization/)

Shuffling

One thing to remember is that Spark is intended to be executed in a cluster (it also has the option to run in your local for development purposes), so the data is going to be distributed among the nodes to take advantage of distributed processing, making the process faster.

While optimization would be an advanced topic, we need to know at least, the principle of how the data is moved within the cluster: shuffling.

This means, when a dataframe is created, data is going to be distributed on the workers (cluster nodes), for operations like filter or map, every worker will have the enough data to process the data to the next stage (row level).

However, for operations like groupBy, the whole data should be compared to group according the stablished conditions. These data movement is known as shuffling, since we have data transferred over the network in our cluster. For a better understanding, let’s see below image

From above illustration we can see that both workers have a partition of the df1, it means portion of data is living on those workers. Therefore, if we want to apply a groupBy, workers should transfer data among the cluster nodes to group the rows.

Another common scenario when shuffle is applied is on the join statement. Records from both data frames needs to be compared in order to join the records.

One advanced topic and common interview question is regarding a best practice to join a large data frame with a tiny one.

This question has the main purpose to know how to manage shuffling in order to optimize performance. There is a feature in Spark that allows you to “send” the tiny dataframe to each of the nodes in the cluster: broadcasting. This will replace the previous action of comparing all the records among the cluster reducing the network traffic load and increasing performance.

Obviously, this has the condition of the tiny dataframe fits into the worker memory (remember Spark works in memory). While this can be done manually (explicit by the programmer), the version 3 of Spark has already integrated an optimization feature (enabled/activated by default) that internally performs the broadcast-join even when it’s not explicit declared in the code.

Data Persistance

All data that is load into a spark session, either dataframe or RDD, is loaded in memory and has the end of the lifecycle once the application is done. Meaning that if it’s not persisted explicitly, it won’t be available on future applications.

Recommendations

If you’re beginner on Apache-spark this list is a set of recommendations that might help you on your first steps with this technology:

Learning by doing: Courses, tutorials and documentation are great resources for the learning process. However, you can not be a professional soccer player by just reading the regalement or watching matches everyday, right? Same thing for programming. Practice is the key! If something is not clear enough, don’t ignore it, get more examples and practice until you get it.
If possible, start with strong-typed language (preferable: Scala): This might be special helpful if you don’t have much experience with programming (rather than SQL), because a strong-typed language will help you with debugging for initial errors while practicing with data transformations, plugging on IDE’s for code-snippets, code auto-complement, etc. At the end, once learned Spark core concepts, same can be applied on the rest of the languages (adapting the API restrictions for specific language).
At the beginning, use notebooks: It’s important to remove frustration from the initial learning curve, so as much as we can we need to avoid extra steps and keep focus on the goal, in this case: learn Apache Spark. By using notebooks it will allow you to go directly to the practice step, without any additional configuration and/or setup.
Unit testing: Take advantage of Spark and the possible programming languages and add unit testing for your code, try to keep it clean, follow principles like single-responsibility to ensure the quality of you pipeline.
Databricks community edition: Related to the above point, Databricks community edition will provide you a pre-configured spark environment, including notebooks, storage (to create tables by loading files), etc. With few clicks you can create a Spark cluster, create tables and notebooks for practice without worry about installations, configurations, etc.
Avoid re-invent the wheel and keep it simple: Spark, at the end, is for data processing, similar than you might have done with SQL (but cleaner and easier, ja!). Try to map the knowledge into the new technology. Example: Apache functions like filters and where are the same as SQL-where.
Progressive learning: Avoid go deeper than needed, at least in the beginning. Try to understand the general basics first, before dealing with advanced topics of some specific topic. Example: at some time you’ll need to configure a spark-session with integrations, advanced configs for optimization, etc. However, at the beginning it’s not relevant. It will be more important if you domain data manipulation before go to the configuration stage.

Developing data-pipelines: Quality is not negotiable

Cesar Mostacero — Wed, 21 Sep 2022 01:02:36 +0000

I know, I know… this is not only applicable for data pipelines, but I wanted to elaborate this with a set specific points, some of them might be shared among different development areas, but other will be directly related to data engineering.

The final delivery of a data engineering project resides not only in processing and storing some data, but also on the quality it has. At the end, data is nothing if it’s not meaningful, why should we pay for storage of something that does not generate a relevant insight?

The size of data in these kind of projects represents an associated cost (more specifically if we’re working on cloud), mainly: storage and computing. So, every execution of a pipeline will cost. That’s one of the reasons we should use the resources properly, avoid unexpected charges by executing low quality code or not optimized processes.

Some weeks ago I was in a conversation from where there was a great sentence: “quality is not negotiable” - And it’s totally true!

And now you might think: that's obvious, no? That should be! However, there are some practice that impact it in an indirect way: planning and estimation steps.

Currently, with agile methodologies, we need to give the importance to phases like design/plan and testing that some times are underestimated and are actually key parts for future processes: support and maintenance.

Estimations and then, negotiation are fundamental stages for every project. You’ll need to have the soft skills to justify the estimated development time and this must cover tasks to ensure our solution has the enough quality before deploy it to production.

There are two possible points in the time where you can ensure (or fix) the quality in your data: during development: the happy time…. Or during a prod fix, where, if before there was not enough time, at this point it’s even worse.

While data engineering is usually referred to develop data pipelines, that’s just one part of the job and the solution is much more than just the ETL source code and we can integrate elements like orchestration services, monitoring tools and other services or componentes to handle quality in our solution, so that the final output data keeps the desired standards and expectations.

Here is a list of recommendations that can help you while designing a data pipeline solution keeping the quality as priority and with the intention to reduce the post-deployment time on tasks like hot fixes, maintenance, adjustments, etc.

Explicit instead of implicit

Suppose we need to extract data from a table composed of 3 columns, but this table is not owned by our team. Then we need to load this data into a target table, with the same number and order of columns. If the initial development has read statements like “select *” it would work on the initial configuration, however your solution will depend on the source and target do not change over the time, due to it would broke the solution. In this example, if the source is not owned by your team, the chances of a change happen will increase.

Identify dependencies and ownerships

We use to work with different teams and stakeholders. All projects will have dependencies not only for downstream, but also upstream. Ideally, we shouldn't start coding to consume source or destination entity/service, without known the relevant information for our process.

So, identify all the dependencies that the pipeline will have, internal and external. This will help you to know point-of-contact or communication/information channels to get accurate information of those entities.

Example, if we’re going to consume an external API, we’ll need to consider as an external input dependency, from where we can get information like: authentication requirements, rates, limits, etc.

So, you will know what are the expectations from the source to be consumed and how much data you can extract, the format and all the information around it.

Avoid assumptions

Related to above example, we can not think just on an ideal scenario or happy path, consequently we need to know all the specifications from input to output in our dependencies to define and plan a solution.

Illustrating this in a use case:
We need to create an ETL to consume an API everyday based on a timestamp field (example: get all data for the previous datetime related to its execution). This ETL will be scheduled, but also, we want a backfill for the last year of data. Once this ETL was scheduled for some days we didn’t see any error, but once we ran it for the backfill, the job suddenly failed. Reason? The API has an api-call limit, and since we ran it for a considerable number of days, it reached the maximum limit and started to return error messages.

Above scenario is something we can predict, right?

If we have listed all conditions from our source, we can design a flow depending on what we want to achieve. There are multiple possible solutions for this use case, from an automated retry once the limit is reached based on some conditions, or performing the backfill on batches, etc. But to design it, relevant information must be documented and taken into consideration while design phase.

Make sure that everything its coded is based on a real fact, instead of assumptions. This might sound an implicit point, however it’s quite often to see it.

Identify edge cases and define action items

Considering the amount of data that the pipeline will process the possibilities that we faced not-expected data will increase. Errors in the data format, changes on the source names, etc.

However, this is not just related to the data and catch malformed information, let’s list some of the possibles scenarios we’ll need to think:
- source code: nulls, data format, schemas, etc.
- sources: timeouts, heavy data loads, network issues, etc.
- transformations/logic: Empty inputs, large inputs, etc.
- execution: empty outputs, etc

Make the possible errors part of the solution

From scenarios like a timeout-error while reading the source or an unauthorized-access error due to temporal sessions lock, and some of the above points as well will mean a runtime exception, and we don’t want that, right?

We need to keep in mind all of those edge cases to identify all possible alternatives that our pipeline should handle. Hence, anticipate to the errors and design a flow according them will help us to ensure quality in the output data. Enabling an auto-retry, working with alerts, etc will enhance the solution to make it smart enough to take actions depending on the type of error.

Start from the basics, it’s really helpful to draw to visualize and understand the problem better, before coding. Create a flow diagram will also help, do not underestimate all these resources, at the end, the code is just the final process of the development it’s kind of translating into the language the solution you already designed it.

Monitor as much as you can
This is one of my favorite parts of the data pipelines development, where you can add integrations to get a real and accurate status of your project.

Usually, data engineering projects are composed of a set of data pipelines. So, it’s not just complicated, but almost impossible to manually check and identify the status of every single item in our solution. So, it’s required to identify information that help us to know relevant insights from our project.

Example, suppose we have a simple ETL, that reads from table A and writes into table B. For some reason, our team was not informed that table A is not longer updated and instead, now we should point to table C as our source. If we don’t monitor application metrics, like output rows or similar, we won’t never know that our ETL is not working as expected, I mean, it will continue read something that is not being updated, therefore it will generate empty output. This is definitely not good.

While working with orchestrations services like GCP-composer or GCP-GKE, we can schedule and eventually know the final status of our pipeline. However the previous example wouldn’t be identified, producing a false positive error.

By introducing monitors on meaningful metrics (like output rows, etc), we can integrate alerts that can be triggered if we produce empty outputs. Even better, we can integrate a monitor service (ie: datadog) for this data visualization, adding thresholds, warnings, alerts, etc. This is really powerful, being alerted in slack for a warning threshold limit or create a ticket on Jira in case an error, all posibles scenarios matched with the corresponding action.

Be ready for a change

Requirements might change during development, but also even after our solution is running on prod. So, developing a flexible and scalable solution will reduce a lot of time on those kind of changes. Some helpful actions for this would be:
- Avoid hardcoding - yeah, this is for everything!
- Parametrized inputs and configurations
- Try to abstract the logic to work in different possible ways (scheduled, adhoc, etc)

Avoid manual processes
Automation is a key, depending on the team structure you might have or not tasks like CICD or similar. If it’s part of your role activities, try as much as you can to reduce manual tasks, replace repetitive and ad-hoc tasks by automated processes.

If for some reason, you need to perform some SQL statements over an entity (table, schema, etc) in a regular basis, try to automate these processes even if they are not going to be scheduled. A parametrized script that handles the logic will reduce the chances of a human error on those kind of actions.

Testing
Yes, it’s part of development scope. The time of a fix will directly depend on the time that took to identify the bug first (then other points like complexity, impact, etc). Well, we can reduce the time by adding different layers of testing: unit testing, integration testing, etc.

And yes, this is one of the shared points among the different development areas. And its relevant since these layers will help us to identify the errors faster. All the logic, syntax and static issues must be handled by unit test, dependencies errors by integrations tests, etc. All of these quality filters before our code reach production.

Documentation
And again, yes. This is part and a very significant one of the development process. The code is not the unique deliverable, think the deliverable as a solution. If you buy a TV, laptop, cellphone, it will have an instructive.

Similarly, our solution must have a proper documentation from the design, development, change log, points of contacts, etc. All relevant information so that new team members can even have a good KT by just reading it. How this is applicable for data quality? Well, eventually we will have more people working together, so there should be a single source of truth: documentation. And even if we work alone in the project, will you remember how and why did your developed that ETL 2 years back? Sometimes we don’t even remember why we did yesterday, so better to have it documented somewhere, right?

And that’s all!

While there are a lot of different circumstances and cases in each and every project, these are some of the common tasks that I think can be applied to try to keep the quality in your data

Hope this helps!

Resilience: a hidden skill that makes the difference

Cesar Mostacero — Tue, 28 Sep 2021 03:08:21 +0000

The only constant in life is the change - a famous quote that’s really true, more in a software development career, every day there are a ton of new things and we’re not going to have enough time to learn all of them.

This is a simple post where I would like to share, why I think resilience is a key skill for most of the people, but specifically in your career.

Currently, Software development is one of the careers with best compensation, but there are a lot of things behind this. One of them is the constant need to keep updated and that’s not an easy task.

Every project ends, even a certification has expiration date (either explicit or implicit) and we need to turn the page and continue. We can not think in all the scenarios will be the same like the previous one or think that the same key will open all the doors.

Yes, experience matter and every new learn will help us to growth. But it does not mean we'll reach the point when learn new things is not longer needed (or not at least to keep competitive).

We can not control each and every item around of us, in our career is the same. Not only in huge scale (technology trends, new programming languages, etc), but mainly in our short scope: our projects, clients, change requirements, etc.

Even when we’re experts on some specific technology we're not exempted of a struggle situation.

We don’t know exactly what we’re going to face in our next project or in the next company and additionally the fact we can't know everything technically.

But we can “train” ourselves for those situations: learn and adapt.

One of the best activities, at least from my point of view, is: problem solving.

The importance of problem solving or competitive programming is not only the fact of send a code and see a green button with the title of “resolved”.The real important is to understand and learn about the process of solving it.

Most of the times will be new challenges and that’s the exact situation we need to face. How to deal with new things and how to solve it. How to manage the time, find alternatives or possible solutions, face the anxiety of a hard challenge and handle the pressure.

It’s not only about learn algorithms, it's to train ourselves to experience (at a minor scale) unknown challenges.

Another good practice is learn a new language, not with the target to be an expert, but with the goal to improve our learning process to make it faster.

Yes, there are a lot of unknowns on the future where we don’t know which technology, programming language or tool we’re going to use, but what we can “control” is how easy will be for us learn those new things and it's always better to be ready.

Why problem solving must be the strongest skill for developers?

Cesar Mostacero — Mon, 04 Mar 2019 01:50:51 +0000

Problem solving is in my opinion, the best asset in a developer, the ability to not only code, but also design solutions gives a differentiating element among the developer population.

Be proficient in a framework, an specific programming language or technology, matters, that is really important. But problem solving should be the complement of this knowledge.

Solutions should be independent of the language. Depending on the programming language you might need to re-write an algorithm or instead use an already built function with the same behavior. But the main knowledge should be in how we can design an optimal solution for every problem.

The brain is a muscle and a muscle should be train

If a soccer player pretend to have a good performance during the weekend match, he should train during the week in order to be physically and mentally ready for the match. Same case for developers.

We can not always only search on internet to find the solution of a given problem. Developers should be able to: identify, design a solution and solve those kind of problems, then technically you can have resources online or even with your team, but the ability to create a solution should be in every developer.

Is it really needed?

Well, from my experience, I have been only in one technical interview where I was asked to code and solve 3 problems during the session. So we can say that, to be hired, it will depend on the interviewers; but in the real work, in daily basis developers face logical problems, so that is better to be ready. Then, my opinion and answer to the question is: yes, definitely.

A well defined process

Not all the problems are similar, but for all of them we can split the process in 3 basic steps:

Identify the problem: We can not fix/solve something that we don't know. First step must be always understand the problem, be able to describe it by yourself. A good exercise is draw it on paper, try to visualize by your own, how the code will look, identify the inputs/outputs/rules/etc. that the problem involves. Then, try to describe it, if you are working with someone else, try to explain the problem, if he/she is able to understand it, you might have the enough knowledge and understanding regarding the problem.
Design multiple solutions: Not regarding code itself, more about pseudo code, algorithms. Design only one solution and try to make it work for the issue could be dangerous if the nature of the algorithm/approach that we implement is not the ideal for that specific problem. Here is the importance of before to code, think... and think well.
Choose and implement the optimal solution: Well, this is the last step (ideally) where we have few options of solutions and we need "only" choose one and implement. What is the importance to choose the ideal solution? Two words: code once.

Code should be always the last step of problem solving, after identify the issue, think and design solutions, finally we came to code.

Then, if code is the final step, testing is not needed? In my opinion, the very first "testing stage" is during the second step, where we design the solution, but every stage must be tested, avoiding assumptions.

Let's say that we want to fix a ranking function for students based on scores and some other rules. While we start to design the solution (mostly on paper), all the possibles scenarios should be considered, scores, rules, exceptions to the rules, data types, error handling, data size, etc. And not only sorting funtions.

Sometimes we find temporal solutions where we know that for some specific (valid) scenario, our code will fail, but if we try to cover it too, the code could be more complex. That's the real importance of this final step, be consent that all scenarios might happen on production (even the ones that you don't imagine), then, if you already know that something could break your code, don't choose that solution, invest more time in second step, design an optimal solution and only after that, code it.

Time is the most valuable resource for a developer

Usually, as developers we don't have enough time and we try to code it faster. The truth also is that, if you code something that is not good enough you will need to invest (more) time when the "fix" come back after sometime with a new bug.
As one of the best programming quotes mentions: Think twice (or even more), code once.

Practice as much as you can
Maybe in the resume wont be as valuable as the master in some framework, but believe me, during working it is really helpful.

How can I getting started?

In order to improve this skill, there are many sites (Online Programming Judges) that provide a set of problems for all the levels, depending on the site you will have the chance to work with different programming language to focus only on solve the problems and not about learn a new specific language.
These are the ones that I used to visit:

Start with the basics, all of us started with A+B before to come with data structures, string process, encryption, etc, avoid frustration and do strong progress, no matter if it is slow when it was learned well.

This is my first post in dev.to, I hope you had liked it...
Happy coding!