Forem: Kenzo Castañeda

Creating a GraphQL API from a Data Lake: Data Engineering (Part 1)

Kenzo Castañeda — Wed, 19 Jul 2023 07:00:20 +0000

I’m a Solution Architect and Data Engineer in our Integrations team at work, and I own one of our API products, a Data Lake GraphQL API. In this article, I wanted to share with you the process of how I built it — starting from the data side.

This article assumes you’re familiar with Azure services such as Data Lake Storage, Data Factory, and Databricks, and have some general knowledge of GraphQL, Python/PySpark, and SQL.

Problem

Data lakes are designed to store large amounts of data usually meant for data analytics and reporting, but one of our use cases at work was to turn that Data Lake into a source for a GraphQL API. Sort of like a Reverse ETL but applications (both Internal and SaaS) will call our API to get the data they need.

Why? Well, it’s a central data source that contains everything our users need, and the data is highly relational which is perfect for GraphQL.

There are a few issues though, primarily with the data:

We have no control over the data lake and the data that goes in there since another group owns that. So we take what they can give.
Data is stored as parquet files in the lake. We need to move that to an SQL database to be able to use an ORM and take advantage of that relational goodness.
In our Data Lake, there’s no enforcement of primary key uniqueness, meaning that data is sometimes ingested without deduplication.
No fields can be used to signify that the data is new or updated, such as an updated_at field, so there was no easy way to get the delta.
Our data lake uses the Star Schema model, and some fact tables can get really huge (~70 million rows), so ingesting them from scratch every time would take almost 8 hours.

Design

Our API’s high-level architecture looks like this:

ADLS2 (Data Lake Blob Storage) → Data Factory/Databricks → SQL Server →
Azure Function/GraphQL Server

In Part 1 of this series, I’ll discuss the first three stages and tackle the data engineering aspect first.

Let’s go through every issue above and see what we can do about it.

Data Quality Control - we can create our own pipelines to have more control over the data we get from the lake.
Parquet to SQL Transformation - we can do this in Databricks.
Deduplication - we can also do this in Databricks.
Change Data Capture - I needed to implement our own CDC system due to the lack of any versioning, update timestamps, or logs from our source. What worked for us is creating a hash of all the columns and storing that as a separate column. Basically, a checksum column. If a single column’s value changes, the hash would also change.
Large Data Volume - I needed to create a delta load pipeline that can detect and extract only the new and updated data since the last load to minimize the volume of data written to our database. This pipeline should include a mechanism for identifying changes in the data (inserts, updates, and deletes) and reflect that in our SQL database.

Let’s break the above into just three concrete steps for our pipeline, which is the result of months of testing and development:

Sanitation
Deduplication
Delta Loading

Sanitation

We need to sanitize and modify the source tables to ensure data quality before ingesting it into the SQL database. This includes filtering records, filling in null values, and removing data quality issues.

Below are snippets of code from our Databricks notebook (which runs on Apache Spark):

The source table is read from ADLS2 and returned as a DataFrame.

# Read the source table
dfSource = (spark.read
           .format('parquet')
           .option("mode", "FAILFAST")
           .load(adls2sourcepath))

The DataFrame is sanitized depending on what table it is.

# do source table modifications
if pTableName == 'fact_table_1':
  dfSource = dfSource.filter(f"fiscal_year_num >= '{datetime.datetime.now().year - 1}'") # filter records to current year-1
  dfSource = dfSource.fillna(value='',subset=["tdcval_code"]) #tdcval_code is part of the primary key and cannot be null
elif pTableName == 'fact_table_2':
  dfSource = dfSource.filter("contract_id is not null") # remove empty Contract id's that were data issues
  dfSource = dfSource.fillna(value=0,subset=["allocation_id"]) #allocation_id is part of the primary key and cannot be null
elif pTableName == 'fact_table_3':
  dfSource = dfSource.filter(f"doc_post_date >= '{datetime.datetime.now().year - 1}'") # filter records to current year-1
elif pTableName == 'fact_table_4':
  dfSource = dfSource.filter(f"partition_key > '201712'") # filter for data quality issues. If you remove that filter data before that time is not of good quality.

Deduplication

To deduplicate, we need to know the primary keys. To do this dynamically, I created a function in SQL Server that outputs the table definition.

This returns each column’s name, type, length, precision, scale, and if it’s nullable or a primary key.

CREATE FUNCTION [config].[func_get_table_definition]
(
    @SchemaName [NVARCHAR](128),@TableName [NVARCHAR](225)
)
RETURNS TABLE
AS
RETURN
(
    SELECT
        c.name 'column_name',
        t.Name 'data_type',
        c.max_length/2 'length',
        c.precision ,
        c.scale ,
        c.is_nullable,
        ISNULL(i.is_primary_key, 0) 'is_pk'
    FROM
        sys.columns c
    INNER JOIN
        sys.types t ON c.user_type_id = t.user_type_id
    -- OUTER APPLY selects a single row that matches each row from the left table.
        OUTER APPLY
        (
            SELECT TOP 1 *
            FROM sys.index_columns M
            WHERE M.object_id = c.object_id AND M.column_id = c.column_id
        ) m
    LEFT OUTER JOIN
        sys.indexes i ON m.object_id = i.object_id AND m.index_id = i.index_id
    WHERE
        c.object_id = OBJECT_ID(@SchemaName + '.' + @TableName)
)

Reference: https://stackoverflow.com/questions/2418527/sql-server-query-to-get-the-list-of-columns-in-a-table-along-with-data-types-no

Example output for a workflow_status_dim table, which I’ll be using in the rest of the examples here:

column_name	data_type	length	precision	scale	is_nullable	is_pk
workflow_status	nvarchar	10	0	0	FALSE	TRUE
workflow_status_description	nvarchar	35	0	0	TRUE	FALSE

To run the function in Databricks, we just read it like any other table via a SELECT statement:

# Read target table definition
dfTargetDef = spark.read \
        .format("jdbc") \
        .option("url", jdbcUrl) \
        .option("query", "select * from config.func_get_table_definition('{0}', '{1}')".format(pSchemaName, pTableName)) \
        .option("user", dbUser) \
        .option("password", dbPass).load()

Once we have the primary keys, we can deduplicate using PySpark’s dropDuplicates method:

# dedup source data according to primary key columns

pkColumns = []

pkColumnsDf = dfTargetDef.filter(dfTargetDef.is_pk == 'true')
for row in pkColumnsDf.collect():
  pkColumns.append(f"{row.column_name}")
  if('char' in row.data_type): dfSource = dfSource.withColumn(row.column_name, trim(col(row.column_name))) # trim string primary columns, remove leading and trailing spaces. Spaces cause issues in merging/primary keys

display(dfSource.groupBy(pkColumns).agg(count(
  '*').alias("count_duplicates")).filter(
  col('count_duplicates') >= 2)) # display the duplicate records

dfSourceCount = dfSource.count()
dfSource = dfSource.dropDuplicates(pkColumns)
dfSourceDedupCount = dfSource.count()
dupCount = dfSourceCount - dfSourceDedupCount

print(dupCount)

# Create a temp view of the source table
dfSource.createOrReplaceTempView("SourceTable")

Aside from deduplication, we also do a minor but important sanitation step which is trimming the char-type primary key columns. Rogue spaces in keys can cause issues when merging the data (unfortunately, this was based on experience).

Delta Loading

Generating the Checksum

The first step in the delta-loading process is creating the checksum column in the source.

The code below dynamically generates an SQL statement that creates a delta table named {pTableName}_src within the Databricks HMS (Hive Metastore). This contains the source table columns plus additional columns such as:

check_sum which contains the hash of all columns created using the built-in HASH function of PySpark
date columns (modified and created date)
flag columns (is_modified and is_deleted) to be able to filter out the changes

# Create source table with check_sum hash
spark.sql("REFRESH TABLE SourceTable") # Invalidates the cached entries
spark.sql(f"DROP TABLE IF EXISTS {pTableName}_src")

entries = [f"CREATE TABLE {pTableName}_src AS SELECT"]
for row in dfTargetDef.collect():
  entries.append(f"SourceTable.{row.column_name},")

entries.append("HASH (")
for idx,row in dfTargetDef.toPandas().iterrows():
  entries.append(f"SourceTable.{row.column_name}{'' if idx == len(dfTargetDef.toPandas().index)-1 else ','}")

entries.append(") as check_sum,")
entries.append("current_timestamp() as created_date,\ncurrent_timestamp() as modified_date,\nfalse as is_modified,\nfalse as is_deleted")
entries.append("FROM SourceTable;")

sql_comm = "\n".join(str(x) for x in entries)
spark.sql(sql_comm)

This generates an SQL statement like below:

CREATE TABLE workflow_status_dim_src AS SELECT
SourceTable.workflow_status,
SourceTable.workflow_status_description,
HASH (
SourceTable.workflow_status,
SourceTable.workflow_status_description
) as check_sum,
current_timestamp() as created_date,
current_timestamp() as modified_date,
false as is_modified,
false as is_deleted
FROM SourceTable;

Merging the Data

We first read the target table in our SQL database (our working tables are in the api schema). We store this as {pTableName}_dlt in Databricks.

dfTarget = spark.read \
          .format("com.microsoft.sqlserver.jdbc.spark") \
          .option("url", jdbcUrl) \
          .option("dbtable", "api." + pTableName) \
          .option("user", dbUser) \
          .option("password", dbPass).load()

dfTarget.createOrReplaceTempView("TargetTable")

# Create the delta/managed tables. This initially loads the rows from the source and target tables.
spark.sql(f"CREATE TABLE IF NOT EXISTS {pTableName}_dlt USING delta AS SELECT * FROM TargetTable;")

Next is to merge the source with the target table. The below code dynamically generates a MERGE statement to merge {pTableName}_src into {pTableName}_dlt.

# Merge source into target table

entries = [f"MERGE INTO {pTableName}_dlt tgt USING {pTableName}_src src ON ("]

# Get primary keys
df_pk = dfTargetDef.filter(dfTargetDef.is_pk == 'true').toPandas()
for idx,row in df_pk.iterrows():
  entries.append(f"src.{row.column_name} = tgt.{row.column_name}{'' if idx == len(df_pk.index)-1 else ' AND'}")

entries.append(') WHEN MATCHED AND tgt.check_sum <> src.check_sum THEN UPDATE SET')

# https://www.geeksforgeeks.org/how-to-iterate-over-rows-and-columns-in-pyspark-dataframe/
for row in dfTargetDef.collect():
  entries.append(f"tgt.{row.column_name} = src.{row.column_name},")

entries.append('tgt.check_sum = src.check_sum,\ntgt.modified_date = src.modified_date,\ntgt.is_modified = true,\ntgt.is_deleted = false')

entries.append('WHEN NOT MATCHED BY TARGET THEN INSERT (')
for row in dfTargetDef.collect():
  entries.append(f"{row.column_name},")

entries.append('check_sum,\ncreated_date,\nmodified_date,\nis_modified,\nis_deleted')
entries.append(') VALUES (')

for row in dfTargetDef.collect():
  entries.append(f"src.{row.column_name},")

entries.append('src.check_sum,\nsrc.created_date,\nsrc.modified_date,\ntrue,\nfalse')
entries.append(') WHEN NOT MATCHED BY SOURCE THEN UPDATE SET tgt.is_deleted = true;')

sql_comm = "\n".join(str(x) for x in entries)
spark.sql(sql_comm)

This generates an SQL statement like below:

MERGE INTO workflow_status_dim_dlt tgt USING workflow_status_dim_src src ON (
src.workflow_status = tgt.workflow_status
) WHEN MATCHED AND tgt.check_sum <> src.check_sum THEN UPDATE SET
tgt.workflow_status = COALESCE(src.workflow_status, tgt.workflow_status),
tgt.workflow_status_description = COALESCE(src.workflow_status_description, tgt.workflow_status_description),
tgt.check_sum = src.check_sum,
tgt.modified_date = src.modified_date,
tgt.is_modified = true,
tgt.is_deleted = false
WHEN NOT MATCHED BY TARGET THEN INSERT (
workflow_status,
workflow_status_description,
check_sum,
created_date,
modified_date,
is_modified,
is_deleted
) VALUES (
src.workflow_status,
src.workflow_status_description,
src.check_sum,
src.created_date,
src.modified_date,
true,
false
) WHEN NOT MATCHED BY SOURCE THEN UPDATE SET tgt.is_deleted = true;

If the primary keys match and the checksums don’t, that means the record has been updated. Set is_modified to True.
When no primary keys match, that means it’s a new record. Insert and set is_modified to True.
When the primary key isn’t present in the source, then it’s a deleted record. Set is_deleted to True.

Writing the Changes

Once the delta tables have been merged, we write the changes (those with True is_modified or is_deleted) to our SQL database in another schema (we use delta).

We use Microsoft’s custom Spark connector for SQL Server to take advantage of the BULK INSERT functionality and make our writes a lot quicker.

dfWrite = spark.sql(f"SELECT * FROM {pTableName}_dlt where is_modified=True OR is_deleted=True;")

dfWrite.write \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .mode("overwrite") \
        .option("url", jdbcUrl) \
        .option("dbtable", "delta." + pTableName) \
        .option("reliabilityLevel", "BEST_EFFORT") \
        .option("batchsize", 100000) \
        .option("tableLock", "true")  \
        .option("user", dbUser) \
        .option("password", dbPass) \
        .option("schemaCheckEnabled", "false") \
        .save()

Since we write the changes to another table, we have to merge that to our main table in the api schema. This time, the MERGE statement runs in a stored procedure in the SQL database, not in Databricks.

Since columns get changed from time to time in our data lake tables and there are multiple tables in the lake, I had to make the stored procedure dynamic. To do that, I created the procedure via Databricks.

I tapped directly into the database connection object to be able to create a stored procedure in the SQL database.

# Create stored procedure code to merge delta table to api table in database
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
con = driver_manager.getConnection(jdbcUrl, dbUser, dbPass)

exec_statement = con.prepareCall(f"IF (OBJECT_ID('config.delta_load_{pTableName}', 'P') IS NOT NULL) DROP PROCEDURE [config].[delta_load_{pTableName}];")
exec_statement.execute()

entries = [f"CREATE PROCEDURE [config].[delta_load_{pTableName}] AS BEGIN"]

entries.append(f"MERGE INTO [api].[{pTableName}] AS tgt USING [delta].[{pTableName}] src ON (") # https://stackoverflow.com/questions/10724348/merge-violation-of-primary-key-constraint

# Get primary keys
df_pk = dfTargetDef.filter(dfTargetDef.is_pk == 'true').toPandas()
for idx,row in df_pk.iterrows():
  entries.append(f"src.{row.column_name} = tgt.{row.column_name}{'' if idx == len(df_pk.index)-1 else ' AND'}")

entries.append(') WHEN MATCHED AND src.is_modified = 1 THEN UPDATE SET')

for row in dfTargetDef.collect():
  entries.append(f"tgt.{row.column_name} = src.{row.column_name},")

entries.append('tgt.check_sum = src.check_sum,\ntgt.modified_date = src.modified_date')
entries.append('WHEN MATCHED AND src.is_deleted = 1 THEN DELETE WHEN NOT MATCHED THEN INSERT (')

for row in dfTargetDef.collect():
  entries.append(f"{row.column_name},")

entries.append('check_sum,\ncreated_date,\nmodified_date')
entries.append(') VALUES (')

for row in dfTargetDef.collect():
  entries.append(f"src.{row.column_name},")

entries.append('src.check_sum,\nsrc.created_date,\nsrc.modified_date')
entries.append(f'); DROP TABLE [delta].{pTableName}; END')

sql_proc = "\n".join(str(x) for x in entries)

# create stored procedure in database
exec_statement = con.prepareCall(sql_proc)
exec_statement.execute()

# Close connections
exec_statement.close()
con.close()

This generates an SQL statement like below:

CREATE PROCEDURE [config].[delta_load_workflow_status_dim] AS BEGIN
MERGE INTO [api].[workflow_status_dim] tgt USING [delta].[workflow_status_dim] src ON (
src.workflow_status = tgt.workflow_status
) WHEN MATCHED AND src.is_modified = 1 THEN UPDATE SET
tgt.workflow_status = COALESCE(src.workflow_status, tgt.workflow_status),
tgt.workflow_status_description = COALESCE(src.workflow_status_description, tgt.workflow_status_description),
tgt.check_sum = src.check_sum,
tgt.modified_date = src.modified_date
WHEN MATCHED AND src.is_deleted = 1 THEN DELETE WHEN NOT MATCHED THEN INSERT (
workflow_status,
workflow_status_description,
check_sum,
created_date,
modified_date
) VALUES (
src.workflow_status,
src.workflow_status_description,
src.check_sum,
src.created_date,
src.modified_date
); DROP TABLE [delta].workflow_status_dim; END

If the primary keys match and is_modified is True, that means the record has been updated → UPDATE.
If the primary keys match and is_deleted is True, that means the record has been deleted→ DELETE.
When no primary keys match, that means it’s a new record → INSERT.
Once the merge is done, we drop the table in the delta schema.

Orchestration

Last but not the least is to orchestrate the whole process. For this, we use Data Factory.

The above flow is what we use for each table in the Data lake.

Record the start date
Run the Databricks notebook to load the changes to the SQL DB
If there are changes, run the stored procedure to merge the changes in the SQL DB
If steps 2 or 3 fail, roll back the tables in Databricks using RESTORE TABLE to make sure their states are the same as the tables in the SQL DB.
Finally, run a stored procedure to save metadata for debugging & analytics purposes (errors, number of rows added/deleted/changed, etc.). We maintain this in another table in the SQL DB.

Results

With our delta load implementation, we’ve seen a 70% decrease in data load times vs. loading each table from scratch from the lake to our SQL database every time (which was our very first iteration).

Before, our data pipelines took 7-10 hours to complete each day, but now it just takes 1-2 hours. This highly depends on your compute capacity, but in our case (Standard_DS3_v2 2-8 worker Databricks cluster and GP_S_Gen5_12 SQL database) assuming a change of 10% or less in the datasets:

A table in the tens of thousands of rows just takes a minute to load the new data.
A table with a million rows takes about 10 minutes or less.
A table with about ~80 million rows takes around an hour.

Let me know if you have comments, questions, or suggestions below, or connect with me at hello@kenzojrc.com or my socials. If this article helped you in any way, it would also be great to hear it!

Stay tuned for the next.

10 Tips for Creating a Product on Your Own - Building Keepsake

Kenzo Castañeda — Sat, 10 Jul 2021 07:54:27 +0000

Hey there! Kenzo here.

This is the first of hopefully many articles on my product-building journey with Keepsake.

Let's start with its origin story.

The Story Behind Keepsake

March 2020 - it was the start of the pandemic. Newly minted bakers, plantitos and plantitas, coffee enthusiasts, TikTok dancers, and content creators have just begun their humble journeys during what would be the longest lockdown in the world.

So like everyone else, I tried to make myself productive as a distraction from all the chaos happening around me. I figured, why not learn React? It's always been on my "To Learn" list but I never found the time. Well, there really was no silver lining to the pandemic, but we ended up with a lot of time stuck at home, so I tried to "make the most out of it."

Personally, I learn most effectively when I work on an actual project, so I thought of things that would be simple enough to build but at the same time be something I'd personally use.

We're all looking forward to better days in a pandemic like this, so the idea started with mementos that would make people smile and look back at the good in life. In high school, we had a day called friendship day where we'd write each other physical letters as tokens of friendship. I still have all of the letters I got tucked safely in an envelope. When I feel a bit down or sentimental, reading them never fails to lighten my mood.

My envelope of letters

So I thought of bringing that experience back virtually and making it a more regular thing, especially in these trying times. You may be thinking... can't we just send an email directly? Sure, nothing's stopping you from sending a meaningful & long message through email, your favorite messaging app, or even by hand, but really (and sadly), who does that nowadays? There was no platform dedicated to meaningful letters yet, so I built one.

I'm a fan of one-word URLs, but the tricky part is finding a fitting domain. Memen.to wasn't available so I thought of synonyms. Luckily, keepsa.ke was available, and that's how Keepsake was born 👶

Should You Build a Product?

Now let's get to the product-building part. Keep in mind I'm writing in the context of building digital products (ex. SaaS).

Creating a product is hard. Creating and launching a product on your own is even harder. I bet you've heard this a lot already, but let me reiterate: you have to wear many hats 🤠

🧠 Product Manager. What's your product's vision? What features will you prioritize?
🎨 UX/UI Designer. What's your product going to look and feel like? How is it going to work?
🖼 Front-end Engineer. How will you implement the interface of your product?
🛠 Back-end Engineer. How will you implement all the logic happening behind the scenes?
💾 Database Engineer. How will you store, organize, and fetch data efficiently?
💥 QA/Test Engineer. How will you make sure your product is reliable?
☁ Cloud Engineer. How will you deploy your product and make sure it scales?
😸 Marketing Manager. How will you get users & customers? What's your messaging?

If you don't go crazy while building it on your own, then congrats! It's definitely worth it in the end, and without a doubt, you will gain important skills in each function above (and more).

Behind Keepsake is more or less a thousand hours made up of research, design, development, and lots of frustration in between. After my day job, I'd work on it. On weekends, I'd work on it. On Holidays, I'd also work on it. But you definitely don't have to work on your product 24/7. What's important (for any side project) is you work at your own pace, make small but steady progress, and actually enjoy what you're doing.

Ten Product-Building Tips

In this article, I'll be sharing 10 tips from what I've learned in the process of building Keepsake, so I hope you find a thing or two useful from it!

1. Learning can get overwhelming, but don't let it consume you.

In programming (and most likely in other fields as well), once you learn something, there's always something new that comes up. You can never truly know everything. For every new use case, bug, or fancy package you use in your app, you will have to read through tons of documentation and learn it.

It can get pretty overwhelming, but that's actually a good thing. It means you'll be learning a lot. What's important is you don't let that feeling of being overwhelmed consume you and stop you from moving forward. You probably won't get it perfect the first time, and that's okay. Start small, then keep going. And going. Eventually, you'll be surprised by how much you've already built and learned in the process.

When I first started with Keepsake, I treated it as a learning project. I had no plans to monetize it or treat it as a side-gig. I initially just wanted to learn React, but that was not enough to build a full-blown product. I also had to learn Next.js, Node.js, TypeScript, GraphQL, AWS Lambda, Postgres, and a lot of other stuff. I didn't learn all of them at the same time though. Again, just take small steps. Watch videos (if you're interested in the same tech stack, I suggest Ben Awad's videos). Look at sample code. Scour Stack Overflow. Learn enough just to make progress.

Although, it does help to have prior experience. Before learning React, I've been building APIs and websites with Django and WordPress, which both use different languages, Python and PHP, but I don't consider myself an expert in any of them.

2. Be aware of your bias and design with empathy.

When creating a totally new product or MVP (Minimum Viable Product), you usually assume a lot of things based on your own intuition. In my opinion, this is okay when you're starting out, and it's usually the quickest way to build an MVP, but be aware that your assumptions are biased. Once you have actual users, be sure to gather feedback from them & validate your product. I built Keepsake mostly based on intuition at first, but once actual people started using it, I iterated based on their feedback.

In everything you design, it would be best to always put yourself in the shoes of the user. What seems understandable for you may not be for your users.

An early mockup of Keepsake's inbox page designed with Adobe XD

3. Solve a problem you personally experience.

It's much easier to build something that tackles a problem you personally experience. Related to Tip #2, you build based on intuition at first, and if you understand the problem you're tackling, your intuition has a much better chance of putting you in the right direction. You will better understand your users and their pain points, and so it will be easier for you to design with empathy.

In any case, even if your app doesn't take off, at least you could be your own user when what you've built solves your own problem 😉

4. You will get ideas at the most random times. Jot them down.

When this happens, don't forget to jot those ideas down! Bring out your favorite notes app and take down at least the gist of your idea because it may never come again. Trust me.

Personally, I usually get the best ideas when I exercise, take a shower, or before I sleep. I thought of Keepsake while working out. Maybe my mind is more relaxed or focused at these times? I don't really know 😅

Keepsake started as "Timebox website" in my To-do list in the middle of a workout

5. Track your work.

Related to Tip #4, jot down everything you can about your project - from your tasks to things you learn in the process. Even though it's simply a hobby project, building things alone can get pretty messy, so it's best to force yourself to organize tasks via your favorite task management tool. I personally use Trello to track development tasks and Notion for everything else. Your future self will thank you for taking notes and being organized.

A snippet of my Trello board (but you can do the same in Notion)

Notion's pretty good for timelining and organizing ideas

6. Know when to stop adding features.

I get it, we always want our product to be better. There will always be a new feature to add. It's a never-ending cycle, so try not to get yourself sucked into that black hole.

What's important is you get your product to your users as soon as possible to validate your idea and find out what features they actually need. From there, you can prioritize features and build a product roadmap.

7. Take it easy on yourself.

Product-building is a very long and arduous journey. Before reaching glorious product-market fit, you will definitely come to a point where you'll question yourself if you should even continue your project. You put so much effort into something, but you do not seem to get the results you expected, and if you stop now, your efforts are basically sunk cost. That's when burnout happens, and I've been there.

What's nice about a side project is that you're not pressured by any external entity. Take a break and just build at your own pace. Make small but steady progress, and enjoy the journey.

This is my GitHub contribution activity. I started developing Keepsake in April 2020 then launched in August 2020. I took a break from working on it after (see mostly black boxes after Aug).

8. Celebrate even the smallest milestones.

Getting your first paying user will feel REALLY good. Hitting 500 users? Definitely something to celebrate. Reaching product-market fit? That'll probably take a long time, but it's something to look forward to and work hard for!

It's important to celebrate even the smallest milestones to give yourself a sense of pride in your work and to push yourself to keep going. If you can build in public, that would be even better!

Give yourself a pat on the back for making it this far. You deserve it.

Keepsake was recently accepted into AWS Activate. It provides $1000 in AWS credits!

Getting featured on Product Hunt's Twitter page felt awesome!

But personally, the best milestones are the positive feedback I've received from people who use Keepsake:

9. Focus on the product, not the tool.

You can use the most popular programming languages and the fanciest frameworks to build your product, but from an average user's perspective, they wouldn't really care much about the tech. They care if the product works well for them.

Yes, I know I built Keepsake because I wanted to learn a new front-end JavaScript library in the first place, but you and I have to remember what truly matters to the end-user is the product you're creating. Don't obsess with finding the most optimized way to do things or the best framework to use, just start making it with what you already know then iterate from there.

10. Strive for sustainability.

Don't be afraid to monetize your work. Your project will most likely have costs to build and maintain, so you need to find a way to make it sustainable. Also, people paying for what you've built is the ultimate validation your product could receive. It means they find real value in it.

My goal is to have Keepsake pay for itself. My dream would be to scale it up enough for it to pay for my living expenses. Today, many successful startups started as side projects, so we all probably want our side projects to have the same happy ending, but of course, it's a very steep but certainly not impossible mountain to climb.

Those are my 10 tips. I wish you the best of luck on your own product-building journey! 😊

Promo

If you're looking for a gift for someone special or just want to send some letters to your friends or to yourself in the future, I'd appreciate it if you gave Keepsake a try! Here's a promo code to get you started: sendthelove 💌