Forem: Behalf Inc.

Event Ordering With Apache Kafka

Amit Goldstein — Mon, 19 Apr 2021 13:04:34 +0000

About Behalf

Behalf is a digital platform that facilitates payments by extending net and financing to businesses. Increase revenue and get paid the next business day* when you sell with Behalf.

In a series of articles, we uncover our journey towards Event-Driven microservices.

The Ordering Problem

Stepping out of the comfort zone of the Monolith and into the wilderness of Event-Driven Microservices, you realize how many things we just take for granted. The natural order of things, for example.

In the Monolith, we didn’t have to think about ordering that much. Many actions that now span several microservices could have been done in one method, where the order of statements determined the order of execution. Or we could use locking - pessimistic or optimistic (even a JVM lock in a single instance monolith). Global locking over microservices is a terrible idea since it creates the kind of tight coupling we are trying to avoid.

Partial Ordering

In Set Theory, we differentiate between Total Order - where all elements have a before/after relationship - and Partial Order - when only a subset of the elements have a relationship. When looking at the set of all events in our system, if we can identify a partial order between our events we only need to take care of ordering these events among themselves.

For example, events related to Customer A must be in order between themselves - we can’t allow a customer to submit a transaction before the customer signed up, and we wouldn’t want to handle the TransactionSubmitted event before we handled the CustomerChangedBank event. None of these events, however, relate to events on Customer B.

Ordering Guarantee with Apache Kafka

We selected Kafka as our event bus for many reasons - it is durable, reliable, scalable, has a great ecosystem and support in Java and Spring Boot. And it comes with an ordering guarantee. It is described in the Confluent Kafka Definitive Guide:

“Apache Kafka preserves the order of messages within a partition. This means that if messages were sent from the producer in a specific order, the broker will write them to a partition in that order and all consumers will read them in that order.”

So with Kafka, we get partial ordering - over a single partition in a single topic. We need to choose a topic topology that represents the partial orders that we identified in our system.

Note - To get strict ordering need to either disable producer retry (not recommended) or set the property max.in.flight.requests.per.connectionto 1. This is explained in the documentation.

Topic Per Event Type

With this topology, we will create a topic per each event type. This is a very common topology, and by reading various tutorials and guides, you might come to the wrong conclusion that this is a “best practice” for Kafka. Indeed this topology has many benefits:

A consumer can choose which topic to subscribe to according to events it needs to handle. This reduces the “noise” of unwanted events.
It makes it easy to coerce the event schema since all events in a topic have the same schema. Historically, Confluent Schema Registry required a single schema per topic. This has changed with the introduction of Schema Reference in Conluent version 5.5.
Combined with log compaction, this is a great way to store value in Kafka, similar to DB records. Kafka will only hold one record per entity id which saves on storage and reduces access time. This is not very helpful in our case since business events are not subject to modification.

However, when it comes to ordering, this topology is not very helpful, at least not in our use case. It will only give us ordering over a specific event type, which is not what we need.

Topic As Feed

The next topology we considered was to have each service publish all its events into a single topic - the service “feed” - such that subscribers can subscribe to in order to get the “latest news” from this service. This is very similar to Atom feeds, but implemented with Kafka. The benefits of this topology:

Less operational cost for the producer service - it only needs to manage a single topic, no need for creating a new topic when introducing new event types.
Although we have more noise here when compared to the “Topic Per Event Type” topology, at least subscribers can choose which service to listen to, and they don’t get events from services they don’t care about.

In terms of ordering, this might be sufficient in many cases, but it doesn’t help our use case, where a chain of events of the same flow can span microservices.

Topic As Flow

A more granular approach will be to create several topics that represent known flows. For example in our case we will send the events to the “SubmitTransactionFlow” topic, but other events can go to the “CustomerOnboardingFlow” topic. Let’s compare this approach:

It’s obviously more noisy than the “Topic Per Event Type” approach, but not necessarily more noisy than the “Topic As Feed” approach. It depends on how granular the flow is and how involved the subscriber is with the flow - it’s possible that the subscriber only cares about a single event in this flow, yet has to listen to all events.
Operational cost - we don’t need to create a topic per each event, but we do need to create a topic per each flow, so not improving that much unless we have a small and consistent number of flows.

And in our specific use case, it will solve the ordering.

The problem begins when the system becomes more and more convoluted - as systems tend to get. This puts an additional burden on the developer to understand exactly which flow a new event belongs to or figure out when a new flow needs to be created. More and more cases of events crossing flows arise, which makes producer and consumer development more coupled.

For example, the CustomerSignedUp event may be conceptually part of the CustomerOnboarding flow, as well as the SubmitTransaction flow, but we can only publish it to one topic, in which it will be ordered.

Topic As Event Store (single topic)

In an Event Sourcing architecture, there is a single Event Store that holds all the events in the system. Events serve as a single point of truth for the system state. You can implement this in Kafka with a single “main” topic that holds all events that make up the system’s state.

This, in fact, is a private case of the “topic as flow” approach, where you consider all events in the system to belong to a single flow.

This is the noisiest approach - subscribers must listen to all events from all producers. If you have a firehose of events - for example user clicks - this might not scale well.
It is best in terms of operational cost - it should be very easy to configure your framework once, and you don’t need to touch it again. In Behalf we use Spring Boot Data Stream along with Spring Cloud Config, which allows us to configure the topic in a single file that configures all services.
It accelerates the development process - producer and consumer teams can work in parallel on the solution once the schema is negotiated. The consumer is decoupled from the producer, which theoretically does not even need to know who is the producing service.

And of course, since all events go to the same queue, we get a total order of all events in the system (more accurately - partial ordering per partition - more on this later).

The Topic As Event Store is the architecture we adopted in Behalf for our main product. It allowed us to dramatically speed up our development process, while the ordering guarantee provided us with the data integrity needed for a financial product. It made it easier to migrate code from the monolith to microservices more quickly and safely.

What about the noise?

An optimized Kafka consumer can consume hundreds and even thousands of messages per second, assuming all you need to do is deserialize the message to extract the event name from it and toss it away if you are not interested in that event. So to understand if the noise problem is a real hurdle, you need to answer the following questions:

What is the noise/signal ratio? Does your consumer care about 1% of events or 50%?
What is an acceptable latency for handling an event? Any event-driven system is eventually consistent, and some latency is expected. Still, you might be in trouble if every event takes seconds or minutes to be processed, especially if you have long event chains.
What is the expected volume?
Are you paying for CPU/RAM? For example, when using a serverless cloud framework with usage-based pricing. In that case, the overhead of processing unwanted events might be costly.

At Behalf we found that the noise ratio was not a problem, even with services that are interested in a fraction of the messages. We also populate the event type as a Kafka header, so consumers can drop such messages without having to deserialize them. But of course, as the system scales, this should be monitored carefully.

Partitioning

With the single-topic approach, our events are all queued up in a nice single line. Our customers, however, do not. They access our system in parallel, which means our streams of state change should run in parallel as well, if we don’t want our system to grind to a halt.

Kafka allows us to partition a topic so it can be consumed in parallel by several consumers in a group. Ordering is guaranteed over a single partition. By default, the partition is determined using a hash function over the message key, which means that we need all events that belong to the same partial order to have the same message key. In our case, the message key can be the customer id, since all messages relevant to the same customer should be consumed in sequence. The same flow for a different customer can proceed in parallel. We must pick a consistent message key that we can relate most events to, like customer Id, user Id, session Id etc. (however, when selecting session Id need to consider implications of multiple sessions per user). If one does not present itself, it could mean that the single topic topology is not a good choice.

We also need to be careful when selecting the number of partitions. Decreasing the number of partitions is not possible without destroying the topic. Each partition carries a small overhead - in producer memory and rebalancing time, so don’t shoot for the moon. Choose wisely according to the number of parallel sessions you need to support. Increasing the number of partitions is possible, but it might cause ordering issues since messages with the same key can be found in different partitions during the transition time. Therefore it’s best to plan ahead and start with a large enough number. As a rule of thumb, you will probably need special optimizations for 10,000 partitions, and anything under 100 is probably not worth the trouble of scaling up later.

The last note about the number of partitions is that the number of consumer threads should be smaller than the number of partitions, otherwise you will have idle consumers that do not get any assignments. When there are more partitions than consumers, Kafka rebalances the partitions such that each consumer gets some. This rebalance also happens when consumers leave and join the group. So while the number of partitions is mostly fixed, services can still scale up and down according to traffic and load. But when designing services for auto scale-out, keep in mind that you are capped by the number of partitions.

Conclusion

Ordering events in a distributed system is not an easy task. Apache Kafka’s ordering guarantee can solve this problem, as long as you pick the correct topic topology and partitioning.

In next posts we will dive deeper into our own take of Event Sourcing and the differences between Business and Data events in our system.

*Subject to underwriting and approval criteria. Approval occurs at the transaction checkpoint. Merchants generally get paid same-day if by virtual card, or next business day if by ACH, (ACH subject to cut-off time of Thursday at 4:45 p.m. PST). Processing delays could occur or due to unforeseen circumstances, e.g. when more information is required.

UML Diagrams: Sequence Diagram Overview

Gene Zeiniss — Sat, 20 Feb 2021 09:56:24 +0000

This article was originally published on Medium.

Unveiling UML Sequence Diagrams Through “The Little Prince” | by Gene Zeiniss | The Startup

Gene Zeiniss ・ Aug 8, 2024 ・
Medium

The most used by me (and my favorite) UML diagram is Sequence Diagram. Implementation designs of each new feature I’m working on are loaded with Sequence diagrams. I mean it, totally chock-full.

In general, a Sequence diagram describes how and in what order objects in our system interact with each other, arranged in time sequence. In this article, I’m about to show you the notations of these interactions through “The Little Prince” narrator’s interactions with the world around him. Have you read Antoine de Saint-Exupéry’s novel?

I will summarise only the part that is relevant to the current article. It will be connected, I promise.

...

Okay, briefly, when he was a child, the Narrator read a book on jungles and was fascinated by the fact that a boa constructor swallows his prey whole and then sleeps for few months while digesting the meal. He drew a picture of the boa in this state.

Here is a boa roentgen, if you have any doubts.

When he showed it to adults, they all thought it was a lumpy hat and suggested he stop messing about and study the “important subjects” in school instead. The adults’ criticism killed the artist. The child grew up, became a pilot, and flew worldwide (not the worst). As an adult, he used the boa drawing to decide what topics to talk about with other adults. If an adult saw a lumpy hat, the Narrator spoke about golf and bridge and politics; he never mentioned jungles or boa constrictors or stars.

...

Now, back to the Sequence Diagram.

Sequence Diagram

In my opinion, a sequence diagram is best to represent cross-services end-to-end scenarios or specific flow (or part of it) within a particular service. I’m creating this diagram before I start to design implementation, just to understand the requirements. It’s a contract between me and other involved guys, such as product manager, software architect, team members, etc. These diagrams are the VIP guests of each feature-related meeting.

The sequence diagram depicts the objects involved in the scenario and the sequence of messages exchanged between them required to execute the functionality. It’s pretty cool and handy.

...

Let’s start with an overview of basic symbols:

Basic Symbols

The sequence diagram has only two basic symbols: object and focus of control.

Object

The Sequence diagram object is notated by a box with the name and dashed line extended down below. While the object can represent a specific class (depending on what kind of logic you describe), I mostly use it to define a microservice.

However, today, it will represent a Narrator.

The dashed line is also known as the timeline, shows the sequential events that occur to an object during the charted process. Note that time in a Sequence diagram is all about ordering. The timeline is not relevant for the duration of the interaction.

Focus of Control

The focus of control is the block that overlays the object’s timeline. This is the time in which the object is actually being utilized.

...

Now we are jumping to the exciting part — the interaction between objects!

Interactions

The objects in software systems can interact synchronously or asynchronously.

Synchronous Message

Synchronous messaging is when the caller makes a call, expecting that object that’s being called to do some processing and return control back to the caller at some point in time.

Back to our Narrator. Remember the bao drawing? He showed the drawing of bao to adults and actually waited to hear what they see:

The synchronous message is notated by a solid line with a filled-in and enclosed arrow, pointing from the caller (Narrator) to the callee (Adult). And you can see that it starts a new focus of control on the callee.

Once the call that was made completes, control is returned back to the caller. This is represented by the return line, which is a dashed line with an open arrow pointing to the caller.

If you describe the flow between classes, I mean, the Narrator is calling to the Adult’s method, you can place the method name on the line. It’s also possible to put the passed parameters (depend on the level of details you are looking for).

The return line can be identified with return values that are coming back to the caller. Again, according to the level of the details you decided on. It’s often common to not label the return line at all.

“Less is more” — Ludwig Mies van der Rohe

In case, Narrator and Adult are services, the message can receive the endpoint name.

Here, you can see my return is pretty detailed. In my diagrams, I often return values that are relevant to the flow or to the contract.

“God is in the details” — Ludwig Mies van der Rohe

...

Another special kind of message is self-massage.

Self-Message

The self-messaging, obviously, is a call that an object makes to itself. When an Adult is asked by a Narrator, the first calls his cognitive abilities to answer the question.

In the case of self-message, you don’t need a return on it.

Asynchronous Message

The pretty of the asynchronous message is that the caller is not expecting the return message, so it’s not blocked.

Imagine, that Narrator just said to Adult that he drew bao.

The async message is represented by a solid line with an open arrow.

...

So, what we saw so far works great for basic simple flows. Although, we all know that the software we build includes many more operations and much more complexity.

Let’s take a look, how we can model structural controls using Sequence Diagrams.

Structural Controls

The first structural control I want to show you is Conditional control.

Alternative (Conditional) Control

The alternative is a choice (that is usually mutually exclusive) between message sequences. This is modeled by switch or if/else logic in our code:

The Narrator showed bao a drawing and asked, “what is it?”. If an adult saw a hat, the Narrator talked about golf; If an adult saw bao, he felt free to speak about boa constrictors.

Alternative control is represented by a labeled rectangle shape with a dashed line inside.

...

Okay, to show you the controls, I will refresh the novel slightly more for you.

The Narrator, who was a pilot (remember?), crashes in the Sahara desert. While he tries to repair his engine, a little boy (a Little Prince) appears out of nowhere and simply asks him to draw a sheep.

Now, let’s talk about Parallel Control.

Parallel Control

So, our Narrator is an extraordinary person (to put it mildly). However, to his credit, he is multitasking. He can (though with mixed success) repair the engine and communicate with a Little Prince in parallel. I must say, he is a genius!

As you can see, Parallel control notation is pretty similar to the Alternative one. It is represented by a labeled rectangle with a dividing dashed line. However, here each section must be processed, and processes should run in parallel.

...

Last but not least structural control is Looping.

Looping Control

When Little Prince appeared in the story, he asked the Narrator to draw a sheep.

Think about it for a moment. A little boy in a desert, alone… He’s not thirst or hunger. He wants a sheep. Fata-morgana? Genius Narrator stops the world and draws a sheep. So far, so good?

As it turned out, the boy was very fussy. He asked to draw a sheep repeatedly (once it was too big, once too old). Until the Narrator pulled a box and told that the sheep is inside.

The looping control shows that message interaction is going to happen multiple times. There is a guard text underneath the inverted tab that explains the loop.

That’s all! 🐑

...

To look at some of the other UML diagrams I wrote about, head on over to my profile.

Authentication & Authorization in Microservices Architecture - Part I

Tzachi Strugo — Sun, 14 Feb 2021 07:42:05 +0000

About Behalf

Behalf facilitates in-purchase financing for B2B buyers and sellers. As a financial institution, it is critical to maintain the customer's trust by ensuring that only qualified customers can access their Behalf Account.

Background

Behalf is based on microservices architecture, meaning that each service is loosely coupled and has its own enclosed, well-defined bounded context.
Moving from Monolith to microservice architecture world have a lot of advantages:

Working with small components creates room to scale the service in separate parts.
Each microservice has its own autonomy & provides flexibility on the technology that will be used.
Productivity & velocity can increase by allowing different development teams to work on various components simultaneously without impacting each other.
Development teams are focused and organized around the business functionality.
Developers have the freedom to work more independently and autonomously without having a dependency on other teams.

However, as engineers and architects, we face security challenges in a distributed architecture. Microservices expose endpoints to the public audience, which are usually referred to as APIs.

A monolith needs to secure only itself, and that's manageable. Microservices have a bigger attack surface, which means a larger number of services comes with more significant risks. Each one needs to take care of all the weaknesses that might be exposed.
In a monolithic architecture, components invoke one another via a method call. In contrast, microservices may expose internal API's (synchronous calls) to communicate with each other. This requires more effort and attention to secure it.
In a monolith, all internal components share the same user session context. In a microservices architecture, nothing is shared between them, so sharing user context is harder and must be explicitly handled from one microservice to another.

According to the above security challenges, we conclude that a microservice's security needs to be tackled differently from the monolith’s.

This article will run you through the challenges and decisions to implement flexible, secure, and efficient authentication and authorization layers in a microservice architecture.

The Difference between Authentication and Authorization

Both terms will come to mind when talking about securing applications. And yet, there might be people that confuse the meaning of these terms.

In the authentication process, the identity of the user is checked to provide access to the system. This process verifies ‘who you are?’ so the user needs to supply login details to authenticate.
Authorization is the process of verifying if the authenticated user is authorized to access specific information or be allowed to execute a certain operation. This process determines which permissions the user has.

Authentication Strategy in a Microservice Architecture

While moving from monolith to microservices architecture, it is important to manage security and access control by understanding how to implement authentication and authorization in the microservices world.
There are several approaches:

Authentication & Authorization on each service

Each microservice needs to implement its own independent security and enforce it on each entry-point. This approach gives the microservice team autonomy to decide how to implement their security solution. However, there are several downsides about this approach:

The security logic needs to be implemented repeatedly in each microservice. This causes code duplication between the services.
It distracts the development team from focusing on their domain main service.
Each microservice depends on user authentication data, which it doesn't own.
It’s hard to maintain and monitor.
Authentication should be a global solution and handle as a cross-cutting concern.

One option to refine this solution would be to use a shared authentication library loaded on each microservice. This will prevent code duplication, and the development team will focus only on their business domain. However, there are still downsides that this refinement can’t solve.

Global Authentication & Authorization Service

In this strategy, a dedicated microservice will handle authentication and authorization concerns. Each business service must authenticate the request before processing it by downstreaming it to the authentication service. However, there are several downsides about this approach:

The authorization check is a business concern. What specific user roles are allowed to do on the service is governed by business rules. Therefore, the authorization concern should not be handled in the global authentication service.
This strategy increases the latency of processing requests.

Global Authentication (API Gateway) and authorization per service

When moving to a microservice architecture, one of the questions that need to be answered is how an application’s clients communicate with the microservices. One approach would be to use direct access between client and microservice. This approach suffers from a strong coupling between clients and microservices.

The API gateway is a single endpoint entry for all requests. It provides flexibility by acting as a central interface for clients using these microservices. Instead of having access to multiple services, a client sends a request to the API gateway responsible for routing it to the downstream service.

Because the API gateway is a single endpoint entry, it is an excellent candidate to enforce authentication concerns. It reduces the latency (call to Authentication service) and ensures the authentication process is consistent across the application. After successful authentication, the security component will enrich the request with the user/security context (identity details on the login user) and route the request to a downstream service that enforces the authorization check.

Authentication Types: Stateful vs. Stateless

In Stateful authentication, the server creates a session for the user after successfully authenticating. The session id is then stored as a cookie in the user's browser and the user session store in the cache or database. When the client tries to access the server with a given session id, the server attempts to load the user session context for the session store, checks if the session is valid, and decides if the client has to access the desired resource or rejects the request.

Stateless authentication stores the user session on the client-side. A cryptographic algorithm signs the user session to ensure the session data’s integrity and authority.
Each time the client requests a resource from the server, the server is responsible for verifying the token claims sent as a cookie.

Since the user session is stored on the client-side, this approach frees the overhead to maintain the user session state, and scaling doesn’t require additional effort.

Brief Introduction to JSON Web Token (JWT)

A JWT is an open standard (RFC-7519) that defines a mechanism for securely transmitting information between two parties. The JWT token is a signed JSON object that contains a list of claims which allow the receiver to validate the sender's identity.

The purpose of using the JWT token is for a stateless authentication mechanism. Stateless authentication stores the user session on the client-side.

JWT Structure

The JSON Web token is composed of three parts separated by periods.

The header contains the algorithm used for signing.
The payload is the session data that also refers to ‘claims’. There are two types of claims:
- The JWT specifications define reserved claims that are recommended to use while generating the JWT token.
- Custom claims
The signature is the most critical part. The signature is calculated by encoding the header and the payload using Base64 encoded. Then the encode64 is signed using a secret key and cryptographic algorithms specified in the header section. The signature is used to verify the token has not changed or modified.

JWT Best Practices and Pitfalls

Always use the HTTPS protocol to offer better protection. This way, all data sent between the client browser and a server are encrypted.
Keep the token size as small as possible. The JWT can be either a signed token by using JSON Web Signature (JWS) or a more secure level of protection by using JSON Web Encryption (JWE). Either way, as a rule of thumb, the token should not contain sensitive data.
Several attacks rely on ambiguity in the API of certain JWT libraries. Make sure the JWT library is protected against it by validating the algorithm name explicitly.
Make sure to use a strong secret key as long as the length of the hash algorithm.
Set the JWT token to a short period to reduce the probability of it being used maliciously.
The JWT is automatically valid until it expires. If an attacker gets the token, the only way to “kill” the session is by a stateful solution that explicitly detects and rejects those tokens.
Using JWT does not prevent CSRF attacks. A Cross-site request forgery, CSRF, is a web security vulnerability that allows the attacker to perform actions that they are not minted to perform. Use a synchronizer token pattern to prevent it.
For additional JWT Best practices, read The JSON Web Token Current Best Practices.

Conclusion

Authentication and Authorization are critical core components for applications. This article explains what needs to be considered while building a clean and robust authentication solution under microservice architecture.
In the next article, we are going to explain our implementation to achieve this goal.

Observability is not just about the tool

Liran Keren — Fri, 29 Jan 2021 16:43:14 +0000

Logs have been with us since the dawn of coding, and with good reason.
Logging helps us debug during development, understand failures in pre-production and production environments, and observe our systems' proper functionality in general.

With the recent microservices and cloud computing trend, there was a need to create centralized logging solutions that would process logs from hundreds of microservices in real-time.
There are many solutions out there. Some are open-source like ELK; others are paid solutions like Sumo Logic, Coralogix, Logz.io, Loggly, and more.

These services indeed solve many of the issues related to log management and the monitoring of distributed systems. However, how we formulate logs is up to us, the developers.

Logs are, in fact, streams of events.
Reading events from the beginning to a certain point helps us understand the state of a specific snapshot in time.
Log events have metadata derived from the log technical context.
Some examples include: host, application name (ex. microservice name), and receive time.
Another metadata point is the log level, which is set by us, the developers.
Finally, there's the "payload" – a schemaless blob of data – a conversation between the developer to their future self.
But should it be schemaless?

Here’s an example of a log (example 1):

log.debug("Fetched {} email records from DB. Query={}, Params={}",
     new Object[] { emails.size(), query, query.getParams() });

The log event that is the output of this code repeats itself every time; however, the values will most likely change.
While it looks like a schemaless string, it has a repeating pattern.
Using Sumo Logic, ELK, Coralogix, or even good old Grep, we can consider "Fetched {} email records from DB." as a Number field - emailRecordsCount, "Query={}," as a String field called query and Params can be parsed to a list of String values.

Parsing logs into fields in any logging system is the key feature that lets us achieve observability.
After parsing logs into fields, the stream of events transforms into a table. Using this table, we can perform aggregations and analyze things like error count, event streaming latency, login statistics, and many more.
With aggregations, we can build dashboards and set alerts, which are essential to our business continuity.

While we established that every log event is a repeating pattern, there are serious issues with the simple logging approach:

Hard to parse While there is a pattern, it is hard to parse it because the developers don’t declare their intentions. Sometimes you'll need complex regex to parse.
Language & Context Sometimes, it's hard to understand the log meaning and context of the log. Day-to-day logging is like a conversation between a developer and their future self. In many cases, only the developer understands and knows the context of the log event. As time goes by, a developer needs to constantly check their code to understand the original context, the format, and their original intentions when writing the event log.
A tendency for change Most developers don’t cover their logging with unit tests. Changing a log event never changes a feature behavior, so end-to-end tests don’t cover it as well. We can build alerts and meaningful dashboards in services like Sumo Logic, but there's an inherent possibility that it will break, and we won't know it.

Here's how we at Behalf solved these issues:

Using formats like key=value or JSON makes the job of parsing into fields much more straightforward, and with many services (SumoLogic, Loggly... ), you gain auto-parsing abilities.

Just choosing a format is not enough, though.
Take the log event from the example above.
Writing it in JSON format would look something like this:

{
    "emailRecordsCount":"123",
    "query":"SELECT email from people",
    "params": {"firstName":"Jhon"}
 }

But while one developer will write it this way on one microservice, another developer will probably use different names for the fields, like emailCount or query_params on their microservice.
The log event should not be intimate.
For this purpose at Behalf, we distinguish between "Debug logs" and "Audit logs.”

Debug logs
These are the logs similar to the log in example 1.
In our methodology, these kinds of logs have Debug level and are used during development.
There's no need for a particular structure; they can be very verbose, they tell a story that the developer who wrote them knows.
They will be seen in development environments, and CI (Continuous integration) runs, but they won't appear in Staging and Production.
They can change without any worry since they won't appear in production in the first place, and no one will establish any monitoring aspects on top of them.
Pros:

It's straightforward and fast to code these lines.

Cons:

Everything I wrote above (hard to parse, language and context, a tendency to change)

Audit Logs
We created a core library service called AuditService that can be used by all microservices.
The AuditService is a simple abstract class that expects any extension of a BaseAuditRecord.

public abstract class AuditService<R extends BaseAuditRecord> { 
   public void audit(R auditRecord) {
   }
}

The BaseAuditRecord and its extensions are basic POJOs that represent the schema of the event that we want to audit.
The AuditService.audit() method receives the record object and logs it using JSON format (serialize the record into JSON).
While the BaseAuditRecord provides some basic properties like status, duration, etc., we still want to create different audit layers for various purposes.
First and foremost, we use AuditService extensions in our different Core libraries and platforms that run all our microservices.
For example, to audit our persistence layer, we extend BaseAuditRecord with AuditDbRecord.
AuditDbRecord has the following properties:

String operation (e.g., SELECT, UPDATE...)
String query
int numRows

We extend AuditService with AuditDbService
public class AuditDb extends AuditService<AuditDbRecord> {}

The output appears like this:
2021-01-21 21:30:23.981 INFO 20 [sessionId=, principalId=, flowId=, traceId=5a139a230e15de3c, spanId=5a139a230e15de3c] --- [io-8092-exec-15] c.behalf.core.persistence.audit.AuditDb : {"status":"OK","duration":2,"operation":"SELECT","type":"READ","numRows":0,"table":"agreement","queryName":"findByCustomerId"}

Pros:

Testability - it can be easily unit tested. Schema changes are significant and apparent. Suppose we need to change AuditService.audit() lines, we check first if an alert or a dashboard uses it.
The need to create the schema forces the developer to consider observability and monitoring in post-production while in the early development phases. The developer has to determine what tech support teams and DevOps teams will get in terms of logs when the feature goes live.
Platform thinking - Creates a uniform understanding of the auditing layers, mostly if it belongs to the platform. If I'm a developer in team A, I would easily understand team B's services logs because the format, language, and context are the same. The same goes for quality engineers, tech support, and even Product teams - everyone can understand the platform audit language.
Searching, parsing, and aggregating becomes easy. For example, at Behalf, we use Sumo Logic. If I want to check the average latency on event consumption on all microservices in production, I will do this search in Sumo:

_sourceCategory=prod* AuditBusinessEvents* 
| json auto
| avg (latency)

It searches production logs for all log lines created by the AuditBusinessEvents class, using Sumo's 'json auto' feature to auto parse the JSON fields, and then it aggregates to get the average latency within the time frame.
Again, no need to be a coder to search and process these logs!!

Cons:

It takes a little bit more time to implement than regular logs.
People might find the output not verbose enough (well, that’s the purpose).

We use the AuditService to log any fundamental process activity, whether it's a core/platform process or a microservice business logic layer process.
With our structured, testable logs, we gain complete observability, we recognize weak spots, understand our business flows, and easily create alerts and dashboards that don't break.

We established an observability practice that doesn't break and always extends as more features are introduced into our system.

log.error ("Oh no, we reached the end of the post");

FuzzyWuzzy — the Before and After

Naomi Kriger — Wed, 13 Jan 2021 12:50:49 +0000

In the previous article, I introduced FuzzyWuzzy library which calculates a 0–100 matching score for a pair of strings. The different FuzzyWuzzy functions enable us to choose the one that would most accurately fit our needs.

However, conducting a successful project is much more than just calculating scores. We need to clean the data before we start working on it, choose the best method to calculate our scores, learn how to work not only with a pair of strings but with tables of data, and eventually know how to use the scores we received to make the most out of our results.

So, without further ado, let’s dive into some best practices we should be familiar with.

Using a Table With Pandas In order To Compare Multiple Strings

As discussed earlier, FuzzyWuzzy functions calculate matching scores for two strings. But when working with “real life” data, we will probably want to compare at least two sets of strings. This means working with a table, or when speaking in Pandas terms — working with DataFrames.

A good table will resemble this one:

The table above contains two comparison columns, each with a relevant header, where the strings to be compared are in parallel rows.

Given such a dataset, we can read the table to a DataFrame using a relevant function. The example below reads directly from a CSV, but if you are interested in using other formats — you can check out the following documentation.

>>> my_data_frame = pd.read_csv("my_folder/my_file_name.csv")

Data Preprocessing — Cleaning the Data Before Analysis

Before we choose our FuzzyWuzzy function and start comparing strings, we want to clean the data to ensure that our results will be as accurate as possible.

Cleaning the data means removing irrelevant strings, and thus improving the functions’ performance.

For example, let’s assume we compare strings of two addresses, where one address is “Joe Boulevard” and the other is “Jule Boulevard”. The matching score will be relatively high, but mostly due to the existence of “Boulevard” in both strings. Removing it and recalculating will result in a much lower matching score:

>>> fuzz.ratio(“Joe Boulevard”, “Jule Boulevard”)
89
>>> fuzz.ratio(“Joe”, “Jule”)
57

The type of cleaning required for your data depends on your domain.
We saw an example of the required cleaning for addresses. Similarly, when comparing phone numbers — we will probably want to remove parentheses and dashes that have no added value. It is also recommended to normalize all of your strings to lowercase since some FuzzyWuzzy functions treat differently-capitalized letters as different strings.
So, look at your data, and decide what should be modified in order to make it clean and ready for processing.

Data Pre-Processing — Let’s Get Technical

Now, let’s define a function with the relevant logic, and iteratively run it on each of the relevant columns in the DataFrame.

** The example below was simplified in order to keep the explanation clear. For best results, it is recommended to use regular expressions (regex) which is beyond the scope of this article. Note that strings_to_remove, in its current form, may lead to imperfect results after the cleanup.

>>> strings_to_remove = [' ave ', ' ave. ', 'avenue', ' lane ', ' ln', 'blvd', 'boulevard', ' rd. ', 'road', 'street', ' st. ', 'str ', ' dr. ', 'drive', ' apt ', 'apartment', 'valley', 'city', '.', ',']

>>> comparison_table =
comparison_table.astype(str).apply(lambda x: x.str.lower())>>> for current_string in strings_to_remove:
comparison_table = comparison_table.astype(str).apply(
lambda x: x.str.replace(current_string, ' '))

>>> comparison_table = comparison_table.astype(str).apply(
lambda x: x.str.replace(' +', ' '))

And — voilà!

Adding a Score Column And Comparing

All that’s left now is to add an empty column named ‘score’ to the DataFrame, calculating the matching scores using our chosen FuzzyWuzzy function,
and populating the DataFrame with those scores.

Here is an example of how to do that -

>>> comparison_table["score"] = "">>> comparison_table['score'] =
comparison_table.apply(lambda row:
fuzz.token_set_ratio(row['col_a_addresses'], row['col_b_addresses']),axis=1)

Let’s compare the results with those we would have received if we had run the FuzzyWuzzy function on an unprocessed DataFrame:

Before Cleaning -

After Cleaning -

So, what actually happened after cleaning the data?
The matching scores became more accurate — either increased or decreased based on the cleaning.

Let’s look at row 3 where the score decreased after cleaning. In this case — the word “Lane” which appeared on both addresses before cleaning, falsely increased the matching score. But after removing it, we were able to see the addresses are not that similar.
Let’s look at row 9 where the score increased after cleaning. While “Lane” and “ln.” have the same meaning, they are different strings with different capitalization. Once cleaning the noise out — we were able to receive a much better score, that more accurately reflects the similarity level between those strings.
It is also interesting to see that the cleaned strings in row 9 are not identical. ”85" appears only in col_b_addresses yet the matching score is 100. Why? Since the strings are “close enough” to be determined as a perfect match by the algorithm. A decision that would have likely been the same if a human being had to make it.

Choosing a FuzzyWuzzy Function — In a Nutshell

One method to choose the best FuzzyWuzzy function to work with is based on the logic/purpose of the different functions and determining which function seems most relevant for your purposes.

However, if you cannot decide which function may retrieve the most accurate results — you can conduct a small research to determine what to work with.

The method I would recommend using would be to take a sample of your data set and run each of the relevant functions against it. Then, for each of the results — manually decide if the value in each row is true positive / false positive / true negative / false negative.

Once this is done, you can either choose where your TP/FP rate is most satisfactory, or go ahead and calculate accuracy* and sensitivity* as well, and use these values to make your decision.
For each project, our goals may differ, and the false positive / true negative rates we are willing to take will be different.

* Both accuracy and sensitivity are used in Data-Science and beyond the scope of this article. The formulas for each of these can be found online.

Choosing A Threshold Score — In a Nutshell

My pair of strings returned a matching score of 82. Is it good? Is it bad?

The answer depends on our target, and there are many relevant questions to ask, such as: are we interested in strings that are very similar to one another, or in different ones? What is the maximal false-positive rate we are willing to accept? What is the minimal true-positive rate we want to work with?

For the same set of strings, we can come up with two different threshold scores — minimal score for similar strings (for example 85), and maximal score for different strings (for example 72).
There can be a whole range between these threshold scores that will be doomed as “inconclusive”.

There are different methods to define a threshold score, and we won’t dig into them in this article. I will, however, mention that choosing a threshold score will require some manual work, similar to the one mentioned above regarding how to choose the best FuzzyWuzzy function to work with — taking a sample set of strings with final scores, determining true-positive and false-positive for the results, and eventually deciding where our threshold stands.

Using FuzzyWuzzy for strings comparison, as well as pre-processing the data, and eventually analyzing the results is a fascinating work. There is always more to do, and different ways to improve the process.

In this article, we explored some of the practices that make this process useful and comfortable.

If you enjoyed this article, and/or the previous one, let me know! Share in the comments below a takeaway note for your next project.

I’d love to hear from you :)

Comparing Strings Is Easy With FuzzyWuzzy

Naomi Kriger — Wed, 13 Jan 2021 12:09:57 +0000

About a year ago, I saw a colleague of mine working on a large data set, aiming to measure the similarity level between each pair of strings. My colleague started to develop a method that calculated the “distance” between the strings, with a set of rules he came up with.

Since I was familiar with the FuzzyWuzzy python library, I knew there was a faster and more efficient way to determine if a pair of strings was similar or different. No tedious calculation needed, no reinventing the wheel, just familiarity with FuzzyWuzzy, and with a few other tips that I will share with you today.

What Is String Comparison, And How Can FuzzyWuzzy Help?

FuzzyWuzzy is a Python library that calculates a similarity score for two given strings.

The similarity score is given on a scale of 0 (completely unrelated) to 100 (a close match).

After all, if all you need is exact string comparison, Python has got you covered:

>>> "check out this example" == "check out this example"
True
>>> "check out this example" == "something completely different"
False

But what if your strings are not necessarily exactly the same, yet you still need to know how similar they are?

>>> "check out this example" == "check out this exampel"
False

This isn’t very helpful.

But check this out:

>>> fuzz.ratio("check out this example", "check out this exampel")
95

Much better.

Getting Started

If the relevant libraries are not installed on your virtual environment and imported before usage, you will need to run the followings:

In a command line:

pip install pandas
pip install fuzzywuzzy
# or — preferablypip install fuzzywuzzy[speedup]
# [speedup] installs python-Levenshtein library
for better performance

Inside your IDE:

>>> import pandas as pd
>>> from fuzzywuzzy import fuzz

Levenshtein Distance — Behind the Scenes of FuzzyWuzzy

The different FuzzyWuzzy functions use Levenshtein distance — a popular algorithm that calculates the distance between two strings. The “further” those strings are from one another, the greater is the distance score (as opposed to FuzzyWuzzy output).

However, Levenshtein distance has a major disadvantage:
It has one logic, while in real life there are a few different ways to define similarity of strings.
Levenshtein distance cannot cover different cases with one piece of logic. Some cases would be defined as similar strings by a human being, yet missed by Levenstein.

For example:

>>> Levenshtein.distance(“measuring with Levenshtein”,
“measuring differently”)
11
>>> Levenshtein.distance(“Levenshtein distance is here”,
“here is distance Levenshtein”)
17

If someone had to decide which pair of strings is more similar, they would probably pick the second option. However, we can see that it actually got a higher distance score.

We need something better than that.

Getting to Know the Functions

FuzzyWuzzy has a few different functions. Each of these functions takes a pair of strings and returns a 0–100 match score. But each of these functions has a slightly different logic, so we can select the most appropriate function to meet our needs.

Let’s get to know some prominent FuzzyWuzzy functions:

fuzz.ratio:
The simplest function. Calculates the score based on the following logic:
Given two strings, where T is the total number of elements in both sequences, and M is the number of matches, the similarity between those strings is:
2*(M / T)*100

>>> fuzz.ratio("great", "green")
60

In this example -
T = len(“great”)+len(“green”) = 10
M = 3
So the formula is 2*(3/10)*100

fuzz.partial_ratio: For two strings, A and B, where len(A) < len(B) and len(A) = n, this function will run ratio function between A and all n-length substrings of B and will retrieve the highest of all scores calculated in this process. Let’s take a look at the following three examples, and then discuss them:

>>> fuzz.partial_ratio(“let’s compare strings”, “strings”)
100
>>> fuzz.partial_ratio(“let’s compare strings”, “stings”)
83
>>> fuzz.ratio(“let’s compare strings”, “strings”)
50

The examples above demonstrate the followings:
1. String in B is fully included in string A, therefore the matching score is 100.
2. String in B is almost fully included in string A, except for a “typo”, therefore the matching score is 83.
3. Same as the first example, but using the ratio function between A and B, instead of partial_ratio. Since ratio function is not aimed to handle the substring case — the score is lower.

fuzz.token_sort_ratio

Sorts the tokens inside both strings (usually split into individual words) and then compares them. This will retrieve a 100 matching score for strings A, B, where A and B contain the same tokens but in different orders.
See the examples below, once with token_sort_ratio function, and once with ratio function, to see the different results.

>>> fuzz.token_sort_ratio("let's compare strings",
"strings compare let's")
100
>>> fuzz.ratio("let's compare strings", "strings compare let's")
57

What about the same token but different counts?

>>> fuzz.token_sort_ratio("let's compare", "let's compare compare")
76

Well, this isn’t the functions’ specialty. For this case, we have token_set_ratio coming to the rescue.

fuzz.token_set_ratio
The main purpose of token_set_ratio is to ignore duplicates and
order-differences between the given tokens.

We can see that the two examples below got the same matching score, although in the second example some tokens have been duplicated and their order was changed.

>>> fuzz.token_set_ratio("we compare strings together",
"together we talk")
81
>>> fuzz.token_set_ratio("strings we we we compare together", "together together talk we")
81

The logic behind this function, after simplifying it a bit, is as follows:
Given strings A, B (let’s have A = ‘hi hey ho’ and B = ‘yo yay ho’):
C = intersection of A and B (‘ho’)
D = C + remainder of A (‘ho hi hey’)
E = C + remainder of B (‘ho yo yay’)
So token_set_ratio runs ratio(C, D), ratio(C, E), ratio(D, E) and returns the highest score of the three.

The full logic contains additional normalization, such as applying set() on A and B, applying sort() on C, and more.

The code implemented by each of the functions described above, as well as other useful FuzzyWuzzy functions, can be found here.

Getting To The Nitty Gritty

In this article, we covered the basics of FuzzyWuzzy library and its functions.

However, getting the best results for our project does not begin and end with knowing which functions are available out there and how to use them.

Doing it like a pro means knowing how to clean data before working with it, how to compare not only pairs of strings but big tables as well, which threshold score (or scores) to use and so much more.

To learn all of these, I invite you to read the successive article called FuzzyWuzzy — the Before and After.

I would love to hear about your experience with FuzzyWuzzy: when did you use it? For what purposes do you plan using it in the future? What makes working with FuzzyWuzzy fun and comfortable for you?

Comment below and share your thoughts!

My first day at Behalf

Dorin Feigenbaum — Thu, 07 Jan 2021 15:26:34 +0000

They say the first impression is one's initial perception of another person. First impressions are also important when you come to your first day at a new job. The way the new environment perceives you, but also the way you see the new environment.

Prerequisite

My first impression of Behalf started before that. It started with a phone call from a really nice and jumpy HR, and followed by another happy sounding manager. We scheduled an interview very fast right after that 2 second call. After the interview process, which was both efficient and quick with a lot of transparency, came my first day.
I can't say I wasn't nervous. Although I already started to learn and was preparing myself for the role, I was still not sure I was ready. I wasn't really sure what to expect. On a quick note, all the interviews were online, and I had never been to the office before my first day, since it was COVID time.

The first day!

Finally my first day arrived, and I went to the office. At first glance the office looked new to me. It looks like they are very well tagged and have clean high tech design. I was very welcomed by my manager (who is a lovely person and deserves her own post). She welcomed me and escorted me to my seat, so I could put down my stuff, and started a quick office tour.
From there we got to HR, which was yet another amazing person. She welcomed me and the other new guy that came with me, with everything that we needed for our first day. She gave us a really inclusive office tour including all the coffee machines and how they work. We got our key to the elevator, got some documents signed and that’s it, I was an official Behalf worker.
The first time I saw my team was then, on Zoom. They had their daily meeting and I sat by my manager (with masks ) saying hello. Their daily looked very pleasant and I couldn't wait to start working with these guys. I felt very welcomed.
The IT guy arrived by then, and off to get our computers. It was my first time with a Mac, and the IT guy was very patient to help me. He helped me start up my computer and account for all sorts of things. Once I got to my place, he made sure I would get all the equipment I needed. He even brought me 2 mouses to pick from and a very wide screen that I can take home (due to COVID) .
After a Corona-time-long-distance lunch, which is always weird but somehow also pleasant, I finally got to open my brand new computer.
I already had about 50 emails in my new inbox. My manager structured my onboarding plan for the next 3 weeks, and scheduled all the meetings I had to go through in advance. She even included a field trip with the team the next day! I had some onboarding processes in my life, but I would have never expected a startup onboarding to be this thorough.
Inside my email waited for me a very well organized list with links and tips to help me start. So I have started the installation process and getting to know my new Mac.

In conclusion

Overall, after the first day I felt both overwhelmed and excited with the quantity of materials that I was about to start and learn. On the other hand I was amazed by how things can be so simple and easy with just the right amount of effort. The longer I am here the more I see this implemented in all the aspects of my new work environment.
All in all, looks like first impressions worked very well, both for me and for Behalf. I can't wait to see what will be next.

UML Diagrams: Component Diagram Overview

Gene Zeiniss — Wed, 06 Jan 2021 14:19:58 +0000

This article was originally published on Medium.

Unveiling UML Component Diagrams: A Pizza-Lover’s Guide | by Gene Zeiniss | The Startup

Gene Zeiniss ・ Aug 8, 2024 ・
Medium

Think of pizza. The imagined smell of saucy, cheesy, spicy pizza baking in the oven is smashing me. Sometimes you want new toppings, other times, you want a different base. Many different ingredients create it. Each of these ingredients is separate components, but they interact and complement one another. They all exist within the greater system, in this case, the pizza.

You got that straight. In this article, we will deal with UML’s Component Diagram.

Component Diagram

Component diagrams are used to visualize how a system’s components interact (gee!) and what relationships they have among them. For the purpose of UML, the term “component” refers to a module of classes that represent independent systems with the ability to interface with the rest of the system. As we’re able to identify these interfaces, we’re able to find parts of the system that can be replaceable. In other words, we able to find plugin components.

Finding plugin components allow us to reuse these components in other projects. It also helps us structure our work by dividing that work off to a few developers or independently running sub-teams.

Back to the pizza, but now more specifically. Think about Hawaiian pizza on a restaurant menu. Probably, you will not find the pizza’s preparation and cooking methods. The menu just states the pizza ingredients, such as ham, pineapple, and bacon pieces over a red sauce (to be perfectly specific). By listing all of the ingredients, the customer can get a better idea of whether or not they’d enjoy the pizza. It is the same for the Component diagram. Unlike the other diagrams, the component ones are focused on high-level structure and not their methods and specific implementations. It’s a kind of a bird’s-eye view of your software system.

...

When you are building a Component diagram, the first step is to identify the basic components used in the system.

Basic Component

A component is a logical unit block of the system, a slightly higher abstraction than classes. It’s a subsystem if you wish. When subsystems are connected, it creates a single system.

Let’s assume we want to create a small family Pizzeria system. Our basic components will be:

Pizzaiolo — a professional pizza maker, as climbed by Oxford.
Customer — the actual pizza’s consumer (pizza eater, the same one that possibly will enjoy Hawaiian pizza).
Waiter — is “The One Who Waits”. Never mind, I’m just playing associations.

In UML, a component is represented as another box with this puzzle symbol in the top right-hand corner.

Here you can see the basic components of our Pizzeria. Cool? Not really. This information is not enough to tell us what we really need to know. Each component has a particular relationship to the other component through the interface it provides. The existence of these interfaces is the most interesting part.

Let’s expand our diagram a little bit. We’ll add some adornments to our Customer component.

Take a look at a diagram. I’ll start by talking about the Order and Payment. When a customer comes to the pizzeria, first, he should select and order the pizza. Right? The pizza party is usually finished by paying the bill. In Component language, Order and Payment are interfaces that our Customer provides (realizes or implements). The purpose of a provided interface is to show that a component offers an interface for others to interact with. This kind of interface is represented by a solid line with a lollipop at the end and a name over it.

Now, to provide the specific payment, the customer needs to know how much he needs to pay. The half-circle (also called “socket”) at the end of the connecting line represents the required interface. The meaning of it is that our honest customer expects to receive a check, provided by some other component (Waiter!) to be able to achieve its responsibilities.

To summarize, the interface describes a group of operations used (required) or created (provided) by components.

...

Our basic components are defined. Next, identify all of the relevant libraries needed for your system.

Libraries and Third-Party Dependencies

Component diagrams do not purely focus on what you implement. All third-party implementation dependencies must also be identified and integrated into the diagram where relevant. I will expand on third-parties later.

...

Finally, come up with the connections found between all these components. See how components can plug together. This is exactly the rationale behind this design.

Connections Between Components

The components can be connected loosely or tightly, like a belt on trousers before (and after) eating pizza.

Let’s start by talking about how we can loosely connect different components together.

Loosely Connection

It’s the main connection type of Content diagram. This is where we can define our pluggable parts of the system. When one component provided interface matches another component’s required interface, it’s called an assembly relationship.

Here you can see Customer and Waiter are connected like a puzzle. The components are plugging together. But you can also unplug them easily. The connection between them is loose. You can popup the Waiter service and replace it with something else, such as Shift Supervisor, or maybe the Pizzaiolo in the flesh will take the order. This is a replaceable piece of our system.

Tightly Connection

The tight connection between components is represented by a solid line between two components. It indicates that it’s not going to be so easy to replace one of these components. They are tightly coupled together. Let’s think of our example, assume that the customer can select pizza only from the pizza menu. His order is tightly dependent on the Menu component.

A Customer can not provide the order without a Menu. Getting rid of one of them will probably, require some rework for implementing this replacement.

...

Basic Component Diagram

Now, let’s take a look at a kind of a completed though rather small, Component diagram.

You can see, the diagram has some new elements that we are not yet looked at. First of all, we have an outer component diagram here, called Pizzeria. It’s our whole system. The other components sitting inside of this box are parts that make up the Pizzeria. It consists of Pizzaiolo, Waiter, and Customer. The Customer is tightly connected to the pizzeria’s menu.

A couple of other things to look at is the interaction with the outside world from the Pizzeria. Notice small squares sitting on the bounds of the Pizzeria component. These squares are called ports, and they specify a separate interaction point between the component and the environment.

In the Pizzeria example, ports represent the third-parties, such as a Cashier Service, used by Waiter; a Grocery Store, that provides the groceries required by Pizzaiolo; Pizzerias Ranking Service, which depend on review, provided by satisfied (or not) Customer.

Since we don’t implement these services, it’s enough to show only the interaction point and the dependency direction (incoming or outgoing). Dependencies are represented by dashed lines linking one component (or element) to another.

...

That’s all. Ciao ciao!

To look at some of the other UML diagrams I wrote about, head on over to my profile.

UML Diagrams: Class Diagram Overview

Gene Zeiniss — Wed, 06 Jan 2021 14:01:47 +0000

This article was originally published on Medium.

Decoding UML Diagrams: The Common Tongue of Software Modeling | by Gene Zeiniss | The Startup

Gene Zeiniss ・ Aug 8, 2024 ・
Medium

In my line of work, I’m often using UML diagrams. Daily, I daresay. Regardless, I never understood the subtlety of the diagram’s components. When to use a full arrow and when hollow, the line should be solid or dashed and which direction it should point. I’m done living in ignorance. So, here I am, about to start the UML diagrams series of articles.
...

Briefly about UML

UML (Unified Modeling Language) is a common software engineering modeling language that is used to solve a wide variety of problems. It’s a “the Common Tongue” of Westeros, for Game of Thrones fans. It helps you specify, visualize, and document models of software systems, including their structure and design, before coding (with an emphasis on “before”). As the old proverb says: “The carpenter measures twice and cuts once”.

There are several types of UML diagrams and each one of them serves a different purpose. The two most broad categories that encompass all other types are Structural and Behavioral diagrams.

Structural (or Static) diagram visualizes the system’s static structure through objects, attributes, operations, and relationships. I have a cat, Lola. She has some stuff, such as a food bowl, claw sharpener, and the poop-house. So, the static aspects of Lola encompass the existence of her stuff. By the way, Lola herself will be described as an object (from an object-oriented view, right?).

The Structural diagrams are: Class Diagram, Object Diagram, Component Diagram, Composite Structure Diagram, Package Diagram, Deployment Diagram, and Profile Diagram.

Behavioral (or Dynamic) diagrams are used to visualize the dynamic aspects of a system. The Behavioral category includes a few general types of behavior, specifically the Use Case Diagram, Activity Diagram, and State Machine Diagram; and types that represent the different aspects of interaction (aka “Interaction Diagrams”) — Sequence Diagram, Communication Diagram, Timing Diagram, and Interaction Overview Diagram.
In other words, it shows how the system interacts with external entities and users, how it responds to input or events, and what constraints it operates under. Just an example of an interaction between me and my cat. Lola hysterically meows. It triggers me to put food in her bowl. She eats while keeping the “thank hers”.

Not all of the 14 different types of UML diagrams are used regularly when documenting systems and/or architectures. I’m about to write about the most useful diagrams, in my humble opinion, and will start the series with Class Diagrams.
...

Class Diagram

Since most software being created nowadays is still based on the object-oriented programming paradigm, using Class diagrams to document the software turns out to be a common-sense solution. A Class Diagram describes the structure of a system by showing it’s classes, their attributes, methods, and the relations between them. It’s a vocabulary of the system, a common language between all members of the team.

Structural features (attributes) define what objects of the class “know”. Cat knows her eye color, coat, weight range and is acutely aware of her superiority.

Behavioral features (operations, aka methods) define what objects of the class “can do”. A cat can meow, eat, and poop (a very superficial idea of what a cat can do).

Classes are described as boxes in a Class Diagram. Each box has a title that represents the name of the class. Under the title, there are two sections:

The middle part of the box includes attributes. An attribute notation is attribute:type = defaultValue (optional)

The bottom part includes operations. An operation notation is operation(params):returnType

Following notations allow to specify the visibility of a class member (i.e. any attribute or operation):

We are cool? This is a Cat Class example. Let’s take a look at what we have here:

“Cat” is a class name (title).
The Cat has some public attributes, such as “eyes color” of type enum with default value “Yellow” (let’s say), and value object “coat” of type Coat.
Cats can meow and eat treats in public, however, they prefer to defecate in privacy (operations).

When you create a class diagram, omit not required details. I mean, try to keep the diagram clean, by showing only relevant details. Class diagrams get cluttered really fast. Take a look at the middle part of the box, the displayed attributes are marked as public. It means that other classes have access to these parameters. As a notation, it saves me the need to indicate the obvious getters and setters we’d expect here. Verstehen?
...

Relationships Between Classes

A relationship is a general term covering the types of logical connections found on class and object diagrams. A class may be involved in one or more relationships with other classes or instances.
...

The class-level relationships cover the object-oriented paradigm key-factors: interface implementation (realization) and inheritance (generalization).

Realization (Interface implementation)

In UML modeling, the realization is a relationship between two model elements, in which one model element (the client) implements the behavior that the other model element (the supplier) specifies.

Now, the same stuff in human language. The Cat is a domestic species of the Felidae family. I could say, Cat is a realization of Felidae. Or, in an object-oriented language, Cat is a specific implementation of the Felidae interface.

Note that the interface box has only one part under the title, which includes the operations, which have various implementations.

Class “Cat” implements the “Felidae” interface. We can recognize it by the connecting line between their boxes. The dashed line with a hollow arrow pointing back to the interface that’s being implemented.

Notice that in the Cat box, we are not showing any operations that are part of “Felidae”. Cat implements these operations, no need to duplicate them (keep diagrams as clean as possible, remember?).

Generalization (Inheritance)

Generalization represents a “IS A’’ relationship between a general class (Cat) and a more specific implementation of this class (Cat Breed). Russian Blue is a cat breed, while each specific cat breed acquires all the attributes and operations of a Cat, but also can have it’s own. For, example, cats of the Scottish Fold breed can stand up on their hind legs.

The generalization relationship notated in UML by a solid line with an enclosed hollow arrow, pointing back to the general (base) class.

Another couple of items to keep in mind here is the concept of Abstract and Concrete Classes.

An abstract class is a class that we will never instantiate. In UML, this class’s name should be italicized.

A concrete class is a class that we actually instantiate. It’s a default class, without any special decorations of its title.

...
The next thing we want to look at in our Class Diagram is modeling relationships.

Basic Relationship (Association)

This association relationship is usually described as a “HAS A” relationship. It indicates that at least one of two related classes refers to each other. This relationship is represented simply by a solid line (without an arrow) that connects two classes.

Now, often, it’s important to understand how many items on each side of a relationship can exist, which we call “cardinality”. The cardinality is expressed in terms of: “one to one”, “one to many” and “many to many”. In UML it’s notated as follows:

Take a look at this diagram:

In the current example, a cat has a human (the chosen one, right). The human has few cats (of both Russian Blue (wannabe) and Scottish Fold breeds. True story). It’s a “many to one” relationship.

Aggregation Relationship

Aggregation is a variant of the “HAS A” relationship, however, it’s more specific than an association. Aggregation represents a one-way association, called a “PART-WHOLE” relationship. I mean, one entity works as an owner (a whole) of another (a part of the whole), but the owned classes do not have a strong lifecycle dependency on the owner.

Back to our Cat. Each cat must have a poop-house. So, the implied relationship between “Cat” and “Poop-House” will be a “HAS A” type relationship. But the reverse may not be true. Each poop-house doesn’t need to be contained by any cat. Now it easy to recognize that Cat is an owner entity and this is an aggregation relationship.

In UML, the connecting line has a hollow diamond, pointing to the owner class. Poop-House is strongly related to the Cat.

Composition Relationship

The composition is a “PART OF” relationship, a special type of aggregation, where parts don’t stand on their own so much. The owner class here is a kind of container, and related entities are its contents. All entities are interdependent of each other for example “litter box and clumping litter is part of poop-house”. The house can not exist without the box or the litter, doesn’t it?

Here we have a solid line with a filled diamond, pointing to the owner class.

Uses Relationship (Dependency)

Uses relationship is a little bit softer tied between the classes. This relationship indicates that one class depends on another because it uses it at some point in time.

For example, other family members can communicate with the Cat and the Human (that is not welcomed much by the Cat).

Uses relationship notation is a dashed line with an open arrow, pointing to the used classes. It helps us to understand that changes in Cat or Human can affect Other Family Members, and so we should be careful. We’re good?
...
We are done. Now, let’s check the completed overall view of the basic Class Diagram.

That’s all.

Forem: Behalf Inc.

Event Ordering With Apache Kafka

About Behalf

The Ordering Problem

Partial Ordering

Ordering Guarantee with Apache Kafka

Topic Per Event Type

Topic As Feed

Topic As Flow

Topic As Event Store (single topic)

What about the noise?

Partitioning

Conclusion

UML Diagrams: Sequence Diagram Overview

This article was originally published on Medium.

Unveiling UML Sequence Diagrams Through “The Little Prince” | by Gene Zeiniss | The Startup

Gene Zeiniss ・ Aug 8, 2024 ・ Medium

Sequence Diagram

Basic Symbols

Object

Focus of Control

Interactions

Synchronous Message

“Less is more” — Ludwig Mies van der Rohe

“God is in the details” — Ludwig Mies van der Rohe

Self-Message

Asynchronous Message

Structural Controls

Alternative (Conditional) Control

Parallel Control

Looping Control

To look at some of the other UML diagrams I wrote about, head on over to my profile.

Authentication & Authorization in Microservices Architecture - Part I

About Behalf

Background

The Difference between Authentication and Authorization

Authentication Strategy in a Microservice Architecture

Authentication & Authorization on each service

Global Authentication & Authorization Service

Global Authentication (API Gateway) and authorization per service

Authentication Types: Stateful vs. Stateless

Brief Introduction to JSON Web Token (JWT)

JWT Structure

JWT Best Practices and Pitfalls

Conclusion

Observability is not just about the tool

FuzzyWuzzy — the Before and After

Using a Table With Pandas In order To Compare Multiple Strings

Data Preprocessing — Cleaning the Data Before Analysis

Data Pre-Processing — Let’s Get Technical

Adding a Score Column And Comparing

Choosing a FuzzyWuzzy Function — In a Nutshell

Choosing A Threshold Score — In a Nutshell

Comparing Strings Is Easy With FuzzyWuzzy

What Is String Comparison, And How Can FuzzyWuzzy Help?

Getting Started

Levenshtein Distance — Behind the Scenes of FuzzyWuzzy

Getting to Know the Functions

Getting To The Nitty Gritty

My first day at Behalf

Prerequisite

The first day!

In conclusion

UML Diagrams: Component Diagram Overview

This article was originally published on Medium.

Unveiling UML Component Diagrams: A Pizza-Lover’s Guide | by Gene Zeiniss | The Startup

Gene Zeiniss ・ Aug 8, 2024 ・ Medium

Component Diagram

Basic Component

Libraries and Third-Party Dependencies

Connections Between Components

Loosely Connection

Tightly Connection

Basic Component Diagram

To look at some of the other UML diagrams I wrote about, head on over to my profile.

UML Diagrams: Class Diagram Overview

This article was originally published on Medium.

Decoding UML Diagrams: The Common Tongue of Software Modeling | by Gene Zeiniss | The Startup

Gene Zeiniss ・ Aug 8, 2024 ・ Medium

Briefly about UML

Gene Zeiniss ・ Aug 8, 2024 ・
Medium

Gene Zeiniss ・ Aug 8, 2024 ・
Medium

Gene Zeiniss ・ Aug 8, 2024 ・
Medium