Forem: Mary Sinaida Omukami

Topic modelling

Mary Sinaida Omukami — Thu, 30 Nov 2023 13:40:55 +0000

what is topic modelling?

Topic modeling is a type of statistical modeling that uses unsupervised Machine Learning to identify clusters or groups of similar words within a body of text. Topic modeling analyzes documents to identify common themes and provide an adequate cluster. In Natural Language Processing, Topic Modeling identifies and extracts abstract topics from large collections of text documents. The main models for analyzing topic modelling are Latent semantic analysis and latent Dirichlet.

latent semantic analysis
LSA is based on the principle that words that are close in meaning tend to be used together in context. LSA links words semantically by context and word frequency. It automatically creates separate topics based on previous inputs and outputs. It assumes all similar documents to share the same patterns when their word frequency and order are consistent. It relates closely to learning and understanding human language and judgment.

Latent Dirichlet Allocation
Topic modelling uses algorithms such as Latent Dirichlet Allocation (LDA) to identify latent topics in the text and represent documents as a mixture of these topics . Latent Dirichlet analysis analyzes large text files to categorize topics, provide valuable insights, and support better decision-making.

The German mathematician, Peter Gustav Lejeune Dirichlet came up with, Dirichlet processes, which in probability theory are “a family of stochastic processes whose realizations are probability distributions.”

Dirichlet model describes the pattern of the words that are repeating together, occurring frequently, and these words are similar to each other. This stochastic process uses Bayesian inferences for explaining “the prior knowledge about the distribution of random variables”. Estimating what are the chances of the words, which are spread over the document occurring again. The model builds data points, estimate probabilities making LDA a product of generative probabilistic model.

The LDA makes two key assumptions:

Documents are a mixture of topics, and
Topics are a mixture of tokens (or words)

In statistics , the documents are known as the probability density (or distribution) of topics and the topics are the probability density (or distribution) of words.

Topic modeling applications

**Document classification
**As an unsupervised Machine Learning technique that uses Natural Language Processing to understand the context and label new documents. It automatically tags each document with the topic it most closely resembles.

Analyzing customer feedback

With so many reviews analyzing customer feedback can be cumbersome due to a lack of understanding of the manual work involved. Sifting through huge quantities of feedbacks can also be costly and time consuming. With topic models, you can evaluate customer feedback.

Customer feedback is assessed and labels created based on what customers say.

Data Engineering for Beginners: A Step-by-Step Guide

Mary Sinaida Omukami — Thu, 26 Oct 2023 12:34:04 +0000

Introduction
Even before the role of a data scientist and the analysis part we have the data engineering part. Data engineers are vital parts of any data science project. The engineer should create a framework in place for the data scientist. Data engineering is not so much about a path to it but skills one needs to have in order to be one.

Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale.
A data engineer is a technology professional who builds storage solutions for vast amounts of data. The ability to design and build data warehouses is among the top skills clients look for in data engineers. Data warehouses reduce the cost and size of all the tasks a data scientist does.
ETL (Extract, Transform, and Load) are the steps which a data engineer follows to build the data pipelines. ETL is a blueprint for how the collected raw data is processed and transformed into data ready for analysis.

Roles in Data Engineering

Work on Data Architecture
They use a systematic approach to plan, create, and maintain data architectures while also keeping it aligned with business requirements. This role requires knowledge of tools like SQL, XML, Hive, Pig, Spark, etc.

Database Administrator
A person working in this role requires extensive knowledge of databases. Responsibilities entail ensuring the databases are available to all the required users, is maintained properly and functions seamlessly when new features are added.

Data Engineer

Data engineers don’t rely on theoretical database concepts alone. They must have the knowledge and prowess to work in any development environment regardless of their programming language. Similarly, they must keep themselves up-to-date with machine learning and its algorithms like the random forest, decision tree, k-means, and others. They are proficient in analytics tools like Tableau, Knime, and Apache Spark. They use these tools to generate valuable business insights for all types of industries.

Data Engineer Roles and Responsibilities
Here is the list of some roles and responsibilities a data engineer might be expected to perform:

1. Work on Data Architecture
They use a systematic approach to plan, create, and maintain data architectures while also keeping it aligned with business requirements.

2. Collect Data
Before initiating any work on the database, they have to obtain data from the right sources. After formulating a set of dataset processes, data engineers store optimized data.

3. Conduct Research
Data engineers conduct research in the industry to address any issues that can arise while tackling a business problem.

4. Improve Skills
Data engineers must keep themselves up-to-date with machine learning and its algorithms like the random forest, decision tree, k-means, and others.

They should be proficient in analytics tools like Tableau, Knime, and Apache Spark. They use these tools to generate valuable business insights.

Skills Required to Become a Data Engineer

1. SQL

SQL serves as the fundamental skill-set for data engineers. You cannot manage relational database management system without mastering SQL. You will need to go through an extensive list of queries and how to issue optimized queries.

2. Data Warehousing

Get a grasp of building and working with a data warehouse . Data warehousing assists data engineers to aggregate unstructured data, collected from multiple sources. It is then compared and assessed to improve the efficiency of business operations.

3. Data Architecture
Data engineers must have the required knowledge to build complex database systems for businesses. It is associated with those operations that are used to tackle data in motion, data at rest, datasets, and the relationship between data-dependent processes and applications.

4. Coding

To link your database and work with all types of applications – web, mobile, desktop, IoT – you must improve your programming skills. Learn an enterprise language like Java or C#. The former is useful in open source tech stacks, while the latter can help you with data engineering in a Microsoft-based stack. The most necessary ones are Python and R.

5. Operating System
You need to become well-versed in operating systems like UNIX, Linux, Solaris, and Windows.

6. Apache Hadoop-Based Analytics
Apache Hadoop is an open-source platform that is used to compute distributed processing and storage against datasets. They assist in a wide range of operations, such as data processing, access, storage, governance, security, and operations.

How to Become a Data Engineer

Below are some of the ways one can use to become a data engineer:

Certifications
Consider obtaining certifications in data engineering, such as AWS Certified Big Data - Specialty, Google Cloud Professional Data Engineer, or Microsoft Certified: Azure Data Engineer Associate. This will help you to demonstrate your expertise to potential employers.

Education
Most data engineering roles require a bachelor's degree in computer science, software engineering, or a related field. A degree in mathematics or statistics can also be helpful.

Build a Portfolio of Data Engineering Projects
Gain hands-on experience working on data engineering projects. You can start with open-source projects or participate in hackathons and coding competitions.

Technical Skills
You need to be proficient in programming languages like Python, Java, and SQL. They must also be familiar with big data technologies like Hadoop, Spark, and Kafka and experience with cloud computing platforms.

In conclusion data engineering like any other data field requires grit, willingness to learn and persistence. In the long run it will pay off.

The Complete Guide to Time Series Models

Mary Sinaida Omukami — Fri, 20 Oct 2023 12:33:37 +0000

A time series model is a set of data points ordered in time, where time is the independent variable. These models are used to analyze and forecast the future.
A time series is a series of data points ordered in time.
Time series analysis is about inferring what has happened to a series of data points in the past and attempting to predict what will happen to it the future.
Characteristics of time series models:

Stationary
Seasonality
Autocorrelation

We can use time series models to predict possible forecasts.

Characteristics of Time Series Model

Autocorrelation

Autocorrelation is the degree of correlation of the same variables between two successive time intervals. It measures how the lagged version of the value of a variable is related to the original version of it in a time series. For time-series, the autocorrelation is the correlation of that time series at two different points in time ,also known as lags. Meaning we are measuring the time series against some lagged version of itself.

Seasonality
Time series data may contain seasonal variation. Seasonal variation, or seasonality, are cycles that repeat regularly over time. A repeating pattern within each year is known as seasonal variation, i.e. repeating patterns within any fixed period.

Stationarity
A stationary time series is one whose properties do not depend on the time at which the series is observed. It does not mean that the series does not change over time, just that the way it changes does not itself change over time. Therefore, time series with trends, or with seasonality, are not stationary ,the trend and seasonality will affect the value of the time series at different times.

Testing Whether a Process is Stationary

Visualizations

The most basic methods for stationarity detection rely on plotting the data and visually checking for trend and seasonal components. Trying to determine whether a stationary process generated a time series just by looking at its plot is tasking. However, there are some basic properties of non-stationary data that we can look for.

Statistical Tests
One statistical tests which we will be going into is:

Augmented Dickey-Fuller (ADF) Test
Statistical tests make strong assumptions about the available data. They can be used to inform whether a null hypothesis can be rejected or fail to be rejected.

They provide confirmation that the time series is stationary or non-stationary. It evaluates the null hypothesis to determine if a unit root is present. If the equation returns p>0, then the process is not stationary. If p=0, then the process is considered stationary.

How to Build Time Series Models

The most popular ways to model time series are:

Moving average.
Exponential smoothing.
Double exponential smoothing.
Triple exponential smoothing.
Seasonal autoregressive integrated moving average (SARIMA.)

Moving Averages

Is a common approach for modeling univariate time series. The moving-average model specifies that the output variable is cross-correlated with a non-identical to itself random-variable.

Exponential Smoothing
Is a method for forecasting univariate time series data. It is based on the principle that a prediction is a weighted linear sum of past observations or lags. The Exponential Smoothing time series method works by assigning exponentially decreasing weights for past observations. Less importance is given to observations as we move further from the present.

Double exponential Smoothing
Double exponential smoothing is used when there is a trend in the time series.
Also known as Holt's trend model or second-order exponential smoothing. Double exponential smoothing is used in time-series forecasting when the data has a linear trend but no seasonal pattern. The basic idea here is to introduce a term that can consider the possibility of the series exhibiting some trend.

Triple Exponential Smoothing
Triple exponential smoothing is used when there is trend
in the data along with seasonal variations. This method is based on three smoothing equations: stationary component, trend, and seasonal. Both seasonal and trend can be additive or multiplicative.

Seasonal Autoregressive Integrated Moving Average Model (SARIMA)
SARIMA is an effective and popular time series model for predicting future values of time series data. It is a useful tool for predicting recurring trends in time series data since it is specifically developed to capture seasonality in time series data. They are a combination of autoregressive (AR) models, moving average (MA) models, and differencing.

Application of Time Series Models

Determining patterns

Helps organizations understand the underlying causes of trends or systemic patterns over time. Using data visualizations, business users can see seasonal trends and dig deeper into why these trends occur.

Forecasting and future trends

The ultimate goal of time series forecasting is to utilize historical data in order to understand future outcomes. The many uses of this include making better strategic business decisions, anticipating shifting trends and pivoting approaches based on that.

Detecting anomalies

Anomaly detection in time series has become an increasingly vital task, with applications such as fraud detection and intrusion monitoring. Tackling this problem requires an array of approaches, including statistical analysis, machine learning, and deep learning.

Examples of Forecasting with time series models

Healthcare
Time series models can be used to monitor the spread of diseases by observing how many people transmit a disease and how many people die after being infected.

Agriculture

Time series models take into account seasonal temperatures, the number of rainy days each month and other variables over the course of years, allowing agricultural workers to assess environmental conditions and ensure a successful harvest.

Finance
Financial analysts can leverage time series models to record sales numbers for each month and predict potential stock market behavior.
Retail
Retailers may apply time series models to study how other companies’ prices and the number of customer purchases change over time, helping them optimize prices.

Introduction to data modeling.

Mary Sinaida Omukami — Sun, 15 Oct 2023 18:16:39 +0000

Data modelling is a process of creating a conceptual representation of data objects and their relationship to one another. Data models are built around business needs. The process begins by collecting information about business requirements from stakeholders and end user. Translated into data structures to formulate a concrete database design. A data model facilitates a deeper understanding of what is being designed. Data models evolve along with changing business needs.
Types of data models:

Conceptual data models

This data model basically defines what the system inherently contains, offers an abstract view.
Including entity classes their characteristics and constraints, the relationships between them and relevant security and data integrity requirements. Business stakeholders and data architects are typically the ones who create conceptual data models with the intent to organize and define various business concepts and rules.

Logical data models
The goal of creating a logical data model is to develop a highly technical map of underlying rules and data structures. It provides detail about concepts and relationships. Logical data models don’t specify any technical system requirements.

Physical data models
The physical data model pertains to how the system will be implemented, and factors in the specific databases management system. This model typically created by developers, is more to define how the actual database will be used or implemented for business purposes.

Data Modeling process
Data modelling techniques provide formalized workflows that include a sequence of tasks to be performed in an iterative manner. Those workflows include:

1.Identify the entities. Identification of the things, events or concepts that are represented in the relevant data set. Each entity should be cohesive and logically discrete from all others.

2.Identify key properties of each entity. Each entity type can be differentiated from all others because it has one or more attributes.

3.Identify relationships among entities. Specify the nature of the relationships each entity has with the others. These relationships are usually documented via unified modeling language (UML).

4.Map attributes to entities completely. To ensure the model reflects how the business will use the data.

5.Assign keys as needed, and decide on a degree of normalization that balances the need to reduce redundancy with performance requirements. Normalization tends to reduce the amount of storage space a database will require, but it can at cost to query performance.

6.Finalize and validate the data model. Data modeling is an iterative process that should be repeated and refined as business needs change.

Types of data modelling
Below are several model types:

Hierarchical data models represent one-to-many relationships in a treelike format. In this type of model, each record has a single root or parent which maps to one or more child tables. Though this approach is less efficient than more recently developed database models, it’s still used in Extensible Markup Language (XML) systems and geographic information systems (GISs).

Relational data models are still implemented today in the many different relational databases commonly used in enterprise computing. Relational data modeling doesn’t require a detailed understanding of the physical properties of the data storage being used. Data segments are explicitly joined through the use of tables, reducing database complexity. Relational databases frequently employ structured query language (SQL) for data management.
Entity-relationship (ER) data models use formal diagrams to represent the relationships between entities in a database. Popularly used by data architects to create visual maps that convey database design objectives.

Object-oriented data models the objects are grouped in class hierarchies, and have associated features. Object-oriented databases can incorporate tables, but can also support more complex data relationships. This approach is employed in multimedia and hypertext databases as well as other use cases.

Dimensional data models were designed to optimize data retrieval speeds for analytic purposes in a data warehouse. While relational and ER models emphasize efficient storage, dimensional models increase redundancy in order to make it easier to locate information for reporting and retrieval. This modeling is typically used across OLAP systems.

Benefits of data modeling

Get everyone on the same page
Clarify your project scope
Improve data quality
Save time and money
Improve database performance
Enable better documentation

Data modeling tools
Numerous commercial and open source solutions are widely used today, including multiple data modeling below are examples:

-erwin Data Modeler
-Enterprise Architect
-ER/Studio
-Free data modeling tools include open source solutions such as Open ModelSphere.

Exploratory Data Analysis using Data Visualization Techniques.

Mary Sinaida Omukami — Fri, 06 Oct 2023 13:14:24 +0000

What is exploratory data analysis?

It is an approach of analyzing data sets to summarize their main characteristics using statistical graphics and other data visualization methods. This involves inspecting the dataset from many angles, describing & summarizing it without making any assumptions about its contents. It's an important step to take before diving into statistical modelling and machine learning models.

Raw data can have outliers, missing values and skewed graphs, if used could eventually generate inaccurate models because they used inaccurate data on the right models. Since the data wasn't thoroughly worked on during the preparation stage.
So exploratory data analysis has four steps as illustrated below:

Data Cleaning: Handling missing values, removing outliers, and ensuring data quality. When you get data in it's raw form you're supposed to inspect to see if it's actually clean. Most of the time due to human error from the collection process you might get some quality issues which will need to be addressed. You can check if a data has Outliers by making box plots. Outliers are removed using the statistical measure z -score method .Missing values are checked using isnull(). For random values like a missing name, you can't guess that so one might need to delete the whole row. For columns and dataset with numeric continuous variable you can mean median or mode of the column.
Data Exploration: Examining summary statistics, visualizing data distributions, and identifying patterns. Examining summary statistics by using df.describe() function and you get the descriptive statistics. Visualization data distribution like univariate plots and bivariate/multivariate plots that analyze ine variable or compare two to multiple variables. From the visualization and summary statistics you can identify patterns in the datasets.
Feature Engineering: Transforming variables, creating new features, or selecting relevant variables for analysis. Transforming variables to the correct data type. Creating features that will be important to our model.
Data Visualization: Presenting insights through plots, charts, and graphs to communicate findings effectively. Keep it simple, choose the right visuals,provide context and make it actionable.

Below are examples of tools that are useful for EDA:
1.Box plot
2.Histogram
3.Multi-vari chart
4.Run chart
5.Scatter plot (2D/3D)
6.Stem-and-leaf plot
7.Odds ratio
8.Heat map
9.Bar chart
10.Horizon graph

In summary, exploratory data is important stage in data analysis process it can determine whether you get accurate final results for your analysis.

How to become a data scientist in 2023/24.

Mary Sinaida Omukami — Mon, 02 Oct 2023 17:51:06 +0000

Who is a data scientist?

This is someone who uses data to understand and explain phenomena around them and help organizations make better decisions.
A data scientist has to have hard and soft skills. The hard skills include Python, R, SQL, statistics and math, data visualization, machine learning, big data and cloud computing. While the soft skills include communication, storytelling ,problem solving and critical thinking.

As a beginner you can follow the step highlighted below if you want to become a Data Science:

1. Learn how to code if you don't.

Traditionally, data science roles do require coding skills. Python and R are the most common languages used in the data field due to their versatility. Data is stored in data bases and SQL being a popular language so learning it would benefit one when they happen to interact with them.

2. Practice on your math and machine learning skills.

Math is an important part of data science because of machine learning algorithms and making insights. One would need to cover on areas of linear algebra, calculus and statistics. You'll will interact with probability, regression modelling, matrices and vectors so studying those topics could help you understand the real life data science applications better.

3.Understand databases.

Since data scientist work with data. Most organizations store their data in data bases. You will need to design, create, and interact with databases on most of the projects you will work on. It's important for a data scientist to have an understanding of data bases and also how to connect and interact with data by using programming languages like python.

4.Learn how to wrangle data, visualize it and make reports.

To make sense out of data for your understanding and others one needs to wrangle which involves cleaning and organizing of raw data into data that one can work with.
In order to present your data for others to understand your analysis you'll need to make good reports and visualization.

5.Do lot's of practice and get some experience.
Practice by working on projects and developing data science skills by solving real world problems. Take an internship to be able to showcase what you've learned and applied on your projects.

6.Engage with the tech community.
For those working in the field of data science, they can offer insights on how the industry is and how to maneuver. Through various resources like blogs, articles, webinars, and online courses.
Participate in challenges and also learn what is knew in your respective field.

Alternatively, there is the route of doing a degree or a post graduate degree on data science.