Forem: Philemonkipkirui

The ultimate Guide to Data Engineering.

Philemonkipkirui — Sun, 25 Aug 2024 20:52:22 +0000

Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale. It is a broad field with applications in just about every industry. This article aims at providing a step by step guide on how one can become a data engineer.
Most data engineers have a bachelors background in computer science or a related field where fundamentals such as cloud computing ,coding skills, and database design are taught.
To become a data engineer one should first focus on developing data engineering skills such as;
Coding. Common programming languages applicable in data engineering are SQL, NoSQL, Python, Java, R and Scala. Proficiency in these languages is essential to this role
Relational and non-relational databases. Databases rank among the most common solutions for data storage.
Extract, transform and load(ETL) systems. This is the process by which data is moved from data bases and other sources into a single repository like a data warehouse.
Big Data Tools. Data engineers don't just work with regular data. Tools and technologies are evolving and vary by company , but some popular ones include Hadoop, MongoDB, and Kafka.
The second step is getting certified. Certifications validate ones skills to employers. The common certifications include Big Data Engineer, Cloudera Certified Professional Dat, IBM Certified Data Engineer or Google Cloud Certified Professional Data Engineer.
Building a portfolio of data engineering projects. A portfolio is often a key component in a job search as it recruiters, hiring managers and potential employers.

Feature Engineering. The Ultimate Guide.

Philemonkipkirui — Sun, 18 Aug 2024 19:14:04 +0000

We would start by first defining a feature according to data science and machine learning, a Feature is an individual measurable property or characteristic of a data point that is used as input for a machine learning algorithm.
Feature Engineering on the other hand refers to the process of transforming raw data into features that are suitable foe machine learning Models. The success of machine learning models heavily depends on the quality of features used to train them.
To get into feature engineering one must be conversant with these five fields;

Feature Creation. This is the generation of new features based on domain knowledge. A deep understanding of the
intended field of application is required for effective
feature creation to ensure meeting of need and utility.
Knowledge on mathematical operations such as mean, mode,
median, sum, difference are essential.

Feature Transformation. This refers to the process of
transforming the feature into a more suitable
representation for the machine learning model. This is done through the application of techniques such as normalization, scaling and encoding

Feature Extraction. This refers to the creation of new features from existing ones. Done mostly to improve on the performance or diversify the utility of the feature.

Feature Selection. This is the selection of the subset of relevant features from the dataset to be used in amachine learning model. They are of several types, filter method, wrapper method and embedded method.

Feature Scaling. This refers to the process of transforming the feature so that they have a similar scale. The technique of feature scaling is sometimes reffered to as feature normalization. Commonly used processes include, min_max scaling and standardization/variance scaling.

Exploratory Data Analysis.

Philemonkipkirui — Sun, 11 Aug 2024 20:06:29 +0000

Exploratory Data Analysis is a technique applied in investigation and analysis of data sets so as to extract vital features and trends. Insights from EDA are essential for machine learning and deep learning models in data science. Exploratory data analysis normally comes after cleaning( removing discrepancies in data) and proper understanding of sets of data by analysts.
Exploratory data analysis helps provide a better understanding of data, helps summarize the main characteristics of data and most importantly uncover relationships between data. EDA also narrowly focuses on checking assumptions required for model fitting and hypothesis testing.
There are four major types of EDA; Univariate Graphical, Univariate nongraphical, Multivariate nongraphical and Multivariate graphical. Univariate Non-graphical is the simplest form of data analysis, as the name suggests, during analysis just one variable is considered. The main goal of univariate non-graphical EDA is to get the underlying sample distribution and make observations about the population. It involves the determination of factors such as determining of central tendencies(mean, mode and median), determination of spread and measurement of skewness.
Univariate Graphical EDA .This process involves the application of graphical tools and techniques to provide a full "picture" of the single variable data set. Some of the graphics utilised include Stem and Leaf plots, Box plots, Histograms and Quantile Normal Plots.
Multivariate Non-graphical. Multivariate Non-graphical data used to show relationships between two or more sets of data with the help of either cross-tabulation(making of a two way table with column headings that match the amount of one variable and raw headings that match the amount of the opposite two variables) or statistics.
Multivariate graphical. Here, graphics are used to to display the relationship between two or more sets of data. The outcome depends on more than two variables while the change causing variables can also be multiple. Some common types of multivariate graphics include Scatter Plots, Multivariate Charts, Run Charts , Bubble Charts and Heat Maps.
To perform exploratory data analysis, python together with its numerous mathematical and visualization libraries are utilized.

Data Science

Philemonkipkirui — Sun, 11 Aug 2024 19:07:36 +0000

Data Science is a field that is concerned with uncovering actionable insights from data. This is done through the combination of mathematics, statistics, programming, advanced analytics , artificial intelligence and machine learning.
The life cycle of data science encompasses of Data Ingestion. It Involves collection of raw data either structured or unstructured from various sources.
Data storage and processing. Depending on the format or nature of data, specific storage factors are considered. This stage also involves the cleaning, duplication, transformation and combining of data.
Data analysis. This involves the examining of data to identify patterns, biases, ranges and distributions within data.
Communication. This is the last stage of data analysis where meaningful insights have been obtained. Such insights allow for reasonable decision making.
Education on data science is mostly provided in institutions of higher learning as either independent majors or complements of majors such as economics, statistics and even engineering. Courses on such are also offered by private entities whose objectives align with proficiency in data analysis and manipulation.
The tools commonly used for data science are mostly programming languages with prebuilt statistical modelling, machine learning and graphical capabilities. Some of the common tools are R Studio, Python with its numerous libraries such as Numpy, Pandas and Matplotlib.

Data Science

Philemonkipkirui — Sun, 04 Aug 2024 21:17:13 +0000