Forem: Lians

Getting Started with Freelancing in Data Science

Lians — Wed, 26 Jul 2023 12:48:46 +0000

Introduction

It's challenging to land your first office job in data science. For example, some job postings by recruiters appear to have high criteria, stating that an entry-level position requires a master's degree! However, who is to say that earning money in Data requires a real job? It's 2023, we have weathered a pandemic that has taught us more about money than any recession has! Everyone wants to be able to work from home and set their own hours, and freelancing is one of the finest ways to achieve this. And no, I'm not only referring to platforms like Upwork or Fiverr; instead, let's go through some of the simplest ways to make money in this industry.

Side Hustles in Data Science

1. Participating in Competitions and Hackathons

Given its popularity in the industry, data has become one of the tech occupations with the highest salaries. Competitions for data science are frequently held on websites like Zindi and Kaggle. Some of these hackathons offer cash awards of up to $5,000 or even job interviews as incentives. One is guaranteed to make money from these events as long as they have the necessary talents and are willing to put in the work. These websites let you create a portfolio as well because they are well-known platforms in the data industry.

2. Tutoring and Teaching

As previously mentioned, the field of data is expanding, and more people are making career changes to enter it. A simpler approach to freelance is to provide training sessions either physically in training facilities or as online courses. You might think about providing coaching and tutoring to those looking to enter the industry and selling your knowledge as courses with educational chapters tailored to your tech stack.

3. Freelancing Sites

A thorough approach and several crucial steps are required to get started with data science freelance sites. Create a well-structured profile to start off with that shows your data science abilities, know-how, and pertinent work history. Second, show off an impressive portfolio of prior projects and case studies to convince prospective clients of your skills. Third, regularly participate in pertinent data science discussions on the platform for freelancing to network and enhance your reputation. Fourth, be sure to carefully define your project scopes and price to draw clients while providing fair compensation for your services. Finally, maintain a high level of professionalism, attentiveness, and fast delivery in order to get great ratings and establish long-lasting relationships on freelancing platforms.

4. Technical and Content Writing

For data scientists, technical writing is a crucial side job that entails producing clear and simple documentation for algorithms, codebases, and data pipelines. Data scientists can better collaborate and understand one another by using this ability to successfully convey their work to both technical and non-technical stakeholders. On the other hand, writing instructional blog posts, tutorials, and articles as a side job in data science entails showcasing data-driven insights, best practices, and market trends. Data scientists may build their personal brands, exchange expertise, and possibly make extra money by producing technical and content as a side job. Here are links to sites to can pitch topics to or sell cold emails to for technical writing, Paid Community and Community Writers Program

4. Consultancy Services

Many start-up companies looking to establish a data science department could benefit from consulting based on their software stack and business plan. Consultancy, like the other gigs mentioned, necessitates the expertise of a data scientist. Building projects and machine learning algorithms for these companies, on the other side, could help you make money quickly! These projects could be rented out or sold for a profit. For example, creating facial recognition algorithm for a client.

Conclusion

These are some of the best-paying and easiest-to-learn freelancing opportunities for data scientists. However, it is strongly advised that before applying, one should increase their skill set. Freelancing takes time and requires you to put in the effort and show up on a daily basis. Fortunately, there is a wealth of information available online on how to get started with each of the concepts suggested above.

Follow me on Twitter for more information on data science.

INTRODUCTION TO OBJECT DETECTION WITH YOLOV5

Lians — Wed, 02 Nov 2022 07:49:18 +0000

Introduction

In the field of artificial intelligence, object detection has established itself as a household term. Devices that unlock phones using facial recognition, self-driving cars, and picture search capabilities are just a few examples. Object detection is a subset of computer vision that is gradually enabling machines to get visual assistance in the outside world. Then, how do you begin using object detection? The first step is to classify the objects using image classification. This entails teaching a computer model to categorize particular real-world items into different groups. For instance, a person, animal, car, plane etc.

You might be wondering what the distinction is between object localization and object detection, which would be the second stage. Basically, object localization involves teaching a computer model to recognize the existence of a single object in a given image and to determine its location. In contrast, object detection involves teaching a model to identify and locate various items in an image or video.

How to practice object localization

Consider a situation where we are attempting to detect two different item types, a bicycle and a car. We will feed an image to either our ResNet, VGG, or CVV neural networks. The two classes are then predicted by the network using bounding boxes and a prediction confidence score.

A rectangular rectangle known as a bounding box surrounds an object and specifies its position class and confidence. There are two techniques for producing a bounding box;

The first step is to construct the numbers x1, x2, y1 and y2 where (x1, y1) and (x2, y2) represent the upper left and bottom right corner points, respectively.
Create two points (x, y) to indicate the image's corner points and two points (h, w) to indicate the object's height and breadth.

We'll discuss how to build bounding boxes for object detection models using the YOLOv5 format in a another session. As we can see, object localization just requires one class per image, therefore it does not require much.

Approaches to achieving object detection

With the advancement of technology, more and more methods were developed to make it easier to achieve object detection. These include;
- Sliding windows

Working with Sliding Windows is one of the original methods of object detection. Here we create a bounding box, which is normally a square and use this box to resize the image into numerous crops, trying to see if there is an instance of the classes being detected. In our above image, there is a sliding window that tries to see if the car appears in various crops of the image. This method is tedious, it involves a lot of computation and as seen, there are a lot of bounding boxes being created for the same object
- Regional based network

Consider a scenario in which we are attempting to develop a model to determine if an object is a human or an animal and we have an image of a man riding a bicycle in a park. For each object, we will draw bounding boxes, extract all potential regions, compute CNN features, and then finally classify the regions.

- YOLO (You Only Look Once)

YOLO is a pretrained model in pyTorch that is used for object detection. It is based on regression, that is instead of selecting the interesting part of an image, it predicts classes and bounding boxes for the whole image in one run of the algorithm. Yolo uses a single CNN to do the object detection as well as localization which makes it faster than R-CNN.
Yolo divides all the given input images into the S * S grid system and each grid is responsible for object detection. Each cell is going to be predicting if there’s a bounding box in that cell and what confidence the box has an predicting a certain class. For every box, we have 5 main attributes;
• X and y co-ordinates for the corner points
• The height and width of the object
• The confidence score for the probability that the box is containing a certain image

Introduction to Data Engineering

Lians — Sat, 20 Aug 2022 06:07:29 +0000

Big data is data that is so large that you must consider how to deal with it. Big data is distinguished by its volume, velocity, value, diversity, and veracity. That is why data engineers exist. A data engineer is in charge of data ingest, collection, and storage.
The process of ingesting and storing data so that it is accessible and ready for analysis is known as data engineering. In other words, they design and build large-scale data collection, storage, and analysis systems.

Data Engineering Roles

To better understand the role of data engineer, I'll provide a brief overview of the data science process using the image below as a guideline.

The first step is to collect and store raw data from around the world.
The following step is data pre-processing, which includes cleaning, filtering, querying, and aggregation.
Following that, we will look at visualization and EDA (Exploratory Data Analysis), which will assist a machine learning engineer in deciding which models and algorithms to use.
Finally, we make decisions such as forecasting and predictions. and generate reports based on the data collected

A data Engineer is responsible for the first two steps in the Data Science Workflow. Data engineers store and ingest data, they are also responsible for setting up a database and building data pipelines.

Why do we need Data Engineers

As previously stated, data engineers are in charge of the first step in the data science workflow. Without them, machine learning engineers, data analysts, and data scientists would be forced to work with raw, unprocessed data. Engineers are required because they process and prepare data for future analysis. Data engineers also assist in the collection, storage, and optimization of data for usability.

Data Engineer Vs Data Scientists

Data engineers transform unstructured data into a more useable format for data analysts while searching through the data for insights. They develop data warehouses for sizable databases, maintain the architecture and design of the data, and construct queries.

On the other side, data scientists gather and organize unstructured data, develop models for using big data, and do big data analysis.

Strong programming skills are required for data engineering, whereas strong analytical skills are required for data science.

Data Analysis Tools

Lians — Mon, 14 Feb 2022 07:12:14 +0000

In this post, I would like to share my journey in becoming a junior Data Analyst. The previous discussion on A guide into Data Science classifies data analysis as one of the careers in the field. A Data Analyst is said to be a person who aims to conclude from available data by cleaning, inspecting, manipulating and modeling, and transforming it. In general, a data analyst uses the previous information records to answer questions such as, What products sell the most? Which state records the highest number of crimes? Like any other data scientist, a data analyst has a list of instruments they must have in their laboratory.

My journey began with numerous research on what Data Analysis entails. I also looked up multiple companies that offer job opportunities and, of course, the minimum pay for an analyst. My favorite information source was from a Youtuber,Luke Barousse. He offers extensive information on Data Science, and I find his learning tactics quite interactive. The next step was to begin understanding what skills I needed to have. Without further explanation, let’s get into the necessary skills.

Tools of a Data Analyst

In my learning experience, I have worked on four categories of skills that I thought were needed in the field.

Technical skills
Business Intelligence Tools
Analytical Skills
Soft Skills

I arranged the skills in a numbered list because I began my journey into Data Analysis. However, that does not mean learning should be self-paced, and a person may start with whatever feels comfortable. Now let’s explain deeply what each of these four categories entails.

Technical Skills

As the name suggests, these are coding skills. There are considerable technical skills to work on for a data analyst, but only a few are used in the industry. That is Programming, Spreadsheets, and Database Management. Here’s a list of resources I combined for this category.

Programming Languages - R, Python

Spreadsheet - Microsoft Excel

Databases - SQL, NoSQL, MySQL.

I have improved my Python, Excel, and SQL experience. One thing to clarify on, when dealing with SQL, data analysis is not creating tables and queries. There is more to SQL than the basic manipulation, and one has to work on aggregation, mathematical functions, string functions, and more. The same can be said for Microsoft Excel. Later, we’ll discuss topics to work on for each technical skill mentioned above.

Business Intelligence Tools

Business Intelligence (BI) is the tactics and technology utilized by businesses for data analysis and management of business information. Other people would call this category Domain Knowledge. The most common tools are Tableau and PowerBI. I have worked with Tableau, and it offers the most fantastic visualization and representation tools. I have not yet exploited PowerBI. However, my research’s findings show widespread in this career.

Analytical Skills

I believe that analysis focus on mathematical functions. That means algebra, calculus, and probability. These units teach you to calculate regressions, data metrics, and forecasting. Most high schools and universities teach these units to their students. Online courses from platforms such as Coursera and YouTube may help perfect these mathematical skills. Additionally, analytical skills include problem-solving, critical thinking, and informative knowledge on subjects. To meet this criterion, you should join communities that speak Data Analysis on primary social media pages.

Soft Skills

This skill is one of the main criteria used by recruiters when hiring employees. Companies look for a goal-oriented person who has better communication skills, is a leader, and is a team worker, among many others. Inter-personal soft skills help maintain a healthy work environment. These skills can be improved through writing, interaction with others, and asking for feedback.

Conclusion

These four tools have been my major priority in my journey as a Data Analyst. Fortunately, the availability of resources is plenty, and it’s never too late to learn. I’d advise you to find something to start with, work on it every day, and watch how you improve from a junior to a senior data Analyst. I hope you find this piece quite helpful. Feel free to leave your comments in the discussion section.

Pandas Walkthrough

Lians — Thu, 27 Jan 2022 09:16:00 +0000

If you're thinking about a career in data science, this is one of the first tools you should learn. Pandas is an excellent tool, particularly for data analysis. Pandas is said to be derived from the term "panel data," and it stands for Python Data Analysis Library in full.

The package supports SQL, Excel (.xmls), and JSON files in addition to.csv files. As a result, there is no need to bother with converting files to csv format. Pandas' ability to re-arrange data into appealing rows and columns is a fascinating feature!

Let’s get started

You must have the anaconda application installed on your PC in order to install pandas. Alternatively,Google Notebook. You may access Jupyter Notebook by typing the keywords Jupyter Notebook into your search bar after installing anaconda. On your machine's browser, the application will open a kernel and a localhost page. If you've completed this successfully, you're ready to begin coding with Pandas. Create a new notebook in Python 3 and type the following in the code cell:

        import pandas as pd

If you've prepared a dataset, the next step is to import it into your notebook. This can be accomplished using;

       df = pd.read_csv(r'_file location_')
         df

However, if you want to use your own data, we can do so by creating a variable inside the code cell. Consider the following scenario:

 data = {'first': ["Lians", "Shem", "Zainab"],
        'last': ["Wanjiku", "Githinji", "Buno"],
        'email': ["[lianswanjiku@gmail.com] 
        (mailto:lianswanjiku@gmail.com)", "[smaina@gmail.com] 
        (mailto:smaina@gmail.com)", "[zainab@gmail.com] 
        (mailto:zainab@gmail.com)"]}
 df = pd.DataFrame (data)
 df

Isn't it straightforward? The next step would be to learn how to use the Pandas tools that are required. In this guide, I'll go through a list of topics that you must study and comprehend in order to be proficient with Pandas.

Series and DataFrames

A dataframe is a combination of many series, whereas a pandas series is a one-dimensional data structure made up of a key-value pair. A dataframe is a data set that has been imported into pandas; however, if you call out a single column, you get a series.

Indexes

An index is a unique identifier for locating a series or dataframe. It is necessary to understand how to create an index, how to use it to call out a row or column and filter data from it, and how to then reset the index. Iloc and loc are two tools for indexing. The distinction between the two is that iloc locates data that is integer positioned, whereas loc uses labels to do so.

iloc

#It returns a series/dataframe that contain values of that line of data that has been indexed.
df.iloc[[0, 1]]

loc

#It returns a series/dataframe that contain values of that line of data that has been indexed.

df.loc[2]

Filtering

The process is used to segregate select rows and columns from the entire dataset during data cleaning. For example, if you’re trying to predict house pricing based on the property’s features, it would be best to filter out these characters. That way, an analyst finds a simplified way to forecast a price. In Pandas, the filter can be applied to a label, for example, ‘Country’.

           df.filter(items=['one', 'three'])

Updating, adding and removing rows and columns

These are also operations that are engaged in data cleansing. You can, for example, change the names of labels in a data set if they appear to be confusing. Sometimes when handling data, it may contain missing values, it is expected to remove (drop) these rows or columns.

1. #Combining rows and columns
       df['full_name'] = df['first'] + ' ' + df['last']                      
2. #let's try to remove a bunch of columns from the dataFrame
      df.drop(columns = ['first', 'last'], inplace =True)**
3. #We can add new elements to the dataframe using the append function
      df.append({'first': 'Tony'}, ignore_index = True)

In order to add a new row, we may have to append the previous ones, as shown below;
Let’s start by creating a new dataframe and calling it df1

people = {'first': ["Stella", "Natalie"],
         'last': ["Smith", "Mikaelson"],
         'email': ["[stelasmith@gmail.com] 
          (mailto:stelasmith@gmail.com)", "[mnatalie@gmail.com] 
           (mailto:mnatalie@gmail.com)"]}
 df1 = pd.DataFrame(people)
 df1

#The we can append it to the already existing one df,
  df.append(df1, ignore_index = True)

Sorting

Sorting data entails arranging it according to a set of criteria. Data can be sorted in Pandas in two ways: by index values or by column labels. or a hybrid of the two Let's take a look at a code snippet for data sorting:

         #Let's arrange the df in the order of alphabetical order in ascending order
              df.sort_values(by='last')

In this case, we're required to arrange the values according to the last names' alphabetical order. As a result, sort by column labels.
The following example sorts data according to their index position

       df.sort_index()

Aggregation and grouping

This is a basic approach of segregating relevant data rows and columns in order to arrive at a faster conclusion, similar to filtering. I won't go into great depth about this because I prefer filtering to grouping and aggregation. However, here's a bit of code for the operations:

#It can be achieved through column labels, index position or a combination of both.

  gender_grp = df.groupby(['GENDER'])
  gender_grp.get_group('Female')

Date and Time

If you're working with a data set that includes a date or time column, you'll need to know how to manipulate it. First, as shown below, we'll describe the date parser variable in pandas.

  d_parser = lambda x: pd.datetime.strptime(x ,'%Y-%m-%d %I-%p')
df = pd.read_csv(r'C:\Users\lian.s\Downloads\ETH_1h.csv', parse_dates= ['Date'] , date_parser = d_parser)
df

We can also create a new column that describes what day it was on a specific date;

df['DayOfWeek'] = df['Date'].dt.day_name()
df

File handling

As previously stated, there are several ways to read files into Pandas; however, in this tutorial, we focused primarily on reading csv files. However, it is recommended that you go through each of the other supported file formats as you learn Pandas. SQL, Excel, and Json, to be specific.

Conclusion

The tutorial has chosen the most important aspects to study when learning Pandas, but there are many more concepts to learn. I recommend that you look through the Pandas documentation

I hope you find this guide useful as you embark on your Data Science journey. Don't forget to share your thoughts and comments in the section below.

Detecting Fake News Project

Lians — Fri, 21 Jan 2022 09:55:01 +0000

I always had Python debates with my friends; there was always something about me and the language that didn't match up. Who'd have guessed that two years later, I'd be considering a career in Data Science centered on Python? This is a step-by-step guide to completing a data science project that detects fake news. For this project, I collected a Data Set from (https://drive.google.com/file/d/1er9NJTLUA3qnRuyhfzuN0XUsoIC4a-_q/view)
To begin you will need the following installed in your computer;

Jupyter Notebook that can be installed using Anaconda
Python 3
Download the csv file of the data set from the link shared above.

Let's get started

We'll design a TfidfVectorizer with sklearn and then establish a PassiveAgressorClassiffer to help fit the model.
You'll need to install the following prerequisites before you can use your Jupyter library. We begin by installing numpy sklearn.

pip install numpy pandas sklearn

Your application will install the required tools and create a fresh input space for you to type your next code.
You'll need to use the codes below to make the appropriate imports for this project;

import pandas as pd
import numpy as np
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

After executing the application, the screen will show no output, indicating that the data set is ready to be read into your notebook.

#news is just a variable name we assigned to the project to simply our access to the dataset, you are free to select another name.
news = pd.read_csv(r'C:\Users\lian.s\Desktop\Sign Recognition\news.csv')
#We print out the number of rows and columns
news.shape
#Prints out the top 5 rows in a dataframe or series
news.head()

The result will be;

That means we successfully read the dataset to our notebook and described the top 5 rows and columns. Now we can call the labels from the data set;

labels= news.label
labels.head()

Congratulations, your project is off to a solid start

You've undoubtedly dealt with training and testing as a young data scientist. This will be the next phase in the development of our project.
We utilize data to construct a training set, which is a subset used to fit the model, and then we use the trainset to test it. As a result, we must first establish a training set and then test it to convert it to a test set. The models developed are used to predict an unknown outcome.
To accomplish this, we employ;

x_train,x_test,y_train,y_test=train_test_split(news['text'], labels, test_size=0.2, random_state=7)

We continue by creating a TfidfVectorizer;
Term Frequency (TF) is the number of times a word appears in a document, whereas Inverse Document Frequency (IDF) is the number of times a word appears in one document relative to others. The function creates its own matrix from raw data sets. We first generate the matrix, then fit and transform the vectorizer on the train set, as well as transform the vectorizer on the test set.

tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)

With the help of sklearn PassiveAggressiveClassifier, which operates by reacting passively to accurate classifications and aggressively to any misclassifications, we'll be able to calculate the accuracy of our test set.

#Initialize a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)
#Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

As you can see, we have an accuracy result of 92.9%

Finally, we print out the matrix of how many fake and true news exist amongst our set. This is what we call a confusion matrix.

confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

From our test, we have 588 true positives, 40 false positives , 589 true negatives and 50 false negatives.

There you have it, for more practice you can use the set to calculate over and underfitting. I hope you found this post useful; please share your thoughts in the comments section below.

A guide into Data Science

Lians — Wed, 19 Jan 2022 08:05:29 +0000

Data Science appears to be one of history's most misunderstood terms. The term data scientist, like stoicism, minimalism, and rationality, has been stripped of its original meaning. So, let's take a look at what becoming a data scientist entails. The word "data science" refers to the collecting, analysis, manipulation, and interpretation of data. Three different vocations can be derived from its definition. Data science, data engineering, and data analysis. Let's talk about the three jobs and how we got them from the definition of data science and get a chance to go into why data science is a profession from the name data science.

1. Data Engineer

To be able to comprehend and handle data, one must first acquire it; this is precisely what a data engineer does. A data engineer's (DE) ultimate purpose is to make information accessible to analysts and scientists. That is to say, data engineering is the first profession that comes to mind when someone hears the term data science. A DE's other duty would be to create mechanisms that would make it easier to obtain pertinent data. What qualifications are required to begin a career in data engineering?

Coding - Expertise in languages such as SQL, Python, NoSQL, and R, which are the most commonly used in this industry, is required.

· Database administration entails becoming familiar with both relational and non-relational databases.

· Machine Learning - this is a burgeoning subject that focuses mostly on data science and may thus be used to create systems that help people comprehend the topic. That is, using machine learning to learn and construct systems to assist you in collecting data sets.

· ETL systems - Extract, Transform, and Load solutions allow you to transport data sets from one location to another. Stitch, Alooma, and other similar technologies are examples.

2. Data Analyst

A data analyst attempts to analyze and alter the acquired data after getting it from an engineer. Cleaning data sets to improve interpretation could be part of the job. It is necessary to establish patterns from prior company data that may be used to assist firms in making better profit decisions. Large data sets are frequently converted into useful formats such as reports or presentations by data analysts. That involves learning skills like:

• Data visualization with Tableau, Numpy, Pandas, and Excel

• Coding - You should know SQL, Python, NoSQL, and R, which are the most common languages in this profession.

• Statistical Programming

• Machine Learning

• Presentation and Communication Skills

3. Data Scientist

A data scientist is supposed to estimate how the result will be in the future based on the thesis developed during the investigation. As a result, a data scientist's primary purpose is to forecast future outcomes based on the data available. They are also expected to develop data models and algorithms to assist the company in resolving complex issues.

Data scientists must have a combination of skills such as;

· Knowledge with algorithms and a combination of analytic capabilities

· Machine learning,

· Data mining

· Statistical skills.

· Coding with languages such as; R, SAS, Python, MatLab, SQL, Hive, Pig, and Spark are examples of coding languages.

That isn't to say that there are only three types of data science career paths; rather, the article focuses on the most primary ones. Business analysts, data architects, machine learning engineers, database administrators, and so on are among the others. I hope this post has helped you understand what data science is all about.