<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ian Daniel</title>
    <description>The latest articles on Forem by Ian Daniel (@iandaniel).</description>
    <link>https://forem.com/iandaniel</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F441806%2F3bccfae3-a441-4e1b-86f5-d6f14e4d4f9e.png</url>
      <title>Forem: Ian Daniel</title>
      <link>https://forem.com/iandaniel</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/iandaniel"/>
    <language>en</language>
    <item>
      <title>A Beginner's Guide to Reading and Writing Files with Pandas</title>
      <dc:creator>Ian Daniel</dc:creator>
      <pubDate>Wed, 15 Feb 2023 19:18:08 +0000</pubDate>
      <link>https://forem.com/iandaniel/a-beginners-guide-to-reading-and-writing-files-with-pandas-3ma8</link>
      <guid>https://forem.com/iandaniel/a-beginners-guide-to-reading-and-writing-files-with-pandas-3ma8</guid>
      <description>&lt;p&gt;In this article we will look at the several ways pandas allows us to read and write files. Pandas is a powerful python library that helps in the manipulation of data, data representation , handling large data sets, making work with data more flexible and customizable etc.&lt;/p&gt;

&lt;p&gt;Installing Pandas - &lt;code&gt;pip install pandas&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;## Reading data from files in pandas&lt;/strong&gt;&lt;br&gt;
there are several methods provided by pandas to read Data from files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. read_csv&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df= pd.read_csv('data.csv')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the read_csv() method is used to read a CSV (comma-separated values) file and create a DataFrame object from it&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;parameters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;delimiter&lt;/code&gt; -  this is commonly used when your csv uses a delimiter other than a comma. common delimiter types in CSVs are comma, tab, space colon, semicolon, pipe, etc. the example below shows a csv that was using semicolon as its delimiter
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = pd.read_csv('data.csv', delimiter=';')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;na_values&lt;/code&gt; - This parameter is used to specify the values that should be treated as missing values in the resulting DataFrame. values that are treated as missing values by default in pandas are  &lt;code&gt;NaN, NA, N/A, null&lt;/code&gt; and empty spaces. if a string has a value like a unkown , you can specify this string as a missing value in the &lt;code&gt;na_values&lt;/code&gt; parameter.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = pd.read_csv('data.csv', na_values=['unknown'])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;na_values&lt;/code&gt; can also be dropped using the dropna() method in pandas.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = pd.read_csv('data.csv', na_values=['unknown'])
df = df.dropna()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;skiprows&lt;/code&gt;- If you need to skip certain rows in your CSV file, you can use the &lt;code&gt;skiprows&lt;/code&gt; parameter to indicate the number of rows to skip. For instance, if the first three rows of your CSV file contain metadata or other irrelevant information, you can skip reading that portion by:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = pd.read_csv('my_data.csv', skiprows=3)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;this are just but a few common parameters used in the .read_csv method, you can look up others like &lt;code&gt;headers,cols&lt;/code&gt; &lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html" rel="noopener noreferrer"&gt;here&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. read_excel&lt;/strong&gt;&lt;br&gt;
The read_excel() method is used to read Excel files (.xls and .xlsx) and returns a pandas DataFrame.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df=pd.read_excel('mydata.xlsx')
df=pd.read_excel('mydata.xls')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Parameters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;sheet_name&lt;/code&gt; - This parameter specifies the name or index of the sheet to read from the Excel file. If the file has multiple sheets, you can use this parameter to select a specific sheet. By default, it reads the first sheet. lets say my excel file has three sheets named &lt;code&gt;Audi, Toyota, Volkswagen&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df=pd.read_excel('mydata.xlsx', sheetname='Audi')
df=pd.read_excel('mydata.xlsx', sheetname='Toyota')
df=pd.read_excel('mydata.xlsx', sheetname='Volkswagen')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unlike read_csv(), Excel files do not have a delimiter method because the data is not separated by a delimiter. Instead, read_excel() uses the sheet name or index, range of cells, or column and row labels to select the data to be read.&lt;/p&gt;

&lt;p&gt;you can check out other parameters related to read_excel &lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html" rel="noopener noreferrer"&gt;here&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. read_json&lt;/strong&gt;&lt;br&gt;
The read_json is a method in the pandas library that is used to read data from a JSON file into a pandas DataFrame. This is mostly used for reading information from a nested JSON file and extracting the essential data from it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = pd.read_json('filename.json')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Parameters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;orient&lt;/code&gt; - This parameter specifies the orientation of the JSON file. The default value is &lt;code&gt;columns&lt;/code&gt;, which assumes that the JSON file has a column-based structure. Other options include &lt;code&gt;index&lt;/code&gt;, which assumes that the JSON file has a row-based structure, and &lt;code&gt;records&lt;/code&gt;, which assumes that the JSON file is a list of records.
suppose we have this as our json file. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;{&lt;br&gt;
    "Make": ["Audi", "Mercedes Benz", "Volkswagen"],&lt;br&gt;
    "Model": ["Rs6","S560","Touareg"],&lt;br&gt;
    "Engine Size": [4.2,3.0,4.2],&lt;br&gt;
    "Fuel": ["Diesel", "Diesel", "CNG"],&lt;br&gt;
    "Year Of Manufacture": [2005, 2018, 2015]&lt;br&gt;
}&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = pd.read_json("data.json", orient="columns", index=False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;OUTPUT&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqsivkxpoh8ns72wb56fk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqsivkxpoh8ns72wb56fk.png" alt="output of the json" width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this example, we set the orient option to 'columns' to indicate to Pandas that the data in the JSON file is ordered by columns. As a result, the procedure generated a DataFrame with the columns &lt;code&gt;Make, Model, Engine Size, Fuel, Year Of Manufacture&lt;/code&gt; and the relevant JSON file data.&lt;/p&gt;

&lt;p&gt;there are other parameters like &lt;code&gt;typ&lt;/code&gt; which specifies the type of object to create from the JSON file.  you can read more methods of the read_json &lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.read_json.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;4. read_sql *&lt;/em&gt;&lt;br&gt;
This method in pandas allows you to read data from a SQL database into a pandas DataFrame. I'm going to use SQLite to illustrate examples.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import sqlite3

conn = sqlite3.connect('data.db')
df = pd.read_sql(sql_query, conn)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;sql_query&lt;/code&gt;: A string containing the SQL query that you want to execute. This can be a SELECT statement, a JOIN, or any other valid SQL command that returns data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;con&lt;/code&gt;: A database connection object. This is an instance of any Python library that provides a connection to a database.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;params&lt;/code&gt;: A list of parameters to pass to the SQL query. This is useful for executing parameterized queries that contain placeholders for user input.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import sqlite3

conn = sqlite3.connect('orders.db')
orderID=233

df = pd.read_sql('SELECT * FROM order WHERE orderID = ?', conn, params=[orderID])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, we define a variable orderID that contains the Order Id that we want to retrieve from the &lt;code&gt;orders&lt;/code&gt; table. We then define a SQL query that uses a parameter placeholder (?) in place of the orderID. We pass the SQL query to the read_sql() method, along with the conn object that represents our connection to the SQLite database.&lt;/p&gt;

&lt;p&gt;There are other parameters you can use in the read_sql method, read more about them &lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Those are the common methods mainly used to read data in pandas, there are other methods like &lt;code&gt;read_html,read_pickle,read_table&lt;/code&gt; etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  Writing to files Using pandas
&lt;/h2&gt;

&lt;p&gt;Pandas offers various methods to export data from a DataFrame to a file. The supported file types include Excel workbooks, CSV files, SQL databases, JSON files, and others.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. to_excel&lt;/strong&gt;&lt;br&gt;
This method allows you to write a DataFrame to an Excel file&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.to_excel('data.xlsx', index=False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Parameters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;sheet_name&lt;/code&gt;: This is the name of the sheet to write the DataFrame to. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;index&lt;/code&gt;: A Boolean value that indicates whether to include the DataFrame's index in the output file.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;excel_writer&lt;/code&gt;: The name of the Excel file to write to, or an instance of ExcelWriter that allows you to write to multiple sheets in the same file.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;writer = pd.ExcelWriter('data.xlsx', engine='xlsxwriter')
orders_df.to_excel(writer, sheet_name='Orders', index=False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;you can read more about to_excel parameter &lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;to_csv
This method used to write a pandas DataFrame to a CSV file.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;parameters:&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;sep&lt;/code&gt;: The delimiter that will be used between fields in the output file. The default is a comma (','). you can use others like &lt;code&gt;; , | or ' '&lt;/code&gt; etc.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;encoding&lt;/code&gt;: The character encoding to use for the output file. The default is 'utf-8'. other encoding types are asciii, utf_32 etc. you can find the full list &lt;a href="https://docs.python.org/3/library/codecs.html#standard-encodings" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;mode&lt;/code&gt;: The mode to open the file in. The default is &lt;code&gt;w&lt;/code&gt;, which overwrites any existing file. you can also use the &lt;code&gt;a&lt;/code&gt; which will append data every time your adding data to the file.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.to_csv('data.csv', sep=';', index=True, header=True, encoding='utf-8', mode='w')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pandas also provides other methods to write data to, like to_json,to_sql,to_dict,to_numpy and many more. you can check the rest of the methods in the pandas &lt;a href="https://pandas.pydata.org/" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Methods used to manipulate Data:&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;head()&lt;/code&gt;: This method returns the first n rows of the DataFrame. By default, it returns the first 5 rows.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;tail()&lt;/code&gt;: This method returns the last n rows of the DataFrame. By default, it returns the last 5 rows.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;dropna()&lt;/code&gt;: This method removes any rows that contain missing values.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;fillna(value)&lt;/code&gt;:This method replaces missing values with the specified value.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;groupby(column)&lt;/code&gt;: This method groups the DataFrame by the specified column and returns a GroupBy object. You can then use this object to perform various aggregation functions on the groups.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;there are other methods that can be used in the read_csv and read_excel like sort_values(),pivot(),merge() and other more. take a look at them &lt;a href="https://pandas.pydata.org/docs/reference/general_functions.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = pd.read_excel('data.xlsx', sheet_name='Audi', header=0, usecols=[0, 1, 3], skiprows=2, nrows=100)

_# print the first 5 rows_
print(df.head())

_# print summary statistics for each column_
print(df.describe())

_# remove any rows with missing values_
df = df.dropna()

# fill missing values with 0
df = df.fillna(0)

#groupby category and calculate the mean 
grouped_df = df.groupby('category').mean()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is just the beginning of the many impressive capabilities of this library. It offers numerous distinctive features that simplify the task of manipulating, analyzing, and visualizing data for data analysts, data scientists, and anyone else working with data&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Data is like garbage. You'd better know what you are going to do with it before you collect it." - Mark Twain&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>automation</category>
      <category>career</category>
      <category>productivity</category>
      <category>discuss</category>
    </item>
  </channel>
</rss>
