Forem: Patman17

Beginner Guide to Optimizing Pandas Calculations

Patman17 — Mon, 04 Nov 2019 03:04:00 +0000

Recently, I started to compete on some competitions on Kaggle and there was one competition that had certain time restrictions that made me look into optimizing my Pandas calculations.

Being a first year student in data science and coding in python, I had limited experience with Pandas and its inner workings. However, one thing I noticed first hand was how slow it ran when utilizing any some of "for loo". This was detrimental as I was only used to old C++ type coding that utilize "for loops" pretty frequently. Thus, I started to look at different ways to implement calculations on Pandas data frames.

In this post, I will highlight all the various ways to do calculations on Pandas data frames and my picks for optimizing calculations. I will be utilizing a dataset called df_train that has 509762 rows to stimulate the calculations.

Method 1: "for loop" over every element of a Pandas series.

This is the most worst way to do any calculations but it can conceptually be the easiest way to implement a solution. I suggest you avoid this method.

%%time
dx = []
for element in range(len(df_train['Dir'])):
    x=0
    x =df_train['S'][element]*math.cos(df_train['Dir'][element]*math.pi/180.0)
    dx.append(x)
df_train['dx'] = dx

CPU times: user 10.5 s, sys: 91.2 ms, total: 10.6 s
Wall time: 10.8 s

Method 2: Use "iterrow" for row operations.

If you want to iterate over each row of a dataframe this would be the method as it was built to be more compatible with Pandas. Again I would avoid using this method as I actually got slower computation time. (5x slower)

%%time
dx = []
for index,row in df_train.iterrows():
    x=0   
    x =row['S']*math.cos(row['Dir']*math.pi/180.0)
    dx.append(x)
df_train['dx'] = dx

CPU times: user 49.3 s, sys: 1.22 s, total: 50.6 s
Wall time: 51.9 s

Method 3: Use apply function.

This method would be my first prefer method as it is conceptually easy and require least amount of code but again the downside is that you have no utilizing vectorization of the numpy array. It is still iterating over each row but does so with a number of internal optimizations, such as using iterators in Cython. For my example, however it was slightly slower than the original method.

%%time
df_train['dx'] = df_train.apply(lambda row: row.S*math.cos((row.Dir)*math.pi/180.0), axis =1)

CPU times: user 14.2 s, sys: 455 ms, total: 14.7 s
Wall time: 15 s

Method 4: Vectorization over Pandas Series

Vectorization mean making the calculation of over the whole numpy array aka do the whole column calculation all at once. This utilize the whole benefit of the numpy library and this method is what we want to strive for.

%%time
df_train['dx'] = df_train.S*np.cos((df_train.Dir)*np.pi/180.0)

CPU times: user 13.7 ms, sys: 3.66 ms, total: 17.4 ms
Wall time: 16.3 ms

As we can see the code is much simpler and it is 100x faster than row iterative methods.

Method 5: Vectorization over Numpy array

The final improvement we can do is convert the Pandas dataframe to an actual array before we do the calculation. It will give a slight boost in calculation speed.

%%time
df_train['dx'] = df_train.S.values*np.cos((df_train.Dir.values)*np.pi/180.0)

CPU times: user 7.87 ms, sys: 3.03 ms, total: 10.9 ms
Wall time: 12.1 ms

Again this method is not necessary but there is a good speed boost.

Conclusion

We want to avoid any row iterative method to perform calculations on Pandas dataframe. However, for certain transformations like string manipulations I found that .apply would be my go to as it is simple and gets the job done.

But if we are utilizing all numpy based operators we need to try Method 4 or 5 as we are truly utilizing the power of the numpy based dataframe.

Intro to Web Scraping a Table

Patman17 — Fri, 27 Sep 2019 19:06:19 +0000

Intro to Web Scraping a Table

What is web scraping? From my limited knowledge web scraping is getting info off a webpage by utilizing the underlining HTML code that contains this vital information.

Today, I am going to walk through the general process to web scrape a table off the internet. In the process, I hope to answer a random question: Does cold weather affect quarterback play?

What you need:

responses (HTTP library)
BeautifulSoup4 (Parser library)
pandas (data manipulation library)

Use this link for more setup details

Overview of Web Scraping

1) Get the response
2) Find the object
3) Parse and store the object
4) Finalize the data

1) Get the response.

This is the easy part. You use 'responses' to get the response from the URL and then you turn the response into a soupy 'soup' with BS4 that it then can navigate and parse.

url ='http://www.espn.com/nfl/qbr/_/type/player-week/week/3'
response = get(url) #getting url response
nfl = soup(response.content, 'html.parser') #turning the response into soup

2) Find the object (the hard part)

For beginners this is the hard part because HTML is a little daunting to decipher at the beginning but there is some underlying principles to help.

With BS4, you can navigate the HTML soup in two ways.

1) Parent, Child, Sibling Hierarchy
HTML is structured with higher level 'Parent' tags (classifiers) with 'Child' tags that it encompasses. BS4 has method to manually move line by line through the HTML if you need to fine tune where you are at.

2) .find( ) and .find_all( )

This is the easier method. You can tell BS4 to find the specific type of tag and what 'name' tag you are looking for.

How do you know what name you want? Easy, just navigate the webpage and right click the object/table you want to parse and 'inspect' it. This should open up a window that will direct you to the corresponding HTML code.

Even easier in HTML are tables because they are structured fairly similar for most webpages.

1) First find the 'table' tag. This will grab the whole table object.
2) Next find_all row tags as 'tr'. This will grab all rows in the table.
3) Next find_all cell tags as 'td' per row. This will grab all cells item in each row.
4) Sometimes the header row could be tag as 'th'. This is useful if you want to label your columns the same as the headers.

I find that slowly working through each object to find my place in the HTML code was best. Many times I would index out different object to see the response and adjust accordingly. It is a pretty iterative process in the beginning and you might have to backtrack to move forward. I find that referring back to the original web page can help you find where you are in the HTML code as well.

General work flow:

1) Attempt to access an object/tag
2) Count/verify # of objects
3) See the response
4) Verify with web page
5) Go further into the object or repeat process.

Also if you stuck with access info/ navigating use this cheatsheet.

http://akul.me/blog/2016/beautifulsoup-cheatsheet/

Below is the code summarizing this process:

tables = nfl.findAll('table') # finding all tables in this soup (lucky for me only one table)
len(tables) # Checking # of tables as predicted only 1 
rows = qbr_table[0.]findAll('tr') # from that table I know look for rows 'tr' is rows in HTML
len(rows) #Verifying how many rows I am trying to get
first_row = rows[1] #Inspect one row
first_row.findAll('td')[1].text # Looking at one row and one element

3) Parse and store the object.

After you find the method to find the object and info you want in the table. Just do the same process over each row. I find that a function parse a row in combination with a list comprehension to loop over all rows is best. Below is the code:

def parse_row(row):
    return [x.text for x in row.findAll('td')]

list_parsed_rows =[parse_row(row) for row in rows[0:]] # list_parsed_rows
df = pd.DataFrame(list_parsed_rows)

4) Finalize the data.

Use pandas to wrangle the data to whatever your heart desires. I find that some column had multiple information and needed to be parse out even further.

Below is the result I got from web scraping ESPN for the quarterback rating (QBR) and another weather site for temperature. The QBR is normalize from 0 to 100 with 50 being the average QB score/effectiveness. In general, there seems to be a hinderance as seen from the void of high QBR at low temps.

Cheers!

Why Data Science?

Patman17 — Fri, 06 Sep 2019 18:37:06 +0000

Intro & Background

Hello, my name is Patrick Ly. I have been working in the oil and gas industry for 5 years. First, I held a position as an Operations Engineer which is an intermediary technical position between the office and oilfield operations. I helped improve the effectiveness and efficiency of our drilling and completions programs. Later in my career I got promoted to Site Supervisor. In this position I was responsible for managing the oilfield service equipment, asset wells, and personnel onsite in the field by delegating and facilitating the work flow.

Recently this year, my company was acquired by another upstream oil company and I was laid off. In truth, it was a blessing in disguise because I contemplated quitting my position for a career shift earlier in the year. I enjoyed the people I worked with, found satisfaction in having a major impact on the business, and the monetary rewards. However, the operational side of the oil and gas industry did not stimulate my technical interests nor did I see it aligning with my long term career goals.

The Search

I started my job search to find a role that could utilize my business knowledge of the oil and gas industry but had a more technical focus. I found that a traditional drilling and completions engineer role would be the best fit from my prior experience but again the technical part of the role did not appeal to me. As an operations engineer you are primarily a project manager and then a business analyst. Thus, there is some limitation on how much higher level analysis/analytics you can do. The priority is executing and managing the operation to be within budget.

As I progressed in my job search, I noticed the growing positions of jobs related to analytics and data science. This was when I did more research into this sector. One main source of information came from one of my friend that just finished a master program for data science. He talked about how relatively new it is and how the demand was growing (especially in oil & gas). Also he talked about the advanced applications of data science that I intrigued me.

The Reason

With this interest I began to examine the reasons I want to pursue data science. Below summarizes the main points:

1) Great fit - Besides having the technical capability as an engineer, data science fits my personality as I am a very logical and analytical person that likes to look at the objectivity of the subject at hand.
2) Improving technical skills - Data science will improve many technical skills such as programming, statisical analysis, problem solving etc. It will provide a platform where I can become a forever learner.
3) Practicality & scope - I can envision numerous application of data science I could use not my career but personal life as well.

Overall, I feel that data science could be a great way to expand my technical abilities and segue my career into another sector that I could be passionate about.