Forem: jpetoskey

Learn to Kaggle

jpetoskey — Thu, 09 Jun 2022 17:20:02 +0000

Time Series Decomposition - Spotting Seasonality

jpetoskey — Mon, 02 May 2022 20:39:55 +0000

The results fascinated me -- seasonal patterns when none were visible in the raw data.

I had looked at trends in this visualization of the median price of homes by zip code:

And, mainly noticed the upward and downward trend in the median price.

However, when I performed and plotted a seasonal decomposition with statsmodels.tsa.seasonal:

# Import and apply seasonal_decompose()
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(np.log(percent1))

# Gather the trend, seasonality, and residuals 
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

# Plot gathered statistics
plt.figure(figsize=(12,8))
plt.subplot(411)
plt.plot(np.log(m1), label='Original', color='blue')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend', color='blue')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality', color='blue')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals', color='blue')
plt.legend(loc='best')
plt.tight_layout()

I noticed something surprising in the seasonality output - seasonality!

I enjoyed seeing the seasonal pattern. I had always heard that prices were slightly higher in the spring/summer months, but it is confirming to see it in the decomposition.

I am now wondering about the magnitude of the trend. I know the magnitude can't be too large, but I'm not sure how large.

I am curious about how I would figure this out.

So far I have tried reversing the log transformation that was part of the process of decomposing the original data, but then I am left with a result of 1, as the magnitude of the seasonal trend oscillates between -0.001 and 0.0005. I'm guessing the seasonal difference is greater than $1 in the real estate market with median prices varying in the data for this zip code between $120,000 and $700,000.

If anyone reads this and has answers, please message me or comment - Jim.Petoskey.146@gmail.com.

Use Random Forest Feature Importance for Feature Selection

jpetoskey — Tue, 29 Mar 2022 16:43:13 +0000

Saving on computation is a priority for me, as I am practicing data science on an old machine, and don't currently have access to cloud computing software.

So, what to do when recursive feature selection (RFE) runs for 20 minutes while I take a snack break and is still spinning when I return?

The answer is scikitlearn's model.feature_importances_.

My Random Forest models are among my best, so this feature is a really nice way to save on computation and still return a high-performing model.

After fitting, predicting, running a confusion matrix, and a classification report, I turn to feature importance so I am able to iterate and improve my model by running it on fewer features, or report the feature importance to my superiors.

See this bit of documentation code from sklearn's Feature importances with a forest of trees web page

One of my favorite things to do with this information, is produce a top ten features bar chart that plots the features by importance.

And, here is the code I copied or wrote to produce it:

# Feature importance
features = pd.DataFrame(forest6.feature_importances_)
features['Feature'] = X_train.columns.values
features['Feature Importance'] = features[0]
features = features.drop(0, axis=1)
features = features.sort_values(by=['Feature Importance'], ascending=True)
features = features.nlargest(n=10, columns=['Feature Importance'])
features

import matplotlib.style as style
# style.available
style.use('fivethirtyeight')

plt.figure(figsize=(8,8))
plt.barh(range(10), features['Feature Importance'], align='center') 
plt.yticks(np.arange(10), features['Feature']) 
plt.xlabel('Feature importance (Weight)')
plt.ylabel('Feature')
plt.title('Top Ten Features by Importance')

I hope this helps you enjoy feature_importances_ by scikitlearn!

Interpreting Coefficients to Reap Return on Renovation

jpetoskey — Fri, 25 Feb 2022 20:09:29 +0000

Summary: This post is about the benefits and drawbacks of using raw, logged, and normalized data for interpreting coefficients of multiple linear regressions.

Coefficients, in this case, are the numbers in front of X, represented by the letter B, in a multiple linear regression equation.

-image from towardsdatascience.com

See an example of a statsmodels Ordinary Least Squares Regression Analysis with coefficients below:

The coefficients provide insight into how certain features impact the slope of the trend line in a multiple linear regression. They are important for understanding which features correlate well with the dependent variable and how much they influence the slope of the model.

However, interpreting the coefficients is not simple and they shouldn't be accepted at face value.

For example, after logging, normalizing, and modeling numerical data on a King County Housing data set, I noticed that an increase in square footage strongly influenced the price of a house in positive terms, while adding a bedroom influenced the price in negative terms. Interpreting this at face value, it could be said that adding bedrooms will decrease the price of a home. However, knowing that this is likely not true based on background knowledge, and the fact that adding a bedroom would typically add square footage, it can be surmised that the negative correlation for bedrooms to price is likely the result of a conflating variable, such as location. I'm imagining a lot of homes with fewer bedrooms that are closer to city-centers in King County with higher prices causing this trend.

For this model, I was thinking about how dividing an existing room into two bedrooms would decrease price, as square footage would stay constant. So, maybe don't divide an existing room into two bedrooms in your next home remodel.

This is a good example of how understanding source data of the coefficients allows them to be interpreted accurately.

In addition, linear models can be built of a combination of raw, logged, and normalized data, so coefficients need to be interpreted differently in each case. I like having models where all predictors or features are raw, logged, or normalized. And, I like having one of each model, because each type of data has its own benefit.

Raw data provides the most insight into actual or real change in the dependent and independent variables because the coefficient can be read to mean a certain value change in the independent variable for a given change in the dependent variable. For example, in the King County housing data set I referred to earlier, the price of a house increased by $211.46 for each additional square foot. This coefficient is easy to use in a non-technical presentation.

When I used logged data, it allowed me to determine the percent change. See this web page by UCLA's Advanced Research Computing, Statistical Methods and Data Analytics team.

For example, I could say that an increase of 0.66% in square footage would lead to a 1% increase in price. Or, in simpler terms, an increase in square footage of 7% would increase the price of a home by more than 10%. This could also be a helpful metric for a non-technical audience.

However, when building a predictive model, that doesn't need to produce coefficients for a non-technical audience, it is often best to produce models with logged and normalized data. The manipulated data allows the model to fit the assumptions of linearity, especially in regards to normality and heteroscedasticity, and for the magnitude of the coefficients to be directly compared to each other.

See how the assumptions of linearity can only be met in this model with logged and normalized data.

Logged and Normalized Data:

Raw Data:

It is apparent, above, that this is true for the King County housing data set, in which the logged and normalized data led to a model that met the assumptions of linearity and produced coefficients that showed square footage had a more positive influence on price than bathrooms, bedrooms, year built, or floors. However, it should be noted that this model did not provide a way to determine how much square footage would need to be added to equal the addition of one bathroom. A model built on raw data would be better for interpreting real value changes in the dependent and independent variables.

In sum, models that are built on raw, logged, and normalized data each have different strengths, weaknesses, and methods for interpretation. Using a combination of the data types is best when trying to build inferential and predictive models as each model can communicate the influence of features in different terms. And, having background knowledge that helps explain certain unexpected trends is helpful, if not critical.

After writing this post, I realized that people would likely want to know what they should do to improve the price of their home. Most surprising for me was that interior upgrades, especially in regards to the condition of the whole home, can have a dramatically positive influence on home price. Other remodel tips gleaned from the data set were to create a view - maybe add or increase the size of a window or build a deck, and to add livable square footage - maybe by converting a garage to living space.

In terms of metrics from the raw coefficients:

Every increase in 1 square foot, the price increased by 211.46 dollars. This is an increase of 105,730.00 dollars for every 500 square feet.
In log terms, a 6.7% increase in square footage corresponded to a 10% increase in home price.
For every additional bedroom, the price decreased by 42,200 dollars. This is an issue in the model because homes with fewer bedrooms are likely to cost more than homes with more bedrooms - likely conflated by other variables such as location.
For each additional bathroom, the price of a home increased by $55,570.

Pick a genre, many genres?

jpetoskey — Sun, 05 Dec 2021 21:38:41 +0000

My first performance task in my data science program with Flatiron School has been to produce recommendations for the theoretical Microsoft Studios about their first movie. After looking through the data, I decided to explore genre first, to see if I noticed any trends.

As I was joining genre to the movie titles in a separate csv, I was surprised to see that some genre pairings were more common than others. For example, adventure, the most popular genre in the top 50 grossing movies, is most often paired with action. I can remember my father talking about which movies he liked when I was a kid, and his descriptions often involved action and adventure. When he dragged me to Disney's Pocahontas, I was pleasantly surprised with action and adventure, despite my preconception of it being a romance.

However, counting movies by genre pairs or combinations didn't give me enough reliable information to make recommendations to theoretical Microsoft Studios, as you can see in this histogram.

Thinking of a movie as one particular genre can be rather limiting. I think this is why we are currently observing the trend that Jourdan Alderidge describes in The Beat. They write, "The majority of content produced in the last several decades are often genre hybrids, using the rules of genre theory to produce new, unique, and different stories" (The Beat)

However, I was determined to pull individual genres because understanding the pattern of individual genres would be important when theoretical Microsoft Studios is planning their first movie. This resulted in some difficulty, as the genres were saved in the object data type, and to explode them into distinct rows, I would need them in a list.

This resource was invaluable and provided me with the code needed to convert the string into a list, based on the location of each comma. Without the comma indicator, split(), will split a string on whitespace, but we do not have whitespace in our string, so we specify for it to split(',') on the comma.

df3['Genre'] = df3['Genre'].str.split(',')

I was then able to use the explode method and pull each genre into a new row.

df2 = df2.explode('Genre')

Once all the genre's were in separate rows, I was able to produce a histogram of the genres of the top 50 grossing movies.

This made me happy and seemed to give me more relevant information to make recommendations to Microsoft. And, going back to my Dad's preferences, it seems action/adventure is a good place to start.