Cleaning up our vehicle weight data#
Before we can analyze our data, we'll need to combine vehicle weights with makes and models, as well as clean up the results a bit.
import glob
import pandas as pd
Read in all of the vehicle weight csv files#
filenames = glob.glob("vehicle-data/*.csv")
filenames[:5]
len(filenames)
df = pd.concat([pd.read_csv(filename) for filename in filenames], sort=False, ignore_index=True)
df.head(2)
Cut it to be just the columns we're interested in#
df = df[['Curb Weight-automatic', 'Curb Weight-manual', 'Make', 'Model', 'Model Year']].copy()
df.head()
Remove 'lbs' from the weight columns, then average the two#
df['Curb Weight-automatic'] = df['Curb Weight-automatic'].str.replace(' lbs', '')
df['Curb Weight-manual'] = df['Curb Weight-manual'].str.replace(' lbs', '')
df.head(2)
# Average the automatic and manual weights
df['weight'] = df[['Curb Weight-automatic','Curb Weight-automatic']].astype(float).mean(axis=1)
df.head(2)
df.drop(columns=['Curb Weight-automatic', 'Curb Weight-manual'], inplace=True)
df.head()
Maybe there are multiples per make/model/year, so we'll average those, too#
# Now let's group by make, model and year and get the average weight
weights = df.groupby(['Make', 'Model', 'Model Year']).weight.mean().to_frame().reset_index()
# Remove anything that doesn't have a weight
weights.dropna(subset=['weight'], inplace=True)
# Make the weights integers because we don't need that specificity
weights.weight = weights.weight.astype(int)
# What do we have?
weights.head()
weights.shape
Standardization#
We'll need this to match content later
- Rename Model Year column
- Capitalize Make/Model
weights.rename(columns={'Model Year': 'ModelYear'}, inplace=True)
weights.ModelYear = weights.ModelYear.astype(str)
weights.Make = weights.Make.str.upper()
weights.Model = weights.Model.str.upper()
Interpolate missing values#
Sometimes we have 2005 and 2007 but not 2006. We'll interpolate those.
# Pivot so each type of car is a column
weights_pivot = weights.pivot_table(values="weight", index='ModelYear', columns=['Make', 'Model'])
weights_pivot.head(10)
# Interpolate missing values in those columns
weights_pivot = weights_pivot.interpolate(limit_direction='both')
weights_pivot.head(10)
# Convert it back into the normal dataframe
filled_in = weights_pivot.T.stack(dropna=False).reset_index()
filled_in.rename(columns={0: 'weight'}, inplace=True)
filled_in.dropna(inplace=True)
filled_in.weight = filled_in.weight.astype(int)
filled_in.head(10)
filled_in.shape
Add in manual work#
manual_weights = pd.read_csv("weights_needed_completed.csv", na_values='x')
manual_filled = manual_weights.set_index(['make', 'model']).dropna(how='all').T.interpolate(limit_direction='both')
manual_realigned = manual_filled.T.stack(dropna=False).reset_index().rename(columns={
'make': 'Make',
'model': 'Model',
'level_2': 'ModelYear',
0: 'weight'
})
filled_in = pd.concat([filled_in, manual_realigned], sort=False, ignore_index=True)
filled_in.head()
filled_in.shape
Save the output#
# Let's just save it right to the data folder
filled_in.to_csv("data/weights.csv", index=False)