Cleaning up our vehicle weight data#

Before we can analyze our data, we'll need to combine vehicle weights with makes and models, as well as clean up the results a bit.

import glob
import pandas as pd

Read in all of the vehicle weight csv files#

filenames = glob.glob("vehicle-data/*.csv")
filenames[:5]
['vehicle-data/MERCURY.csv',
 'vehicle-data/LINCOLN.csv',
 'vehicle-data/DELOREAN.csv',
 'vehicle-data/LAMBORGHINI.csv',
 'vehicle-data/SUZUKI.csv']
len(filenames)
60
df = pd.concat([pd.read_csv(filename) for filename in filenames], sort=False, ignore_index=True)
df.head(2)
Anti-Brake System Curb Weight-automatic Curb Weight-manual Engine Type Front Brake Type Fuel Economy-city Fuel Economy-highway Ground Clearance Make Model Model Year Rear Brake Type Tires Transmission-long Wheelbase
0 Non-ABS 2500 lbs 2423 lbs 1.6L L4 DOHC 16V Disc 23 - 25 miles/gallon 26 - 31 miles/gallon NaN Mercury Capri 1994 Disc 185/60R14 4-Speed Automatic | 5-Speed Manual 94.70 in.
1 Non-ABS 2500 lbs 2423 lbs 1.6L L4 DOHC 16V Disc 23 - 25 miles/gallon 26 - 31 miles/gallon NaN Mercury Capri 1994 Disc 185/60R14 4-Speed Automatic | 5-Speed Manual 94.70 in.

Cut it to be just the columns we're interested in#

df = df[['Curb Weight-automatic', 'Curb Weight-manual', 'Make', 'Model', 'Model Year']].copy()
df.head()
Curb Weight-automatic Curb Weight-manual Make Model Model Year
0 2500 lbs 2423 lbs Mercury Capri 1994
1 2500 lbs 2423 lbs Mercury Capri 1994
2 2492 lbs 2404 lbs Mercury Capri 1993
3 2492 lbs 2404 lbs Mercury Capri 1993
4 2492 lbs 2404 lbs Mercury Capri 1992

Remove 'lbs' from the weight columns, then average the two#

df['Curb Weight-automatic'] = df['Curb Weight-automatic'].str.replace(' lbs', '')
df['Curb Weight-manual'] = df['Curb Weight-manual'].str.replace(' lbs', '')
df.head(2)
Curb Weight-automatic Curb Weight-manual Make Model Model Year
0 2500 2423 Mercury Capri 1994
1 2500 2423 Mercury Capri 1994
# Average the automatic and manual weights
df['weight'] = df[['Curb Weight-automatic','Curb Weight-automatic']].astype(float).mean(axis=1)
df.head(2)
Curb Weight-automatic Curb Weight-manual Make Model Model Year weight
0 2500 2423 Mercury Capri 1994 2500.0
1 2500 2423 Mercury Capri 1994 2500.0
df.drop(columns=['Curb Weight-automatic', 'Curb Weight-manual'], inplace=True)
df.head()
Make Model Model Year weight
0 Mercury Capri 1994 2500.0
1 Mercury Capri 1994 2500.0
2 Mercury Capri 1993 2492.0
3 Mercury Capri 1993 2492.0
4 Mercury Capri 1992 2492.0

Maybe there are multiples per make/model/year, so we'll average those, too#

# Now let's group by make, model and year and get the average weight
weights = df.groupby(['Make', 'Model', 'Model Year']).weight.mean().to_frame().reset_index()

# Remove anything that doesn't have a weight
weights.dropna(subset=['weight'], inplace=True)

# Make the weights integers because we don't need that specificity
weights.weight = weights.weight.astype(int)

# What do we have?
weights.head()
Make Model Model Year weight
0 Acura CL 1997 3009
1 Acura CL 1998 3215
2 Acura CL 1999 3285
3 Acura CL 2001 3470
4 Acura CL 2002 3470
weights.shape
(4323, 4)

Standardization#

We'll need this to match content later

  • Rename Model Year column
  • Capitalize Make/Model
weights.rename(columns={'Model Year': 'ModelYear'}, inplace=True)
weights.ModelYear = weights.ModelYear.astype(str)
weights.Make = weights.Make.str.upper()
weights.Model = weights.Model.str.upper()

Interpolate missing values#

Sometimes we have 2005 and 2007 but not 2006. We'll interpolate those.

# Pivot so each type of car is a column
weights_pivot = weights.pivot_table(values="weight", index='ModelYear', columns=['Make', 'Model'])
weights_pivot.head(10)
Make ACURA ... VOLVO
Model CL ILX INTEGRA LEGEND MDX NSX RDX RL RLX SLX ... S70 S80 S90 V40 V50 V60 V70 XC60 XC70 XC90
ModelYear
1988 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1989 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1990 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1991 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1992 NaN NaN 2615.0 3486.0 NaN 3098.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1993 NaN NaN 2615.0 3446.0 NaN 3097.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1994 NaN NaN 2670.0 3560.0 NaN 3109.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1995 NaN NaN 2670.0 3560.0 NaN 3208.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1996 NaN NaN 2670.0 NaN NaN 3208.0 NaN 3660.0 NaN 4315.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1997 3009.0 NaN 2703.0 NaN NaN 3069.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10 rows × 649 columns

# Interpolate missing values in those columns
weights_pivot = weights_pivot.interpolate(limit_direction='both')
weights_pivot.head(10)
Make ACURA ... VOLVO
Model CL ILX INTEGRA LEGEND MDX NSX RDX RL RLX SLX ... S70 S80 S90 V40 V50 V60 V70 XC60 XC70 XC90
ModelYear
1988 3009.0 2910.0 2615.0 3486.0 4323.0 3098.0 3968.0 3660.0 3933.0 4315.0 ... 3333.0 3618.0 4012.0 2998.0 3570.0 3527.0 3433.0 4387.0 3699.0 4493.0
1989 3009.0 2910.0 2615.0 3486.0 4323.0 3098.0 3968.0 3660.0 3933.0 4315.0 ... 3333.0 3618.0 4012.0 2998.0 3570.0 3527.0 3433.0 4387.0 3699.0 4493.0
1990 3009.0 2910.0 2615.0 3486.0 4323.0 3098.0 3968.0 3660.0 3933.0 4315.0 ... 3333.0 3618.0 4012.0 2998.0 3570.0 3527.0 3433.0 4387.0 3699.0 4493.0
1991 3009.0 2910.0 2615.0 3486.0 4323.0 3098.0 3968.0 3660.0 3933.0 4315.0 ... 3333.0 3618.0 4012.0 2998.0 3570.0 3527.0 3433.0 4387.0 3699.0 4493.0
1992 3009.0 2910.0 2615.0 3486.0 4323.0 3098.0 3968.0 3660.0 3933.0 4315.0 ... 3333.0 3618.0 4012.0 2998.0 3570.0 3527.0 3433.0 4387.0 3699.0 4493.0
1993 3009.0 2910.0 2615.0 3446.0 4323.0 3097.0 3968.0 3660.0 3933.0 4315.0 ... 3333.0 3618.0 4012.0 2998.0 3570.0 3527.0 3433.0 4387.0 3699.0 4493.0
1994 3009.0 2910.0 2670.0 3560.0 4323.0 3109.0 3968.0 3660.0 3933.0 4315.0 ... 3333.0 3618.0 4012.0 2998.0 3570.0 3527.0 3433.0 4387.0 3699.0 4493.0
1995 3009.0 2910.0 2670.0 3560.0 4323.0 3208.0 3968.0 3660.0 3933.0 4315.0 ... 3333.0 3618.0 4012.0 2998.0 3570.0 3527.0 3433.0 4387.0 3699.0 4493.0
1996 3009.0 2910.0 2670.0 3560.0 4323.0 3208.0 3968.0 3660.0 3933.0 4315.0 ... 3333.0 3618.0 4012.0 2998.0 3570.0 3527.0 3433.0 4387.0 3699.0 4493.0
1997 3009.0 2910.0 2703.0 3560.0 4323.0 3069.0 3968.0 3609.0 3933.0 4465.0 ... 3333.0 3618.0 4012.0 2998.0 3570.0 3527.0 3433.0 4387.0 3699.0 4493.0

10 rows × 649 columns

# Convert it back into the normal dataframe
filled_in = weights_pivot.T.stack(dropna=False).reset_index()
filled_in.rename(columns={0: 'weight'}, inplace=True)
filled_in.dropna(inplace=True)
filled_in.weight = filled_in.weight.astype(int)
filled_in.head(10)
Make Model ModelYear weight
0 ACURA CL 1988 3009
1 ACURA CL 1989 3009
2 ACURA CL 1990 3009
3 ACURA CL 1991 3009
4 ACURA CL 1992 3009
5 ACURA CL 1993 3009
6 ACURA CL 1994 3009
7 ACURA CL 1995 3009
8 ACURA CL 1996 3009
9 ACURA CL 1997 3009
filled_in.shape
(19470, 4)

Add in manual work#

manual_weights = pd.read_csv("weights_needed_completed.csv", na_values='x')
manual_filled = manual_weights.set_index(['make', 'model']).dropna(how='all').T.interpolate(limit_direction='both')
manual_realigned = manual_filled.T.stack(dropna=False).reset_index().rename(columns={
    'make': 'Make',
    'model': 'Model',
    'level_2': 'ModelYear',
    0: 'weight'
})
filled_in = pd.concat([filled_in, manual_realigned], sort=False, ignore_index=True)
filled_in.head()
Make Model ModelYear weight
0 ACURA CL 1988 3009.0
1 ACURA CL 1989 3009.0
2 ACURA CL 1990 3009.0
3 ACURA CL 1991 3009.0
4 ACURA CL 1992 3009.0
filled_in.shape
(19877, 4)

Save the output#

# Let's just save it right to the data folder
filled_in.to_csv("data/weights.csv", index=False)
 
make CHEVROLET DODGE LEXUS HYUNDAI LEXUS CHRYSLER TOYOTA FORD GMC LINCOLN CHEVROLET
model SILVERADO RAM ES SANTA FE RX TOWN AND COUNTRY 4-RUNNER E-350 SIERRA TOWN CAR GMT-400
1983 4200.0 4800.0 3300.0 3600.0 3700.0 4200.0 3800.0 5000.0 4200.0 4000.0 4300.0
1984 4200.0 4800.0 3300.0 3600.0 3700.0 4200.0 3800.0 5000.0 4200.0 4000.0 4300.0
1985 4200.0 4800.0 3300.0 3600.0 3700.0 4200.0 3800.0 5000.0 4200.0 4000.0 4300.0
1986 4200.0 4800.0 3300.0 3600.0 3700.0 4200.0 3800.0 5000.0 4200.0 4000.0 4300.0
1987 4200.0 4800.0 3300.0 3600.0 3700.0 4200.0 3800.0 5000.0 4200.0 4000.0 4300.0
 
Make Model ModelYear weight
0 CHEVROLET SILVERADO 1983 4200.0
1 CHEVROLET SILVERADO 1984 4200.0
2 CHEVROLET SILVERADO 1985 4200.0
3 CHEVROLET SILVERADO 1986 4200.0
4 CHEVROLET SILVERADO 1987 4200.0
5 CHEVROLET SILVERADO 1988 4200.0
6 CHEVROLET SILVERADO 1989 4200.0
7 CHEVROLET SILVERADO 1990 4200.0
8 CHEVROLET SILVERADO 1991 4200.0
9 CHEVROLET SILVERADO 1992 4200.0
10 CHEVROLET SILVERADO 1993 4200.0
11 CHEVROLET SILVERADO 1994 4200.0
12 CHEVROLET SILVERADO 1995 4200.0
13 CHEVROLET SILVERADO 1996 4200.0
14 CHEVROLET SILVERADO 1997 4200.0
15 CHEVROLET SILVERADO 1998 4200.0
16 CHEVROLET SILVERADO 1999 4200.0
17 CHEVROLET SILVERADO 2000 4280.0
18 CHEVROLET SILVERADO 2001 4360.0
19 CHEVROLET SILVERADO 2002 4440.0
20 CHEVROLET SILVERADO 2003 4520.0
21 CHEVROLET SILVERADO 2004 4600.0
22 CHEVROLET SILVERADO 2005 4700.0
23 CHEVROLET SILVERADO 2006 4800.0
24 CHEVROLET SILVERADO 2007 4900.0
25 CHEVROLET SILVERADO 2008 5000.0
26 CHEVROLET SILVERADO 2009 5000.0
27 CHEVROLET SILVERADO 2010 5000.0
28 CHEVROLET SILVERADO 2011 5000.0
29 CHEVROLET SILVERADO 2012 5000.0
... ... ... ... ...
377 CHEVROLET GMT-400 1990 4340.0
378 CHEVROLET GMT-400 1991 4360.0
379 CHEVROLET GMT-400 1992 4380.0
380 CHEVROLET GMT-400 1993 4400.0
381 CHEVROLET GMT-400 1994 4420.0
382 CHEVROLET GMT-400 1995 4440.0
383 CHEVROLET GMT-400 1996 4460.0
384 CHEVROLET GMT-400 1997 4480.0
385 CHEVROLET GMT-400 1998 4500.0
386 CHEVROLET GMT-400 1999 4500.0
387 CHEVROLET GMT-400 2000 4500.0
388 CHEVROLET GMT-400 2001 4500.0
389 CHEVROLET GMT-400 2002 4500.0
390 CHEVROLET GMT-400 2003 4500.0
391 CHEVROLET GMT-400 2004 4500.0
392 CHEVROLET GMT-400 2005 4500.0
393 CHEVROLET GMT-400 2006 4500.0
394 CHEVROLET GMT-400 2007 4500.0
395 CHEVROLET GMT-400 2008 4500.0
396 CHEVROLET GMT-400 2009 4500.0
397 CHEVROLET GMT-400 2010 4500.0
398 CHEVROLET GMT-400 2011 4500.0
399 CHEVROLET GMT-400 2012 4500.0
400 CHEVROLET GMT-400 2013 4500.0
401 CHEVROLET GMT-400 2014 4500.0
402 CHEVROLET GMT-400 2015 4500.0
403 CHEVROLET GMT-400 2016 4500.0
404 CHEVROLET GMT-400 2017 4500.0
405 CHEVROLET GMT-400 2018 4500.0
406 CHEVROLET GMT-400 2019 4500.0

407 rows × 4 columns