{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A final cleaning, combining, and filtering of our vehicle data\n", "\n", "After combining from so many sources, we need to filter out the car crashes we're interested in. We're curious about 2-car accidents that happen between light vehicles (no tractor-trailers)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<p class=\"reading-options\">\n <a class=\"btn\" href=\"/car-crashes-weight-regression/05-clean-combine-and-filter-data\">\n <i class=\"fa fa-sm fa-book\"></i>\n Read online\n </a>\n <a class=\"btn\" href=\"/car-crashes-weight-regression/notebooks/05 - Clean combine and filter data.ipynb\">\n <i class=\"fa fa-sm fa-download\"></i>\n Download notebook\n </a>\n <a class=\"btn\" href=\"https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/car-crashes-weight-regression/notebooks/05 - Clean combine and filter data.ipynb\" target=\"_new\">\n <i class=\"fa fa-sm fa-laptop\"></i>\n Interactive version\n </a>\n</p>" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "pd.set_option(\"display.max_columns\", 100)\n", "pd.set_option(\"display.max_rows\", 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Read in our data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The vehicles involved in the crash" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>AREA_DAMAGED_CODE1</th>\n", " <th>AREA_DAMAGED_CODE2</th>\n", " <th>AREA_DAMAGED_CODE3</th>\n", " <th>AREA_DAMAGED_CODE_IMP1</th>\n", " <th>AREA_DAMAGED_CODE_MAIN</th>\n", " <th>BODY_TYPE_CODE</th>\n", " <th>COMMERCIAL_FLAG</th>\n", " <th>CONTI_DIRECTION_CODE</th>\n", " <th>CV_BODY_TYPE_CODE</th>\n", " <th>DAMAGE_CODE</th>\n", " <th>DRIVERLESS_FLAG</th>\n", " <th>FIRE_FLAG</th>\n", " <th>GOING_DIRECTION_CODE</th>\n", " <th>GVW_CODE</th>\n", " <th>HARM_EVENT_CODE</th>\n", " <th>HAZMAT_SPILL_FLAG</th>\n", " <th>HIT_AND_RUN_FLAG</th>\n", " <th>HZM_NUM</th>\n", " <th>MOVEMENT_CODE</th>\n", " <th>NUM_AXLES</th>\n", " <th>PARKED_FLAG</th>\n", " <th>REPORT_NO</th>\n", " <th>SPEED_LIMIT</th>\n", " <th>TOWED_AWAY_FLAG</th>\n", " <th>TOWED_VEHICLE_CONFIG_CODE</th>\n", " <th>VEHICLE_ID</th>\n", " <th>VEH_MAKE</th>\n", " <th>VEH_MODEL</th>\n", " <th>VEH_YEAR</th>\n", " <th>VIN_NO</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>8.0</td>\n", " <td>9.0</td>\n", " <td>10.0</td>\n", " <td>10.0</td>\n", " <td>10.0</td>\n", " <td>23.08</td>\n", " <td>N</td>\n", " <td>E</td>\n", " <td>NaN</td>\n", " <td>5</td>\n", " <td>N</td>\n", " <td>N</td>\n", " <td>E</td>\n", " <td>NaN</td>\n", " <td>9.0</td>\n", " <td>NaN</td>\n", " <td>N</td>\n", " <td>NaN</td>\n", " <td>1.0</td>\n", " <td>NaN</td>\n", " <td>N</td>\n", " <td>ADJ487004H</td>\n", " <td>30</td>\n", " <td>Y</td>\n", " <td>0</td>\n", " <td>000238fd-44fa-4cd5-8eb7-41ab30500bec</td>\n", " <td>CHEVY</td>\n", " <td>TAHOE</td>\n", " <td>2005.0</td>\n", " <td>1GNEK13Q2J285593</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>12.0</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>12.0</td>\n", " <td>12.0</td>\n", " <td>2.00</td>\n", " <td>N</td>\n", " <td>E</td>\n", " <td>NaN</td>\n", " <td>4</td>\n", " <td>N</td>\n", " <td>N</td>\n", " <td>E</td>\n", " <td>NaN</td>\n", " <td>1.0</td>\n", " <td>NaN</td>\n", " <td>N</td>\n", " <td>NaN</td>\n", " <td>1.0</td>\n", " <td>NaN</td>\n", " <td>N</td>\n", " <td>MCP2487000M</td>\n", " <td>40</td>\n", " <td>Y</td>\n", " <td>0</td>\n", " <td>00038116-1bf9-48cc-b317-4f4375d14b60</td>\n", " <td>INFI</td>\n", " <td>4S</td>\n", " <td>2003.0</td>\n", " <td>JNKCV51E63M013580</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " AREA_DAMAGED_CODE1 AREA_DAMAGED_CODE2 AREA_DAMAGED_CODE3 \\\n", "0 8.0 9.0 10.0 \n", "1 12.0 NaN NaN \n", "\n", " AREA_DAMAGED_CODE_IMP1 AREA_DAMAGED_CODE_MAIN BODY_TYPE_CODE \\\n", "0 10.0 10.0 23.08 \n", "1 12.0 12.0 2.00 \n", "\n", " COMMERCIAL_FLAG CONTI_DIRECTION_CODE CV_BODY_TYPE_CODE DAMAGE_CODE \\\n", "0 N E NaN 5 \n", "1 N E NaN 4 \n", "\n", " DRIVERLESS_FLAG FIRE_FLAG GOING_DIRECTION_CODE GVW_CODE HARM_EVENT_CODE \\\n", "0 N N E NaN 9.0 \n", "1 N N E NaN 1.0 \n", "\n", " HAZMAT_SPILL_FLAG HIT_AND_RUN_FLAG HZM_NUM MOVEMENT_CODE NUM_AXLES \\\n", "0 NaN N NaN 1.0 NaN \n", "1 NaN N NaN 1.0 NaN \n", "\n", " PARKED_FLAG REPORT_NO SPEED_LIMIT TOWED_AWAY_FLAG \\\n", "0 N ADJ487004H 30 Y \n", "1 N MCP2487000M 40 Y \n", "\n", " TOWED_VEHICLE_CONFIG_CODE VEHICLE_ID VEH_MAKE \\\n", "0 0 000238fd-44fa-4cd5-8eb7-41ab30500bec CHEVY \n", "1 0 00038116-1bf9-48cc-b317-4f4375d14b60 INFI \n", "\n", " VEH_MODEL VEH_YEAR VIN_NO \n", "0 TAHOE 2005.0 1GNEK13Q2J285593 \n", "1 4S 2003.0 JNKCV51E63M013580 " ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vehicles = pd.read_csv(\"data/cleaned/combined-vehicle-data.csv\")\n", "vehicles.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are we interested in for this dataset? We probably don't need all those columns!" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>REPORT_NO</th>\n", " <th>SPEED_LIMIT</th>\n", " <th>VEH_MAKE</th>\n", " <th>VEH_MODEL</th>\n", " <th>VEH_YEAR</th>\n", " <th>VIN_NO</th>\n", " <th>VEHICLE_ID</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ADJ487004H</td>\n", " <td>30</td>\n", " <td>CHEVY</td>\n", " <td>TAHOE</td>\n", " <td>2005.0</td>\n", " <td>1GNEK13Q2J285593</td>\n", " <td>000238fd-44fa-4cd5-8eb7-41ab30500bec</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>MCP2487000M</td>\n", " <td>40</td>\n", " <td>INFI</td>\n", " <td>4S</td>\n", " <td>2003.0</td>\n", " <td>JNKCV51E63M013580</td>\n", " <td>00038116-1bf9-48cc-b317-4f4375d14b60</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>CB5190006B</td>\n", " <td>55</td>\n", " <td>TOYT</td>\n", " <td>TK</td>\n", " <td>2011.0</td>\n", " <td>5TFUY5F1XBX167340</td>\n", " <td>0003b659-2785-4868-8877-0b786a284827</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>ADJ4590035</td>\n", " <td>5</td>\n", " <td>HONDA</td>\n", " <td>CIVIC</td>\n", " <td>2015.0</td>\n", " <td>2HGFG4A59FH702545</td>\n", " <td>00050484-d08f-4b6e-bc7e-9ec270e94660</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>ADJ849000Z</td>\n", " <td>10</td>\n", " <td>HONDA</td>\n", " <td>ACCORD</td>\n", " <td>2003.0</td>\n", " <td>1HGCM66313A037175</td>\n", " <td>00057af4-d848-4cee-b854-707f57581f4e</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " REPORT_NO SPEED_LIMIT VEH_MAKE VEH_MODEL VEH_YEAR VIN_NO \\\n", "0 ADJ487004H 30 CHEVY TAHOE 2005.0 1GNEK13Q2J285593 \n", "1 MCP2487000M 40 INFI 4S 2003.0 JNKCV51E63M013580 \n", "2 CB5190006B 55 TOYT TK 2011.0 5TFUY5F1XBX167340 \n", "3 ADJ4590035 5 HONDA CIVIC 2015.0 2HGFG4A59FH702545 \n", "4 ADJ849000Z 10 HONDA ACCORD 2003.0 1HGCM66313A037175 \n", "\n", " VEHICLE_ID \n", "0 000238fd-44fa-4cd5-8eb7-41ab30500bec \n", "1 00038116-1bf9-48cc-b317-4f4375d14b60 \n", "2 0003b659-2785-4868-8877-0b786a284827 \n", "3 00050484-d08f-4b6e-bc7e-9ec270e94660 \n", "4 00057af4-d848-4cee-b854-707f57581f4e " ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vehicles = vehicles[['REPORT_NO', 'SPEED_LIMIT', 'VEH_MAKE', 'VEH_MODEL', 'VEH_YEAR', 'VIN_NO', 'VEHICLE_ID']]\n", "vehicles.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Details about the crash itself" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>ACC_DATE</th>\n", " <th>ACC_TIME</th>\n", " <th>AGENCY_CODE</th>\n", " <th>AREA_CODE</th>\n", " <th>COLLISION_TYPE_CODE</th>\n", " <th>COUNTY_NO</th>\n", " <th>C_M_ZONE_FLAG</th>\n", " <th>DISTANCE</th>\n", " <th>DISTANCE_DIR_FLAG</th>\n", " <th>FEET_MILES_FLAG</th>\n", " <th>FIX_OBJ_CODE</th>\n", " <th>HARM_EVENT_CODE1</th>\n", " <th>HARM_EVENT_CODE2</th>\n", " <th>JUNCTION_CODE</th>\n", " <th>LANE_CODE</th>\n", " <th>LATITUDE</th>\n", " <th>LIGHT_CODE</th>\n", " <th>LOC_CODE</th>\n", " <th>LOGMILE_DIR_FLAG</th>\n", " <th>LOG_MILE</th>\n", " <th>LONGITUDE</th>\n", " <th>MAINROAD_NAME</th>\n", " <th>MUNI_CODE</th>\n", " <th>RD_COND_CODE</th>\n", " <th>RD_DIV_CODE</th>\n", " <th>REFERENCE_NO</th>\n", " <th>REFERENCE_ROAD_NAME</th>\n", " <th>REFERENCE_SUFFIX</th>\n", " <th>REFERENCE_TYPE_CODE</th>\n", " <th>REPORT_NO</th>\n", " <th>REPORT_TYPE</th>\n", " <th>ROUTE_TYPE_CODE</th>\n", " <th>RTE_NO</th>\n", " <th>RTE_SUFFIX</th>\n", " <th>SIGNAL_FLAG</th>\n", " <th>SURF_COND_CODE</th>\n", " <th>WEATHER_CODE</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2018-04-10 00:00:00</td>\n", " <td>01:50:00</td>\n", " <td>MSP</td>\n", " <td>UNK</td>\n", " <td>17</td>\n", " <td>18.0</td>\n", " <td>N</td>\n", " <td>0.0</td>\n", " <td>N</td>\n", " <td>F</td>\n", " <td>22.03</td>\n", " <td>16.0</td>\n", " <td>11.0</td>\n", " <td>1.0</td>\n", " <td>NaN</td>\n", " <td>38.277230</td>\n", " <td>4.0</td>\n", " <td>NaN</td>\n", " <td>N</td>\n", " <td>2.09</td>\n", " <td>-76.682876</td>\n", " <td>NEWTOWNE NECK RD</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>1.0</td>\n", " <td>166.0</td>\n", " <td>ROSEBANK RD</td>\n", " <td>NaN</td>\n", " <td>CO</td>\n", " <td>MSP6188002Q</td>\n", " <td>Injury Crash</td>\n", " <td>MD</td>\n", " <td>243.0</td>\n", " <td>NaN</td>\n", " <td>N</td>\n", " <td>2.0</td>\n", " <td>6.01</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2018-05-02 00:00:00</td>\n", " <td>11:06:00</td>\n", " <td>ELKTON</td>\n", " <td>UNK</td>\n", " <td>5</td>\n", " <td>7.0</td>\n", " <td>N</td>\n", " <td>15.0</td>\n", " <td>S</td>\n", " <td>F</td>\n", " <td>0.00</td>\n", " <td>1.0</td>\n", " <td>1.0</td>\n", " <td>3.0</td>\n", " <td>NaN</td>\n", " <td>39.613747</td>\n", " <td>1.0</td>\n", " <td>NaN</td>\n", " <td>N</td>\n", " <td>19.41</td>\n", " <td>-75.837454</td>\n", " <td>N BRIDGE ST</td>\n", " <td>52.0</td>\n", " <td>1.0</td>\n", " <td>3.0</td>\n", " <td>362.0</td>\n", " <td>LAUREL DR</td>\n", " <td>NaN</td>\n", " <td>MU</td>\n", " <td>BK0227001M</td>\n", " <td>Injury Crash</td>\n", " <td>MD</td>\n", " <td>213.0</td>\n", " <td>NaN</td>\n", " <td>N</td>\n", " <td>2.0</td>\n", " <td>6.01</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " ACC_DATE ACC_TIME AGENCY_CODE AREA_CODE COLLISION_TYPE_CODE \\\n", "0 2018-04-10 00:00:00 01:50:00 MSP UNK 17 \n", "1 2018-05-02 00:00:00 11:06:00 ELKTON UNK 5 \n", "\n", " COUNTY_NO C_M_ZONE_FLAG DISTANCE DISTANCE_DIR_FLAG FEET_MILES_FLAG \\\n", "0 18.0 N 0.0 N F \n", "1 7.0 N 15.0 S F \n", "\n", " FIX_OBJ_CODE HARM_EVENT_CODE1 HARM_EVENT_CODE2 JUNCTION_CODE LANE_CODE \\\n", "0 22.03 16.0 11.0 1.0 NaN \n", "1 0.00 1.0 1.0 3.0 NaN \n", "\n", " LATITUDE LIGHT_CODE LOC_CODE LOGMILE_DIR_FLAG LOG_MILE LONGITUDE \\\n", "0 38.277230 4.0 NaN N 2.09 -76.682876 \n", "1 39.613747 1.0 NaN N 19.41 -75.837454 \n", "\n", " MAINROAD_NAME MUNI_CODE RD_COND_CODE RD_DIV_CODE REFERENCE_NO \\\n", "0 NEWTOWNE NECK RD 0.0 1.0 1.0 166.0 \n", "1 N BRIDGE ST 52.0 1.0 3.0 362.0 \n", "\n", " REFERENCE_ROAD_NAME REFERENCE_SUFFIX REFERENCE_TYPE_CODE REPORT_NO \\\n", "0 ROSEBANK RD NaN CO MSP6188002Q \n", "1 LAUREL DR NaN MU BK0227001M \n", "\n", " REPORT_TYPE ROUTE_TYPE_CODE RTE_NO RTE_SUFFIX SIGNAL_FLAG \\\n", "0 Injury Crash MD 243.0 NaN N \n", "1 Injury Crash MD 213.0 NaN N \n", "\n", " SURF_COND_CODE WEATHER_CODE \n", "0 2.0 6.01 \n", "1 2.0 6.01 " ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crashes = pd.read_csv(\"data/cleaned/combined-crash-data.csv\")\n", "crashes.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do we need all of those fields? Again - probably not! What fields are we interested in?" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>REPORT_NO</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>MSP6188002Q</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>BK0227001M</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>ZU8005001W</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>MCP2891005S</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>MSP6743004Q</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " REPORT_NO\n", "0 MSP6188002Q\n", "1 BK0227001M\n", "2 ZU8005001W\n", "3 MCP2891005S\n", "4 MSP6743004Q" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crashes = crashes[['REPORT_NO']]\n", "crashes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Weights of the cars" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>VIN</th>\n", " <th>Make</th>\n", " <th>Model</th>\n", " <th>ModelYear</th>\n", " <th>weight</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2FMDA5143TBB45576</td>\n", " <td>FORD</td>\n", " <td>WINDSTAR</td>\n", " <td>1996</td>\n", " <td>3733.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2G1WC5E37E1120089</td>\n", " <td>CHEVROLET</td>\n", " <td>IMPALA</td>\n", " <td>2014</td>\n", " <td>3618.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " VIN Make Model ModelYear weight\n", "0 2FMDA5143TBB45576 FORD WINDSTAR 1996 3733.0\n", "1 2G1WC5E37E1120089 CHEVROLET IMPALA 2014 3618.0" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weights = pd.read_csv(\"data/cleaned/vins_and_weights.csv\")\n", "weights.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Seems good to me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Person data" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>AIRBAG_DEPLOYED</th>\n", " <th>ALCOHOL_TESTTYPE_CODE</th>\n", " <th>ALCOHOL_TEST_CODE</th>\n", " <th>BAC_CODE</th>\n", " <th>CDL_FLAG</th>\n", " <th>CLASS</th>\n", " <th>CONDITION_CODE</th>\n", " <th>DATE_OF_BIRTH</th>\n", " <th>DRUG_TESTRESULT_CODE</th>\n", " <th>DRUG_TEST_CODE</th>\n", " <th>EJECT_CODE</th>\n", " <th>EMS_UNIT_LABEL</th>\n", " <th>EQUIP_PROB_CODE</th>\n", " <th>FAULT_FLAG</th>\n", " <th>INJ_SEVER_CODE</th>\n", " <th>LICENSE_STATE_CODE</th>\n", " <th>MOVEMENT_CODE</th>\n", " <th>OCC_SEAT_POS_CODE</th>\n", " <th>PED_LOCATION_CODE</th>\n", " <th>PED_OBEY_CODE</th>\n", " <th>PED_TYPE_CODE</th>\n", " <th>PED_VISIBLE_CODE</th>\n", " <th>PERSON_ID</th>\n", " <th>PERSON_TYPE</th>\n", " <th>REPORT_NO</th>\n", " <th>SAF_EQUIP_CODE</th>\n", " <th>SEX_CODE</th>\n", " <th>VEHICLE_ID</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.0</td>\n", " <td>NaN</td>\n", " <td>0.0</td>\n", " <td>NaN</td>\n", " <td>N</td>\n", " <td>C</td>\n", " <td>0.0</td>\n", " <td>1952-04-20 00:00:00</td>\n", " <td>NaN</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>NaN</td>\n", " <td>1.0</td>\n", " <td>N</td>\n", " <td>1</td>\n", " <td>PA</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>48dd00ee-e033-47e7-ad1e-0b734020301b</td>\n", " <td>D</td>\n", " <td>AB4284000S</td>\n", " <td>13.0</td>\n", " <td>F</td>\n", " <td>eb6aadb8-dacb-4744-a1a7-ab812c96f27f</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1.0</td>\n", " <td>NaN</td>\n", " <td>0.0</td>\n", " <td>NaN</td>\n", " <td>N</td>\n", " <td>NaN</td>\n", " <td>0.0</td>\n", " <td>1985-05-28 00:00:00</td>\n", " <td>NaN</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>NaN</td>\n", " <td>0.0</td>\n", " <td>N</td>\n", " <td>1</td>\n", " <td>MD</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>166296bd-ffd3-4c16-aa74-4f4bf4139d8d</td>\n", " <td>D</td>\n", " <td>AB4313000X</td>\n", " <td>13.0</td>\n", " <td>F</td>\n", " <td>b463eb20-2f01-4200-9d6f-b18888ce2593</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " AIRBAG_DEPLOYED ALCOHOL_TESTTYPE_CODE ALCOHOL_TEST_CODE BAC_CODE \\\n", "0 1.0 NaN 0.0 NaN \n", "1 1.0 NaN 0.0 NaN \n", "\n", " CDL_FLAG CLASS CONDITION_CODE DATE_OF_BIRTH DRUG_TESTRESULT_CODE \\\n", "0 N C 0.0 1952-04-20 00:00:00 NaN \n", "1 N NaN 0.0 1985-05-28 00:00:00 NaN \n", "\n", " DRUG_TEST_CODE EJECT_CODE EMS_UNIT_LABEL EQUIP_PROB_CODE FAULT_FLAG \\\n", "0 0.0 1.0 NaN 1.0 N \n", "1 0.0 0.0 NaN 0.0 N \n", "\n", " INJ_SEVER_CODE LICENSE_STATE_CODE MOVEMENT_CODE OCC_SEAT_POS_CODE \\\n", "0 1 PA NaN NaN \n", "1 1 MD NaN NaN \n", "\n", " PED_LOCATION_CODE PED_OBEY_CODE PED_TYPE_CODE PED_VISIBLE_CODE \\\n", "0 NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN \n", "\n", " PERSON_ID PERSON_TYPE REPORT_NO \\\n", "0 48dd00ee-e033-47e7-ad1e-0b734020301b D AB4284000S \n", "1 166296bd-ffd3-4c16-aa74-4f4bf4139d8d D AB4313000X \n", "\n", " SAF_EQUIP_CODE SEX_CODE VEHICLE_ID \n", "0 13.0 F eb6aadb8-dacb-4744-a1a7-ab812c96f27f \n", "1 13.0 F b463eb20-2f01-4200-9d6f-b18888ce2593 " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "people = pd.read_csv(\"data/cleaned/combined-person-data.csv\")\n", "people.head(2)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>INJ_SEVER_CODE</th>\n", " <th>VEHICLE_ID</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>eb6aadb8-dacb-4744-a1a7-ab812c96f27f</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>b463eb20-2f01-4200-9d6f-b18888ce2593</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1</td>\n", " <td>3c8629d0-d524-47c1-bfbc-b18e07f3087e</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1</td>\n", " <td>c4628cdb-f295-4a24-8a4b-653741ac6ae7</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1</td>\n", " <td>cdda1580-fd79-4358-8819-c2250f494591</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " INJ_SEVER_CODE VEHICLE_ID\n", "0 1 eb6aadb8-dacb-4744-a1a7-ab812c96f27f\n", "1 1 b463eb20-2f01-4200-9d6f-b18888ce2593\n", "2 1 3c8629d0-d524-47c1-bfbc-b18e07f3087e\n", "3 1 c4628cdb-f295-4a24-8a4b-653741ac6ae7\n", "4 1 cdda1580-fd79-4358-8819-c2250f494591" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "people = people[['INJ_SEVER_CODE', 'VEHICLE_ID']]\n", "people.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Merging our data" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>REPORT_NO</th>\n", " <th>SPEED_LIMIT</th>\n", " <th>VEH_MAKE</th>\n", " <th>VEH_MODEL</th>\n", " <th>VEH_YEAR</th>\n", " <th>VIN_NO</th>\n", " <th>VEHICLE_ID</th>\n", " <th>VIN</th>\n", " <th>Make</th>\n", " <th>Model</th>\n", " <th>ModelYear</th>\n", " <th>weight</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ADJ487004H</td>\n", " <td>30</td>\n", " <td>CHEVY</td>\n", " <td>TAHOE</td>\n", " <td>2005.0</td>\n", " <td>1GNEK13Q2J285593</td>\n", " <td>000238fd-44fa-4cd5-8eb7-41ab30500bec</td>\n", " <td>1GNEK13Q2J285593</td>\n", " <td>CHEVROLET</td>\n", " <td>GMT-400</td>\n", " <td>1988.0</td>\n", " <td>4300.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>MCP2487000M</td>\n", " <td>40</td>\n", " <td>INFI</td>\n", " <td>4S</td>\n", " <td>2003.0</td>\n", " <td>JNKCV51E63M013580</td>\n", " <td>00038116-1bf9-48cc-b317-4f4375d14b60</td>\n", " <td>JNKCV51E63M013580</td>\n", " <td>INFINITI</td>\n", " <td>G35</td>\n", " <td>2003.0</td>\n", " <td>3468.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>CB5190006B</td>\n", " <td>55</td>\n", " <td>TOYT</td>\n", " <td>TK</td>\n", " <td>2011.0</td>\n", " <td>5TFUY5F1XBX167340</td>\n", " <td>0003b659-2785-4868-8877-0b786a284827</td>\n", " <td>5TFUY5F1XBX167340</td>\n", " <td>TOYOTA</td>\n", " <td>TUNDRA</td>\n", " <td>2011.0</td>\n", " <td>5480.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>ADJ4590035</td>\n", " <td>5</td>\n", " <td>HONDA</td>\n", " <td>CIVIC</td>\n", " <td>2015.0</td>\n", " <td>2HGFG4A59FH702545</td>\n", " <td>00050484-d08f-4b6e-bc7e-9ec270e94660</td>\n", " <td>2HGFG4A59FH702545</td>\n", " <td>HONDA</td>\n", " <td>CIVIC</td>\n", " <td>2015.0</td>\n", " <td>2754.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>ADJ849000Z</td>\n", " <td>10</td>\n", " <td>HONDA</td>\n", " <td>ACCORD</td>\n", " <td>2003.0</td>\n", " <td>1HGCM66313A037175</td>\n", " <td>00057af4-d848-4cee-b854-707f57581f4e</td>\n", " <td>1HGCM66313A037175</td>\n", " <td>HONDA</td>\n", " <td>ACCORD</td>\n", " <td>2003.0</td>\n", " <td>3023.0</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>AE5207008Z</td>\n", " <td>30</td>\n", " <td>FORD</td>\n", " <td>4D</td>\n", " <td>2016.0</td>\n", " <td>1FADP3K28GL258987</td>\n", " <td>00089d4a-7038-4693-9e02-b402676631af</td>\n", " <td>1FADP3K28GL258987</td>\n", " <td>FORD</td>\n", " <td>FOCUS</td>\n", " <td>2016.0</td>\n", " <td>2932.0</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>MDTA1150000D</td>\n", " <td>30</td>\n", " <td>GILL</td>\n", " <td>BUS</td>\n", " <td>2004.0</td>\n", " <td>15GGD211X41076128</td>\n", " <td>000abd04-058d-4e14-9f1c-4a100493a305</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>MSP5973003Q</td>\n", " <td>35</td>\n", " <td>CHEVROLET</td>\n", " <td>HHR</td>\n", " <td>2008.0</td>\n", " <td>3GNDA23DX8S558149</td>\n", " <td>000e3b5d-b4ae-44a1-af02-df6cce381fc1</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>MCP27070015</td>\n", " <td>25</td>\n", " <td>ACUR</td>\n", " <td>TSX</td>\n", " <td>2008.0</td>\n", " <td>JH4CL96848C021626</td>\n", " <td>000f45d8-bc0e-4f9c-820a-474212e669cd</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>ADJ859000Y</td>\n", " <td>15</td>\n", " <td>HONDA</td>\n", " <td>CIVIC</td>\n", " <td>2009.0</td>\n", " <td>2HGFA16689H357624</td>\n", " <td>0010076b-0a45-45f3-8796-f387b39cd85d</td>\n", " <td>2HGFA16689H357624</td>\n", " <td>HONDA</td>\n", " <td>CIVIC</td>\n", " <td>2009.0</td>\n", " <td>2678.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " REPORT_NO SPEED_LIMIT VEH_MAKE VEH_MODEL VEH_YEAR \\\n", "0 ADJ487004H 30 CHEVY TAHOE 2005.0 \n", "1 MCP2487000M 40 INFI 4S 2003.0 \n", "2 CB5190006B 55 TOYT TK 2011.0 \n", "3 ADJ4590035 5 HONDA CIVIC 2015.0 \n", "4 ADJ849000Z 10 HONDA ACCORD 2003.0 \n", "5 AE5207008Z 30 FORD 4D 2016.0 \n", "6 MDTA1150000D 30 GILL BUS 2004.0 \n", "7 MSP5973003Q 35 CHEVROLET HHR 2008.0 \n", "8 MCP27070015 25 ACUR TSX 2008.0 \n", "9 ADJ859000Y 15 HONDA CIVIC 2009.0 \n", "\n", " VIN_NO VEHICLE_ID VIN \\\n", "0 1GNEK13Q2J285593 000238fd-44fa-4cd5-8eb7-41ab30500bec 1GNEK13Q2J285593 \n", "1 JNKCV51E63M013580 00038116-1bf9-48cc-b317-4f4375d14b60 JNKCV51E63M013580 \n", "2 5TFUY5F1XBX167340 0003b659-2785-4868-8877-0b786a284827 5TFUY5F1XBX167340 \n", "3 2HGFG4A59FH702545 00050484-d08f-4b6e-bc7e-9ec270e94660 2HGFG4A59FH702545 \n", "4 1HGCM66313A037175 00057af4-d848-4cee-b854-707f57581f4e 1HGCM66313A037175 \n", "5 1FADP3K28GL258987 00089d4a-7038-4693-9e02-b402676631af 1FADP3K28GL258987 \n", "6 15GGD211X41076128 000abd04-058d-4e14-9f1c-4a100493a305 NaN \n", "7 3GNDA23DX8S558149 000e3b5d-b4ae-44a1-af02-df6cce381fc1 NaN \n", "8 JH4CL96848C021626 000f45d8-bc0e-4f9c-820a-474212e669cd NaN \n", "9 2HGFA16689H357624 0010076b-0a45-45f3-8796-f387b39cd85d 2HGFA16689H357624 \n", "\n", " Make Model ModelYear weight \n", "0 CHEVROLET GMT-400 1988.0 4300.0 \n", "1 INFINITI G35 2003.0 3468.0 \n", "2 TOYOTA TUNDRA 2011.0 5480.0 \n", "3 HONDA CIVIC 2015.0 2754.0 \n", "4 HONDA ACCORD 2003.0 3023.0 \n", "5 FORD FOCUS 2016.0 2932.0 \n", "6 NaN NaN NaN NaN \n", "7 NaN NaN NaN NaN \n", "8 NaN NaN NaN NaN \n", "9 HONDA CIVIC 2009.0 2678.0 " ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We'll keep the ones without weights so we can drop those accidents from the analysis\n", "vehicles_with_weights = vehicles.merge(weights,\n", " how='left',\n", " left_on='VIN_NO',\n", " right_on='VIN')\n", "vehicles_with_weights.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do the `VEH_MAKE` and `Make` columns match up, generally speaking? **Do we trust what we did by searching out the VIN?**" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>VEH_MAKE</th>\n", " <th>Make</th>\n", " <th>VEH_MODEL</th>\n", " <th>Model</th>\n", " <th>VEH_YEAR</th>\n", " <th>ModelYear</th>\n", " <th>weight</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>CHEVY</td>\n", " <td>CHEVROLET</td>\n", " <td>TAHOE</td>\n", " <td>GMT-400</td>\n", " <td>2005.0</td>\n", " <td>1988.0</td>\n", " <td>4300.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>INFI</td>\n", " <td>INFINITI</td>\n", " <td>4S</td>\n", " <td>G35</td>\n", " <td>2003.0</td>\n", " <td>2003.0</td>\n", " <td>3468.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>TOYT</td>\n", " <td>TOYOTA</td>\n", " <td>TK</td>\n", " <td>TUNDRA</td>\n", " <td>2011.0</td>\n", " <td>2011.0</td>\n", " <td>5480.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>HONDA</td>\n", " <td>HONDA</td>\n", " <td>CIVIC</td>\n", " <td>CIVIC</td>\n", " <td>2015.0</td>\n", " <td>2015.0</td>\n", " <td>2754.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>HONDA</td>\n", " <td>HONDA</td>\n", " <td>ACCORD</td>\n", " <td>ACCORD</td>\n", " <td>2003.0</td>\n", " <td>2003.0</td>\n", " <td>3023.0</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>FORD</td>\n", " <td>FORD</td>\n", " <td>4D</td>\n", " <td>FOCUS</td>\n", " <td>2016.0</td>\n", " <td>2016.0</td>\n", " <td>2932.0</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>GILL</td>\n", " <td>NaN</td>\n", " <td>BUS</td>\n", " <td>NaN</td>\n", " <td>2004.0</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>CHEVROLET</td>\n", " <td>NaN</td>\n", " <td>HHR</td>\n", " <td>NaN</td>\n", " <td>2008.0</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>ACUR</td>\n", " <td>NaN</td>\n", " <td>TSX</td>\n", " <td>NaN</td>\n", " <td>2008.0</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>HONDA</td>\n", " <td>HONDA</td>\n", " <td>CIVIC</td>\n", " <td>CIVIC</td>\n", " <td>2009.0</td>\n", " <td>2009.0</td>\n", " <td>2678.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " VEH_MAKE Make VEH_MODEL Model VEH_YEAR ModelYear weight\n", "0 CHEVY CHEVROLET TAHOE GMT-400 2005.0 1988.0 4300.0\n", "1 INFI INFINITI 4S G35 2003.0 2003.0 3468.0\n", "2 TOYT TOYOTA TK TUNDRA 2011.0 2011.0 5480.0\n", "3 HONDA HONDA CIVIC CIVIC 2015.0 2015.0 2754.0\n", "4 HONDA HONDA ACCORD ACCORD 2003.0 2003.0 3023.0\n", "5 FORD FORD 4D FOCUS 2016.0 2016.0 2932.0\n", "6 GILL NaN BUS NaN 2004.0 NaN NaN\n", "7 CHEVROLET NaN HHR NaN 2008.0 NaN NaN\n", "8 ACUR NaN TSX NaN 2008.0 NaN NaN\n", "9 HONDA HONDA CIVIC CIVIC 2009.0 2009.0 2678.0" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vehicles_with_weights[['VEH_MAKE', 'Make', 'VEH_MODEL', 'Model', 'VEH_YEAR', 'ModelYear', 'weight']].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks some of these have weights, while others don't. How many do we have weights for?" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True 534436\n", "False 207697\n", "Name: weight, dtype: int64" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# isna vs notna\n", "vehicles_with_weights.weight.notna().value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some of our vehicles don't have weights or years!\n", "\n", "Let's get the report numbers of vehicles without weights and/or years, then exclude those reports (crashes) from our analysis." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>REPORT_NO</th>\n", " <th>SPEED_LIMIT</th>\n", " <th>VEH_MAKE</th>\n", " <th>VEH_MODEL</th>\n", " <th>VEH_YEAR</th>\n", " <th>VIN_NO</th>\n", " <th>VEHICLE_ID</th>\n", " <th>VIN</th>\n", " <th>Make</th>\n", " <th>Model</th>\n", " <th>ModelYear</th>\n", " <th>weight</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>6</th>\n", " <td>MDTA1150000D</td>\n", " <td>30</td>\n", " <td>GILL</td>\n", " <td>BUS</td>\n", " <td>2004.0</td>\n", " <td>15GGD211X41076128</td>\n", " <td>000abd04-058d-4e14-9f1c-4a100493a305</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>MSP5973003Q</td>\n", " <td>35</td>\n", " <td>CHEVROLET</td>\n", " <td>HHR</td>\n", " <td>2008.0</td>\n", " <td>3GNDA23DX8S558149</td>\n", " <td>000e3b5d-b4ae-44a1-af02-df6cce381fc1</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>MCP27070015</td>\n", " <td>25</td>\n", " <td>ACUR</td>\n", " <td>TSX</td>\n", " <td>2008.0</td>\n", " <td>JH4CL96848C021626</td>\n", " <td>000f45d8-bc0e-4f9c-820a-474212e669cd</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>DA3276000V</td>\n", " <td>45</td>\n", " <td>TOYT</td>\n", " <td>CAMRY</td>\n", " <td>2018.0</td>\n", " <td>JTNB11HK2J3004953</td>\n", " <td>0018b4d3-b19b-4327-b5f9-f0725fd06844</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>CB5712001X</td>\n", " <td>50</td>\n", " <td>JEEP</td>\n", " <td>COMPASS</td>\n", " <td>2015.0</td>\n", " <td>W452866115351</td>\n", " <td>001968e1-88be-424b-b7cf-bd816ffe340a</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " REPORT_NO SPEED_LIMIT VEH_MAKE VEH_MODEL VEH_YEAR \\\n", "6 MDTA1150000D 30 GILL BUS 2004.0 \n", "7 MSP5973003Q 35 CHEVROLET HHR 2008.0 \n", "8 MCP27070015 25 ACUR TSX 2008.0 \n", "14 DA3276000V 45 TOYT CAMRY 2018.0 \n", "15 CB5712001X 50 JEEP COMPASS 2015.0 \n", "\n", " VIN_NO VEHICLE_ID VIN Make Model \\\n", "6 15GGD211X41076128 000abd04-058d-4e14-9f1c-4a100493a305 NaN NaN NaN \n", "7 3GNDA23DX8S558149 000e3b5d-b4ae-44a1-af02-df6cce381fc1 NaN NaN NaN \n", "8 JH4CL96848C021626 000f45d8-bc0e-4f9c-820a-474212e669cd NaN NaN NaN \n", "14 JTNB11HK2J3004953 0018b4d3-b19b-4327-b5f9-f0725fd06844 NaN NaN NaN \n", "15 W452866115351 001968e1-88be-424b-b7cf-bd816ffe340a NaN NaN NaN \n", "\n", " ModelYear weight \n", "6 NaN NaN \n", "7 NaN NaN \n", "8 NaN NaN \n", "14 NaN NaN \n", "15 NaN NaN " ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find the ones missing weights\n", "missing_weight = vehicles_with_weights[vehicles_with_weights.weight.isna()]\n", "missing_weight.head()" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6 MDTA1150000D\n", "7 MSP5973003Q\n", "8 MCP27070015\n", "14 DA3276000V\n", "15 CB5712001X\n", "Name: REPORT_NO, dtype: object" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get their report numbers\n", "missing_weight.REPORT_NO.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before (396419, 1)\n", "After (218513, 1)\n" ] } ], "source": [ "# Filter those out of our 'crashes' dataset\n", "# We need to outline this real bigtime beforehand, \"what can we keep?\"\n", "# because if we're only looking at our own car, then we don't need to drop these\n", "print(\"Before\", crashes.shape)\n", "crashes = crashes[~crashes.REPORT_NO.isin(missing_weight.REPORT_NO)]\n", "print(\"After\", crashes.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we're getting rid of half of our data, is it still okay?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# We only want two-vehicle crashes\n", "\n", "> For analytic purposes, we decompose the data set into three sub-sa three-vehicle crashes, and single-vehicle crashes. The two-vehicle crash data set is the focus of most of our analyses\n", "\n", "Each vehicle has a `REPORT_NO`. If two vehicles show up with the same `REPORT_NO`, we know there were two vehicles in the crash." ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>REPORT_NO</th>\n", " <th>SPEED_LIMIT</th>\n", " <th>VEH_MAKE</th>\n", " <th>VEH_MODEL</th>\n", " <th>VEH_YEAR</th>\n", " <th>VIN_NO</th>\n", " <th>VEHICLE_ID</th>\n", " <th>VIN</th>\n", " <th>Make</th>\n", " <th>Model</th>\n", " <th>ModelYear</th>\n", " <th>weight</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ADJ487004H</td>\n", " <td>30</td>\n", " <td>CHEVY</td>\n", " <td>TAHOE</td>\n", " <td>2005.0</td>\n", " <td>1GNEK13Q2J285593</td>\n", " <td>000238fd-44fa-4cd5-8eb7-41ab30500bec</td>\n", " <td>1GNEK13Q2J285593</td>\n", " <td>CHEVROLET</td>\n", " <td>GMT-400</td>\n", " <td>1988.0</td>\n", " <td>4300.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>MCP2487000M</td>\n", " <td>40</td>\n", " <td>INFI</td>\n", " <td>4S</td>\n", " <td>2003.0</td>\n", " <td>JNKCV51E63M013580</td>\n", " <td>00038116-1bf9-48cc-b317-4f4375d14b60</td>\n", " <td>JNKCV51E63M013580</td>\n", " <td>INFINITI</td>\n", " <td>G35</td>\n", " <td>2003.0</td>\n", " <td>3468.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>CB5190006B</td>\n", " <td>55</td>\n", " <td>TOYT</td>\n", " <td>TK</td>\n", " <td>2011.0</td>\n", " <td>5TFUY5F1XBX167340</td>\n", " <td>0003b659-2785-4868-8877-0b786a284827</td>\n", " <td>5TFUY5F1XBX167340</td>\n", " <td>TOYOTA</td>\n", " <td>TUNDRA</td>\n", " <td>2011.0</td>\n", " <td>5480.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>ADJ4590035</td>\n", " <td>5</td>\n", " <td>HONDA</td>\n", " <td>CIVIC</td>\n", " <td>2015.0</td>\n", " <td>2HGFG4A59FH702545</td>\n", " <td>00050484-d08f-4b6e-bc7e-9ec270e94660</td>\n", " <td>2HGFG4A59FH702545</td>\n", " <td>HONDA</td>\n", " <td>CIVIC</td>\n", " <td>2015.0</td>\n", " <td>2754.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>ADJ849000Z</td>\n", " <td>10</td>\n", " <td>HONDA</td>\n", " <td>ACCORD</td>\n", " <td>2003.0</td>\n", " <td>1HGCM66313A037175</td>\n", " <td>00057af4-d848-4cee-b854-707f57581f4e</td>\n", " <td>1HGCM66313A037175</td>\n", " <td>HONDA</td>\n", " <td>ACCORD</td>\n", " <td>2003.0</td>\n", " <td>3023.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " REPORT_NO SPEED_LIMIT VEH_MAKE VEH_MODEL VEH_YEAR VIN_NO \\\n", "0 ADJ487004H 30 CHEVY TAHOE 2005.0 1GNEK13Q2J285593 \n", "1 MCP2487000M 40 INFI 4S 2003.0 JNKCV51E63M013580 \n", "2 CB5190006B 55 TOYT TK 2011.0 5TFUY5F1XBX167340 \n", "3 ADJ4590035 5 HONDA CIVIC 2015.0 2HGFG4A59FH702545 \n", "4 ADJ849000Z 10 HONDA ACCORD 2003.0 1HGCM66313A037175 \n", "\n", " VEHICLE_ID VIN Make \\\n", "0 000238fd-44fa-4cd5-8eb7-41ab30500bec 1GNEK13Q2J285593 CHEVROLET \n", "1 00038116-1bf9-48cc-b317-4f4375d14b60 JNKCV51E63M013580 INFINITI \n", "2 0003b659-2785-4868-8877-0b786a284827 5TFUY5F1XBX167340 TOYOTA \n", "3 00050484-d08f-4b6e-bc7e-9ec270e94660 2HGFG4A59FH702545 HONDA \n", "4 00057af4-d848-4cee-b854-707f57581f4e 1HGCM66313A037175 HONDA \n", "\n", " Model ModelYear weight \n", "0 GMT-400 1988.0 4300.0 \n", "1 G35 2003.0 3468.0 \n", "2 TUNDRA 2011.0 5480.0 \n", "3 CIVIC 2015.0 2754.0 \n", "4 ACCORD 2003.0 3023.0 " ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vehicles_with_weights.head()" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MDTA1229000H 69\n", "CB53480003 35\n", "MSP67460095 18\n", "CE4636002C 17\n", "MSP6063005N 17\n", "Name: REPORT_NO, dtype: int64" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Which technique?\n", "\n", "# counted = vehicles_with_weights.groupby('REPORT_NO').size().reset_index(name='car_count')\n", "# counted.head()\n", "\n", "counted = vehicles_with_weights.REPORT_NO.value_counts()\n", "counted.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There was a 69-car crash? That's pretty interesting. Better look that one up. To get the ones that have two cars, we just ask if `== 2` - since it's a series, we don't have to ask for the column name." ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['MSP528100BT', 'MSP5179001M', 'AE5869002N', 'ADH2680025', 'MSP2312000H',\n", " 'HS98300005', 'AC12830021', 'DA3892000P', 'AC2040003L', 'AC2127000B',\n", " ...\n", " 'ADI179000X', 'DA3889002W', 'ADI946000L', 'ADJ380001Y', 'AC2034002L',\n", " 'MSP61740086', 'AK00410003', 'MCP1366006C', 'ADJ186004B', 'ZH0339000W'],\n", " dtype='object', length=253011)" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#two_car_report_nos = counted[counted.car_count == 2].REPORT_NO\n", "two_car_report_nos = counted[counted == 2].index\n", "two_car_report_nos" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Filter for only light vehicles\n", "\n", "> this data set contains all collisions involving two light vehicles built after 1980. We define a light vehicle as any car, pickup truck, SUV, or minivan that weights between 1500 and 6000 pounds." ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(217074, 1)" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "is_not_light_vehicle = (vehicles_with_weights.weight <= 1500) | (vehicles_with_weights.weight > 6000)\n", "is_before_1980 = vehicles_with_weights.ModelYear.astype(float) < 1980\n", "not_light_vehicle_crash = vehicles_with_weights[is_not_light_vehicle | is_before_1980].REPORT_NO\n", "crashes = crashes[~crashes.REPORT_NO.isin(not_light_vehicle_crash)]\n", "crashes.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }