Feature engineering with Buzzfeed planes#

When you're working on flight data, you don't get a nice description of where and how the plane flies - you're stuck with a series of individual measurements. In this notebook we'll work on engineering some descriptive features about each plane's "normal" way of flying.

Read online Download notebook Interactive version

You might need to install pyproj and shapely libraries for this one. You can uncomment the lines below and run them to (hopefully) get the installation done.

# !pip install pyproj
# !pip install shapely

Imports#

# Normal imports for data analysis
import pandas as pd 
import numpy as np

# Extra geographic bits, only used for for calculating fly zone area
from shapely.geometry import Polygon
from pyproj import Transformer                                                                                                                

%matplotlib inline

Read in our dataset#

First we'll read in the dataset. Each organization you can get data from - OpenSky, FlightRadar24, etc - comes with slightly different columns. This set is from OpenSky.

df = pd.read_csv('data/flights_data4-1529539200-cleaned.csv', dtype={'icao24': 'str'})
df.head()

	icao24	callsign	day	firstseen	lastseen	estdepartureairport	estarrivalairport	time	altitude	latitude	longitude	heading
0	a09909	NaN	1529539200	1.529574e+09	1529574562	KPWK	IN93	1529574561	609.6	41.702100	-86.469939	88.232170
1	a09909	NaN	1529539200	1.529574e+09	1529574562	KPWK	IN93	1529574559	609.6	41.702042	-86.472140	88.253716
2	a09909	NaN	1529539200	1.529574e+09	1529574562	KPWK	IN93	1529574490	914.4	41.700657	-86.553040	88.451843
3	a09909	NaN	1529539200	1.529574e+09	1529574562	KPWK	IN93	1529574466	1219.2	41.703683	-86.587595	111.037514
4	a09909	NaN	1529539200	1.529574e+09	1529574562	KPWK	IN93	1529574465	1219.2	41.703962	-86.588681	111.037514

When we sit down to do our machine learning, our question is going to be: is this plane a surveillance plane or not? Probably a bunch of rows, one per plane, each one having a prediction of 0 (not surveillance) or 1 (surveillance). It seems like a silly thing to call out, but... take a look at our data.

Each row isn't a plane, it's a single reading from that plane in flight! This is going to cause some trouble. We can't predict whether each reading is a spy plane or not, because each reading doesn't have enough information in it. Each row has a time, altitude, coordinates, a heading, and a few other columns, but saying "you're flying north at these coordinates" is definitely not good enough to say whether something is a surveillance plane or not.

We'll somehow need to summarize all of these readings into a single row, to illustrate how the plane flies in a way that we can tell apart surveillance and non-surveillance planes. This calls for feature engineering!

Feature engineering#

Most of the time when we're doing machine learning, we have a list of columns we can plug into our machine learning algorithm. Maybe we divide a column by a thousand, or turn it into a percent instead of a raw number, but nothing really too crazy. Our main goal is feature selection, which is picking which columns to use.

In this case, though, we need to actually design all of these features. We'll pick apart each plane's flight paths using the coordinates, and turn it into a single row.

Picking features to engineer for surveillance plane detection#

What might make a spy plane unique? Buzzfeed had some ideas:

How fast the plane goes
Where the plane flies
How large of an area the plane flies over
How often the plane changes direction
How high each flight usually is
How long each flight is
What type of plane it is

Your first instinct might be to calculate an average for each one of these. For example, how long is an average flight? and how fast does the plane usually go? Instead of sticking at this level, Peter went a bit further.

Feature engineering example: Altitude quantiles#

Let's see Peter's alternative, using altitude as an example. For example, let's say we're interested in plane a4f179.

df[df.icao24 == 'a4f179'].head()

	icao24	callsign	day	firstseen	lastseen	estdepartureairport	estarrivalairport	time	altitude	latitude	longitude	heading
1096564	a4f179	MSQT714	1529539200	1.529614e+09	1529614495	2AZ7	2AZ3	1529614493	1219.2	33.748993	-112.408242	299.500732
1096565	a4f179	MSQT714	1529539200	1.529614e+09	1529614495	2AZ7	2AZ3	1529614491	1219.2	33.748489	-112.407177	300.398743
1096566	a4f179	MSQT714	1529539200	1.529614e+09	1529614495	2AZ7	2AZ3	1529614463	1219.2	33.742310	-112.395406	311.185913
1096567	a4f179	MSQT714	1529539200	1.529614e+09	1529614495	2AZ7	2AZ3	1529614461	1219.2	33.741574	-112.394371	311.128754
1096568	a4f179	MSQT714	1529539200	1.529614e+09	1529614495	2AZ7	2AZ3	1529614460	1219.2	33.741295	-112.393970	307.715973

How many readings do we have for this one? It's important to know what we're generally working with.

df[df.icao24 == 'a4f179'].shape

(3723, 12)

Wow, almost four thousand! That's a lot of info. Since we're focused on breaking it down to one row, though, we could ask "how high does this plane usually fly?"

df[df.icao24 == 'a4f179'].altitude.mean()

1001.4272360999194

That number might not mean much on its own, as we don't know if it's high or low. We'd need to compare it to other planes to see what that speed really looks like (...yeah, yeah, this isn't exactly true, but bear with me).

Let's get the mean altitude of all readings.

df.altitude.mean()

4684.43290572469

Okay, cool, so our plane seems to generally fly a lot lower than other planes. But I'm conflicted - we took 3800 altitude datapoints and turned it into one! It seems like like we're throwing out a lot of data that might prove useful in tracking down planes. Instead of reducing a plane to one number, let's see if we have any alternatives.

Think about all the questions you might ask about how planes fly. Do almost all planes fly at the same altitude? When they're flying, do they usually stay at one altitude? Or do they go up and down, in a way that measuring the mean won't capture?

We don't know anything about planes, so maybe we should ask what a "normal" flight looks like beyond just the mean. A histogram could help us there!

df.altitude.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x10df415f8>

Wow, interesting! It seems like there's a lot of low-flying planes, and then a big peak at 10,000, but plenty between and above. It might be useful to capture that numerically. We could ask for quartiles, but let's go a little more complex: what are the 20%, 40%, 60% and 80% boundaries for altitude?

quantiles = df.altitude.quantile([0.2, 0.4, 0.6, 0.8])
quantiles

0.2      914.4
0.4     1828.8
0.6     4572.0
0.8    10058.4
Name: altitude, dtype: float64

Remember how we were "throwing out a lot of data" by reducing our altitude measurements to just an average? The way Peter got around this way to count each measurement separately.

His question became: compared to an average plane, how often was our plane in each altitude group? How often was between the the 0-20% altitude? or the 20-40% altitude? or the 40-60% latitude?

A very average plane will always be in the 40-60% range. A plane that's always flying high is in the 80-100% group. A low-flying plane would be in 0-20%. If you split your time between high and low, you'll have entries in both 0-20% and 80-100%. This is a great way to summarize your data without throwing it all out.

Let's perform that analysis with our a4f179 plane, seeing how much time it spends in each bucket.

# Create buckets based on the quantiles of all planes,
# reusing our variable from above
bin_quantiles = [-np.inf, *quantiles, np.inf]

# altitude1: -infinity to 914.4
# altitude2: 914.4     to 1,828.8
# altitude3: 1,828.8   to 4,572.0
# altitude4: 4,572.0   to 10,058.4
# altitude5: 10,058.4  to infinity
bin_labels = ['altitude1','altitude2','altitude3','altitude4','altitude5']

# Get the altitudes for just this plane
altitudes = df[df.icao24 == 'a4f179'].altitude

# Count how many altitude reading are in each bucket
# normalize=True to give us percentages
# bins=bin_labels to name them
# sort=False so we can see them in order of 1-5
# * 100 to convert to percentages because it's easier to read
pd.cut(altitudes, 
       bins=bin_quantiles,
       labels=bin_labels).value_counts(sort=False, normalize=True) * 100

altitude1    18.614021
altitude2    81.385979
altitude3     0.000000
altitude4     0.000000
altitude5     0.000000
Name: altitude, dtype: float64

So our plane spends about 18% of its time flying low, and 80% of its time flying medium-low. Let's compare that with another plane, a97db4.

altitudes = df[df.icao24 == 'a97db4'].altitude

pd.cut(altitudes, 
       bins=bin_quantiles,
       labels=bin_labels).value_counts(sort=False, normalize=True) * 100

altitude1     1.923788
altitude2     8.694044
altitude3     5.438402
altitude4     8.213097
altitude5    75.730670
Name: altitude, dtype: float64

This plane is a completely different beast! Instead of spending most of its time low, it works its way up to the highest altitudes and spends around 76% of its time there. It's a very different pattern than our first plane, and gives a good summary that we might not have gotten if we just took the mean.

Calculating other feature quantiles#

Now that we have an idea of what we're going to be doing, let's get started! And to make things a bit easier, Buzzfeed didn't actually process these quantiles for all of the planes in its dataset:

Then we took a random sample of 500 aircraft and calculated the following for each one:

So we can start by pulling a sample of planes. Note that we can't just ask for 500 random rows - we have to ask for 500 unique plane identifiers and then use those to select our rows.

unique_icao24 = df.icao24.unique()

sample_icao24 = np.random.choice(unique_icao24, 500)
sample_icao24[:5]

array(['a06a7e', 'a9df59', 'a9e391', 'ae0232', 'a5ddea'], dtype=object)

Now we can select our readings for the sample planes.

sample_df = df[df.icao24.isin(sample_icao24)]
sample_df.shape

(239188, 12)

Calculate quantiles#

Then we took a random sample of 500 aircraft and calculated the following for each one:

duration of each flight segment recorded by Flightradar24, in minutes.

boxes Area of a rectangular bounding box drawn around each flight segment, in square kilometers.

Altitude quantiles#

Since we basically did altitude already, though, we'll throw that in there.

# Get a list of altitudes
altitudes = df.set_index('icao24').altitude

# Calculate quantiles from the sample and calculate each plane's behavior
quantiles = altitudes.loc[sample_icao24].quantile([0.2, 0.4, 0.6, 0.8])

bin_quantiles = [-np.inf, *quantiles, np.inf]
bin_labels = ['altitude1','altitude2','altitude3','altitude4','altitude5']

binned = pd.cut(altitudes,
                bins=bin_quantiles,
                labels=bin_labels)

altitude_features = binned.groupby(['icao24']).value_counts(normalize=True).unstack(fill_value=0) * 100
altitude_features.head()

altitude	altitude1	altitude2	altitude3	altitude4	altitude5
icao24
000001	70.951157	28.149100	0.899743	0.000000	0.000000
009f55	0.000000	0.000000	0.000000	100.000000	0.000000
00b0ec	17.757009	26.168224	32.710280	14.953271	8.411215
00b1fa	37.500000	24.038462	9.134615	20.192308	9.134615
00b208	13.580247	32.098765	17.283951	33.333333	3.703704

Flight duration quantiles#

Calculate flight durations for all of our planes. We're grouping by the plane ID and the firstseen column, which indicates a new flight.

flight_times = df.groupby(['icao24', 'firstseen']).lastseen.last().reset_index(level=1)
flight_times.head()

	firstseen	lastseen
icao24
000001	1.529539e+09	1529539737
000001	1.529541e+09	1529571558
000001	1.529595e+09	1529610553
009f55	1.529586e+09	1529587568
00b0ec	1.529523e+09	1529576455

We can then subtract the two to get the duration for each flight.

durations = flight_times.lastseen - flight_times.firstseen
durations.head()

icao24
000001      894.0
000001    30940.0
000001    15530.0
009f55     1695.0
00b0ec    53862.0
dtype: float64

# Calculate quantiles from the sample and calculate each plane's behavior
quantiles = durations.loc[sample_icao24].quantile([0.2, 0.4, 0.6, 0.8])

bin_quantiles = [-np.inf, *quantiles, np.inf]
bin_labels = ['duration1','duration2','duration3','duration4','duration5']

binned = pd.cut(durations,
                bins=bin_quantiles,
                labels=bin_labels)

duration_features = binned.groupby(['icao24']).value_counts(normalize=True).unstack(fill_value=0) * 100
duration_features.head()

	duration1	duration2	duration3	duration4	duration5
icao24
000001	33.333333	0.0	0.0	0.0	66.666667
009f55	0.000000	100.0	0.0	0.0	0.000000
00b0ec	0.000000	0.0	0.0	0.0	100.000000
00b1fa	0.000000	0.0	0.0	0.0	100.000000
00b208	0.000000	0.0	0.0	0.0	100.000000

Speed quantiles#

TODO speed!!!

df.head(2)

	icao24	callsign	day	firstseen	lastseen	estdepartureairport	estarrivalairport	time	altitude	latitude	longitude	heading
0	a09909	NaN	1529539200	1.529574e+09	1529574562	KPWK	IN93	1529574561	609.6	41.702100	-86.469939	88.232170
1	a09909	NaN	1529539200	1.529574e+09	1529574562	KPWK	IN93	1529574559	609.6	41.702042	-86.472140	88.253716

Bounding box area quintiles#

We'll calculate for everyone, then get quintiles from the sample

# Calculate min/max for each flight's lat/lon
coords = df.groupby(['icao24', 'firstseen']).agg({
    'latitude': ['min', 'max'],
    'longitude': ['min', 'max']
}).reset_index(level=1, drop=True)
coords.head()

	latitude		longitude
	min	max	min	max
icao24
000001	34.875229	34.885409	-86.809593	-86.707801
000001	34.777222	34.881821	-86.809312	-86.692085
000001	33.032364	33.679613	-111.994124	-111.361322
009f55	34.681184	37.396591	-80.488294	-78.992672
00b0ec	37.990173	40.612033	-73.915267	-68.287626

# Calculate the bounding box in km for those
# Yeah, sorry this is complicated

transformer = Transformer.from_crs('epsg:4326', 'epsg:3857')

def get_area(row):
    coords_y = [
        row['longitude']['min'],
        row['longitude']['max'],
        row['longitude']['max'],
        row['longitude']['min'],
    ]
    coords_x = [
        row['latitude']['max'],
        row['latitude']['max'],
        row['latitude']['min'],
        row['latitude']['min'],
    ]

    # This is so much faster than going the shapely route
    # but also slower than it used to be?
    lon, lat = transformer.transform(coords_x, coords_y)
    square_m = Polygon(zip(lon, lat)).area
    square_km = square_m / (1000 * 1000)
    return square_km

areas = coords.apply(get_area, axis=1)
areas.head()

icao24
000001        15.653545
000001       185.110905
000001      6076.579771
009f55     62250.036094
00b0ec    236331.915381
dtype: float64

# Calculate quantiles from the sample and calculate each plane's behavior
quantiles = areas.loc[sample_icao24].quantile([0.2, 0.4, 0.6, 0.8])

bin_quantiles = [-np.inf, *quantiles, np.inf]
bin_labels = ['area1','area2','area3','area4','area5']

binned = pd.cut(areas,
                bins=bin_quantiles,
                labels=bin_labels)

area_features = binned.groupby(['icao24']).value_counts(normalize=True).unstack(fill_value=0) * 100
area_features.head()

	area1	area2	area3	area4	area5
icao24
000001	100.0	0.0	0.0	0.0	0.0
009f55	0.0	100.0	0.0	0.0	0.0
00b0ec	0.0	0.0	100.0	0.0	0.0
00b1fa	0.0	0.0	100.0	0.0	0.0
00b208	0.0	0.0	100.0	0.0	0.0

Compute the steering#

Each measurement has a heading, the direction the plane is going. By calculating the difference between one heading and the next, we can calculate in which direction the plane was steered.

df['steering'] = df.groupby(['icao24', 'firstseen']).heading.diff()
df.head()

	icao24	callsign	day	firstseen	lastseen	estdepartureairport	estarrivalairport	time	altitude	latitude	longitude	heading	steering
0	a09909	NaN	1529539200	1.529574e+09	1529574562	KPWK	IN93	1529574561	609.6	41.702100	-86.469939	88.232170	NaN
1	a09909	NaN	1529539200	1.529574e+09	1529574562	KPWK	IN93	1529574559	609.6	41.702042	-86.472140	88.253716	0.021545
2	a09909	NaN	1529539200	1.529574e+09	1529574562	KPWK	IN93	1529574490	914.4	41.700657	-86.553040	88.451843	0.198128
3	a09909	NaN	1529539200	1.529574e+09	1529574562	KPWK	IN93	1529574466	1219.2	41.703683	-86.587595	111.037514	22.585670
4	a09909	NaN	1529539200	1.529574e+09	1529574562	KPWK	IN93	1529574465	1219.2	41.703962	-86.588681	111.037514	0.000000

steerings = df.set_index('icao24').steering
steerings.head()

icao24
a09909          NaN
a09909     0.021545
a09909     0.198128
a09909    22.585670
a09909     0.000000
Name: steering, dtype: float64

Since we aren't quite sure how much steering goes into flying a plane, we can take a look at a histogram to try to understand it.

steerings.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x11c5cad30>

Notice those +/- 300-something turns? It's a side effect of our calculation - think about what a 359 degree turn to the right is (Hint: it's just ONE DEGREE to the left). Let's get rid of them.

# Just in case it crossed the 360 barrier, we can only turn 180* L or R
steerings = steerings.apply(lambda direction: direction if direction <= 180 else direction - 360)
steerings = steerings.apply(lambda direction: direction if direction >= -180 else direction + 360)

steerings.head()

icao24
a09909          NaN
a09909     0.021545
a09909     0.198128
a09909    22.585670
a09909     0.000000
Name: steering, dtype: float64

Does the distribution still have those 300s in there?

steerings.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x125f1c550>

That looks more reasonable - now we can run some calculations. Peter didn't use quantiles for this one, he used his own boundaries.

# Calculate quantiles from the sample and calculate each plane's behavior
#quantiles = areas.loc[sample_icao24].quantile([0.2, 0.4, 0.6, 0.8])
quantiles = [-25, -10, -1, 0, 1, 22, 45]

bin_quantiles = [-np.inf, *quantiles, np.inf]
bin_labels = ['steer1','steer2','steer3','steer4','steer5', 'steer6', 'steer7', 'steer8']

binned = pd.cut(steerings,
                bins=bin_quantiles,
                labels=bin_labels)

steer_features = binned.groupby(['icao24']).value_counts(normalize=True).unstack(fill_value=0) * 100
steer_features.head()

steering	steer1	steer2	steer3	steer4	steer5	steer6	steer7	steer8
icao24
000001	1.609788	2.318094	56.213780	6.696716	3.090792	27.817128	1.094656	1.159047
009f55	0.000000	0.000000	29.411765	23.529412	41.176471	5.882353	0.000000	0.000000
00b0ec	0.000000	0.943396	48.113208	8.490566	16.037736	26.415094	0.000000	0.000000
00b1fa	0.000000	0.000000	42.028986	12.077295	10.628019	35.265700	0.000000	0.000000
00b208	0.000000	0.000000	63.975155	16.770186	5.590062	13.043478	0.621118	0.000000

Observations: How many times did each plane show up in our dataset?#

count_features = df.groupby('icao24').size().to_frame(name='observations')
count_features.head(2)

	observations
icao24
000001	1556
009f55	18

Combine everything into a features set#

Now that we have multiple dataframes of features, we can concatenate them all together based on the iaco24 index and have a nice big set of features.

features = pd.concat([
    altitude_features,
    duration_features,
    area_features,
    steer_features,
    count_features
], axis=1, sort=False)
features.head(2)

	altitude1	altitude2	altitude3	altitude4	altitude5	duration1	duration2	duration3	duration4	duration5	...	area5	steer1	steer2	steer3	steer4	steer5	steer6	steer7	steer8	observations
000001	70.951157	28.1491	0.899743	0.0	0.0	33.333333	0.0	0.0	0.0	66.666667	...	0.0	1.609788	2.318094	56.213780	6.696716	3.090792	27.817128	1.094656	1.159047	1556
009f55	0.000000	0.0000	0.000000	100.0	0.0	0.000000	100.0	0.0	0.0	0.000000	...	0.0	0.000000	0.000000	29.411765	23.529412	41.176471	5.882353	0.000000	0.000000	18

2 rows × 24 columns

We'll also turn the icao24 back into a column, and capitalize it to match the other popular datasets on the topic.

features = features.reset_index().rename(columns={ 'index': 'icao24'})
features.icao24 = features.icao24.str.upper()
features.head()

	icao24	altitude1	altitude2	altitude3	altitude4	altitude5	duration1	duration2	...	steer1	steer2	steer3	steer4	steer5	steer6	steer7	steer8	observations
0	000001	70.951157	28.149100	0.899743	0.000000	0.000000	33.333333	0.0	...	1.609788	2.318094	56.213780	6.696716	3.090792	27.817128	1.094656	1.159047	1556
1	009F55	0.000000	0.000000	0.000000	100.000000	0.000000	0.000000	100.0	...	0.000000	0.000000	29.411765	23.529412	41.176471	5.882353	0.000000	0.000000	18
2	00B0EC	17.757009	26.168224	32.710280	14.953271	8.411215	0.000000	0.0	...	0.000000	0.943396	48.113208	8.490566	16.037736	26.415094	0.000000	0.000000	107
3	00B1FA	37.500000	24.038462	9.134615	20.192308	9.134615	0.000000	0.0	...	0.000000	0.000000	42.028986	12.077295	10.628019	35.265700	0.000000	0.000000	208
4	00B208	13.580247	32.098765	17.283951	33.333333	3.703704	0.000000	0.0	...	0.000000	0.000000	63.975155	16.770186	5.590062	13.043478	0.621118	0.000000	162

5 rows × 25 columns

Filtering for meaning and saving#

Peter only kept planes that had over 500 measurements.

We have far fewer planes than he did (and a much smaller dataset), but for propriety's sake we'll do the same.

print("Before filtering", features.shape)
subset = features[features.observations > 500]
print("After filtering", subset.shape)

Before filtering (13840, 25)
After filtering (3965, 25)

# Save the filtered version
subset.to_csv("data/opensky-features.csv", index=False)

# Save a copy of the full set just for experimentation
features.to_csv("data/opensky-features-full.csv", index=False)

Review#

We started off with a list of coordinates and other information for our planes, but needed to turn that into a description of how the plane behaves in the air. Adjusting your data to create new descriptive elements is called feature engineering, and is an important part of machine learning.

While we could have just used simple features like the average altitude and speed of our planes, we felt like that was throwing out a lot of our data. Instead, we divided the "average" behavior of a plane into 5 quantiles - 0-20%, 20-40%, 40-60%, 60-80%, and 80-100% - and saw how much time each plane spent in each quantile. For example, a slow, low-flying plane might spend most of its time in the 0-20% quantile for altitude and speed.

Although it was complicated, this method allowed us to create a more complete set of features describing how each plane flies. The hope is that these detailed features will allow us to better identify planes that are flown similarly to one another.

Discussion topics#

Do you think these features do a better job of describing a plane than just the median or mean? Is that answer the same for all of the measurements we took?
Which feature do you predict works best to describe surveillance planes?
Why do we not use quantiles for steering? It might be interesting to check what the quantiles would have been.
Can you think of other features that might be used to describe surveillance planes?
Some organizations that publish this sort of data have agreements to not include military or government planes. Is that a reasonable request?
What might be some uses of this data?
How is this data collected? What are the requirements to be able to obtain it on your own?