Feature engineering with Buzzfeed planes#
When you're working on flight data, you don't get a nice description of where and how the plane flies - you're stuck with a series of individual measurements. In this notebook we'll work on engineering some descriptive features about each plane's "normal" way of flying.
You might need to install pyproj
and shapely
libraries for this one. You can uncomment the lines below and run them to (hopefully) get the installation done.
# !pip install pyproj
# !pip install shapely
Imports#
# Normal imports for data analysis
import pandas as pd
import numpy as np
# Extra geographic bits, only used for for calculating fly zone area
from shapely.geometry import Polygon
from pyproj import Transformer
%matplotlib inline
Read in our dataset#
First we'll read in the dataset. Each organization you can get data from - OpenSky, FlightRadar24, etc - comes with slightly different columns. This set is from OpenSky.
df = pd.read_csv('data/flights_data4-1529539200-cleaned.csv', dtype={'icao24': 'str'})
df.head()
When we sit down to do our machine learning, our question is going to be: is this plane a surveillance plane or not? Probably a bunch of rows, one per plane, each one having a prediction of 0
(not surveillance) or 1
(surveillance). It seems like a silly thing to call out, but... take a look at our data.
Each row isn't a plane, it's a single reading from that plane in flight! This is going to cause some trouble. We can't predict whether each reading is a spy plane or not, because each reading doesn't have enough information in it. Each row has a time, altitude, coordinates, a heading, and a few other columns, but saying "you're flying north at these coordinates" is definitely not good enough to say whether something is a surveillance plane or not.
We'll somehow need to summarize all of these readings into a single row, to illustrate how the plane flies in a way that we can tell apart surveillance and non-surveillance planes. This calls for feature engineering!
Feature engineering#
Most of the time when we're doing machine learning, we have a list of columns we can plug into our machine learning algorithm. Maybe we divide a column by a thousand, or turn it into a percent instead of a raw number, but nothing really too crazy. Our main goal is feature selection, which is picking which columns to use.
In this case, though, we need to actually design all of these features. We'll pick apart each plane's flight paths using the coordinates, and turn it into a single row.
Picking features to engineer for surveillance plane detection#
What might make a spy plane unique? Buzzfeed had some ideas:
- How fast the plane goes
- Where the plane flies
- How large of an area the plane flies over
- How often the plane changes direction
- How high each flight usually is
- How long each flight is
- What type of plane it is
Your first instinct might be to calculate an average for each one of these. For example, how long is an average flight? and how fast does the plane usually go? Instead of sticking at this level, Peter went a bit further.
Feature engineering example: Altitude quantiles#
Let's see Peter's alternative, using altitude as an example. For example, let's say we're interested in plane a4f179
.
df[df.icao24 == 'a4f179'].head()
How many readings do we have for this one? It's important to know what we're generally working with.
df[df.icao24 == 'a4f179'].shape
Wow, almost four thousand! That's a lot of info. Since we're focused on breaking it down to one row, though, we could ask "how high does this plane usually fly?"
df[df.icao24 == 'a4f179'].altitude.mean()
That number might not mean much on its own, as we don't know if it's high or low. We'd need to compare it to other planes to see what that speed really looks like (...yeah, yeah, this isn't exactly true, but bear with me).
Let's get the mean altitude of all readings.
df.altitude.mean()
Okay, cool, so our plane seems to generally fly a lot lower than other planes. But I'm conflicted - we took 3800 altitude datapoints and turned it into one! It seems like like we're throwing out a lot of data that might prove useful in tracking down planes. Instead of reducing a plane to one number, let's see if we have any alternatives.
Think about all the questions you might ask about how planes fly. Do almost all planes fly at the same altitude? When they're flying, do they usually stay at one altitude? Or do they go up and down, in a way that measuring the mean won't capture?
We don't know anything about planes, so maybe we should ask what a "normal" flight looks like beyond just the mean. A histogram could help us there!
df.altitude.hist()
Wow, interesting! It seems like there's a lot of low-flying planes, and then a big peak at 10,000, but plenty between and above. It might be useful to capture that numerically. We could ask for quartiles, but let's go a little more complex: what are the 20%, 40%, 60% and 80% boundaries for altitude?
quantiles = df.altitude.quantile([0.2, 0.4, 0.6, 0.8])
quantiles
Remember how we were "throwing out a lot of data" by reducing our altitude measurements to just an average? The way Peter got around this way to count each measurement separately.
His question became: compared to an average plane, how often was our plane in each altitude group? How often was between the the 0-20% altitude? or the 20-40% altitude? or the 40-60% latitude?
A very average plane will always be in the 40-60% range. A plane that's always flying high is in the 80-100% group. A low-flying plane would be in 0-20%. If you split your time between high and low, you'll have entries in both 0-20% and 80-100%. This is a great way to summarize your data without throwing it all out.
Let's perform that analysis with our a4f179
plane, seeing how much time it spends in each bucket.
# Create buckets based on the quantiles of all planes,
# reusing our variable from above
bin_quantiles = [-np.inf, *quantiles, np.inf]
# altitude1: -infinity to 914.4
# altitude2: 914.4 to 1,828.8
# altitude3: 1,828.8 to 4,572.0
# altitude4: 4,572.0 to 10,058.4
# altitude5: 10,058.4 to infinity
bin_labels = ['altitude1','altitude2','altitude3','altitude4','altitude5']
# Get the altitudes for just this plane
altitudes = df[df.icao24 == 'a4f179'].altitude
# Count how many altitude reading are in each bucket
# normalize=True to give us percentages
# bins=bin_labels to name them
# sort=False so we can see them in order of 1-5
# * 100 to convert to percentages because it's easier to read
pd.cut(altitudes,
bins=bin_quantiles,
labels=bin_labels).value_counts(sort=False, normalize=True) * 100
So our plane spends about 18% of its time flying low, and 80% of its time flying medium-low. Let's compare that with another plane, a97db4
.
altitudes = df[df.icao24 == 'a97db4'].altitude
pd.cut(altitudes,
bins=bin_quantiles,
labels=bin_labels).value_counts(sort=False, normalize=True) * 100
This plane is a completely different beast! Instead of spending most of its time low, it works its way up to the highest altitudes and spends around 76% of its time there. It's a very different pattern than our first plane, and gives a good summary that we might not have gotten if we just took the mean.
Calculating other feature quantiles#
Now that we have an idea of what we're going to be doing, let's get started! And to make things a bit easier, Buzzfeed didn't actually process these quantiles for all of the planes in its dataset:
Then we took a random sample of 500 aircraft and calculated the following for each one:
So we can start by pulling a sample of planes. Note that we can't just ask for 500 random rows - we have to ask for 500 unique plane identifiers and then use those to select our rows.
unique_icao24 = df.icao24.unique()
sample_icao24 = np.random.choice(unique_icao24, 500)
sample_icao24[:5]
Now we can select our readings for the sample planes.
sample_df = df[df.icao24.isin(sample_icao24)]
sample_df.shape
Calculate quantiles#
Then we took a random sample of 500 aircraft and calculated the following for each one:
- duration of each flight segment recorded by Flightradar24, in minutes.
- boxes Area of a rectangular bounding box drawn around each flight segment, in square kilometers.
Altitude quantiles#
Since we basically did altitude already, though, we'll throw that in there.
# Get a list of altitudes
altitudes = df.set_index('icao24').altitude
# Calculate quantiles from the sample and calculate each plane's behavior
quantiles = altitudes.loc[sample_icao24].quantile([0.2, 0.4, 0.6, 0.8])
bin_quantiles = [-np.inf, *quantiles, np.inf]
bin_labels = ['altitude1','altitude2','altitude3','altitude4','altitude5']
binned = pd.cut(altitudes,
bins=bin_quantiles,
labels=bin_labels)
altitude_features = binned.groupby(['icao24']).value_counts(normalize=True).unstack(fill_value=0) * 100
altitude_features.head()
Flight duration quantiles#
Calculate flight durations for all of our planes. We're grouping by the plane ID and the firstseen
column, which indicates a new flight.
flight_times = df.groupby(['icao24', 'firstseen']).lastseen.last().reset_index(level=1)
flight_times.head()
We can then subtract the two to get the duration for each flight.
durations = flight_times.lastseen - flight_times.firstseen
durations.head()
# Calculate quantiles from the sample and calculate each plane's behavior
quantiles = durations.loc[sample_icao24].quantile([0.2, 0.4, 0.6, 0.8])
bin_quantiles = [-np.inf, *quantiles, np.inf]
bin_labels = ['duration1','duration2','duration3','duration4','duration5']
binned = pd.cut(durations,
bins=bin_quantiles,
labels=bin_labels)
duration_features = binned.groupby(['icao24']).value_counts(normalize=True).unstack(fill_value=0) * 100
duration_features.head()
Speed quantiles#
TODO speed!!!
df.head(2)
Bounding box area quintiles#
We'll calculate for everyone, then get quintiles from the sample
# Calculate min/max for each flight's lat/lon
coords = df.groupby(['icao24', 'firstseen']).agg({
'latitude': ['min', 'max'],
'longitude': ['min', 'max']
}).reset_index(level=1, drop=True)
coords.head()
# Calculate the bounding box in km for those
# Yeah, sorry this is complicated
transformer = Transformer.from_crs('epsg:4326', 'epsg:3857')
def get_area(row):
coords_y = [
row['longitude']['min'],
row['longitude']['max'],
row['longitude']['max'],
row['longitude']['min'],
]
coords_x = [
row['latitude']['max'],
row['latitude']['max'],
row['latitude']['min'],
row['latitude']['min'],
]
# This is so much faster than going the shapely route
# but also slower than it used to be?
lon, lat = transformer.transform(coords_x, coords_y)
square_m = Polygon(zip(lon, lat)).area
square_km = square_m / (1000 * 1000)
return square_km
areas = coords.apply(get_area, axis=1)
areas.head()
# Calculate quantiles from the sample and calculate each plane's behavior
quantiles = areas.loc[sample_icao24].quantile([0.2, 0.4, 0.6, 0.8])
bin_quantiles = [-np.inf, *quantiles, np.inf]
bin_labels = ['area1','area2','area3','area4','area5']
binned = pd.cut(areas,
bins=bin_quantiles,
labels=bin_labels)
area_features = binned.groupby(['icao24']).value_counts(normalize=True).unstack(fill_value=0) * 100
area_features.head()
Compute the steering#
Each measurement has a heading, the direction the plane is going. By calculating the difference between one heading and the next, we can calculate in which direction the plane was steered.
df['steering'] = df.groupby(['icao24', 'firstseen']).heading.diff()
df.head()
steerings = df.set_index('icao24').steering
steerings.head()
Since we aren't quite sure how much steering goes into flying a plane, we can take a look at a histogram to try to understand it.
steerings.hist()
Notice those +/- 300-something turns? It's a side effect of our calculation - think about what a 359 degree turn to the right is (Hint: it's just ONE DEGREE to the left). Let's get rid of them.
# Just in case it crossed the 360 barrier, we can only turn 180* L or R
steerings = steerings.apply(lambda direction: direction if direction <= 180 else direction - 360)
steerings = steerings.apply(lambda direction: direction if direction >= -180 else direction + 360)
steerings.head()
Does the distribution still have those 300s in there?
steerings.hist()
That looks more reasonable - now we can run some calculations. Peter didn't use quantiles for this one, he used his own boundaries.
# Calculate quantiles from the sample and calculate each plane's behavior
#quantiles = areas.loc[sample_icao24].quantile([0.2, 0.4, 0.6, 0.8])
quantiles = [-25, -10, -1, 0, 1, 22, 45]
bin_quantiles = [-np.inf, *quantiles, np.inf]
bin_labels = ['steer1','steer2','steer3','steer4','steer5', 'steer6', 'steer7', 'steer8']
binned = pd.cut(steerings,
bins=bin_quantiles,
labels=bin_labels)
steer_features = binned.groupby(['icao24']).value_counts(normalize=True).unstack(fill_value=0) * 100
steer_features.head()
Observations: How many times did each plane show up in our dataset?#
count_features = df.groupby('icao24').size().to_frame(name='observations')
count_features.head(2)
Combine everything into a features set#
Now that we have multiple dataframes of features, we can concatenate them all together based on the iaco24
index and have a nice big set of features.
features = pd.concat([
altitude_features,
duration_features,
area_features,
steer_features,
count_features
], axis=1, sort=False)
features.head(2)
We'll also turn the icao24
back into a column, and capitalize it to match the other popular datasets on the topic.
features = features.reset_index().rename(columns={ 'index': 'icao24'})
features.icao24 = features.icao24.str.upper()
features.head()
Filtering for meaning and saving#
Peter only kept planes that had over 500 measurements.
We have far fewer planes than he did (and a much smaller dataset), but for propriety's sake we'll do the same.
print("Before filtering", features.shape)
subset = features[features.observations > 500]
print("After filtering", subset.shape)
# Save the filtered version
subset.to_csv("data/opensky-features.csv", index=False)
# Save a copy of the full set just for experimentation
features.to_csv("data/opensky-features-full.csv", index=False)
Review#
We started off with a list of coordinates and other information for our planes, but needed to turn that into a description of how the plane behaves in the air. Adjusting your data to create new descriptive elements is called feature engineering, and is an important part of machine learning.
While we could have just used simple features like the average altitude and speed of our planes, we felt like that was throwing out a lot of our data. Instead, we divided the "average" behavior of a plane into 5 quantiles - 0-20%, 20-40%, 40-60%, 60-80%, and 80-100% - and saw how much time each plane spent in each quantile. For example, a slow, low-flying plane might spend most of its time in the 0-20% quantile for altitude and speed.
Although it was complicated, this method allowed us to create a more complete set of features describing how each plane flies. The hope is that these detailed features will allow us to better identify planes that are flown similarly to one another.
Discussion topics#
- Do you think these features do a better job of describing a plane than just the median or mean? Is that answer the same for all of the measurements we took?
- Which feature do you predict works best to describe surveillance planes?
- Why do we not use quantiles for steering? It might be interesting to check what the quantiles would have been.
- Can you think of other features that might be used to describe surveillance planes?
- Some organizations that publish this sort of data have agreements to not include military or government planes. Is that a reasonable request?
- What might be some uses of this data?
- How is this data collected? What are the requirements to be able to obtain it on your own?