Combining datasets and cleaning data for jury selection analysis#
Before we perform our logistic regression on jury selection data, we'll need to do a bit of cleaning.
import pandas as pd
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
pd.set_option('display.width', 200)
Read in the files#
The dataset comes in a few sections: the jurors themselves, their answers to the questions, and data about the trial.
jurors = pd.read_csv("data/jurors.csv")
jurors.head(2)
id | trial | trial__id | race | gender | race_source | gender_source | struck_by | strike_eligibility | |
---|---|---|---|---|---|---|---|---|---|
0 | 35 | 1993-9826--Terry L. Landingham | 1 | White | Male | Jury strike sheet | Jury strike sheet | Struck for cause | NaN |
1 | 38 | 1993-9826--Terry L. Landingham | 1 | Black | Female | Jury strike sheet | Jury strike sheet | Struck for cause | NaN |
answers = pd.read_csv("data/voir_dire_answers.csv")
answers.head(2)
id | juror_id | juror_id__trial__id | no_responses | married | children | religious | education | leans_state | leans_defense | leans_ambi | moral_hardship | job_hardship | caretaker | communication | medical | employed | social | prior_jury | crime_victim | fam_crime_victim | accused | fam_accused | eyewitness | fam_eyewitness | military | law_enforcement | fam_law_enforcement | premature_verdict | premature_guilt | premature_innocence | def_race | vic_race | def_gender | vic_gender | def_social | vic_social | def_age | vic_age | def_sexpref | vic_sexpref | def_incarcerated | vic_incarcerated | beliefs | other_biases | innocence | take_stand | arrest_is_guilt | cant_decide | cant_affirm | cant_decide_evidence | cant_follow | know_def | know_vic | know_wit | know_attny | civil_plantiff | civil_def | civil_witness | witness_defense | witness_state | prior_info | death_hesitation | no_death | no_life | no_cops | yes_cops | legally_disqualified | witness_ambi | notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1521 | 107.0 | 3.0 | False | unknown | unknown | unknown | unknown | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | NaN |
1 | 1524 | 108.0 | 3.0 | False | unknown | unknown | unknown | unknown | False | False | False | False | False | False | False | False | False | False | False | True | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | NaN |
trials = pd.read_csv("data/trials.csv")
trials.head(2)
id | defendant_name | cause_number | state_strikes | defense_strikes | county | defendant_race | second_defendant_race | third_defendant_race | fourth_defendant_race | more_than_four_defendants | judge | prosecutor_1 | prosecutor_2 | prosecutor_3 | prosecutors_more_than_three | def_attny_1 | def_attny_2 | def_attny_3 | def_attnys_more_than_three | offense_code_1 | offense_title_1 | offense_code_2 | offense_title_2 | offense_code_3 | offense_title_3 | offense_code_4 | offense_title_4 | offense_code_5 | offense_title_5 | offense_code_6 | offense_title_6 | more_than_six | verdict | case_appealed | batson_claim_by_defense | batson_claim_by_state | voir_dire_present | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Terry L. Landingham | 1993-9826 | False | False | Attala | Black | NaN | NaN | NaN | False | Joseph Loper, Jr | Kevin Horan | NaN | NaN | False | James H. Powell, III | NaN | NaN | False | 97-3-7(2)(b) | Aggravated Assault | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False | Guilty on at least one offense | True | False | False | True |
1 | 2 | Donovan Johnson | 2009-0023 | False | True | Attala | Black | NaN | NaN | NaN | False | Joseph Loper, Jr | Ryan M. Berry | Mike Howie | NaN | False | Rosalind H. Jordan | NaN | NaN | False | 41-29-139(a)(1)(b)(1) | sale of cocaine | 41-29-139(a)(1)(b)(1) | sale of cocaine | 41-29-139(a)(1)(b)(1) | sale of cocaine | NaN | NaN | NaN | NaN | NaN | NaN | False | Guilty on at least one offense | True | False | False | True |
Combine#
We'll combine the datasets together based on the juror's id code as well as which trial they were participating in.
df = answers.merge(jurors, left_on='juror_id', right_on='id')
df = df.merge(trials, left_on='trial__id', right_on='id')
df.head(2)
id_x | juror_id | juror_id__trial__id | no_responses | married | children | religious | education | leans_state | leans_defense | leans_ambi | moral_hardship | job_hardship | caretaker | communication | medical | employed | social | prior_jury | crime_victim | fam_crime_victim | accused | fam_accused | eyewitness | fam_eyewitness | military | law_enforcement | fam_law_enforcement | premature_verdict | premature_guilt | premature_innocence | def_race | vic_race | def_gender | vic_gender | def_social | vic_social | def_age | vic_age | def_sexpref | vic_sexpref | def_incarcerated | vic_incarcerated | beliefs | other_biases | innocence | take_stand | arrest_is_guilt | cant_decide | cant_affirm | cant_decide_evidence | cant_follow | know_def | know_vic | know_wit | know_attny | civil_plantiff | civil_def | civil_witness | witness_defense | witness_state | prior_info | death_hesitation | no_death | no_life | no_cops | yes_cops | legally_disqualified | witness_ambi | notes | id_y | trial | trial__id | race | gender | race_source | gender_source | struck_by | strike_eligibility | id | defendant_name | cause_number | state_strikes | defense_strikes | county | defendant_race | second_defendant_race | third_defendant_race | fourth_defendant_race | more_than_four_defendants | judge | prosecutor_1 | prosecutor_2 | prosecutor_3 | prosecutors_more_than_three | def_attny_1 | def_attny_2 | def_attny_3 | def_attnys_more_than_three | offense_code_1 | offense_title_1 | offense_code_2 | offense_title_2 | offense_code_3 | offense_title_3 | offense_code_4 | offense_title_4 | offense_code_5 | offense_title_5 | offense_code_6 | offense_title_6 | more_than_six | verdict | case_appealed | batson_claim_by_defense | batson_claim_by_state | voir_dire_present | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1521 | 107.0 | 3.0 | False | unknown | unknown | unknown | unknown | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | NaN | 107 | 2004-0257--Sparky Watson | 3 | White | Male | Jury strike sheet | Jury strike sheet | Struck by the defense | Both State and Defense | 3 | Sparky Watson | 2004-0257 | True | True | Grenada | Black | NaN | NaN | NaN | False | C. Morgan, III | Susan Denley | Ryan Berry | NaN | False | M. Kevin Horan | Elizabeth Davis | NaN | False | 41-29-139(a)(1)(b)(3) | sale of marihuana (less than 30 grams) | 41-29-139(a)(1)(b)(1) | sale of cocaine | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False | Guilty on at least one offense | True | False | False | True |
1 | 1524 | 108.0 | 3.0 | False | unknown | unknown | unknown | unknown | False | False | False | False | False | False | False | False | False | False | False | True | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | NaN | 108 | 2004-0257--Sparky Watson | 3 | Black | Female | Jury strike sheet | Jury strike sheet | Struck by the state | State | 3 | Sparky Watson | 2004-0257 | True | True | Grenada | Black | NaN | NaN | NaN | False | C. Morgan, III | Susan Denley | Ryan Berry | NaN | False | M. Kevin Horan | Elizabeth Davis | NaN | False | 41-29-139(a)(1)(b)(3) | sale of marihuana (less than 30 grams) | 41-29-139(a)(1)(b)(1) | sale of cocaine | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False | Guilty on at least one offense | True | False | False | True |
Filter#
We'll now need to label the jurors as struck or not. We'll look at the ones who were eligible for striking by their the defense or the state, and then label them as being struck by the state or not.
df = df[(df.strike_eligibility == 'Both State and Defense') | (df.strike_eligibility == 'State')]
df.state_strikes.value_counts()
0 1647 1 648 Name: state_strikes, dtype: int64
df['struck_by_state'] = df.struck_by == 'Struck by the state'
df.struck_by_state.value_counts()
False 1722 True 573 Name: struck_by_state, dtype: int64
Turn into numbers#
Our dataset is absolutely full of True
and False
values! Machine learning likes 0
and 1
values a lot more, so we'll do a search and replace across our entire dataframe.
df = df.replace({
True: 1,
False: 0
})
df.head(3)
id_x | juror_id | juror_id__trial__id | no_responses | married | children | religious | education | leans_state | leans_defense | leans_ambi | moral_hardship | job_hardship | caretaker | communication | medical | employed | social | prior_jury | crime_victim | fam_crime_victim | accused | fam_accused | eyewitness | fam_eyewitness | military | law_enforcement | fam_law_enforcement | premature_verdict | premature_guilt | premature_innocence | def_race | vic_race | def_gender | vic_gender | def_social | vic_social | def_age | vic_age | def_sexpref | vic_sexpref | def_incarcerated | vic_incarcerated | beliefs | other_biases | innocence | take_stand | arrest_is_guilt | cant_decide | cant_affirm | cant_decide_evidence | cant_follow | know_def | know_vic | know_wit | know_attny | civil_plantiff | civil_def | civil_witness | witness_defense | witness_state | prior_info | death_hesitation | no_death | no_life | no_cops | yes_cops | legally_disqualified | witness_ambi | notes | id_y | trial | trial__id | race | gender | race_source | gender_source | struck_by | strike_eligibility | id | defendant_name | cause_number | state_strikes | defense_strikes | county | defendant_race | second_defendant_race | third_defendant_race | fourth_defendant_race | more_than_four_defendants | judge | prosecutor_1 | prosecutor_2 | prosecutor_3 | prosecutors_more_than_three | def_attny_1 | def_attny_2 | def_attny_3 | def_attnys_more_than_three | offense_code_1 | offense_title_1 | offense_code_2 | offense_title_2 | offense_code_3 | offense_title_3 | offense_code_4 | offense_title_4 | offense_code_5 | offense_title_5 | offense_code_6 | offense_title_6 | more_than_six | verdict | case_appealed | batson_claim_by_defense | batson_claim_by_state | voir_dire_present | struck_by_state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1521 | 107.0 | 3.0 | 0 | unknown | unknown | unknown | unknown | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 107 | 2004-0257--Sparky Watson | 3 | White | Male | Jury strike sheet | Jury strike sheet | Struck by the defense | Both State and Defense | 3 | Sparky Watson | 2004-0257 | 1 | 1 | Grenada | Black | NaN | NaN | NaN | 0 | C. Morgan, III | Susan Denley | Ryan Berry | NaN | 0 | M. Kevin Horan | Elizabeth Davis | NaN | 0 | 41-29-139(a)(1)(b)(3) | sale of marihuana (less than 30 grams) | 41-29-139(a)(1)(b)(1) | sale of cocaine | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | Guilty on at least one offense | 1 | 0 | 0 | 1 | 0 |
1 | 1524 | 108.0 | 3.0 | 0 | unknown | unknown | unknown | unknown | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 108 | 2004-0257--Sparky Watson | 3 | Black | Female | Jury strike sheet | Jury strike sheet | Struck by the state | State | 3 | Sparky Watson | 2004-0257 | 1 | 1 | Grenada | Black | NaN | NaN | NaN | 0 | C. Morgan, III | Susan Denley | Ryan Berry | NaN | 0 | M. Kevin Horan | Elizabeth Davis | NaN | 0 | 41-29-139(a)(1)(b)(3) | sale of marihuana (less than 30 grams) | 41-29-139(a)(1)(b)(1) | sale of cocaine | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | Guilty on at least one offense | 1 | 0 | 0 | 1 | 1 |
2 | 1525 | 109.0 | 3.0 | 1 | unknown | unknown | unknown | unknown | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 109 | 2004-0257--Sparky Watson | 3 | Black | Female | Jury strike sheet | Jury strike sheet | Juror chosen to serve on jury | Both State and Defense | 3 | Sparky Watson | 2004-0257 | 1 | 1 | Grenada | Black | NaN | NaN | NaN | 0 | C. Morgan, III | Susan Denley | Ryan Berry | NaN | 0 | M. Kevin Horan | Elizabeth Davis | NaN | 0 | 41-29-139(a)(1)(b)(3) | sale of marihuana (less than 30 grams) | 41-29-139(a)(1)(b)(1) | sale of cocaine | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | Guilty on at least one offense | 1 | 0 | 0 | 1 | 0 |
Save#
Now we're all done, time to save the results!
df.to_csv("data/jury-cleaned.csv", index=False)
Discussion topics#
Why did we run the filter below?
df = df[(df.strike_eligibility == 'Both State and Defense') | (df.strike_eligibility == 'State')]
About the site
Hi, I'm Soma, welcome to Data Science for Journalism a.k.a. investigate.ai!
There's been a lot of buzz about machine learning and "artificial intelligence" being used in stories over the past few years. It's mostly not that complicated - a little stats, a classifier here or there - but it's hard to know where to start without a little help.
If you know a little Python programming, hopefully this site can be that help! Learn more about this project here.
Our newsletter
Links
Thanks to Columbia Journalism School, the Knight Foundation, and many others.