In [2]:
# --- Imports ---
import sys,os
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

SF drug geography

Data pulling and cleaning -

Let's pull all SF Crime data provided by SF data:

Let's pull it in and peek at the schema.

In [3]:
d_crime=pd.read_csv('/Users/lmartin/Desktop/Misc/Blog_Archive/GIS/Map__Crime_Incidents_-_from_1_Jan_2003.csv')
print d_crime.shape
d_crime.head(1)
(1723961, 12)
Out[3]:
IncidntNum Category Descript DayOfWeek Date Time PdDistrict Resolution Address X Y Location
0 50436712 ASSAULT BATTERY Wednesday 04/20/2005 12:00:00 AM 04:00 MISSION NONE 18TH ST / CASTRO ST -122.435003 37.760888 (37.7608878061245, -122.435002864271)

It provides ~ 1M records with:

Here's a nice map of the districts: http://sf-police.org/index.aspx?page=796

Let's create an easy handle (days) for timeseries analysis.

In [4]:
date=pd.to_datetime(d_crime['Date'])
print date.min()
print date.max()
t_delta=(date-date.min()).astype('timedelta64[D]')
d_crime['days']=t_delta
d_crime.head(1)
2003-01-01 00:00:00
2015-02-13 00:00:00
Out[4]:
IncidntNum Category Descript DayOfWeek Date Time PdDistrict Resolution Address X Y Location days
0 50436712 ASSAULT BATTERY Wednesday 04/20/2005 12:00:00 AM 04:00 MISSION NONE 18TH ST / CASTRO ST -122.435003 37.760888 (37.7608878061245, -122.435002864271) 840

The first recorded request is 2003-01-01 and most recent is 2015-02-13. Nice.

Distributions and exploration -

In [411]:
def plotdat(data,cat):
    l=data.groupby(cat).size()
    l.sort()
    fig=plt.figure(figsize=(10,5))
    plt.yticks(fontsize=8)
    l.plot(kind='bar',fontsize=12,color='k')
    plt.xlabel('')
    plt.ylabel('Number of reports',fontsize=10)

plotdat(d_crime,'Category')

Let's get a more detailed view by examining Descript, which is the particular crime type.

In [405]:
l=d_crime.groupby('Descript').size()
l.sort()
print l.shape
(912,)

Since there's 912 different crime types, let's slice by percentile and peek at the top types of crime for each PdDistrict.

In [409]:
def types_districts(d_crime,per):
    
    # Group by crime type and district 
    hoods_per_type=d_crime.groupby('Descript').PdDistrict.value_counts(sort=True)
    t=hoods_per_type.unstack().fillna(0)
    
    # Sort by hood sum
    hood_sum=t.sum(axis=0)
    hood_sum.sort(ascending=False)
    t=t[hood_sum.index]
    
    # Filter by crime per district
    crime_sum=t.sum(axis=1)
    crime_sum.sort()
    
    # Large number, so let's slice the data.
    p=np.percentile(crime_sum,per)
    ix=crime_sum[crime_sum>p]
    t=t.loc[ix.index]
    return t
    
t=types_districts(d_crime,96)

Cluster the non-normalized data across the top percentile reports and each PdDistrict.

In [9]:
sns.clustermap(t) 
Out[9]:
<seaborn.matrix.ClusterGrid at 0x10299dd10>

Normalize verically across PdDistrict.

In [18]:
sns.clustermap(t,standard_scale=1) 
Out[18]:
<seaborn.matrix.ClusterGrid at 0x10a782c10>

Normalize horizontally across crime types.

In [10]:
sns.clustermap(t,standard_scale=0) 
Out[10]:
<seaborn.matrix.ClusterGrid at 0x1096d1110>

(1) GTA is the most common crime in most PdDistricts.

  • Tenderloin is an outlier, enriched in base/rock crack and narcotics.

(2) For the distribution of crime across areas:

  • Southern: Theft, including theft from auto.
  • Tenderloin: Base / rock crack and narcotics.
  • Bayview: Violence and threats.

Now, let's drill down on a specific question -

Lets, re-examine the crime types.

In [242]:
plotdat(d_crime,'Category')

I'm interested in DRUG/NARCOTIC:

  • I think it will show some interesting dynamics.
  • I think different areas of the city will have different distributions.
In [80]:
# Let's drill down onto one
cat=d_crime[d_crime['Category']=='DRUG/NARCOTIC']
c=cat['Descript'].value_counts()
c.sort(ascending=False)
c.head(10)
Out[80]:
POSSESSION OF NARCOTICS PARAPHERNALIA       19795
POSSESSION OF BASE/ROCK COCAINE             14003
POSSESSION OF MARIJUANA                     10815
SALE OF BASE/ROCK COCAINE                    8691
POSSESSION OF BASE/ROCK COCAINE FOR SALE     7204
POSSESSION OF METH-AMPHETAMINE               7112
POSSESSION OF MARIJUANA FOR SALES            5456
POSSESSION OF CONTROLLED SUBSTANCE           4123
POSSESSION OF HEROIN                         3954
POSSESSION OF COCAINE                        2897
dtype: int64

We can use what we had above, but we simply slice the input data on a category first (above).

In [81]:
t=types_districts(cat,70)
In [82]:
sns.clustermap(t)
Out[82]:
<seaborn.matrix.ClusterGrid at 0x141545450>
In [83]:
sns.clustermap(t,standard_scale=1)
Out[83]:
<seaborn.matrix.ClusterGrid at 0x151752f90>
In [84]:
sns.clustermap(t,standard_scale=0)
Out[84]:
<seaborn.matrix.ClusterGrid at 0x1127195d0>

Nice. We could study these for a while.

But, here's the point:

I think we can simplify this if we compress different types of drug into groups.

Then, we can examine both temporal and spatail profiles.

Drug dynamics -

We'll create a 30 day window.

In [376]:
# Let's drill down onto one
cat=d_crime[d_crime['Category']=='DRUG/NARCOTIC']

# Bin crime by 30 day window
cat['Month']=np.floor(cat['days']/30) # Approximate month (30 day window)

# Default
district='All'
In [377]:
def timeseries(dat,per):
    ''' Category grouped by month '''
    
    # Group by crime type and district 
    cat_per_time=dat.groupby('Month').Descript.value_counts(sort=True)
    t=cat_per_time.unstack().fillna(0)
        
    # Filter by crime per district
    crime_sum=t.sum(axis=0)
    crime_sum.sort()
    
    # Large number, so let's slice the data.
    p=np.percentile(crime_sum,per)
    ix=crime_sum[crime_sum>p]
    t=t[ix.index]
    return t
    
t_all=timeseries(cat,0)

Let's group the drug categories to make this easier to examine.

In [190]:
barituate_features=['SALE OF BARBITUATES',
                    'POSSESSION OF BARBITUATES FOR SALES',
                    'ENCOURAGE MINOR TO USE BARBITUATES',
                    'POSSESSION OF BARBITUATES'] 
In [212]:
coke_features=['ENCOURAGING MINOR TO USE COCAINE',
               'SALES COCAINE BASE/SCHOOLYARD TRAFFICKING ACT VIO',
               'TRANSPORTATION OF COCAINE',
               'SALE OF COCAINE',
               'POSSESSION OF COCAINE FOR SALES',
               'POSSESSION OF COCAINE']
In [210]:
weed_features=['ENCOURAGING MINOR TO USE MARIJUANA',
               'FURNISHING MARIJUANA',
               'PLANTING/CULTIVATING MARIJUANA',
               'TRANSPORTATION OF MARIJUANA',
               'SALE OF MARIJUANA',
               'POSSESSION OF MARIJUANA FOR SALES',
               'POSSESSION OF MARIJUANA']
In [171]:
metadone_features=['TRANSPORTATION OF METHADONE',
                   'SALE OF METHADONE',
                   'POSSESSION OF METHADONE FOR SALES',
                   'POSSESSION OF METHADONE']
In [172]:
hallu_features=['TRANSPORTATION OF OPIATES',
                'SALE OF HALLUCINOGENIC',
                'POSSESSION OF OPIUM',
                'POSSESSION OF OPIUM DERIVATIVE',
                'POSSESSION OF OPIUM',
                'SALE OF OPIUM',
                'SALE OF OPIUM DERIVATIVE',
                'TRANSPORTATION OF OPIATES',
                'POSSESSION OF OPIUM FOR SALES',
                'TRANSPORTATION OF HALLUCINOGENIC',
                'POSSESSION OF OPIUM DERIVATIVE FOR SALES',
                'SALE OF OPIATES',
                'SALE OF HALLUCINOGENIC',
                'POSSESSION OF OPIUM DERIVATIVE',
                'POSSESSION OF OPIUM',
                'POSSESSION OF OPIATES FOR SALES',
                'POSSESSION OF HALLUCINOGENIC FOR SALES',
                'POSSESSION OF OPIATES',
                'POSSESSION OF HALLUCINOGENIC']
In [173]:
meth_features=['TRANSPORTATION OF AMPHETAMINE',
               'SALE OF AMPHETAMINE',
               'POSSESSION OF AMPHETAMINE',
               'SALE OF METH-AMPHETAMINE',
               'TRANSPORTATION OF METH-AMPHETAMINE',
               'POSSESSION OF AMPHETAMINE FOR SALES',
               'POSSESSION OF METH-AMPHETAMINE FOR SALE',
               'POSSESSION OF METH-AMPHETAMINE']
In [175]:
heroin_features=['SALE OF HEROIN',
                 'POSSESSION OF HEROIN',
                 'POSSESSION OF HEROIN FOR SALES',
                 'TRANSPORTATION OF HEROIN',
                 'SALE OF HEROIN',
                 'POSSESSION OF HEROIN FOR SALES',
                 'POSSESSION OF HEROIN']
In [174]:
crack_features=['POSSESSION OF BASE/ROCK COCAINE FOR SALE',
                'SALE OF BASE/ROCK COCAINE',
                'POSSESSION OF BASE/ROCK COCAINE']
In [283]:
# Lets use real dates for plotting
days_from_start=pd.Series(t_all.index*30).astype('timedelta64[D]')
dates_for_plot=date.min()+days_from_start
time_labels=dates_for_plot.map(lambda x: str(x.year)+'-'+str(x.month))
In [397]:
def drug_analysis(t,district,plot):
    t['BARBITUATES']=t[map(lambda s: s.strip(), barituate_features)].sum(axis=1)
    t['HEROIN']=t[map(lambda s: s.strip(), heroin_features)].sum(axis=1)
    t['HALLUCINOGENIC']=t[map(lambda s: s.strip(), hallu_features)].sum(axis=1)
    t['METH']=t[map(lambda s: s.strip(), meth_features)].sum(axis=1)
    t['WEED']=t[map(lambda s: s.strip(), weed_features)].sum(axis=1)
    t['COKE']=t[map(lambda s: s.strip(), coke_features)].sum(axis=1)
    t['METHADONE']=t[map(lambda s: s.strip(), metadone_features)].sum(axis=1)
    t['CRACK']=t[map(lambda s: s.strip(), crack_features)].sum(axis=1)
    drugs=t[['HEROIN','HALLUCINOGENIC','METH','WEED','COKE','CRACK']]
    if plot:
        drugs.index=[int(i) for i in drugs.index]
        colors = plt.cm.jet(np.linspace(0, 1, drugs.shape[1]))
        drugs.plot(kind='bar', stacked=True, figsize=(20,5), color=colors, width=1, title=district,fontsize=6)
    return drugs

drug_df_all=drug_analysis(t_all,district,True)
In [398]:
def drug_analysis_rescale(t,district,plot):
    t['BARBITUATES']=t[map(lambda s: s.strip(), barituate_features)].sum(axis=1)
    t['HEROIN']=t[map(lambda s: s.strip(), heroin_features)].sum(axis=1)
    t['HALLUCINOGENIC']=t[map(lambda s: s.strip(), hallu_features)].sum(axis=1)
    t['METH']=t[map(lambda s: s.strip(), meth_features)].sum(axis=1)
    t['WEED']=t[map(lambda s: s.strip(), weed_features)].sum(axis=1)
    t['COKE']=t[map(lambda s: s.strip(), coke_features)].sum(axis=1)
    t['METHADONE']=t[map(lambda s: s.strip(), metadone_features)].sum(axis=1)
    t['CRACK']=t[map(lambda s: s.strip(), crack_features)].sum(axis=1)
    drugs=t[['HEROIN','HALLUCINOGENIC','METH','WEED','COKE','CRACK']]
    if plot:
        drugs=drugs.div(drugs.sum(axis=1),axis=0)
        drugs.index=[int(i) for i in drugs.index]
        colors = plt.cm.GnBu(np.linspace(0, 1, drugs.shape[1]))
        colors = plt.cm.jet(np.linspace(0, 1, drugs.shape[1]))
        drugs.plot(kind='bar', stacked=True, figsize=(20,5), color=colors, width=1, title=district, legend=False)
        plt.ylim([0,1])
    return drugs

drug_df_all=drug_analysis_rescale(t_all,district,True)

Let's add the real dates.

In [375]:
dates_for_plot.index=dates_for_plot
sns.set_context(rc={"figure.figsize": (12.5,5.5)})
for d,c in zip(['METH','CRACK','HEROIN','WEED'],['b','r','c','g']):
    plt.plot(dates_for_plot.index,drug_df_all[d],'o-',color=c,ms=6,mew=1.5,mec='white',linewidth=0.5,label=d,alpha=0.75)
plt.legend(loc='upper left',scatterpoints=1,prop={'size':8})
Out[375]:
<matplotlib.legend.Legend at 0x156c6f290>

Let's iterate through each district.

In [399]:
stor=[]
stor_time=[]

for d in set(d_crime['PdDistrict']):
    # Specify district and group by time
    dist_dat=cat[cat['PdDistrict']==d]
    t=timeseries(dist_dat,0)
    # Merge to ensure all categories are preserved!
    t_merge=pd.DataFrame(columns=t_all.columns)
    m=pd.concat([t_merge,t],axis=0).fillna(0)
    m.reset_index(inplace=True)
    # Plot
    drug_df=drug_analysis(m,d,True)
    plt.show()
    s=drug_df.sum(axis=0)
    stor=stor+[s]
    drug_df.columns=cols=[c+"_%s"%d for c in drug_df.columns]
    stor_time=stor_time+[drug_df]
    
drug_dat_time=pd.concat(stor_time,axis=1)
drug_dat=pd.concat(stor,axis=1)
drug_dat.columns=[list(set(d_crime['PdDistrict']))]

We can also look at correlations between areas for different drugs.

In [232]:
sns.set_context(rc={"figure.figsize": (20,20)})
sns.corrplot(drug_dat_time, annot=False, sig_stars=False,diag_names=False)
Out[232]:
<matplotlib.axes.AxesSubplot at 0x1397f6110>

With this in mind, we can examine select timeseries data.

In [295]:
drug_dat_time.index=dates_for_plot
sns.set_context(rc={"figure.figsize": (7.5,5)})
for d,c in zip(['METH_MISSION','CRACK_MISSION'],['b','r']):
    plt.plot(drug_dat_time.index,drug_dat_time[d],'o-',color=c,ms=6,mew=1,mec='white',linewidth=0.5,label=d,alpha=0.75)
plt.legend(loc='upper left',scatterpoints=1,prop={'size':8})
Out[295]:
<matplotlib.legend.Legend at 0x158375790>
In [374]:
sns.set_context(rc={"figure.figsize": (7.5,5)})
for d,c in zip(['CRACK_MISSION','CRACK_TENDERLOIN','CRACK_BAYVIEW','CRACK_SOUTHERN'],['b','r','g','c']):
    plt.plot(drug_dat_time.index,drug_dat_time[d],'o-',color=c,ms=6,mew=1,mec='white',linewidth=0.5,label=d,alpha=0.75)
plt.legend(loc='upper left',scatterpoints=1,prop={'size':8})
Out[374]:
<matplotlib.legend.Legend at 0x1172d5950>

Spatial relationships -

Let's re-do what we did above, but re-scale it.

In [412]:
stor=[]
stor_time=[]

for d in set(d_crime['PdDistrict']):
    # Specify district and group by time
    dist_dat=cat[cat['PdDistrict']==d]
    t=timeseries(dist_dat,0)
    # Merge to ensure all categories are preserved!
    t_merge=pd.DataFrame(columns=t_all.columns)
    m=pd.concat([t_merge,t],axis=0).fillna(0)
    m.reset_index(inplace=True)
    # Plot
    drug_df=drug_analysis_rescale(m,d,True)
    plt.show()
    s=drug_df.sum(axis=0)
    stor=stor+[s]
    drug_df.columns=cols=[c+"_%s"%d for c in drug_df.columns]
    stor_time=stor_time+[drug_df]
    
drug_dat_time_rescale=pd.concat(stor_time,axis=1)
drug_dat_rescale=pd.concat(stor,axis=1)
drug_dat.columns=[list(set(d_crime['PdDistrict']))]

We can now summarize this data using clustered heatmaps.

In [250]:
sns.clustermap(drug_dat,standard_scale=0)
Out[250]:
<seaborn.matrix.ClusterGrid at 0x139555710>
In [400]:
sns.clustermap(drug_dat,standard_scale=1)
Out[400]:
<seaborn.matrix.ClusterGrid at 0x165731a10>

Mapping relationships -

Let's isolate all crack-related records.

In [297]:
tmp=cat.copy()
tmp.set_index('Descript',inplace=True)
In [300]:
crack_dat=tmp.loc[crack_features]
crack_pts=crack_dat[['X','Y','Month']]

Plot the crack regimes.

In [414]:
d=pd.DataFrame(crack_pts.groupby('Month').size())
d.index=dates_for_plot
d.columns=['Count']
In [423]:
diff=len(d.index)-120
In [438]:
plt.plot(d.index,d['Count'],'o-',color='k',ms=6,mew=1,mec='white',linewidth=0.5,label=d,alpha=0.75)
plt.axvspan(d.index[40-diff],d.index[40],color='cyan',alpha=0.5)
plt.axvspan(d.index[80-diff],d.index[80],color='red',alpha=0.5)
plt.axvspan(d.index[120],d.index[-1],color='green',alpha=0.5)
Out[438]:
<matplotlib.patches.Polygon at 0x13d3c6c10>
In [439]:
oldest_crack_sums=d.loc[(d.index>d.index[40-diff]) & (d.index<d.index[40])]
old_crack_sums=d.loc[(d.index>d.index[80-diff]) & (d.index<d.index[80])]
new_crack_sums=d.loc[d.index>d.index[120]]

Fold-difference in mean between the two regimes.

In [440]:
old_crack_sums['Count'].mean()/float(new_crack_sums['Count'].mean())
Out[440]:
4.6624129930394425

Two regimes.

In [441]:
oldest_crack=crack_pts[(crack_pts['Month']>(40-diff)) & (crack_pts['Month']<40)]
oldest_crack.columns=['longitude','latitude','time']
old_crack=crack_pts[(crack_pts['Month']>(80-diff)) & (crack_pts['Month']<80)]
old_crack.columns=['longitude','latitude','time']
new_crack=crack_pts[crack_pts['Month']>120]
new_crack.columns=['longitude','latitude','time']

We can look at this spatially.

Use a shapefile for Neighborhoods in SF to overlay the data onto a map.

https://data.sfgov.org/Geographic-Locations-and-Boundaries/Neighborhoods/ejmn-jyk6

Basemap can be used to view this. Some nice work at this link that I drew from:

http://sensitivecities.com/so-youd-like-to-make-a-map-using-python-EN.html

In [314]:
from mpl_toolkits.basemap import Basemap
import fiona
from shapely.geometry import shape, mapping
from pyproj import Proj, transform
from fiona.crs import from_epsg
from itertools import chain

shapefile="/Users/lmartin/Desktop/Misc/Blog_Archive/GIS/sffind_neighborhoods/sffind_neighborhoods.shp"
shp = fiona.open(shapefile)
bds = shp.bounds
print (bds)

shp.close()
extra = 0.01
ll = (bds[0], bds[1])
ur = (bds[2], bds[3])
coords = list(chain(ll, ur))
w, h = coords[2] - coords[0], coords[3] - coords[1]
(-122.5148972319999, 37.708089209000036, -122.35698198799994, 37.83239597600004)

We can use the Basemap library.

In [433]:
m = Basemap(
    projection='tmerc',
    lon_0=-122.,
    lat_0=37.7,
    ellps = 'WGS84',
    llcrnrlon=coords[0] - extra * w,
    llcrnrlat=coords[1] - extra + 0.01 * h,
    urcrnrlon=coords[2] + extra * w,
    urcrnrlat=coords[3] + extra + 0.01 * h,
    lat_ts=0,
    resolution='i',
    suppress_ticks=True)

m.readshapefile(
    '/Users/lmartin/Desktop/Misc/Blog_Archive/GIS/sffind_neighborhoods/sffind_neighborhoods',
    'SF',
    color='none',
    zorder=2)
Out[433]:
(117,
 5,
 [-122.5148972319999, 37.708089209000036, 0.0, 0.0],
 [-122.35698198799994, 37.83239597600004, 0.0, 0.0],
 <matplotlib.collections.LineCollection at 0x1572afd50>)
In [442]:
from descartes import PolygonPatch
from shapely.geometry import Point, Polygon, MultiPoint, MultiPolygon
from shapely.prepared import prep
from matplotlib.collections import PatchCollection

# Set up a map dataframe
df_map = pd.DataFrame({
    'poly': [Polygon(xy) for xy in m.SF],
    'ward_name': [ward['name'] for ward in m.SF_info]})
df_map['area_m'] = df_map['poly'].map(lambda x: x.area)
df_map['area_km'] = df_map['area_m'] / 100000

def makePoints(dat):
    # Create Point objects in map coordinates from dataframe lon and lat values
    map_points = pd.Series([Point(m(mapped_x,mapped_y)) for mapped_x, mapped_y in zip(dat['longitude'],dat['latitude'])])
    plt_points = MultiPoint(list(map_points.values))
    hoods_polygon = prep(MultiPolygon(list(df_map['poly'].values)))
    pts = filter(hoods_polygon.contains,plt_points)
    return pts

sf_oldest_crack_points_todraw=makePoints(oldest_crack)
sf_old_crack_points_todraw=makePoints(old_crack)
sf_new_crack_points_todraw=makePoints(new_crack)
In [443]:
# Draw neighborhoods withpolygons
df_map['patches'] = df_map['poly'].map(lambda x: PolygonPatch(
    x,
    fc='#000000',
    ec='#ffffff', lw=.5, alpha=1,
    zorder=4))

plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111, axisbg='w', frame_on=False)

# Now, we can overlay our points
dev = m.scatter(
    [geom.x for geom in sf_oldest_crack_points_todraw],
    [geom.y for geom in sf_oldest_crack_points_todraw],
    10, marker='o', lw=.25,
    facecolor='cyan', edgecolor='cyan',
    alpha=0.75, antialiased=True,
    label='New crack', zorder=3)

dev = m.scatter(
    [geom.x for geom in sf_old_crack_points_todraw],
    [geom.y for geom in sf_old_crack_points_todraw],
    10, marker='o', lw=.25,
    facecolor='red', edgecolor='red',
    alpha=0.75, antialiased=True,
    label='Old crack', zorder=3)

dev = m.scatter(
    [geom.x for geom in sf_new_crack_points_todraw],
    [geom.y for geom in sf_new_crack_points_todraw],
    10, marker='o', lw=.25,
    facecolor='green', edgecolor='green',
    alpha=0.75, antialiased=True,
    label='New crack', zorder=3)

ax.add_collection(PatchCollection(df_map['patches'].values, match_original=True))
plt.tight_layout()
fig.set_size_inches(15,15)
<matplotlib.figure.Figure at 0x156e0a6d0>
In [ ]: