• SSC Jupyter Data Exploration
  • Part 1
  • Part 2
  • Part 3

Data Exploration with Python and Jupyter - part 3¶

Basic usage of the Pandas library to download a dataset, explore its contents, clean up missing or invalid data, filter the data according to different criteria, and plot visualizations of the data.

  • Part 1: Python and Jupyter
  • Part 2: Pandas with toy data
  • Part 3: Pandas with real data

Press Spacebar to go to the next slide (or ? to see all navigation shortcuts)

Let's download some real data¶

For some reason, the London Fire Brigade provides a public spreadsheet of all animal rescue incidents since 2009:

https://data.london.gov.uk/dataset/animal-rescue-incidents-attended-by-lfb

They provide a link to the dataset in excel format

In [1]:
# import the Pandas library & matplotlib for plotting

import pandas as pd
import matplotlib.pyplot as plt
In [2]:
# download an excel spreadsheet with some data and convert it to a DataFrame
url = "https://data.london.gov.uk/download/animal-rescue-incidents-attended-by-lfb/01007433-55c2-4b8a-b799-626d9e3bc284/Animal%20Rescue%20incidents%20attended%20by%20LFB%20from%20Jan%202009.csv.xlsx"
df = pd.read_excel(url)

Display the DataFrame¶

In [3]:
df
Out[3]:
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNotionalCost(£) IncidentNotionalCost(£) FinalDescription ... UPRN Street USRN PostcodeDistrict Easting_m Northing_m Easting_rounded Northing_rounded Latitude Longitude
0 139091 2009-01-01 03:01:00 2009 2008/09 Special Service 1.0 2.0 255.0 510.0 Redacted ... NaN Waddington Way 20500146.0 SE19 NaN NaN 532350 170050 NaN NaN
1 275091 2009-01-01 08:51:00 2009 2008/09 Special Service 1.0 1.0 255.0 255.0 Redacted ... NaN Grasmere Road NaN SE25 534785.0 167546.0 534750 167550 51.390954 -0.064167
2 2075091 2009-01-04 10:07:00 2009 2008/09 Special Service 1.0 1.0 255.0 255.0 Redacted ... NaN Mill Lane NaN SM5 528041.0 164923.0 528050 164950 51.368941 -0.161985
3 2872091 2009-01-05 12:27:00 2009 2008/09 Special Service 1.0 1.0 255.0 255.0 Redacted ... 1.000215e+11 Park Lane 21401484.0 UB9 504689.0 190685.0 504650 190650 51.605283 -0.489684
4 3553091 2009-01-06 15:23:00 2009 2008/09 Special Service 1.0 1.0 255.0 255.0 Redacted ... NaN Swindon Lane 21300122.0 RM3 NaN NaN 554650 192350 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12525 142176-31072025 2025-07-31 11:00:00 2025 2025/26 Special Service 1.0 1.0 NaN NaN ASSIST OWNER WITH NEW BORN KITTENS STRANDED IN... ... NaN CHARLTON ROAD 21201537.0 HA3 NaN NaN 517950 189350 NaN NaN
12526 142240-31072025 2025-07-31 13:10:00 2025 2025/26 Special Service 1.0 1.0 NaN NaN Redacted ... 2.000504e+08 GRANVILLE ROAD 20018840.0 N12 526369.0 191200.0 526350 191250 51.605471 -0.176594
12527 142378-31072025 2025-07-31 16:23:00 2025 2025/26 Special Service 1.0 1.0 NaN NaN PUPPY TRAPPED UNDER RADIATOR OF KITCHEN - IN D... ... NaN EMMANUEL ROAD 21900521.0 SW12 NaN NaN 529250 173050 NaN NaN
12528 142449-31072025 2025-07-31 18:33:00 2025 2025/26 Special Service 1.0 1.0 NaN NaN DOG STUCK IN POND - NEAR GO APE ... 1.002519e+10 ACCESS ROAD FROM COCKFOSTERS ROAD TO TRENT PARK 20707339.0 EN4 528290.0 196934.0 528250 196950 51.656569 -0.146759
12529 142475-31072025 2025-07-31 19:19:00 2025 2025/26 Special Service 1.0 2.0 NaN NaN CAT TRAPPED IN WAREHOUSE - CALLER STATES MULTI... ... 1.000236e+11 WOODFIELD PLACE 8400721.0 W9 524918.0 182017.0 524950 182050 51.523264 -0.200785

12530 rows × 31 columns

Column data types¶

In [4]:
df.dtypes
Out[4]:
IncidentNumber                        object
DateTimeOfCall                datetime64[ns]
CalYear                                int64
FinYear                               object
TypeOfIncident                        object
PumpCount                            float64
PumpHoursTotal                       float64
HourlyNotionalCost(£)                float64
IncidentNotionalCost(£)              float64
FinalDescription                      object
AnimalGroupParent                     object
OriginofCall                          object
PropertyType                          object
PropertyCategory                      object
SpecialServiceTypeCategory            object
SpecialServiceType                    object
WardCode                              object
Ward                                  object
BoroughCode                           object
Borough                               object
StnGroundName                         object
UPRN                                 float64
Street                                object
USRN                                 float64
PostcodeDistrict                      object
Easting_m                            float64
Northing_m                           float64
Easting_rounded                        int64
Northing_rounded                       int64
Latitude                             float64
Longitude                            float64
dtype: object

DateTimeOfCall¶

In [5]:
df["DateTimeOfCall"].head()
Out[5]:
0   2009-01-01 03:01:00
1   2009-01-01 08:51:00
2   2009-01-04 10:07:00
3   2009-01-05 12:27:00
4   2009-01-06 15:23:00
Name: DateTimeOfCall, dtype: datetime64[ns]
In [6]:
# this is already a datetime object, which is great
# a quick sanity check to see if it looks correct:
pd.to_datetime(df["DateTimeOfCall"]).plot()
# should be a single monotonically increasing line: looks good!
Out[6]:
<Axes: >
No description has been provided for this image

Use datetime as the index¶

In [7]:
df.set_index("DateTimeOfCall", inplace=True)
In [8]:
df
Out[8]:
IncidentNumber CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNotionalCost(£) IncidentNotionalCost(£) FinalDescription AnimalGroupParent ... UPRN Street USRN PostcodeDistrict Easting_m Northing_m Easting_rounded Northing_rounded Latitude Longitude
DateTimeOfCall
2009-01-01 03:01:00 139091 2009 2008/09 Special Service 1.0 2.0 255.0 510.0 Redacted Dog ... NaN Waddington Way 20500146.0 SE19 NaN NaN 532350 170050 NaN NaN
2009-01-01 08:51:00 275091 2009 2008/09 Special Service 1.0 1.0 255.0 255.0 Redacted Fox ... NaN Grasmere Road NaN SE25 534785.0 167546.0 534750 167550 51.390954 -0.064167
2009-01-04 10:07:00 2075091 2009 2008/09 Special Service 1.0 1.0 255.0 255.0 Redacted Dog ... NaN Mill Lane NaN SM5 528041.0 164923.0 528050 164950 51.368941 -0.161985
2009-01-05 12:27:00 2872091 2009 2008/09 Special Service 1.0 1.0 255.0 255.0 Redacted Horse ... 1.000215e+11 Park Lane 21401484.0 UB9 504689.0 190685.0 504650 190650 51.605283 -0.489684
2009-01-06 15:23:00 3553091 2009 2008/09 Special Service 1.0 1.0 255.0 255.0 Redacted Rabbit ... NaN Swindon Lane 21300122.0 RM3 NaN NaN 554650 192350 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2025-07-31 11:00:00 142176-31072025 2025 2025/26 Special Service 1.0 1.0 NaN NaN ASSIST OWNER WITH NEW BORN KITTENS STRANDED IN... Cat ... NaN CHARLTON ROAD 21201537.0 HA3 NaN NaN 517950 189350 NaN NaN
2025-07-31 13:10:00 142240-31072025 2025 2025/26 Special Service 1.0 1.0 NaN NaN Redacted Dog ... 2.000504e+08 GRANVILLE ROAD 20018840.0 N12 526369.0 191200.0 526350 191250 51.605471 -0.176594
2025-07-31 16:23:00 142378-31072025 2025 2025/26 Special Service 1.0 1.0 NaN NaN PUPPY TRAPPED UNDER RADIATOR OF KITCHEN - IN D... Dog ... NaN EMMANUEL ROAD 21900521.0 SW12 NaN NaN 529250 173050 NaN NaN
2025-07-31 18:33:00 142449-31072025 2025 2025/26 Special Service 1.0 1.0 NaN NaN DOG STUCK IN POND - NEAR GO APE Dog ... 1.002519e+10 ACCESS ROAD FROM COCKFOSTERS ROAD TO TRENT PARK 20707339.0 EN4 528290.0 196934.0 528250 196950 51.656569 -0.146759
2025-07-31 19:19:00 142475-31072025 2025 2025/26 Special Service 1.0 2.0 NaN NaN CAT TRAPPED IN WAREHOUSE - CALLER STATES MULTI... Cat ... 1.000236e+11 WOODFIELD PLACE 8400721.0 W9 524918.0 182017.0 524950 182050 51.523264 -0.200785

12530 rows × 30 columns

In [9]:
# can now use datetime to select rows: here is jan 2021
df.loc["2021-01-01":"2021-01-31", "FinalDescription"]
Out[9]:
DateTimeOfCall
2021-01-01 12:09:00        KITTEN STUCK UP TREE  AL REQUESTED FROM SCENE
2021-01-01 14:06:00                                             Redacted
2021-01-03 18:40:00                CAT WITH LEG TRAPPED IN BATH PLUGHOLE
2021-01-04 13:39:00                                             Redacted
2021-01-06 10:22:00                                             Redacted
2021-01-06 13:09:00    CAT IN DISTRESS ON ROOF - ADDITIONAL APPLIANCE...
2021-01-06 20:35:00        DOG TRAPPED IN FOX HOLE  - MEET AT CLUB HOUSE
2021-01-07 23:50:00                   KITTEN STUCK BETWEEN WALL AND ROOF
2021-01-09 08:01:00                                  DOG STUCK IN TRENCH
2021-01-10 19:27:00                                             Redacted
2021-01-12 11:39:00                                             Redacted
2021-01-12 22:38:00                                 CAT TRAPPED IN DITCH
2021-01-16 18:05:00                          DOG TRAPPED IN PORTER CABIN
2021-01-17 16:09:00    DOG TRAPPED IN WAREHOUSE AREA - CALLER BELIEVE...
2021-01-17 17:09:00      BIRD TRAPPED IN NETTING    CALLER WILL MEET YOU
2021-01-18 15:17:00            CAT STUCK IN TREE BEING ATTACKED BY CROWS
2021-01-18 17:06:00    ASSIST RSPCA - SMALL ANIMAL RESUE - BIRD ENTAN...
2021-01-19 18:28:00                          CAT TRAPPED BEHIND CUPBOARD
2021-01-19 20:24:00                                             Redacted
2021-01-19 20:36:00                              RUNNING CALL AT ON ROOF
2021-01-20 09:35:00                      CAT STUCK BETWEEN TREE BRANCHES
2021-01-21 13:15:00                              SWAN TRAPPED IN NETTING
2021-01-21 18:23:00                               CAT TRAPPED IN CHIMNEY
2021-01-22 14:22:00                   CAT TRAPPED BETWEEN WALL AND FENCE
2021-01-23 10:18:00                               CAT TRAPPED IN CHIMNEY
2021-01-23 15:43:00                            CAT TRAPPED BETWEEN WALLS
2021-01-23 17:16:00                                             Redacted
2021-01-25 12:02:00             ASSIST RSPCA WITH FOX STUCK DOWN CULVERT
2021-01-26 13:42:00         DOG STUCK IN RAILINGS - CALLER WILL MEET YOU
2021-01-26 18:21:00                                             Redacted
2021-01-26 22:44:00    BIRDS TRAPPED IN BASKETBALL COURT CALLER IS ON...
2021-01-26 23:35:00             FOX TRAPPED IN FENCE IN ALLEYWAY NEXT TO
2021-01-27 09:18:00    CAT STUCK IN TREE - ATTENDED YESTERDAY AND ADV...
2021-01-27 10:12:00    BIRD TRAPPED BY LEG IN A TREE - RSPCA IN ATTEN...
2021-01-27 15:22:00                           CAT UP TREE   ASSIST RSPCA
2021-01-29 10:47:00                 TRAPPED FOX IN FENCE  IN REAR GARDEN
2021-01-30 14:53:00                                 CAT STUCK UNDER SHED
2021-01-30 15:28:00              BIRD CAUGHT IN NETTING - RSPCA ON SCENE
2021-01-30 17:54:00                                DOG TRAPPED UNDER CAR
2021-01-31 12:53:00                   CAT STUCK UP TREE - RSPCA ON SCENE
2021-01-31 13:48:00           INJURED CAT STUCK IN GREEN AREA AT REAR OF
Name: FinalDescription, dtype: object
In [10]:
# resample the timeseries by month and count incidents
df.resample("ME")["IncidentNumber"].count().plot(title="Monthly Calls")
# see https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases
plt.show()
No description has been provided for this image
In [11]:
# resample by year, sum total costs, average hourly costs
fig, axs = plt.subplots(figsize=(16, 4), ncols=2)
df.resample("YE")["IncidentNotionalCost(£)"].sum().plot(
    title="Year total cost", ax=axs[0]
)
df.resample("YE")["HourlyNotionalCost(£)"].mean().plot(
    title="Average hourly cost", ax=axs[1]
)
plt.show()
No description has been provided for this image

Missing data¶

Different strategies for dealing with missing data:

  • Ignore the issue
    • some things may break / not work as expected
  • Remove rows/columns with missing data
    • remove all rows with missing data: df.dropna(axis=0)
    • remove all columns with missing data: df.dropna(axis=1)
  • Guess (impute) missing data
    • replace all missing entries with a value: df.fillna(1)
    • replace missing entries with mean for that column df.fillna(df.mean())
    • replace each missing entry with previous valid entry: df.fillna(method="pad")
    • replace missing by interpolating between valid entries: df.interpolate()
In [12]:
# count missing entries for each column
df.isna().sum()
Out[12]:
IncidentNumber                   0
CalYear                          0
FinYear                          0
TypeOfIncident                   0
PumpCount                       86
PumpHoursTotal                  88
HourlyNotionalCost(£)          599
IncidentNotionalCost(£)        681
FinalDescription                 6
AnimalGroupParent                0
OriginofCall                     0
PropertyType                     0
PropertyCategory                 0
SpecialServiceTypeCategory       0
SpecialServiceType               0
WardCode                        21
Ward                            21
BoroughCode                     14
Borough                         14
StnGroundName                    0
UPRN                          7785
Street                           0
USRN                          1156
PostcodeDistrict                 0
Easting_m                     6766
Northing_m                    6766
Easting_rounded                  0
Northing_rounded                 0
Latitude                      6766
Longitude                     6766
dtype: int64
In [13]:
# If PumpCount is missing, typically so is PumpHoursTotal
# 66 rows are missing at least one of these
pump_missing = df["PumpCount"].isna() | df["PumpHoursTotal"].isna()
print(pump_missing.sum())
88
In [14]:
# so we could choose to drop these rows
df1 = df.drop(df.loc[pump_missing].index)
# here we made a new dataset df1 with these rows dropped
# to drop the rows from the original dataset df, could do:
#
# df = df.drop(df.loc[pump_missing == True].index)
#
# or:
#
# df.drop(df.loc[pump_missing == True].index, inplace=True)
#
print(len(df1))
12442
In [15]:
# another equivalent way to do this
df2 = df.dropna(subset=["PumpCount", "PumpHoursTotal"])
print(len(df2))
12442
In [16]:
# but if we drop them, we lose valid data from other columns
# let's look at the distribution of values:
fig, axs = plt.subplots(1, 2, figsize=(14, 6))
df.plot.hist(y="PumpCount", ax=axs[0])
df.plot.hist(y="PumpHoursTotal", ax=axs[1])
plt.plot()
Out[16]:
[]
No description has been provided for this image
In [17]:
# looks like it would be better to replace missing PumpCount and PumpHoursTotal fields with 1
df.fillna({"PumpCount": 1, "PumpHoursTotal": 1}, inplace=True)
In [18]:
df.isna().sum()
Out[18]:
IncidentNumber                   0
CalYear                          0
FinYear                          0
TypeOfIncident                   0
PumpCount                        0
PumpHoursTotal                   0
HourlyNotionalCost(£)          599
IncidentNotionalCost(£)        681
FinalDescription                 6
AnimalGroupParent                0
OriginofCall                     0
PropertyType                     0
PropertyCategory                 0
SpecialServiceTypeCategory       0
SpecialServiceType               0
WardCode                        21
Ward                            21
BoroughCode                     14
Borough                         14
StnGroundName                    0
UPRN                          7785
Street                           0
USRN                          1156
PostcodeDistrict                 0
Easting_m                     6766
Northing_m                    6766
Easting_rounded                  0
Northing_rounded                 0
Latitude                      6766
Longitude                     6766
dtype: int64

Count the unique entries in each column¶

In [19]:
df.nunique().sort_values()
Out[19]:
TypeOfIncident                    1
PumpCount                         4
SpecialServiceTypeCategory        4
PropertyCategory                  7
OriginofCall                      8
PumpHoursTotal                   12
HourlyNotionalCost(£)            14
CalYear                          17
FinYear                          18
SpecialServiceType               24
AnimalGroupParent                29
BoroughCode                      37
Borough                          70
IncidentNotionalCost(£)          90
StnGroundName                   109
PropertyType                    196
PostcodeDistrict                284
Northing_rounded                428
Easting_rounded                 533
WardCode                        763
Ward                           1381
UPRN                           4509
Northing_m                     5140
Easting_m                      5263
Latitude                       5691
Longitude                      5691
FinalDescription               7565
USRN                           8112
Street                         8636
IncidentNumber                12530
dtype: int64
In [20]:
# "cat" and "Cat" are treated as different animals here:
df["AnimalGroupParent"].unique()
Out[20]:
array(['Dog', 'Fox', 'Horse', 'Rabbit',
       'Unknown - Heavy Livestock Animal', 'Squirrel', 'Cat', 'Bird',
       'Unknown - Domestic Animal Or Pet', 'Sheep', 'Deer',
       'Unknown - Wild Animal', 'Snake', 'Lizard', 'Hedgehog', 'cat',
       'Hamster', 'Lamb', 'Fish', 'Bull', 'Cow', 'Ferret', 'Budgie',
       'Unknown - Animal rescue from water - Farm animal', 'Pigeon',
       'Goat', 'Tortoise',
       'Unknown - Animal rescue from below ground - Farm animal', 'Rat'],
      dtype=object)
In [21]:
# select rows where AnimalGroupParent is "cat", replace with "Cat"
df.loc[df["AnimalGroupParent"] == "cat", "AnimalGroupParent"] = "Cat"
In [22]:
df["AnimalGroupParent"].unique()
Out[22]:
array(['Dog', 'Fox', 'Horse', 'Rabbit',
       'Unknown - Heavy Livestock Animal', 'Squirrel', 'Cat', 'Bird',
       'Unknown - Domestic Animal Or Pet', 'Sheep', 'Deer',
       'Unknown - Wild Animal', 'Snake', 'Lizard', 'Hedgehog', 'Hamster',
       'Lamb', 'Fish', 'Bull', 'Cow', 'Ferret', 'Budgie',
       'Unknown - Animal rescue from water - Farm animal', 'Pigeon',
       'Goat', 'Tortoise',
       'Unknown - Animal rescue from below ground - Farm animal', 'Rat'],
      dtype=object)
In [23]:
df.groupby("AnimalGroupParent")["IncidentNumber"].count().sort_values().plot.barh(
    logx=True
)
plt.show()
No description has been provided for this image
In [24]:
# apparently different hourly costs
# does it depend on the type of event? or does it just increase over time?
df["HourlyNotionalCost(£)"].unique()
Out[24]:
array([255., 260., 290., 295., 298., 326., 328., 333., 339., 346., 352.,
       364., 388., 430.,  nan])
In [25]:
# just goes up over time
df["HourlyNotionalCost(£)"].plot.line()
Out[25]:
<Axes: xlabel='DateTimeOfCall'>
No description has been provided for this image
In [26]:
# Group incidents by fire station & count them
df.groupby("StnGroundName")["IncidentNumber"].count()
Out[26]:
StnGroundName
Acton           98
Addington       91
Barking        114
Barnet         124
Battersea      106
              ... 
Whitechapel     44
Willesden       99
Wimbledon       98
Woodford       115
Woodside       110
Name: IncidentNumber, Length: 109, dtype: int64

Plot location of calls on a map¶

  • note: this section uses some more libraries, to install them:
  • pip install geopandas contextily
In [27]:
import geopandas

# drop missing longitude/latitude
df2 = df.dropna(subset=["Longitude", "Latitude"])
# also drop zero values
df2 = df2[df2["Latitude"] != 0]

# set crs to EPSG:4326 to specify WGS84 Latitude/Longitude
gdf = geopandas.GeoDataFrame(
    df2,
    geometry=geopandas.points_from_xy(df2["Longitude"], df2["Latitude"]),
    crs="EPSG:4326",
)
In [28]:
gdf.head()
Out[28]:
IncidentNumber CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNotionalCost(£) IncidentNotionalCost(£) FinalDescription AnimalGroupParent ... Street USRN PostcodeDistrict Easting_m Northing_m Easting_rounded Northing_rounded Latitude Longitude geometry
DateTimeOfCall
2009-01-01 08:51:00 275091 2009 2008/09 Special Service 1.0 1.0 255.0 255.0 Redacted Fox ... Grasmere Road NaN SE25 534785.0 167546.0 534750 167550 51.390954 -0.064167 POINT (-0.06417 51.39095)
2009-01-04 10:07:00 2075091 2009 2008/09 Special Service 1.0 1.0 255.0 255.0 Redacted Dog ... Mill Lane NaN SM5 528041.0 164923.0 528050 164950 51.368941 -0.161985 POINT (-0.16199 51.36894)
2009-01-05 12:27:00 2872091 2009 2008/09 Special Service 1.0 1.0 255.0 255.0 Redacted Horse ... Park Lane 21401484.0 UB9 504689.0 190685.0 504650 190650 51.605283 -0.489684 POINT (-0.48968 51.60528)
2009-01-07 06:29:00 4011091 2009 2008/09 Special Service 1.0 1.0 255.0 255.0 Redacted Dog ... Holloway Road NaN E11 539013.0 186162.0 539050 186150 51.557221 0.003880 POINT (0.00388 51.55722)
2009-01-07 11:55:00 4211091 2009 2008/09 Special Service 1.0 1.0 255.0 255.0 Redacted Dog ... Aldersbrook Road NaN E12 541327.0 186654.0 541350 186650 51.561067 0.037434 POINT (0.03743 51.56107)

5 rows × 31 columns

In [29]:
f, ax = plt.subplots(figsize=(16, 16))
# plot location of calls involving animals
gdf.plot(ax=ax, color="black", alpha=0.3)
plt.title("Call locations")
# plt.axis("off")
plt.show()
No description has been provided for this image
In [30]:
import contextily as cx

f, ax = plt.subplots(figsize=(16, 16))
# plot location of calls involving animals
gdf.plot(ax=ax, color="black", alpha=0.3)
# add a basemap of the region using contextily
cx.add_basemap(ax, crs=gdf.crs)
plt.title("Call locations")
plt.axis("off")
plt.show()
No description has been provided for this image
In [31]:
f, ax = plt.subplots(figsize=(16, 16))
# plot location of calls involving animals
for animal, colour in [
    ("Cow", "black"),
    ("Deer", "red"),
    ("Fox", "blue"),
    ("Snake", "yellow"),
]:
    gdf[gdf["AnimalGroupParent"] == animal].plot(
        ax=ax, color=colour, alpha=0.5, label=animal
    )
# add a basemap of the region using contextily
cx.add_basemap(ax, crs=gdf.crs)
plt.title("Call locations by animal")
plt.legend()
plt.axis("off")
plt.show()
No description has been provided for this image

Suggested workflow / philosophy¶

1. you want to do something but not sure how¶

  • if you know / have a guess which function to use, look at its docstring: ?function_name
  • if you don't have any idea what to try, google how do I ... in pandas
  • modern alternative: ask chat gpt to write python code using pandas to ...
  • if in doubt, just try something!

Suggested workflow / philosophy¶

2. you try something and get an error message¶

  • copy & paste the last bit into google (along with the function_name and/or pandas)
  • don't be intimidated by the long and apparently nonsensical error messages
  • almost certainly someone else has had this exact problem
  • almost certainly the solution is waiting for you

Suggested workflow / philosophy¶

3. look for a stackoverflow answer with many up-votes¶

  • ignore the green tick, this just means the person asking the question liked the answer
  • typically an answer with many up-votes is a better option
  • more recent answers can also be better: sometimes a library has changed since an older answer was written

Next steps¶

  • experiment with your own datasets
  • read some pandas documentation
    • user guide
  • follow a tutorial
    • getting started tutorials
  • free interactive kaggle courses
    • pandas
    • data cleaning