• SSC Jupyter Data Exploration
  • Part 1
  • Part 2
  • Part 3

Data Exploration with Python and Jupyter - part 3¶

Basic usage of the Pandas library to download a dataset, explore its contents, clean up missing or invalid data, filter the data according to different criteria, and plot visualizations of the data.

  • Part 1: Python and Jupyter
  • Part 2: Pandas with toy data
  • Part 3: Pandas with real data

Press Spacebar to go to the next slide (or ? to see all navigation shortcuts)

Let's download some real data¶

For some reason, the London Fire Brigade provides a public spreadsheet of all animal rescue incidents since 2009:

https://data.london.gov.uk/dataset/animal-rescue-incidents-attended-by-lfb

They provide a link to the dataset in excel format

In [1]:
# import the Pandas library & matplotlib for plotting

import pandas as pd
import matplotlib.pyplot as plt
In [2]:
# download an excel spreadsheet with some data and convert it to a DataFrame
url = "https://data.london.gov.uk/download/animal-rescue-incidents-attended-by-lfb/01007433-55c2-4b8a-b799-626d9e3bc284/Animal%20Rescue%20incidents%20attended%20by%20LFB%20from%20Jan%202009.csv.xlsx"
df = pd.read_excel(url)

Display the DataFrame¶

In [3]:
df
Out[3]:
IncidentNumber DateTimeOfCall CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNotionalCost(£) IncidentNotionalCost(£) FinalDescription ... UPRN Street USRN PostcodeDistrict Easting_m Northing_m Easting_rounded Northing_rounded Latitude Longitude
0 139091 2009-01-01 03:01:00 2009 2008/09 Special Service 1.0 2.0 255 510.0 Redacted ... NaN Waddington Way 20500146.0 SE19 NaN NaN 532350 170050 NaN NaN
1 275091 2009-01-01 08:51:00 2009 2008/09 Special Service 1.0 1.0 255 255.0 Redacted ... NaN Grasmere Road NaN SE25 534785.0 167546.0 534750 167550 51.390954 -0.064167
2 2075091 2009-01-04 10:07:00 2009 2008/09 Special Service 1.0 1.0 255 255.0 Redacted ... NaN Mill Lane NaN SM5 528041.0 164923.0 528050 164950 51.368941 -0.161985
3 2872091 2009-01-05 12:27:00 2009 2008/09 Special Service 1.0 1.0 255 255.0 Redacted ... 1.000215e+11 Park Lane 21401484.0 UB9 504689.0 190685.0 504650 190650 51.605283 -0.489684
4 3553091 2009-01-06 15:23:00 2009 2008/09 Special Service 1.0 1.0 255 255.0 Redacted ... NaN Swindon Lane 21300122.0 RM3 NaN NaN 554650 192350 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11653 220675-29122024 2024-12-29 11:34:00 2024 2024/25 Special Service 1.0 1.0 430 430.0 CAT TRAPPED BEHIND WHEEL OF CAR ... 1.000213e+11 FONTWELL CLOSE 21201469.0 HA3 515218.0 191476.0 515250 191450 51.610342 -0.337450
11654 220734-29122024 2024-12-29 13:46:00 2024 2024/25 Special Service 1.0 1.0 430 430.0 CAT TRAPPED IN FLOOR PANEL IN CEILING OF LIFT ... ... NaN ALTASH WAY 20800055.0 SE9 NaN NaN 543150 172550 NaN NaN
11655 221162-30122024 2024-12-30 10:53:00 2024 2024/25 Special Service 1.0 1.0 430 430.0 TWO CATS TRAPPED BEHIND WALL ... NaN COLLINGHAM PLACE 21700113.0 SW5 NaN NaN 525750 178850 NaN NaN
11656 221334-30122024 2024-12-30 15:44:00 2024 2024/25 Special Service 1.0 1.0 430 430.0 Redacted ... 2.000012e+11 BRIGHTON ROAD 20501946.0 CR5 529718.0 158834.0 529750 158850 51.313841 -0.140121
11657 221970-31122024 2024-12-31 17:37:00 2024 2024/25 Special Service 1.0 1.0 430 430.0 Redacted ... NaN DALE VIEW CRESCENT 22830850.0 E4 NaN NaN 538350 193450 NaN NaN

11658 rows × 31 columns

Column data types¶

In [4]:
df.dtypes
Out[4]:
IncidentNumber                        object
DateTimeOfCall                datetime64[ns]
CalYear                                int64
FinYear                               object
TypeOfIncident                        object
PumpCount                            float64
PumpHoursTotal                       float64
HourlyNotionalCost(£)                  int64
IncidentNotionalCost(£)              float64
FinalDescription                      object
AnimalGroupParent                     object
OriginofCall                          object
PropertyType                          object
PropertyCategory                      object
SpecialServiceTypeCategory            object
SpecialServiceType                    object
WardCode                              object
Ward                                  object
BoroughCode                           object
Borough                               object
StnGroundName                         object
UPRN                                 float64
Street                                object
USRN                                 float64
PostcodeDistrict                      object
Easting_m                            float64
Northing_m                           float64
Easting_rounded                        int64
Northing_rounded                       int64
Latitude                             float64
Longitude                            float64
dtype: object

DateTimeOfCall¶

In [5]:
df["DateTimeOfCall"].head()
Out[5]:
0   2009-01-01 03:01:00
1   2009-01-01 08:51:00
2   2009-01-04 10:07:00
3   2009-01-05 12:27:00
4   2009-01-06 15:23:00
Name: DateTimeOfCall, dtype: datetime64[ns]
In [6]:
# this is already a datetime object, which is great
# a quick sanity check to see if it looks correct:
pd.to_datetime(df["DateTimeOfCall"]).plot()
# should be a single monotonically increasing line: looks good!
Out[6]:
<Axes: >
No description has been provided for this image

Use datetime as the index¶

In [7]:
df.set_index("DateTimeOfCall", inplace=True)
In [8]:
df
Out[8]:
IncidentNumber CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNotionalCost(£) IncidentNotionalCost(£) FinalDescription AnimalGroupParent ... UPRN Street USRN PostcodeDistrict Easting_m Northing_m Easting_rounded Northing_rounded Latitude Longitude
DateTimeOfCall
2009-01-01 03:01:00 139091 2009 2008/09 Special Service 1.0 2.0 255 510.0 Redacted Dog ... NaN Waddington Way 20500146.0 SE19 NaN NaN 532350 170050 NaN NaN
2009-01-01 08:51:00 275091 2009 2008/09 Special Service 1.0 1.0 255 255.0 Redacted Fox ... NaN Grasmere Road NaN SE25 534785.0 167546.0 534750 167550 51.390954 -0.064167
2009-01-04 10:07:00 2075091 2009 2008/09 Special Service 1.0 1.0 255 255.0 Redacted Dog ... NaN Mill Lane NaN SM5 528041.0 164923.0 528050 164950 51.368941 -0.161985
2009-01-05 12:27:00 2872091 2009 2008/09 Special Service 1.0 1.0 255 255.0 Redacted Horse ... 1.000215e+11 Park Lane 21401484.0 UB9 504689.0 190685.0 504650 190650 51.605283 -0.489684
2009-01-06 15:23:00 3553091 2009 2008/09 Special Service 1.0 1.0 255 255.0 Redacted Rabbit ... NaN Swindon Lane 21300122.0 RM3 NaN NaN 554650 192350 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2024-12-29 11:34:00 220675-29122024 2024 2024/25 Special Service 1.0 1.0 430 430.0 CAT TRAPPED BEHIND WHEEL OF CAR Cat ... 1.000213e+11 FONTWELL CLOSE 21201469.0 HA3 515218.0 191476.0 515250 191450 51.610342 -0.337450
2024-12-29 13:46:00 220734-29122024 2024 2024/25 Special Service 1.0 1.0 430 430.0 CAT TRAPPED IN FLOOR PANEL IN CEILING OF LIFT ... Cat ... NaN ALTASH WAY 20800055.0 SE9 NaN NaN 543150 172550 NaN NaN
2024-12-30 10:53:00 221162-30122024 2024 2024/25 Special Service 1.0 1.0 430 430.0 TWO CATS TRAPPED BEHIND WALL cat ... NaN COLLINGHAM PLACE 21700113.0 SW5 NaN NaN 525750 178850 NaN NaN
2024-12-30 15:44:00 221334-30122024 2024 2024/25 Special Service 1.0 1.0 430 430.0 Redacted Cat ... 2.000012e+11 BRIGHTON ROAD 20501946.0 CR5 529718.0 158834.0 529750 158850 51.313841 -0.140121
2024-12-31 17:37:00 221970-31122024 2024 2024/25 Special Service 1.0 1.0 430 430.0 Redacted Cat ... NaN DALE VIEW CRESCENT 22830850.0 E4 NaN NaN 538350 193450 NaN NaN

11658 rows × 30 columns

In [9]:
# can now use datetime to select rows: here is jan 2021
df.loc["2021-01-01":"2021-01-31", "FinalDescription"]
Out[9]:
DateTimeOfCall
2021-01-01 12:09:00        KITTEN STUCK UP TREE  AL REQUESTED FROM SCENE
2021-01-01 14:06:00                                             Redacted
2021-01-03 18:40:00                CAT WITH LEG TRAPPED IN BATH PLUGHOLE
2021-01-04 13:39:00                                             Redacted
2021-01-06 10:22:00                                             Redacted
2021-01-06 13:09:00    CAT IN DISTRESS ON ROOF - ADDITIONAL APPLIANCE...
2021-01-06 20:35:00        DOG TRAPPED IN FOX HOLE  - MEET AT CLUB HOUSE
2021-01-07 23:50:00                   KITTEN STUCK BETWEEN WALL AND ROOF
2021-01-09 08:01:00                                  DOG STUCK IN TRENCH
2021-01-10 19:27:00                                             Redacted
2021-01-12 11:39:00                                             Redacted
2021-01-12 22:38:00                                 CAT TRAPPED IN DITCH
2021-01-16 18:05:00                          DOG TRAPPED IN PORTER CABIN
2021-01-17 16:09:00    DOG TRAPPED IN WAREHOUSE AREA - CALLER BELIEVE...
2021-01-17 17:09:00      BIRD TRAPPED IN NETTING    CALLER WILL MEET YOU
2021-01-18 15:17:00            CAT STUCK IN TREE BEING ATTACKED BY CROWS
2021-01-18 17:06:00    ASSIST RSPCA - SMALL ANIMAL RESUE - BIRD ENTAN...
2021-01-19 18:28:00                          CAT TRAPPED BEHIND CUPBOARD
2021-01-19 20:24:00                                             Redacted
2021-01-19 20:36:00                              RUNNING CALL AT ON ROOF
2021-01-20 09:35:00                      CAT STUCK BETWEEN TREE BRANCHES
2021-01-21 13:15:00                              SWAN TRAPPED IN NETTING
2021-01-21 18:23:00                               CAT TRAPPED IN CHIMNEY
2021-01-22 14:22:00                   CAT TRAPPED BETWEEN WALL AND FENCE
2021-01-23 10:18:00                               CAT TRAPPED IN CHIMNEY
2021-01-23 15:43:00                            CAT TRAPPED BETWEEN WALLS
2021-01-23 17:16:00                                             Redacted
2021-01-25 12:02:00             ASSIST RSPCA WITH FOX STUCK DOWN CULVERT
2021-01-26 13:42:00         DOG STUCK IN RAILINGS - CALLER WILL MEET YOU
2021-01-26 18:21:00                                             Redacted
2021-01-26 22:44:00    BIRDS TRAPPED IN BASKETBALL COURT CALLER IS ON...
2021-01-26 23:35:00             FOX TRAPPED IN FENCE IN ALLEYWAY NEXT TO
2021-01-27 09:18:00    CAT STUCK IN TREE - ATTENDED YESTERDAY AND ADV...
2021-01-27 10:12:00    BIRD TRAPPED BY LEG IN A TREE - RSPCA IN ATTEN...
2021-01-27 15:22:00                           CAT UP TREE   ASSIST RSPCA
2021-01-29 10:47:00                 TRAPPED FOX IN FENCE  IN REAR GARDEN
2021-01-30 14:53:00                                 CAT STUCK UNDER SHED
2021-01-30 15:28:00              BIRD CAUGHT IN NETTING - RSPCA ON SCENE
2021-01-30 17:54:00                                DOG TRAPPED UNDER CAR
2021-01-31 12:53:00                   CAT STUCK UP TREE - RSPCA ON SCENE
2021-01-31 13:48:00           INJURED CAT STUCK IN GREEN AREA AT REAR OF
Name: FinalDescription, dtype: object
In [10]:
# resample the timeseries by month and count incidents
df.resample("ME")["IncidentNumber"].count().plot(title="Monthly Calls")
# see https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases
plt.show()
No description has been provided for this image
In [11]:
# resample by year, sum total costs, average hourly costs
fig, axs = plt.subplots(figsize=(16, 4), ncols=2)
df.resample("YE")["IncidentNotionalCost(£)"].sum().plot(
    title="Year total cost", ax=axs[0]
)
df.resample("YE")["HourlyNotionalCost(£)"].mean().plot(
    title="Average hourly cost", ax=axs[1]
)
plt.show()
No description has been provided for this image

Missing data¶

Different strategies for dealing with missing data:

  • Ignore the issue
    • some things may break / not work as expected
  • Remove rows/columns with missing data
    • remove all rows with missing data: df.dropna(axis=0)
    • remove all columns with missing data: df.dropna(axis=1)
  • Guess (impute) missing data
    • replace all missing entries with a value: df.fillna(1)
    • replace missing entries with mean for that column df.fillna(df.mean())
    • replace each missing entry with previous valid entry: df.fillna(method="pad")
    • replace missing by interpolating between valid entries: df.interpolate()
In [12]:
# count missing entries for each column
df.isna().sum()
Out[12]:
IncidentNumber                   0
CalYear                          0
FinYear                          0
TypeOfIncident                   0
PumpCount                       77
PumpHoursTotal                  79
HourlyNotionalCost(£)            0
IncidentNotionalCost(£)         79
FinalDescription                 5
AnimalGroupParent                0
OriginofCall                     0
PropertyType                     0
PropertyCategory                 0
SpecialServiceTypeCategory       0
SpecialServiceType               0
WardCode                        19
Ward                            19
BoroughCode                     14
Borough                         14
StnGroundName                    0
UPRN                          7265
Street                           0
USRN                          1156
PostcodeDistrict                 0
Easting_m                     6246
Northing_m                    6246
Easting_rounded                  0
Northing_rounded                 0
Latitude                      6246
Longitude                     6246
dtype: int64
In [13]:
# If PumpCount is missing, typically so is PumpHoursTotal
# 66 rows are missing at least one of these
pump_missing = df["PumpCount"].isna() | df["PumpHoursTotal"].isna()
print(pump_missing.sum())
79
In [14]:
# so we could choose to drop these rows
df1 = df.drop(df.loc[pump_missing].index)
# here we made a new dataset df1 with these rows dropped
# to drop the rows from the original dataset df, could do:
#
# df = df.drop(df.loc[pump_missing == True].index)
#
# or:
#
# df.drop(df.loc[pump_missing == True].index, inplace=True)
#
print(len(df1))
11579
In [15]:
# another equivalent way to do this
df2 = df.dropna(subset=["PumpCount", "PumpHoursTotal"])
print(len(df2))
11579
In [16]:
# but if we drop them, we lose valid data from other columns
# let's look at the distribution of values:
fig, axs = plt.subplots(1, 2, figsize=(14, 6))
df.plot.hist(y="PumpCount", ax=axs[0])
df.plot.hist(y="PumpHoursTotal", ax=axs[1])
plt.plot()
Out[16]:
[]
No description has been provided for this image
In [17]:
# looks like it would be better to replace missing PumpCount and PumpHoursTotal fields with 1
df.fillna({"PumpCount": 1, "PumpHoursTotal": 1}, inplace=True)
In [18]:
df.isna().sum()
Out[18]:
IncidentNumber                   0
CalYear                          0
FinYear                          0
TypeOfIncident                   0
PumpCount                        0
PumpHoursTotal                   0
HourlyNotionalCost(£)            0
IncidentNotionalCost(£)         79
FinalDescription                 5
AnimalGroupParent                0
OriginofCall                     0
PropertyType                     0
PropertyCategory                 0
SpecialServiceTypeCategory       0
SpecialServiceType               0
WardCode                        19
Ward                            19
BoroughCode                     14
Borough                         14
StnGroundName                    0
UPRN                          7265
Street                           0
USRN                          1156
PostcodeDistrict                 0
Easting_m                     6246
Northing_m                    6246
Easting_rounded                  0
Northing_rounded                 0
Latitude                      6246
Longitude                     6246
dtype: int64

Count the unique entries in each column¶

In [19]:
df.nunique().sort_values()
Out[19]:
TypeOfIncident                    1
PumpCount                         4
SpecialServiceTypeCategory        4
PropertyCategory                  7
OriginofCall                      8
PumpHoursTotal                   12
HourlyNotionalCost(£)            14
CalYear                          16
FinYear                          17
SpecialServiceType               24
AnimalGroupParent                29
BoroughCode                      37
Borough                          70
IncidentNotionalCost(£)          90
StnGroundName                   109
PropertyType                    194
PostcodeDistrict                283
Northing_rounded                428
Easting_rounded                 533
WardCode                        762
Ward                           1371
UPRN                           4184
Northing_m                     4848
Easting_m                      4949
Latitude                       5341
Longitude                      5341
FinalDescription               7054
USRN                           7629
Street                         8210
IncidentNumber                11658
dtype: int64
In [20]:
# "cat" and "Cat" are treated as different animals here:
df["AnimalGroupParent"].unique()
Out[20]:
array(['Dog', 'Fox', 'Horse', 'Rabbit',
       'Unknown - Heavy Livestock Animal', 'Squirrel', 'Cat', 'Bird',
       'Unknown - Domestic Animal Or Pet', 'Sheep', 'Deer',
       'Unknown - Wild Animal', 'Snake', 'Lizard', 'Hedgehog', 'cat',
       'Hamster', 'Lamb', 'Fish', 'Bull', 'Cow', 'Ferret', 'Budgie',
       'Unknown - Animal rescue from water - Farm animal', 'Pigeon',
       'Goat', 'Tortoise',
       'Unknown - Animal rescue from below ground - Farm animal', 'Rat'],
      dtype=object)
In [21]:
# select rows where AnimalGroupParent is "cat", replace with "Cat"
df.loc[df["AnimalGroupParent"] == "cat", "AnimalGroupParent"] = "Cat"
In [22]:
df["AnimalGroupParent"].unique()
Out[22]:
array(['Dog', 'Fox', 'Horse', 'Rabbit',
       'Unknown - Heavy Livestock Animal', 'Squirrel', 'Cat', 'Bird',
       'Unknown - Domestic Animal Or Pet', 'Sheep', 'Deer',
       'Unknown - Wild Animal', 'Snake', 'Lizard', 'Hedgehog', 'Hamster',
       'Lamb', 'Fish', 'Bull', 'Cow', 'Ferret', 'Budgie',
       'Unknown - Animal rescue from water - Farm animal', 'Pigeon',
       'Goat', 'Tortoise',
       'Unknown - Animal rescue from below ground - Farm animal', 'Rat'],
      dtype=object)
In [23]:
df.groupby("AnimalGroupParent")["IncidentNumber"].count().sort_values().plot.barh(
    logx=True
)
plt.show()
No description has been provided for this image
In [24]:
# apparently different hourly costs
# does it depend on the type of event? or does it just increase over time?
df["HourlyNotionalCost(£)"].unique()
Out[24]:
array([255, 260, 290, 295, 298, 326, 328, 333, 339, 346, 352, 364, 388,
       430])
In [25]:
# just goes up over time
df["HourlyNotionalCost(£)"].plot.line()
Out[25]:
<Axes: xlabel='DateTimeOfCall'>
No description has been provided for this image
In [26]:
# Group incidents by fire station & count them
df.groupby("StnGroundName")["IncidentNumber"].count()
Out[26]:
StnGroundName
Acton           93
Addington       86
Barking        108
Barnet         108
Battersea      100
              ... 
Whitechapel     39
Willesden       87
Wimbledon       95
Woodford       109
Woodside        98
Name: IncidentNumber, Length: 109, dtype: int64

Plot location of calls on a map¶

  • note: this section uses some more libraries, to install them:
  • pip install geopandas contextily
In [27]:
import geopandas

# drop missing longitude/latitude
df2 = df.dropna(subset=["Longitude", "Latitude"])
# also drop zero values
df2 = df2[df2["Latitude"] != 0]

# set crs to EPSG:4326 to specify WGS84 Latitude/Longitude
gdf = geopandas.GeoDataFrame(
    df2,
    geometry=geopandas.points_from_xy(df2["Longitude"], df2["Latitude"]),
    crs="EPSG:4326",
)
In [28]:
gdf.head()
Out[28]:
IncidentNumber CalYear FinYear TypeOfIncident PumpCount PumpHoursTotal HourlyNotionalCost(£) IncidentNotionalCost(£) FinalDescription AnimalGroupParent ... Street USRN PostcodeDistrict Easting_m Northing_m Easting_rounded Northing_rounded Latitude Longitude geometry
DateTimeOfCall
2009-01-01 08:51:00 275091 2009 2008/09 Special Service 1.0 1.0 255 255.0 Redacted Fox ... Grasmere Road NaN SE25 534785.0 167546.0 534750 167550 51.390954 -0.064167 POINT (-0.06417 51.39095)
2009-01-04 10:07:00 2075091 2009 2008/09 Special Service 1.0 1.0 255 255.0 Redacted Dog ... Mill Lane NaN SM5 528041.0 164923.0 528050 164950 51.368941 -0.161985 POINT (-0.16199 51.36894)
2009-01-05 12:27:00 2872091 2009 2008/09 Special Service 1.0 1.0 255 255.0 Redacted Horse ... Park Lane 21401484.0 UB9 504689.0 190685.0 504650 190650 51.605283 -0.489684 POINT (-0.48968 51.60528)
2009-01-07 06:29:00 4011091 2009 2008/09 Special Service 1.0 1.0 255 255.0 Redacted Dog ... Holloway Road NaN E11 539013.0 186162.0 539050 186150 51.557221 0.003880 POINT (0.00388 51.55722)
2009-01-07 11:55:00 4211091 2009 2008/09 Special Service 1.0 1.0 255 255.0 Redacted Dog ... Aldersbrook Road NaN E12 541327.0 186654.0 541350 186650 51.561067 0.037434 POINT (0.03743 51.56107)

5 rows × 31 columns

In [29]:
f, ax = plt.subplots(figsize=(16, 16))
# plot location of calls involving animals
gdf.plot(ax=ax, color="black", alpha=0.3)
plt.title("Call locations")
# plt.axis("off")
plt.show()
No description has been provided for this image
In [30]:
import contextily as cx

f, ax = plt.subplots(figsize=(16, 16))
# plot location of calls involving animals
gdf.plot(ax=ax, color="black", alpha=0.3)
# add a basemap of the region using contextily
cx.add_basemap(ax, crs=gdf.crs)
plt.title("Call locations")
plt.axis("off")
plt.show()
No description has been provided for this image
In [31]:
f, ax = plt.subplots(figsize=(16, 16))
# plot location of calls involving animals
for animal, colour in [
    ("Cow", "black"),
    ("Deer", "red"),
    ("Fox", "blue"),
    ("Snake", "yellow"),
]:
    gdf[gdf["AnimalGroupParent"] == animal].plot(
        ax=ax, color=colour, alpha=0.5, label=animal
    )
# add a basemap of the region using contextily
cx.add_basemap(ax, crs=gdf.crs)
plt.title("Call locations by animal")
plt.legend()
plt.axis("off")
plt.show()
No description has been provided for this image

Suggested workflow / philosophy¶

1. you want to do something but not sure how¶

  • if you know / have a guess which function to use, look at its docstring: ?function_name
  • if you don't have any idea what to try, google how do I ... in pandas
  • modern alternative: ask chat gpt to write python code using pandas to ...
  • if in doubt, just try something!

Suggested workflow / philosophy¶

2. you try something and get an error message¶

  • copy & paste the last bit into google (along with the function_name and/or pandas)
  • don't be intimidated by the long and apparently nonsensical error messages
  • almost certainly someone else has had this exact problem
  • almost certainly the solution is waiting for you

Suggested workflow / philosophy¶

3. look for a stackoverflow answer with many up-votes¶

  • ignore the green tick, this just means the person asking the question liked the answer
  • typically an answer with many up-votes is a better option
  • more recent answers can also be better: sometimes a library has changed since an older answer was written

Next steps¶

  • experiment with your own datasets
  • read some pandas documentation
    • user guide
  • follow a tutorial
    • getting started tutorials
  • free interactive kaggle courses
    • pandas
    • data cleaning