Pandas#

Create Dataframe#

data = {"col1": [1, 2], "col2": [3, 4]}
df = pd.DataFrame(data=data)

Load and Save CSV#

  • save to CSV: df.to_csv("path_or_buffer.csv")

  • save to CSV (without row names / index): df.to_csv("path_or_buffer.csv", index=False)

  • save to compressed CSV (without row names / index): df.to_csv("path_or_buffer.csv.gz", compression="gzip", index=False)

  • load from CSV:

df = pd.read_csv(
    "path_or_buffer",
    sep=";",
    encoding="us-ascii",
    usecols=col_list,
    nrows=number_of_rows_to_read,
    low_memory=False,
    quoting=csv.QUOTE_NONE,
)
  • load csv without header: df = pd.read_csv("path_or_buffer", names=["column_name_1", "column_name_2"], header=None)

Load and Save as Parquet#

Load and Save with Pickle#

Display Data#

Delete Data#

  • delete column inline

    • df.drop("column_name", axis=1, inplace=True)

    • column_name can also be a list of str

  • remove rows on condition: df.drop(df[df["col_name"] == condition].index, inplace=True)

  • remove duplicates

    • keep first (inplace): df.drop_duplicates(inplace=True, keep="first")

    • only consider certain columns to identify duplicates, keep first (inplace): df.drop_duplicates(list_of_cols, inplace=True, keep="first")

Modify Data#

  • sort

    • low to high values: df.sort_values("column_name", inplace=True)

    • high to low values: df.sort_values("column_name", ascending=False, inplace=True)

    • high to low values & Nan values on top: df.sort_values("column_name", ascending=False, na_position="first")

  • shuffle: df = df.sample(frac=1).reset_index(drop=True)

tqdm apply#

from tqdm import tqdm
tqdm.pandas()
# now use progress_apply instead of apply

Combine Data#

Stack two Dataframes#

Never forget to ignore_index or you have duplicate index values and bad things might happen later!

df = pd.concat([df_01, df_02], ignore_index=True)

Display Settings#

Examples for display settings:

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

# display long sentences in multiple rows (Jupyter)
pd.set_option("display.max_colwidth", None)

Filter nan Values#

nan == nan is always false. That is why we can not use == to check for nan-values. Use pd.isnull(obj : scalar or array-like) instead or isnull(). Examples:

df.loc[pd.isnull(df["col"])]
df[df["col"].isnull()]

Other#

  • rename columns: df.rename(columns={"a": "x"}, inplace=True)