Python DataFrames
Creating a DataFrame
import pandas as pd
myDF = pd.DataFrame({"Name":["Bob","Jane","Tay"],"Age":[27,32,18],"Job":["Chef","HR","Retail"]})
- A dataframe stores data tables, including their values and (optional) labels for rows and columns.
- It allows for convenient data access and manipulation.
- The example above represents a table with 3 columns ("Name", "Age", and "Job"), and with 3 rows containing the specified values.
- Passing a dictionary is just one way of initiate a DataFrame, you can also pass an NDArray for example.
- You can add labels to the rows with the index attribute, for example myDF.index = ['row1', 'row2', 'row3']
Accessing values
myNum = myDF1["Age"][0] # Get the first value (index 0) in the "Age" column
myCol = myDF1["Name"] # Get the "Name" column (as a Series)
myDF2 = myDF1[["Name","Job"]] # Get both of the specified columns (as a new Dataframe)
myDF2 = myDF1[1:3] # Get rows from index 1 up to (but not including) 3
myItm = myDF1.iloc[0,1] # Retrieve by row index and column index
myDF2 = myDF1.loc["row1","Age"] # Get value at a row and column specified by name (or index if no name was specified)
myDF2 = myDF1[myDF["Name"]=="Bob"] # Get DataFrame containing rows that meet a specified condition
myArr = myDF1["Age"].unique() # Get the unique values (no repeats) in the specified column (as an ndarray)
...
- Simple access is similar to accessing data from a multidimensional array, but you can specify the column name and row name instead of the position indices.
- You can specify the row by its name rather than index if you set the names previously.
- A Series is like a list that stores a column or row of a table along with its name and data type.
Removing and replacing values
myDF = myDF.drop(columns=["Age"]) # Get new DataFrame without the specified columns
myDF = myDF.dropna(axis=0) # Remove rows with NaN values (for preparing data)
myDF = myDF.replace(1,0) # Replace 1 values with 0
myDF['C1'] = myDF['C1'].astype(int) # Convert boolean values in column 'C1' to 0 and 1
...
- If you want to modify the current dataframe (rather than creating a new one), use the parameter inplace=True
- You can replace any values with any other value of the same type.
File reading/writing:
myDF1 = pd.read_csv('myFile1.csv', header=None) # Read from a comma separated value file
myDF2 = pd.read_excel('myDF.xlsx') # Read from a spreadsheet file
...
myDF1.to_csv('myFile3.csv') # Write to a file in comma separated value format
myDF2.to_excel('myFile4.xlsx') # Write to a file as a spreadsheet
...
- Specify header=None only if the first line of the file doesn't have column labels.
- Some other formats include json, sql, and hdf.
- If the read data doesn't have column names, you can add them with df.columns = ['ColName1','ColName2',...]
Modification
myDF = myDF.sort_values(['Name']) # Arrange rows by order of values in a specified column
myDF['Age'] = myDF['Age']/12 # Divide all values in the column by 12
myDF['Age'] = myDF['Age'].transform('sqrt') # Take the square root of all values in the column
myDF['NewCol'] = [23,11,56,47] # Add a new column called 'NewCol' with the values
...
- You can pass ascending=False as a parameter to the sort_values function to arrange them from highest to lowest values.
- You can do other math operations on column values: +, -, *, ...
- The transform function supports custom lambda functions too.
Combination
myDF3 = pd.merge(myDF1, myDF2, on="Name") # Rows with the same "Name" value get combined into a new table
myDF3 = pd.merge(myDF1, myDF2, on=["Name","Birthdate"]) # Select on multiple features
myDF3 = pd.concat([myDF1, myDF2], ignore_index=True) # Get a concatenation of the specified dataframes
...
- You may want to select on multiple features if there are repeats of values in one of the features (e.g. 2 people named Bob, but with different birthdates).
- For a multiple-feature merge, both tables must have all the listed features (e.g. table 1 has "Name" and "Birthdate" features, and table 2 also has both).
- The result of the concat function will be a table that is the first table's rows with the second table's rows added below. For columns in one table that the other table doesn't have, NaN values will fill the empty cells.
Analysis
myDF.head(3) # Show the top 3 rows of the table
myDF.tail(4) # Show the last 4 rows of the table
myDF.info() # Get a summary of the DataFrame, including info such as data types and memory usage
myAvg = myDF['Age'].mean() # Get the average of values in a column
myCorr = myDF.corr(numeric_only=True) # Get contingency table showing correlation between columns with numeric features
mySum = myDF['Hours'].sum(axis=0) # Get the total of all the values in the column 'Hours'
myMin = myDF['Age'].min(axis=0) # Get the lowest value in the 'Age' column
myMax = myDF['Age'].max(axis=0) # Get the highest value in the 'Age' column
myShape = myDF.shape # Get the number of columns and rows
myDup = myDF.duplicated() # Get column indicating whether each row is a duplicate of a previous row
...
- For functions that use the axis parameter, set it to 1 to apply it across the values in a row.
- If you have a table with only numeric values, you can apply the sum/min/max function to the whole table (not only a column).
Looping through data
# Iterate through rows
for i,myRow in myDF.iterrows():
print(myRow['Age']) # Display the Age value in every row
# Iterate through columns
for myCol in myDF:
print(myDF[myCol][0]) # Display the first value of every column
- If you want to get the whole rows or whole columns, don't specify the last brackets, and they will be returned as Series type.
Plotting data
myDF1.plot.scatter(x="Age", y="Height") # Display a scatterplot between 2 columns
myDF2.plot.line(x="Time", y="Height") # Display a line plot
myDF3['City'].hist() # Display a histogram
...
- A scatter plot shows data points with features represented on the x and y axes.
- A histogram counts the values for each category found in the data.
Challenge
Create a dataframe that includes columns for age, weight, and height. Add 7 values for each column, age values should be 21, 22, or 23. Plot a histogram of the age values. Create a scatter plot of weight with height, and show correlation values (as contignency table) comparing the features. Display the minimum and maximum heights. Multiply the heights by a factor (as if to convert them to different units).
Completed