The following is adapted from Wikipedia: Pandas is a software library written for the Python programming language for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods. The data structure introduced in Pandas is the data frame. This is a two-dimensional representation of tabular data--think spreadsheets.
This first exercise focuses on creating data frames. Pandas can convert a variety of data files into a data frame for analysis. A common computer file for holding data is a csv (comma separated values) file. Here is a link to the data file we will use: COVID data
This is a short program which begins with importing pandas as pd. The method pd.read_cvs is a Pandas function for reading a csv file and creating a Data Frame-- here assigned to the variable df. The print(df) produces the output below. In an earlier exercise we imported a library csv to read comma separated data files. We need not do that here, since the Pandas library has its own method for reading these files. pd.read_csv().
This data set has an index in the first column (automatically assigned, not in original data set). The 0th column in the data set contains the date (in date format). Column 1 is an integer which represents the days since the beginning of the data set: date-Mar 10,2020. Column 2 is the change in the number of cases since the preceding day. Column 3 and 4 represent the number of negative and positive tests performed. NaN is the representation of the data for no entry. In this case testing results are reported only after 22 MAR.
# Ex 2.5.1
import pandas as pd # To read data
df = pd.read_csv (r'py2.5resources/ex2.5.1covid-data.csv') # load data set
print(df)
Here we will plot data from a data frame. We will use pandas.DataFrame.plot. The format for this is quite simple and shown in Line 5. The variable df is a pandas dataframe, so we can simply use df.plot() where plot() is a method defined for df objects. It is not necessary to explicitly import matplotlib.
# Ex 2.5.2
import pandas as pd # To read data
df = pd.read_csv (r'py2.5resources/ex2.5.1covid-data.csv') # load data set
print(df)
df.plot(x ='day', y='cases', kind = 'scatter')
We can select parts of the data for analysis. In this case I want to look only at the test results, by day number.
Lines 2-3 are identical to the last example.
Line 4 Here we define a new data frame df1 whose column headings are given by the list ['day','test_neg','test_pos']. Note: the format is new_dataframe=old_dataframe[ list of columns from old data frame to be used ]
Line 5 The statement df1.dropna() removes any row with at least one NaN (Not a Number) meaning missing data. NOTE: alternatively we could have written line 5 as: df2=df1.drop(0:11)
NOTE: When the output gets large as it can with dataframes, the output will be displayed within a scroll bar window.
# Ex 2.5.2
import pandas as pd # To read data
df = pd.read_csv (r'py2.5resources/ex2.5.1covid-data.csv') # load data set
df1= df[['day','test_neg','test_pos']]
df2=df1.dropna() #drops rows with NaN
print(df2)
We can select rows and columns using the method .iloc[rows,columns ] where rows and columns can be a list of rows/columns, or a slice like 1:5 If we write df.iloc[0:5,0:3] that means rows 0,1,2,3,4 and columns 0,1,2.
# Ex 2.5.3
import pandas as pd # To read data
df = pd.read_csv (r'py2.5resources/ex2.5.1covid-data.csv') # load data set
df1=df.iloc[0:5]
print('first five rows\ndf.iloc[0:5]\n\n',df1,'\n\n') # first five rows of dataframe
df2=df.iloc[:, 0:2]
print('first two columns of data frame with all rows\ndf.iloc[:, 0:2] \n\n',df2,'\n\n')
df3=df.iloc[[0,3,6,24],[0,2,4]]
print('rows 0,3,6,24 and columns 0,2,4\ndf.iloc[[0,3,6,24],[0,2,4]]\n\n',df3,'\n\n')
df4=df.iloc[0:5,0:3]
print('First 5 rows and the first 3 columns df.iloc[0:5,0:3]\n\n',df4,'\n\n') # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1).
line 9 We can use arithmetic operators on columns df2=df1['test_neg']+df1['test_pos'] to create a new data frame. The column in the new data frame is the sum of the values row by row (in this case we name the column tot_tests).
line 12 Here we place a new column in df1 which is the single column in df2. df1['tot_tests']=df2
lines 15 & 18 This operation is repeated here to compute the percent_pos for all tests.
# Ex 2.5.4
import numpy as np
import matplotlib.pyplot as plt # To visualize
import pandas as pd # To read data
df = pd.read_csv (r'py2.5resources/ex2.5.1covid-data.csv') # load data set
df1=df.drop(columns=['cases','change']) #drops the two cols named
df1=df1.dropna() #drops rows with NaN
# define new data frame df2 as the sum of elements from df1
df2=df1['test_neg']+df1['test_pos'] #create a new data frame df2 with column heading 'tot_tests'
# add a new column to df1 named 'tot_tests' from the single column in df2
df1['tot_tests']=df2 # add new column to df1 named tot_tests and assign it df2
# define new data frame df3 with a column 'percent_pos'
df3=df1['test_pos']/df1['tot_tests']*100
# add a new column to df1 named 'percent_pos' from df3
df1['percent_pos']=df3
print(df1)
Here we introduce computing means and standard deviations of the data in the dataframe. We use the methods .mean() and .std() applied to the column 'percent_pos'. See lines 12-13
Note: these statistical moments are for all data in the dataframe.
It may well be that the statistics change over time. Care must be taken to properly interpret these statistics.
import pandas as pd # To read data
df = pd.read_csv (r'py2.5resources/ex2.5.1covid-data.csv') # load data set
df1=df.drop(columns=['cases','change']) #drops the two cols named
df1=df1.dropna() #drops rows with NaN
df2=df1['test_neg']+df1['test_pos'] #create a new data frame df2 with column heading 'tot_tests'
df1['tot_tests']=df2 # add new column to df1 named tot_tests and assign it df2
df3=df1['test_pos']/df1['tot_tests']*100
df1['percent_pos']=df3
# plotting and computing simple statistics
df1.plot(x ='day', y='percent_pos', kind = 'scatter')
print('mean percent positive tests = ',df1['percent_pos'].mean(),'%')
print('standard deviation positive tests = ',df1['percent_pos'].std())
While there are outliers, there seems to be a trend downward in positive tests since the beginning of April.
We can use the techniques we learned in scipy and matplotlib to enhance the analysis we do using data expressed as pandas dataframes. In this example we will lines 1-8 create a dataframe df1 which holds data representing days (into the pandemic) and COVID cases. Rather than using the pandas plot function, we will directly use matplotlib. Our first task is to convert the data in the columns of the dataframe into numpy arrays. This is done in lie 9-10 We refer to the column in df1['day] and use the method .to_numpy .
line 12-14 defines the model we will fit to the data.
line 19 Applies curve_fit to find the best parameters
# Ex 2.5.6
import numpy as np
import matplotlib.pyplot as plt # To visualize
import pandas as pd # To read data
from scipy.optimize import curve_fit
df = pd.read_csv (r'py2.5resources/ex2.5.1covid-data.csv') # load data set
df1=df[['day','cases']] #df1 contains two columns
df1.dropna()
x = df1['day'].to_numpy()
y = df1['cases'].to_numpy()
plt.plot(x, y, 'o', color ='red', label ="data")
def model(x,a,b,c,d,e):
y=a+b*x+c*x**2+d*x**3+e*x**4
return y
# curve_fit() function takes the test-function
# x-data and y-data as argument and returns
# the coefficients a and b in param and
# the estimated covariance of param in param_cov
param, param_cov = curve_fit(model, x, y)
a=param[0]
b=param[1]
c=param[2]
d=param[3]
e=param[4]
print ('a=',a,'\nb=',b,'\nc=',c,'\nd=',d,'\ne=',e)
plt.plot(x, model(x,a,b,c,d,e), '--', color ='blue', label ="fitted data")
plt.legend()
plt.show()
# Ex 2.5.x
import numpy as np
import matplotlib.pyplot as plt # To visualize
import pandas as pd # To read data
from scipy.optimize import curve_fit
df = pd.read_csv (r'py2.5resources/ex2.5.1covid-data.csv') # load data set
print(df)
As the pandemic began, the New York Times collected data reporting cases on a county by county basis across the United States. The data runs from 01/31/2020 thru 05/13/2022. There are a total of 3,143 counties across the U.S.; many are represented in these data. The file size is nearly 105 M-bytes, and lists over 250 thousand individual daily records.
Lines 2-3 load the required libraries.
Lines 6-7 access and print the size of the file prior to downloading
Line 10-13 reads the NYT data from the github website into the dataframe df. The next two lines of code print out the number of rows and the number of columns in the dataframe. df.shape returns a tuple in which the $0^{th}$ element is the number of rows and the $1^{st}$ element is the number of columns.
Line 15This selects a subset of the dataframe in which the county is listed as 'Santa Clara'. While the name Santa Clara is unique for all the US, this is not so for other counties. In general to select a different county you will need to expand line 15 with & df['state']=='state name'.
Line 16 prints the number of rows in the redefined dataframe. NOTE: df.shape[0]=number of rows and df.shape[1] would be the number of columns of df.
Line 19 inserts a numpy array starting at 0 and ending at length of the dataframe-1 to represent the day from the start of collecting data. Why length-1? That is because the length of the dataframe includes the header.
dataframe df1. NOTE: the method .to_string() applied to df1 will print out all of the 834 records for Santa Clara County. If you do not include .to_string() you will get a representative sample of the data. NOTE: when you print out all the data, it will be displayed in a scrolling window.We should scroll thru the data to make sure that we do not have any missing values which should be scrubed.
Each data record has a date, county, state, fips, cases, and deaths. On the far left is the index. This is the index from the original dataframe df. Note, data from a single county is not lumped in a sequential set of records. Rather, groups of data have been added for counties over time.
Line 22 We are going to save this data to a local file so we don't have to reread the large NYT file. But, before we do that let's get rid of the column marked fips. Fips is a federal code that uniquely defines each county. We will not be using this. To delete a column simply use the method .drop()
Line 24 writes the dataframe df to a scrolling window.
Line 27 To write a dataframe to a cvs file we simply use the method .to_cvs applied to the dataframe. the first argument is the pathname. In this case, we are storing this to a file 'ex2.5.6-SCdata.cvs' in a directory py2.5resources which is located in the current directory. Index=False will not copy the index on the far left starts with 36. Remember this was an index when we imported the entire NYT data. If we read the file we have just written (w/o index) later, it will create a new index, which will at that time be a sequential index starting at 0.
Experiment
Open the file you have just written in directory py2.5resources. Use any text editor. You can open it with a spreadsheet app, however, that app will add the table format to the data. Incidently, you can build reasonably sized data sets using a spreadsheet and export that data to a .csv file. I found that creating data with a text editor and saving it (at least with my text editor) it will not allow me to save as a .csv file, but rather saves it as a .rtf file.
# Ex 2.5.7
import pandas as pd
import urllib.request
import numpy as np
# find the file size
file = urllib.request.urlopen('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')
print('File length in MB=',file.length/1.0e+06)
# read the file
df=pd.read_csv(r'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')
print('\n #rows=',df.shape[0])
print('\n #cols=',df.shape[1])
# define a new data frame in which only data from Santa Clara County is stored
df = df.loc[df['county'] == 'Santa Clara']
print('\n #rows of data Santa Clara=',df.shape[0])
# insert a new column representing the days since data began to be recorded.
df.insert(1,'day',np.linspace(0,len(df)-1,len(df)))
# deleting coulumn headed by 'fips'
df=df.drop('fips', axis='columns')
print(df.to_string())
# writing to cvs file
df.to_csv("py2.5resources/ex2.5.6-SCdata.csv", index=False)
If you look at the data in the past example, you will notice that the numbers reported for cases are cumulative. In order to better understand the evolution of the pandemic, it would be convenient if we could look at new cases on a given date. It is true that $\Delta$ cases will show more statistical variation, but will reveal the cyclic nature of the pandemic.
Line 6 We read the modified NYTs data which we have stored locally in the last example
Lines 7 We add a column with the header 'day'. df.insert(1,'day',np.linspace(0,len(df)-1,len(df))). inserts a column at position 1 (second column, not counting the index). The zeroth column is 'date' so this new column is placed right after that. Why the -1? The length of the dataframe includes the header.
Line 9 Adds a new column to the dataframe with the header dc (standing for $\delta$cases. The value is the difference between the current row in the column 'cases' and the preceding row.
LET'S LOOK AT THIS DATA IN SOME DETAIL First we begin to see the cyclic nature of the flow of the pandemic that we witnessed during the pandemic. The first peak at around day 350 corresponds to January 2021--that winter's peak. The second major peak occurring about a year earlier. The smaller peak near the end of the graph is due to the emergence of new strains of the virus.
Notice that there are a number of days during the second peak during which the data points are zero. That is not because there were no new cases, but resulted in delays in reporting. This caused even higher spikes in the data representing two or more days of reporting. In addition, there is an anomoly in the data somewhere around day 525 in which the new cases are reported as negative. Remember, the actual data was recorded as cumulative. At this point, there was obviously a reset in the data, reflecting an earlier miscount.
How might we deal with these issues? On way is to compute a moving average. That is on a given day we graph the mean of the preceeding n days. Of course, the larger the window we choose to compute the average, the more filtering we do causing a delay and smoothing of the actual curve.
Lines 16-17 The last few lines of the code have been commented out. Uncomment lins 16-17. This will compute and display a scatter diagram for a seven day moving average. Note, the cyclic nature is evident, but the zero and negative values have been filtered in the averaging process. Note that the scale of the vertical axis is different when we do the rolling average.
line 19 If you uncomment this line you can verify the actual computation of the rolling average from the data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# read the locally stored and updated dataframe.
df=pd.read_csv(r"py2.5resources/ex2.5.6-SCdata.csv")
# add a new column "delta cases". This is the actual new cases reported that day.
df["delta_cases"]=df['cases'].diff(1)
df=df.dropna()
# plot $\Delta cases$ v.s. day
df.plot('day','delta_cases', kind = 'scatter',figsize=(15,8))
# compute 7-day moving average and plot
#df['7 day moving avg'] = df.iloc[:,6].rolling(window=7).mean()
#df.plot('day','7 day moving avg', kind = 'scatter',figsize=(15,8), label="7 day moving average")
#print(df.to_string())