pd.options.display.max_rows = -1
(-1 will print entire dataset or give number like 4000 to print 4000 lines in dataset)
2. Converting string to datatime object
from datetime import datetime
date_string = "2012-08-12"
current_date = datetime.strptime(date_string, "%Y-%m-%d")
3. Find delta days between 2 dates
delta = abs(current_date - old date)
delta.days -> give total days
4. Find min, max & mean(average) in dictionary key value pair.
Ex :
print d
min_len = min(d.items(), key=lambda x: x[1])[1]
max_len = max(d.items(), key=lambda x: x[1])[1]
avg_len = float(sum(d.values()))/len(d)
print min_len, max_len, avg_len
Output :
{19L: 4, 24581L: 633}
4 633 318.5
5. Merging dataframes and filling NaN values.
merged_df = pd.merge(df_1, df_1, on='key_name_to_be_given', how='outer')
merged_df.fillna(0, inplace=True)
outer implies union(default is inner -> intersection)
6. To find if string contains one of multiple substring in dataframe row element and return those rows which has one amount those.
bool_df = df["column_name"].str.contains("DIAG|LAB|DRUG")
final_df = df[bool_df == True]
Here, df is panda dataframe, and it search/finds rows which has substring DIAG or LAB or DRUG
bool_df has True or False.
final_df will have only rows which had substring DIAG or LAB or DRUG.
7. Grouping elements in dataframe
group = df.groupby(["column_name_1", "column_name_2"])
for p in group.groups:
print p, " has ", len(group.groups[p]), " entries"
Can add single or multiple column names.
8. Creating new panda dataframe with filtered columns,
Assume : old dataframe has A, B, C, D, E columns.
new = old.filter(['A','B','D'], axis=1)
Now, new will have A, B, D columns.
9. Converting datatime object to string
Ex:
print datatime_obj
print datatime_obj.strftime('%Y-%m-%d')
2012-12-31 00:00:00
2012-12-31
10. Appending rows to the dataframe.
Given dead_encounter is dictionary of key, value pair.
Create empty dataframe and then add the value from dead_encounter to dataframe df
df = pd.DataFrame(columns=('patient_id', 'indx_date'))
i = 0
for key in dead_encounter:
df.loc[i] = (key, dead_encounter[key].strftime('%Y-%m-%d'))
i = i+1
11. Reading from csv file
import pandas as pd
events = pd.read_csv(filepath + 'events.csv')
12. Writing to csv file
dataframe.to_csv(file_path + 'filename.csv', columns=['col_1', 'col_2'], index=False)
13. Appending to csv file
dataframe.to_csv(file_path + 'filename.csv', columns=['col_1', 'col_2'], index=False,mode='a', header=False)
14. from dataframe with having datatime column sub 30 days.
from datetime import datetime, timedelta
df = pd.read_csv(filepath + 'events.csv')
print df
df['indx_date'] = pd.to_datetime(df['indx_date'])
df['indx_date'] = df_deceased['indx_date']-timedelta(days=30)
print df
output:
patient_id indx_date
0 8193.0 2012-12-31
1 24579.0 2015-08-07
patient_id indx_date
0 8193.0 2012-12-01
1 24579.0 2015-07-08
15. Remove complete row if value in particular column is NaN
df = df[pd.notnull(df["column_name"])]
16. Open and write line to file.
f = open(deliverables_path + 'file_name.csv', 'w')
f.write("name_id,feature_id,feature_value\n") # python will convert \n to os.linesep
f.close()
17. column in pandas:
>>df = df.drop('column_name', 1)
0 for rows and 1 for columns.
In place drop without reassign df:
>>df.drop('column_name', axis=1, inplace=True)
To drop by column number instead of by column name:
df.drop(df.columns[[0, 1, 3]], axis=1)
deletes 1st, 2nd & 4th column.
18. change particular value in dataframe.
df.set_value(index, "column_name", value_to_update)
index -> is row index (can specify even name but should be key)
19. Convert dataframe to dictonary.
df_dict = df.set_index('column_id').T.to_dict('list')
output ex :
column_id feature_id feature_value
0 8193.0 3171.0 0.039215686274509803
{8193.0: [3171.0, 0.039215686274509803]}
df_dict = df.set_index('column_id').T.to_dict('records')
output ex :
column_id feature_id
0 8193.0 3171.0
{8193.0: 3171.0}
https://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.to_dict.html
20. Sorting list inside list with given index,
from operator import itemgetter
print list
list = sorted(list[key], key=itemgetter(0))
print list
Output:
[(2741.0, 1.0), (2751.0, 1.0), (2760.0, 1.0), (2841.0, 1.0), (2880.0, 1.0), (2914.0, 1.0), (2948.0, 1.0), (3008.0, 1.0), (3049.0, 1.0), (1193.0, 1.0), (1340.0, 1.0), (1658.0, 1.0), (1723.0, 1.0), (2341.0, 1.0), (2414.0, 1.0)]
[(1193.0, 1.0), (1340.0, 1.0), (1658.0, 1.0), (1723.0, 1.0), (2341.0, 1.0), (2414.0, 1.0), (2741.0, 1.0), (2751.0, 1.0), (2760.0, 1.0), (2841.0, 1.0), (2880.0, 1.0), (2914.0, 1.0), (2948.0, 1.0), (3008.0, 1.0), (3049.0, 1.0)]