Sunday, September 3, 2017

Python Tips

1. Not to truncate the pandas dataset while printing.
        pd.options.display.max_rows = -1
 (-1 will print entire dataset or give number like 4000 to print 4000 lines in dataset)
2. Converting string to datatime object
    from datetime import datetime
    date_string = "2012-08-12"
    current_date = datetime.strptime(date_string, "%Y-%m-%d")

3. Find delta days between 2 dates
  delta = abs(current_date - old date)
  delta.days -> give total days


4. Find min, max & mean(average) in dictionary key value pair.

Ex :

    print d

    min_len = min(d.items(), key=lambda x: x[1])[1]
    max_len = max(d.items(), key=lambda x: x[1])[1]
    avg_len = float(sum(d.values()))/len(d)

    print min_len, max_len, avg_len

Output :
              {19L: 4, 24581L: 633}
               4 633 318.5

5. Merging dataframes and filling NaN values.

    merged_df = pd.merge(df_1, df_1, on='key_name_to_be_given', how='outer')
    merged_df.fillna(0, inplace=True)

    outer implies union(default is inner -> intersection)


6. To find if string contains one of multiple substring in dataframe row element and return those rows which has one amount those.
   bool_df  = df["column_name"].str.contains("DIAG|LAB|DRUG")
   final_df = df[bool_df == True]

Here, df is panda dataframe, and it search/finds rows which has substring DIAG or LAB or DRUG
bool_df has True or False.
final_df will have only rows which had substring DIAG or LAB or DRUG.


7. Grouping elements in dataframe
    group = df.groupby(["column_name_1", "column_name_2"])
    for p in group.groups:
        print p, " has ", len(group.groups[p]), " entries"

Can add single or multiple column names.



8. Creating new panda dataframe with filtered columns,
Assume : old dataframe has A, B, C, D, E columns.
 
new = old.filter(['A','B','D'], axis=1)
 
Now, new will have A, B, D columns.


9. Converting datatime object to string

Ex:
print datatime_obj
print datatime_obj.strftime('%Y-%m-%d')

2012-12-31 00:00:00

2012-12-31

10. Appending rows to the dataframe.

Given dead_encounter is dictionary of key, value pair.
Create empty dataframe and then add the value from dead_encounter to dataframe df

    df = pd.DataFrame(columns=('patient_id', 'indx_date'))
    i = 0
    for key in dead_encounter:
         df.loc[i] = (key, dead_encounter[key].strftime('%Y-%m-%d'))
         i = i+1

11. Reading from csv file
import pandas as pd
events = pd.read_csv(filepath + 'events.csv')

12. Writing to csv file
dataframe.to_csv(file_path + 'filename.csv', columns=['col_1', 'col_2'], index=False)

13. Appending to csv file
dataframe.to_csv(file_path + 'filename.csv', columns=['col_1', 'col_2'], index=False,mode='a', header=False)


14. from dataframe with having datatime column sub 30 days.

from datetime import datetime, timedelta

    df = pd.read_csv(filepath + 'events.csv')
    print df
    df['indx_date'] = pd.to_datetime(df['indx_date'])
    df['indx_date'] = df_deceased['indx_date']-timedelta(days=30)
    print df

output:
     patient_id   indx_date
0        8193.0  2012-12-31
1       24579.0  2015-08-07

     patient_id  indx_date
0        8193.0 2012-12-01
1       24579.0 2015-07-08

15. Remove complete row if value in particular column is NaN

df = df[pd.notnull(df["column_name"])]

16. Open and write line to file.

    f = open(deliverables_path + 'file_name.csv', 'w')
    f.write("name_id,feature_id,feature_value\n")  # python will convert \n to os.linesep
    f.close()

17. column in pandas:

>>df = df.drop('column_name', 1)

0 for rows and 1 for columns.

In place drop without reassign df:

>>df.drop('column_name', axis=1, inplace=True)

To drop by column number instead of by column name:

df.drop(df.columns[[0, 1, 3]], axis=1)

deletes 1st, 2nd & 4th column.


18. change particular value in dataframe.
df.set_value(index, "column_name", value_to_update)

index -> is row index (can specify even name but should be key)

19. Convert dataframe to dictonary.

df_dict = df.set_index('column_id').T.to_dict('list')
output ex :
       column_id  feature_id  feature_value
0         8193.0   3171.0         0.039215686274509803
{8193.0: [3171.0, 0.039215686274509803]}

df_dict = df.set_index('column_id').T.to_dict('records')
output ex :
       column_id  feature_id
0         8193.0   3171.0      
{8193.0: 3171.0}



https://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.to_dict.html

20. Sorting list inside list with given index,




from operator import itemgetter

print list

list = sorted(list[key], key=itemgetter(0))
print list

Output:
 [(2741.0, 1.0), (2751.0, 1.0), (2760.0, 1.0), (2841.0, 1.0), (2880.0, 1.0), (2914.0, 1.0), (2948.0, 1.0), (3008.0, 1.0), (3049.0, 1.0), (1193.0, 1.0), (1340.0, 1.0), (1658.0, 1.0), (1723.0, 1.0), (2341.0, 1.0), (2414.0, 1.0)]

 [(1193.0, 1.0), (1340.0, 1.0), (1658.0, 1.0), (1723.0, 1.0), (2341.0, 1.0), (2414.0, 1.0), (2741.0, 1.0), (2751.0, 1.0), (2760.0, 1.0), (2841.0, 1.0), (2880.0, 1.0), (2914.0, 1.0), (2948.0, 1.0), (3008.0, 1.0), (3049.0, 1.0)]