Zephyrnet Logo

Date Processing and Feature Engineering in Python

Date:

Date Processing and Feature Engineering in Python

Have a look at some code to streamline the parsing and processing of dates in Python, including the engineering of some useful and common features.


Figure
Photo by Sonja Langford on Unsplash

 

Maybe, like me, you deal with dates a lot when processing data in Python. Maybe, also like me, you get frustrated with dealing with dates in Python, and find you consult the documentation far too often to do the same things over and over again.

Like anyone who codes and finds themselves doing the same thing more than a handful of times, I wanted to make my life easier by automating some common date processing tasks, as well as some simple and frequent feature engineering, so that my common date parsing and processing tasks for a given date could be done with a single function call. I could then select which features I was interested in extracting at a given time afterwards.

This date processing is accomplished via the use of a single Python function, which accepts only a single date string formatted as ‘YYYY-MM-DD‘ (because that’s how dates are formatted), and which returns a dictionary consisting of (currently) 18 key/value feature pairs. Some of these keys are very straightforward (e.g. the parsed four 4 date year) while others are engineered (e.g. whether or not the date is a public holiday). If you find this code at all useful, you should be able to figure out how to alter or extend it to suit your needs. For some ideas on additional date/time related features you may want to code the generation of, check out this article.

Most of the functionality is accomplished using the Python datetime module, much of which relies on the strftime() method. The real benefit, however, is that there is a standard, automated approach to the same repetitive queries.

The only non-standard library used is holidays, a “fast, efficient Python library for generating country, province and state specific sets of holidays on the fly.” While the library can accommodate a whole host of national and sub-national holiodays, I have used the US national holidays for this example. With a quick glance at the project’s documentation and the code below, you will very easily determine how to change this if needed.

So, let’s first take a look at process_date() function. The comments should provide insight into what is going on, should you need it.

import datetime, re, sys, holidays def process_date(input_str: str) -> {}: """Processes and engineers simple features for date strings Parameters: input_str (str): Date string of format '2021-07-14' Returns: dict: Dictionary of processed date features """ # Validate date string input regex = re.compile(r'd{4}-d{2}-d{2}') if not re.match(regex, input_str): print("Invalid date format") sys.exit(1) # Process date features my_date = datetime.datetime.strptime(input_str, '%Y-%m-%d').date() now = datetime.datetime.now().date() date_feats = {} date_feats['date'] = input_str date_feats['year'] = my_date.strftime('%Y') date_feats['year_s'] = my_date.strftime('%y') date_feats['month_num'] = my_date.strftime('%m') date_feats['month_text_l'] = my_date.strftime('%B') date_feats['month_text_s'] = my_date.strftime('%b') date_feats['dom'] = my_date.strftime('%d') date_feats['doy'] = my_date.strftime('%j') date_feats['woy'] = my_date.strftime('%W') # Fixing day of week to start on Mon (1), end on Sun (7) dow = my_date.strftime('%w') if dow == '0': dow = 7 date_feats['dow_num'] = dow if dow == '1': date_feats['dow_text_l'] = 'Monday' date_feats['dow_text_s'] = 'Mon' if dow == '2': date_feats['dow_text_l'] = 'Tuesday' date_feats['dow_text_s'] = 'Tue' if dow == '3': date_feats['dow_text_l'] = 'Wednesday' date_feats['dow_text_s'] = 'Wed' if dow == '4': date_feats['dow_text_l'] = 'Thursday' date_feats['dow_text_s'] = 'Thu' if dow == '5': date_feats['dow_text_l'] = 'Friday' date_feats['dow_text_s'] = 'Fri' if dow == '6': date_feats['dow_text_l'] = 'Saturday' date_feats['dow_text_s'] = 'Sat' if dow == '7': date_feats['dow_text_l'] = 'Sunday' date_feats['dow_text_s'] = 'Sun' if int(dow) > 5: date_feats['is_weekday'] = False date_feats['is_weekend'] = True else: date_feats['is_weekday'] = True date_feats['is_weekend'] = False # Check date in relation to holidays us_holidays = holidays.UnitedStates() date_feats['is_holiday'] = input_str in us_holidays date_feats['is_day_before_holiday'] = my_date + datetime.timedelta(days=1) in us_holidays date_feats['is_day_after_holiday'] = my_date - datetime.timedelta(days=1) in us_holidays # Days from today date_feats['days_from_today'] = (my_date - now).days return date_feats


A few points to note:

  • By default, Python treats days of the week as starting on Sunday (0) and ending on Saturday (6); For me, and my processing, weeks start on Monday, and end on Sunday — and I don’t need a day 0 (as opposed to starting the week on day 1) — and so this needed to be changed
  • A weekday/weekend feature was easy to create
  • Holiday-related features were easy to engineer using the holidays library, and performing simple date addition and subtraction; again, substituting other national or sub-national holidays (or adding to the existing) would be easy to do
  • A days_from_today feature was created with another line or 2 of simple date math; negative numbers are the number of days a given dates was before today, while positive numbers are days from today until the given date

I don’t personally need, for example, a is_end_of_month feature, but you should be able to see how this could be added to the above code with relative ease at this point. Give some customization a try for yourself.

Now let’s test it out. We will process one date and print out what is returned, the full dictionary of key-value feature pairs.

import pprint
my_date = process_date('2021-07-20')
pprint.pprint(my_date)


{'date': '2021-07-20', 'days_from_today': 6, 'dom': '20', 'dow_num': '2', 'dow_text_l': 'Tuesday', 'dow_text_s': 'Tue', 'doy': '201', 'is_day_after_holiday': False, 'is_day_before_holiday': False, 'is_holiday': False, 'is_weekday': True, 'is_weekend': False, 'month_num': '07', 'month_text_l': 'July', 'month_text_s': 'Jul', 'woy': '29', 'year': '2021', 'year_s': '21'}


Here you can see the full list of feature keys, and corresponding values. Now, in a normal situation I won’t need to print out the entire dictionary, but instead get the values of a particular key or set of keys.

We can demonstrate how this might work practically with the below code. We will create a list of dates, and then process this list of dates one by one, ultimately creating a Pandas data frame of a selection of processed date features, printing it out to screen.

import pandas as pd dates = ['2021-01-01', '2020-04-04', '1993-05-11', '2002-07-19', '2024-11-03', '2050-12-25']
df = pd.DataFrame() for d in dates: my_date = process_date(d) features = [my_date['date'], my_date['year'], my_date['month_num'], my_date['month_text_s'], my_date['dom'], my_date['doy'], my_date['woy'], my_date['is_weekend'], my_date['is_holiday'], my_date['days_from_today']] ds = pd.Series(features) df = df.append(ds, ignore_index=True) df.rename(columns={0: 'date', 1: 'year', 2: 'month_num', 3: 'month', 4: 'day_of_month', 5: 'day_of_year', 6: 'week_of_year', 7: 'is_weekend', 8: 'is_holiday', 9: 'days_from_today'}, inplace=True) df.set_index('date', inplace=True)
print(df)


 year month_num month day_of_month day_of_year week_of_year is_weekend is_holiday days_from_today
date 2021-01-01 2021 01 Jan 01 001 00 0.0 1.0 -194.0
2020-04-04 2020 04 Apr 04 095 13 1.0 0.0 -466.0
1993-05-11 1993 05 May 11 131 19 0.0 0.0 -10291.0
2002-07-19 2002 07 Jul 19 200 28 0.0 0.0 -6935.0
2024-11-03 2024 11 Nov 03 308 44 1.0 0.0 1208.0
2050-12-25 2050 12 Dec 25 359 51 1.0 1.0 10756.0


And this data frame hopefully gives you a better idea of how this functionality could be useful in practice.

Good luck, and happy data processing.

 
Related:


PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://www.kdnuggets.com/2021/07/date-pre-processing-feature-engineering-python.html

spot_img

Latest Intelligence

spot_img

Chat with us

Hi there! How can I help you?