Logo Zephyrnet

Dễ dàng tạo và lưu trữ các tính năng trong Amazon SageMaker mà không cần mã

Ngày:

Data scientists and machine learning (ML) engineers often prepare their data before building ML models. Data preparation typically includes data preprocessing and feature engineering. You preprocess data by transforming data into the right shape and quality for training, and you engineer features by selecting, transforming, and creating variables when building a predictive model.

Amazon SageMaker helps you perform these tasks by simplifying feature preparation with Trình sắp xếp dữ liệu Amazon SageMaker and storage and feature serving with Cửa hàng tính năng Amazon SageMaker. You can prepare your data and engineer features using over 300 built-in transformations with Data Wrangler. Then you can persist those features to a purpose-built feature store for ML with Feature Store. These services help you build automatic and repeatable processes to streamline your data preparation tasks, all without writing code.

We’re excited to announce a new capability that seamlessly integrates Data Wrangler with Feature Store. You can now easily create features with Data Wrangler and store those features in Feature Store with just a few clicks in Xưởng sản xuất Amazon SageMaker.

In this post, we demonstrate creating features with Data Wrangler and persisting them in Feature Store using the hotel booking demand dataset. We focus on the data preparation and feature engineering tasks to show how easily you can create and stores features in SageMaker without code using Data Wrangler. After the features are stored, they can be used for training and inference by multiple models and teams.

Tổng quan về giải pháp

To demonstrate feature engineering and feature storage, we use a hotel booking demand dataset. You can tải xuống tập dữ liệuview the full description of each variable. The dataset contains information such as when a hotel booking was made, the booking location, the length of stay, the number of parking spaces, and other features.

Our goal is to engineer features to predict if a user will cancel a booking.

We host the dataset in an Dịch vụ lưu trữ đơn giản của Amazon (Amazon S3) bucket. We also open a Studio domain to utilize the native Data Wrangler and Feature Store capabilities. We import the dataset into a Luồng Data Wrangler and define the data transformation steps we want to apply using the Data Wrangler user interface (UI). We then have SageMaker run our feature engineering steps and store the features in Feature Store.

Sơ đồ sau minh họa quy trình giải pháp.

To demonstrate Data Wrangler’s feature engineering steps, we assume we’ve already conducted exploratory data analysis (EDA). EDA helps you understand your data by identifying patterns in your data. For example, we might find that customers who book resort hotels tend to stay longer than city hotels. Or customers that stay over the weekend purchase more meals. Because these patterns aren’t evident with data in tables, data scientists use visualization tools to help identify patterns. EDA is often a necessary step to determine which features to create, delete, and transform.

If you already have features ready to export to Feature Store, you can navigate to the Save features to Feature Store section to learn how you can easily save your prepared features to Feature Store.

Điều kiện tiên quyết

If you want to follow along with this post, you should have the following prerequisites:

Create features with Data Wrangler

To create features with Data Wrangler, complete the following steps:

  1. Enter your Studio domain.
  2. Chọn Trình sắp xếp dữ liệu as your resource to view.
  3. Chọn Luồng mới.
  4. Chọn Nhập khẩu and import your data.

You can see a preview of the data in the Data Wrangler UI when selecting your dataset. You can also choose a sampling method. Because our dataset is relatively small, we choose not to sample our data. The flow editor now shows two steps in the UI, representing the step you took to import the data and a data validation step Data Wrangler automatically completes for you.

  1. Chọn dấu cộng bên cạnh Loại dữ liệu Và chọn Thêm biến đổi.

Assuming we’ve spent time in EDA, we can remove redundant columns that contribute to target leakage. Target leakage occurs when some data in a training dataset is strongly correlated with the target label, but isn’t available in real-world data. After we conduct a target leakage analysis, we determine we should drop redundant columns. Data Wrangler helped identify 10 columns to drop.

  1. Add a step and choose the Thả cột bước biến đổi.

Additionally, we determine we can remove columns like agentadults after a multicollinearity analysis. Multicollinearity is the presence of high correlations between two or more independent variables. We usually want to avoid variables to be correlated to each other because they can lead to misleading and inaccurate models.

We also want to drop duplicate rows. In our case, nearly 28% of all rows in our dataset are duplicates. Because duplicates may have undesirable effects on our model, we use the transform set to remove them.

  1. Thêm một biến đổi mới và chọn Quản lý hàng from the list of available transforms.
  2. Chọn Bỏ các bản sao trên Chuyển đổi trình đơn thả xuống.

Next, we want to handle missing values. We find that many hotel guests didn’t travel with children, and have a blank value for the children column. We can replace this blank value with 0.

  1. Chọn Xử lý thiếu as the transform step and Điền vào chỗ thiếu as the transform type.
  2. Add a transform to fill blank values with the 0 value by choosing children làm cột đầu vào.

From our EDA, we see that there are many missing values for the country column. However, the data reveals most of the hotel guests are from Europe. We determine that missing country column values can be replaced with the most commonly occurring country—Portugal (PRT).

  1. Chọn Xử lý thiếu transform step and choose Điền vào chỗ thiếu as the transform type.
  2. Chọn country as the input column, and enter PRT như Điền giá trị.

ML algorithms like linear regression, logistic regression, neural networks, and others that use gradient descent as an optimization technique require data to be scaled. Normalization (also known as min-max scaling) is a scaling technique that transforms values to be in the range of 0–1. Standardization is another scaling technique where the values are centered around the mean with a standard deviation unit. In our case, we normalize the numeric feature columns to a standard scale between [0, 1].

  1. Chọn Xử lý số transform step and Quy mô giá trị as the transform type.
  2. Chọn Tỷ lệ tối thiểu-tối đa as the scaler and lead_time, booking_changes, adr, and others as the input columns.
  3. Leave 0 as min and 1 as Max giá trị mặc định.

We also want to handle categorical data by representing them as numeric values. For example, if your categories are DogCat, you may encode this information into two vectors, [1,0] to represent Dog, and [0,1] to represent Cat. For our dataset, we use one-hot encoding to encode categories into an integer between 0 and the total number of categories within the column.

  1. Chọn Mã hóa một nóng transform type from the Mã hóa phân loại biến đổi.

ML models are sensitive to the distribution and range of your feature values. Outliers can negatively impact model accuracy and lead to longer training times. For our dataset, we apply the standard deviation numeric outliers transform with a set of configuration values as shown in the following screenshot. We apply this transform on the numeric columns.

  1. Chọn Standard Deviation Numeric Outliers transform type from the Xử lý các ngoại lệ biến đổi.

Lastly, we want to balance the target variable for class imbalance. In Data Wrangler, we can handle class imbalance using three different techniques:

  • Mẫu gạch dưới ngẫu nhiên
  • Lấy mẫu ngẫu nhiên
  • NHỎ
  1. In the Data Wrangler transform pane, choose Số dư dữ liệu as the group and choose Lấy mẫu ngẫu nhiên cho Chuyển đổi trường.

The ratio of positive to negative cases is around 0.38 before balancing.

After oversampling and balancing the dataset, the ratio equates to 1.

Now that we’ve completed our feature engineering tasks, we’re ready to export our features to Feature Store with one click.

Save features to Feature Store

You can easily export your generated features to SageMaker Feature Store by selecting it as the destination.

You can save the features into an existing feature group or create new one. For this post, we create a new feature group. Studio directs you to a new tab where you can create a new feature group.

  1. Choose the plus sign, choose Xuất khẩu sang, và lựa chọn Cửa hàng tính năng SageMaker.

  1. Chọn Tạo nhóm tính năng.

  1. Tùy ý, chọn Create “EventTime” column.
  2. Chọn Sau.

  1. Copy the JSON schema, then choose Tạo.

  1. Provide a feature group name and an optional description for your feature group.
  2. Select a feature group storage configuration that is either online or offline, or both.

Online stores serve features with low millisecond latency for real-time inference, whereas offline stores are ideal for retrieving your features for training models or for batch scoring. Additionally, you can run queries on your offline feature stores by registering your features in an Keo AWS Data Catalog. For more information, see Query Feature Store with Athena and AWS Glue.

  1. Chọn Tiếp tục.

Next, you specify the feature definitions. You specify the data type (string, integral, fractional) for each feature definition.

  1. Enter the JSON schema from the previous step to define your feature definitions.
  2. Chọn Tiếp tục.

  1. Next, you specify a record identifier name and a timestamp to uniquely identify a record within a feature group.

The record identifier name must refer to one of the names of a feature defined in the feature group’s feature definition. In our case, we use the existing identifier, distribution-channel, which was in our source dataset, and EventTime.

  1. Chọn Tiếp tục.

  1. Lastly, apply any relevant tags and review your feature group details.
  2. Chọn Create feature group để hoàn thiện quy trình.

  1. After we create our feature group, we can return to the Data Wrangler flow UI.
  2. Choose the plus sign, choose Thêm điểm đến, và lựa chọn Cửa hàng tính năng SageMaker.

  1. We choose the desired destination feature group to ensure that the features we’re storing match the feature group schema.

If the newly created feature group doesn’t show up in the UI, refresh the list to reload the groups.

  1. Chose the message under the THẨM ĐỊNH column to have Data Wrangler validate the schema of the dataset with the schema of the feature group.

If you missed specifying the event time column, Data Wrangler will notify you of an error and request that you add one to your dataset.

Once validated, Data Wrangler informs you that the data frame matches the feature group schema.

  1. If you enabled both the online and offline stores for the feature group, you can optionally select Write to offline store only to only ingest data to the offline store.

This is helpful for historical data backfilling scenarios.

  1. Chọn Thêm to add another step to our Data Wrangler flow.
  2. With all our steps defined, choose Tạo việc làm to run our ML workflow from feature engineering to ingesting features into our feature group.

  1. Give the job a name, then provide the job specifications like the type and number of instances.
  2. Chọn chạy.

Congratulations! You’ve successfully engineered features using Data Wrangler and stored them in a persistent feature store without writing any code. You can easily explore features, see details of your feature group, and update the feature group schema when necessary.

Kết luận

In this post, we created features with Data Wrangler, and easily stored those features in Feature Store. We showed an example workflow for feature engineering in the Data Wrangler UI. Then we saved those features into Feature Store directly from Data Wrangler by creating a new feature group. Finally, we ran a processing job to ingest those features into Feature Store. These services helped us build automatic and repeatable processes to streamline our data preparation tasks, all without writing code.

With this new integration, you can accelerate your ML tasks with a more streamlined experience between feature engineering and feature ingestion. For more information, refer to Bắt đầu với Data WranglerBắt đầu với Cửa hàng tính năng Amazon SageMaker.


Về các tác giả

Peter Chung là Kiến trúc sư giải pháp cho AWS và rất đam mê giúp khách hàng khám phá thông tin chi tiết từ dữ liệu của họ. Ông đã và đang xây dựng các giải pháp để giúp các tổ chức đưa ra quyết định dựa trên dữ liệu ở cả khu vực công và tư nhân. Anh ấy có tất cả các chứng chỉ AWS cũng như hai chứng chỉ GCP. Anh ấy thích cà phê, nấu ăn, năng động và dành thời gian cho gia đình.

Patrick Lin là Kỹ sư phát triển phần mềm của Amazon SageMaker Data Wrangler. Ông cam kết biến Amazon SageMaker Data Wrangler trở thành công cụ chuẩn bị dữ liệu số một cho quy trình sản xuất ML được sản xuất hóa. Ngoài giờ làm việc, bạn có thể thấy anh ấy đọc sách, nghe nhạc, trò chuyện với bạn bè và phục vụ tại nhà thờ của anh ấy.

Tử Dao Hoàng is a Software Development Engineer with Amazon SageMaker Data Wrangler. He is passionate about building great product that makes ML easy for the customers. Outside of work, Ziyao likes to read, and hang out with his friends

tại chỗ_img

Tin tức mới nhất

tại chỗ_img