Zephyrnet Logo

Explore visualizations with AWS Glue interactive sessions | Amazon Web Services

Date:

AWS Glue interactive sessions offer a powerful way to iteratively explore datasets and fine-tune transformations using Jupyter-compatible notebooks. Interactive sessions enable you to work with a choice of popular integrated development environments (IDEs) in your local environment or with AWS Glue or Amazon SageMaker Studio notebooks on the AWS Management Console, all while seamlessly harnessing the power of a scalable, on-demand Apache Spark backend. This post is part of a series exploring the features of AWS Glue interactive sessions.

AWS Glue interactive sessions now include native support for the matplotlib visualization library (AWS Glue version 3.0 and later). In this post, we look at how we can use matplotlib and Seaborn to explore and visualize data using AWS Glue interactive sessions, facilitating rapid insights without complex infrastructure setup.

Solution overview

You can quickly provision new interactive sessions directly from your notebook without needing to interact with the AWS Command Line Interface (AWS CLI) or the console. You can use magic commands to provide configuration options for your session and install any additional Python modules that are needed.

In this post, we use the classic Iris and MNIST datasets to navigate through a few commonly used visualization techniques using matplotlib on AWS Glue interactive sessions.

Create visualizations using AWS Glue interactive sessions

We start by installing the Sklearn and Seaborn libraries using the additional_python_modules Jupyter magic command:

%additional_python_modules scikit-learn, seaborn

You can also upload Python wheel modules to Amazon Simple Storage Service (Amazon S3) and specify the full path as a parameter value to the additional_python_modules magic command.

Now, let’s run a few visualizations on the Iris and MNIST datasets.

  1. Create a pair plot using Seaborn to uncover patterns within sepal and petal measurements across the iris species:
    import seaborn as sns
    import matplotlib.pyplot as plt # Load the Iris dataset
    iris = sns.load_dataset("iris") # Create a pair plot
    sns.pairplot(iris, hue="species")
    %matplot plt

  2. Create a violin plot to reveal the distribution of the sepal width measure across the three species of iris flowers:
    # Create a violin plot of the Sepal Width measure
    plt.figure(figsize=(10, 6))
    sns.violinplot(x="species", y="sepal_width", data=iris)
    plt.title("Violin Plot of Sepal Width by Species")
    plt.show()
    %matplot plt

  3. Create a heat map to display correlations across the iris dataset variables:
    # Calculate the correlation matrix
    correlation_matrix = iris.corr() # Create a heatmap using Seaborn
    plt.figure(figsize=(8, 6))
    sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
    plt.title("Correlation Heatmap")
    %matplot plt

  4. Create a scatter plot on the MNIST dataset using PCA to visualize distributions among the handwritten digits:
    import matplotlib.pyplot as plt
    from sklearn.datasets import fetch_openml
    from sklearn.decomposition import PCA # Load the MNIST dataset
    mnist = fetch_openml('mnist_784', version=1)
    X, y = mnist['data'], mnist['target'] # Apply PCA to reduce dimensions to 2 for visualization
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X) # Scatter plot of the reduced data
    plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y.astype(int), cmap='viridis', s=5)
    plt.xlabel("Principal Component 1")
    plt.ylabel("Principal Component 2")
    plt.title("PCA - MNIST Dataset")
    plt.colorbar(label="Digit Class") %matplot plt

  5. Create another visualization using matplotlib and the mplot3d toolkit:
    import numpy as np
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import Axes3D # Generate mock data
    x = np.linspace(-5, 5, 100)
    y = np.linspace(-5, 5, 100)
    x, y = np.meshgrid(x, y)
    z = np.sin(np.sqrt(x**2 + y**2)) # Create a 3D plot
    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection='3d') # Plot the surface
    surface = ax.plot_surface(x, y, z, cmap='viridis') # Add color bar to map values to colors
    fig.colorbar(surface, ax=ax, shrink=0.5, aspect=10) # Set labels and title
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')
    ax.set_title('3D Surface Plot Example') %matplot plt

As illustrated by the preceding examples, you can use any compatible visualization library by installing the required modules and then using the %matplot magic command.

Conclusion

In this post, we discussed how extract, transform, and load (ETL) developers and data scientists can efficiently visualize patterns in their data using familiar libraries through AWS Glue interactive sessions. With this functionality, you’re empowered to focus on extracting valuable insights from their data, while AWS Glue handles the infrastructure heavy lifting using a serverless compute model. To get started today, refer to Developing AWS Glue jobs with Notebooks and Interactive sessions.


About the authors

Annie Nelson is a Senior Solutions Architect at AWS. She is a data enthusiast who enjoys problem solving and tackling complex architectural challenges with customers.

Keerthi Chadalavada is a Senior Software Development Engineer at AWS Glue. She is passionate about designing and building end-to-end solutions to address customer data integration and analytic needs.

Zach Mitchell is a Sr. Big Data Architect. He works within the product team to enhance understanding between product engineers and their customers while guiding customers through their journey to develop their enterprise data architecture on AWS.

Gal blog picGal Heyne is a Product Manager for AWS Glue with a strong focus on AI/ML, data engineering and BI. She is passionate about developing a deep understanding of customer’s business needs and collaborating with engineers to design easy to use data products.

spot_img

Latest Intelligence

spot_img