Connect with us

Big Data

Essential Functionalities to Guide you While using AWS Glue and PySpark!




In this post, I have penned down AWS Glue and PySpark functionalities which can be helpful when thinking of creating AWS pipeline and writing AWS Glue PySpark scripts.

AWS Glue PySpark

AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amounts of datasets from various sources for analytics and data processing.

While creating the AWS Glue job, you can select between Spark, Spark Streaming, and Python shell. These jobs can run a proposed script generated by AWS Glue, or an existing script that you provide or a new script authored by you. Also, you can select different monitoring options, job execution capacity, timeouts, delayed notification threshold, and non-overridable and overridable parameters.

AWS Glue PySpark

AWS Glue PySpark

Recently AWS recently launched Glue version 2.0 which features 10x faster Spark ETL job start times and reducing the billing duration from a 10-minute minimum to 1-minute minimum.

With AWS Glue you can create development endpoint and configure SageMaker or Zeppelin notebooks to develop and test your Glue ETL scripts.

AWS Glue PySpark - AWS Glue

I create a SageMaker notebook connected to the Dev endpoint to the author and test the ETL scripts. Depending on the language you are comfortable with, you can spin up the notebook.

AWS Glue PySpark -Jupyter Notebook

Now, let’s talk about some specific features and functionalities in AWS Glue and PySpark which can be helpful.

1. Spark DataFrames

Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database. You can create DataFrame from RDD, from file formats like csv, json, parquet.

With SageMaker Sparkmagic(PySpark) Kernel notebook, the Spark session is automatically created.

AWS Glue PySpark -Spark Dataframe

To create DataFrame –

# from CSV files S3_IN = "s3://mybucket/train/training.csv"csv_df = ("org.apache.spark.csv") .option("header", True) .option("quote", '"') .option("escape", '"') .option("inferSchema", True) .option("ignoreLeadingWhiteSpace", True) .option("ignoreTrailingWhiteSpace", True) .csv(S3_IN, multiLine=False)
)# from PARQUET files S3_PARQUET="s3://mybucket/folder1/dt=2020-08-24-19-28/"df = from JSON files
df = from multiline JSON file df =, multiLine=True)

2. GlueContext

GlueContext is the entry point for reading and writing DynamicFrames in AWS Glue. It wraps the Apache SparkSQL SQLContext object providing mechanisms for interacting with the Apache Spark platform.

from awsglue.job import Job
from awsglue.transforms import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions
from awsglue.dynamicframe import DynamicFrameglueContext = GlueContext(SparkContext.getOrCreate())

Glue Context

3. DynamicFrame

AWS Glue DynamicFrames are similar to SparkSQL DataFrames. It represents a distributed collection of data without requiring you to specify a schema. Also, it can be used to read and transform data that contains inconsistent values and types.

DynamicFrame can be created using the following options –

  • create_dynamic_frame_from_rdd — created from an Apache Spark Resilient Distributed Dataset (RDD)
  • create_dynamic_frame_from_catalog — created using a Glue catalog database and table name
  • create_dynamic_frame_from_options — created with the specified connection and format. Example — The connection type, such as Amazon S3, Amazon Redshift, and JDBC

DynamicFrames can be converted to and from DataFrames using .toDF() and fromDF(). Use the following syntax-

#create DynamicFame from S3 parquet files
datasource0 = glueContext.create_dynamic_frame_from_options( connection_type="s3", connection_options = { "paths": [S3_location] }, format="parquet", transformation_ctx="datasource0")#create DynamicFame from glue catalog datasource0 = glueContext.create_dynamic_frame.from_catalog( database = "demo", table_name = "testtable", transformation_ctx = "datasource0")#convert to spark DataFrame df1 = datasource0.toDF()#convert to Glue DynamicFrame
df2 = DynamicFrame.fromDF(df1, glueContext , "df2")

You can read more about this here.

4. AWS Glue Job Bookmark

AWS Glue Job bookmark helps process incremental data when rerunning the job on a scheduled interval, preventing reprocessing of old data.

You can read more about this here. Also, you can read this.

5. Write out data

The DynamicFrame of the transformed dataset can be written out to S3 as non-partitioned (default) or partitioned. “partitionKeys” parameter can be specified in connection_option to write out the data to S3 as partitioned. AWS Glue organizes these datasets in Hive-style partition.

In the following code example, AWS Glue DynamicFrame is partitioned by year, month, day, hour, and written in parquet format in Hive-style partition on to S3.


S3_location = "s3://bucket_name/table_name"datasink = glueContext.write_dynamic_frame_from_options(
frame= data,
connection_options={ "path": S3_location, "partitionKeys": ["year", "month", "day", "hour"]
transformation_ctx ="datasink")

You can read more about this here.

6. “glueparquet” format option

glueparquet is a performance-optimized Apache parquet writer type for writing DynamicFrames. It computes and modifies the schema dynamically.

datasink = glueContext.write_dynamic_frame_from_options( frame=dynamicframe, connection_type="s3", connection_options={ "path": S3_location, "partitionKeys": ["year", "month", "day", "hour"] }, format="glueparquet", format_options = {"compression": "snappy"}, transformation_ctx ="datasink")

You can read more about this here.

7. S3 Lister and other options for optimizing memory management

AWS Glue provides an optimized mechanism to list files on S3 while reading data into DynamicFrame which can be enabled using additional_options parameter “useS3ListImplementation” to true.

You can read more about this here.

8. Purge S3 path

purge_s3_path is a nice option available to delete files from a specified S3 path recursively based on retention period or other available filters. As an example, suppose you are running AWS Glue job to fully refresh the table per day writing the data to S3 with the naming convention of s3://bucket-name/table-name/dt=<data-time>. Based on the defined retention period using the Glue job itself you can delete the dt=<date-time> s3 folders. Another option is to set the S3 bucket lifecycle policy with the prefix.

#purge locations older than 3 days
print("Attempting to purge S3 path with retention set to 3 days.")
glueContext.purge_s3_path( s3_path=output_loc, options={"retentionPeriod": 72})

You have other options like purge_table, transition_table, and transition_s3_path also available. The transition_table option transitions the storage class of the files stored on Amazon S3 for the specified catalog’s database and table.

You can read more about this here.

9. Relationalize Class

Relationalize class can help flatten nested json outermost level.

You can read more about this here.

10. Unbox Class

The Unbox class helps the unbox string field in DynamicFrame to specified format type(optional).

You can read more about this here.

11. Unnest Class

The Unnest class flattens nested objects to top-level elements in a DynamicFrame.

|-- id: string
|-- type: string
|-- content: map
| |-- keyType: string
| |-- valueType: string

With content attribute/column being map Type, we can use the unnest class to unnest each key element.

unnested = UnnestFrame.apply(frame=data_dynamic_dframe)
|-- id: string
|-- type: string
|-- content.dateLastUpdated: string
|-- content.creator: string
|-- content.dateCreated: string
|-- content.title: string

12. printSchema()

To print the Spark or Glue DynamicFrame schema in tree format use printSchema().

|-- ID: int
|-- Name: string
|-- Identity: string
|-- Alignment: string
|-- EyeColor: string
|-- HairColor: string
|-- Gender: string
|-- Status: string
|-- Appearances: int
|-- FirstAppearance: choice
| |-- int
| |-- long
| |-- string
|-- Year: int
|-- Universe: string

13. Fields Selection

select_fields can be used to select fields from Glue DynamicFrame.

# From DynamicFramedatasource0.select_fields(["Status","HairColor"]).toDF().distinct().show()

AWS Glue PySpark -Fields Selection

To select fields from Spark Dataframe to use “select” –

# From["Status","HairColor"]).distinct().show()

Image for post

14. Timestamp

For instance, the application writes data into DynamoDB and has a last_updated attribute/column. But, DynamoDB does not natively support date/timestamp data type. So, you could either store it as String or Number. In case stored as a number, it’s usually done as epoch time — the number of seconds since 00:00:00 UTC on 1 January 1970. You could see something like “1598331963” which is 2020–08–25T05:06:03+00:00 in ISO 8601.

You can read more about Timestamp here.

How can you convert it to a timestamp?

When you read the data using AWS Glue DynamicFrame and view the schema, it will show it as “long” data type.

|-- version: string
|-- item_id: string
|-- status: string
|-- event_type: string
|-- last_updated: long

To convert the last_updated long data type into timestamp data type, you can use the following code-

import pyspark.sql.functions as f
import pyspark.sql.types as tnew_df = ( df .withColumn("last_updated", f.from_unixtime(f.col("last_updated")/1000).cast(t.TimestampType()))

15. Temporary View from Spark DataFrame

In case you want to store the Spark DataFrame as a table and query it using spark SQL, you can convert the DataFrame into a temporary view that is available for only that spark session using createOrReplaceTempView.

df = spark.createDataFrame( [ (1, ['a', 'b', 'c'], 90.00), (2, ['x', 'y'], 99.99), ], ['id', 'event', 'score'] )df.printSchema()
root |-- id: long (nullable = true) |-- event: array (nullable = true) | |-- element: string (containsNull = true) |-- score: double (nullable = true)df.createOrReplaceTempView("example")spark.sql("select * from example").show()+---+---------+-----+
| id| event|score|
| 1|[a, b, c]| 90.0|
| 2| [x, y]|99.99|

16. Extract element from ArrayType

Suppose from the above example, you want to create a new attribute/column to store only the last event. How would you do it?

You use the element_at function. It returns an element of the array at the given index in extraction if col is an array. Also, it can be used to extract the given key in extraction if col is a map.

import pyspark.sql.functions as element_atnewdf = df.withColumn("last_event", element_at("event", -1))newdf.printSchema()
root |-- id: long (nullable = true) |-- event: array (nullable = true) | |-- element: string (containsNull = true) |-- score: double (nullable = true) |-- last_event: string (nullable = true)
| id| event|score|last_event|
| 1|[a, b, c]| 90.0| c|
| 2| [x, y]|99.99| y|

17. explode

The explode function in PySpark is used to explode array or map columns in rows. For example, let’s try to explode “event” column from the above example-

from pyspark.sql.functions import explodedf1 =,explode(df.event))df1.printSchema()
root |-- id: long (nullable = true) |-- col: string (nullable = true)
| id|col|
| 1| a|
| 1| b|
| 1| c|
| 2| x|
| 2| y|

18. getField

In a Struct type, if you want to get a field by name, you can use “getField”. The following is its syntax-

import pyspark.sql.functions as f
from pyspark.sql import Rowfrom pyspark.sql import Row
df = spark.createDataFrame([Row(attributes=Row(Name='scott', Height=6.0, Hair='black')), Row(attributes=Row(Name='kevin', Height=6.1, Hair='brown'))]
root |-- attributes: struct (nullable = true) | |-- Hair: string (nullable = true) | |-- Height: double (nullable = true) | |-- Name: string (nullable = true)
| attributes|
|[black, 6.0, scott]|
|[brown, 6.1, kevin]|
+-------------------+df1 = (df .withColumn("name", f.col("attributes").getField("Name")) .withColumn("height", f.col("attributes").getField("Height")) .drop("attributes") )
| name|height|
|scott| 6.0|
|kevin| 5.1|

19. startswith

In case, you want to find records based on a string match you can use “startswith”.

In the following example I am searching for all records where value for description column starts with “[{“.

import pyspark.sql.functions as fdf.filter(f.col("description").startswith("[{")).show()

20. Extract year, month, day, hour

One of the common use cases is to write the AWS Glue DynamicFrame or Spark DataFrame to S3 in Hive-style partition. To do so you can extract the year, month, day, hour, and use it as partitionkeys to write the DynamicFrame/DataFrame to S3.

import pyspark.sql.functions as fdf2 = (raw_df .withColumn('year', f.year(f.col('last_updated'))) .withColumn('month', f.month(f.col('last_updated'))) .withColumn('day', f.dayofmonth(f.col('last_updated'))) .withColumn('hour', f.hour(f.col('last_updated'))) )

About the Author


Anand Prakash – 5x AWS Certified | 5x Oracle Certified

Avid learner of technology solutions around databases, big-data, machine learning.
Connect on Twitter @anandp86

You can also read this article on our Mobile APP Get it on Google Play

Related Articles


Big Data

Amazing Low-Code Machine Learning Capabilities with New Ludwig Update



Amazing Low-Code Machine Learning Capabilities with New Ludwig Update

Integration with Ray, MLflow and TabNet are among the top features of this release.

Image Credit: Ludwig


I recently started a new newsletter focus on AI education and already has over 50,000 subscribers. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:


If you follow this blog you know I am a fan of the Ludwig open source project. Initially incubated by Uber and now part of the Linux AI Foundation, Ludwig provides one of the best low-code machine learning(ML) stacks in the current market. Last week, Ludwig’s 0.4 was open sourced and includes a set of cool capabilities that could make it even a stronger fit for real world ML solutions.

What is Uber Ludwig?

Functionally, Ludwig is a framework for simplifying the processes of selecting, training and evaluating machine learning models for a given scenario. Think about configuring rather than coding machine learning models. Ludwig provides a set of model architectures that can be combined together to create an end-to-end model optimized for a specific set of requirements. Conceptually, Ludwig was designed based on a series of principles:

  • No coding required: no coding skills are required to train a model and use it for obtaining predictions.
  • Generality: a new data type-based approach to deep learning model design that makes the tool usable across many different use cases.
  • Flexibility: experienced users have extensive control over model building and training, while newcomers will find it easy to use.
  • Extensibility: easy to add new model architecture and new feature data types.
  • Understandability: deep learning model internals are often considered black boxes, but we provide standard visualizations to understand their performance and compare their predictions.

A Declarative Experience for Everything ML

Ludwig’s trajectory is focusing on enabling configuration-based, declerative models to interact with the top ML stacks in the current market. From that perspective, Ludwig adds a layer of simplicity and a consistent experience for data science teams looking to leverage the best-of-breed ML frameworks in their solutions.

Image Credit: Ludwig


Ludwig 0.4

The focus of new release of Ludwig has been to streamline declarative models for MLOps practices. From that perspective, Ludwig 0.4 includes a set of capabilities that could simplify the implementation of MLOps pipelines in real world solutions. Let’s review a few:

1) Ludwig on Ray

By far my favorite feature of this release was the integration with the Ray platform. Ray is one of the most complete stacks for highly scalable ML training and optimization processes. In Ludwig 0.4, data scientists can scale training workloads from a single laptop to a large Ray cluster using a few lines of configuration code.

2) Hyperparameter Search with Ray Tune

Ray Tune is a component of the Ray platform that allow distributed hyperparameter search in large clusters of nodes. Ludwig 0.4 integrates Ray Tune allowing distributed hyperparameter search algorithms such as Population-Based TrainingBayesian Optimization, and HyperBand among others.

3) Declarative Tabular Models with TabNet

TabNet is one of the top deep learning stacks for tabular data which incorporate cutting edge features such as attention architectures. The new release of Ludwig enables a declarative experience for tabular models by adding a new TabNet combiner which also includes tabular feature transformation and attention mechanisms to achieve state-of-the-art performance.

4) Experiment Tracking and Model Serving with MLflow

MLflow is rapidly becoming one of the most popular platfoms for ML experiment tracking and model serving. Ludwig 0.4 enables MLflwo-based experiment tracking with a single command line. Additionally, the new version of Ludwig can deploy a ML model to the MLflow registry using a simple command line statement.

Original. Reposted with permission.


Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading

Big Data

Analytics Engineering Everywhere



Analytics Engineering Everywhere

Many new roles have appeared in the data world ever since the rise of the Data Scientist took the spotlight several years ago. Now, there is a new core player ready to take center stage, and we may see in five years, nearly every organization will have an Analytics Engineering team.

By Jason Ganz, a senior data analyst at GoCanvas.

Analytics Engineering — An Introduction

There’s a quiet revolution happening in the world of data. For years we have been blasted with nonstop articles about “The Sexiest Job of the 21st Century” — a data scientist. A data scientist, we have been taught, is a figure of almost otherworldly intelligence who uses quasi-mystical arts to perform feats of data wizardly. But these days, if you talk to the people who watch the data space most closely — there’s a different data role that has them even more excited.

To be clear, there are some very real and very cool applications of data science that can allow organizations to do things with data that can completely transform how their organization operates. But for many orgs, particularly smaller organizations without millions of dollars to invest, data science initiatives tend to fall flat because of the lack of a solid data infrastructure to support them.

While everyone was focused on the rise of data science, another discipline has been quietly taking shape, driven not by glitzy articles in Harvard Business Review but by the people working in the trenches in data-intensive roles. They call it the analytics engineer.

An analytics engineer is someone who brings together the data-savvy and domain knowledge of an analyst with software engineering tooling and best practices. Day to day, that means spending in a suite of tools that is becoming known as “The Modern Data Stack” and particularly dbt. These tools allow analytics engineers to centralize data and then model it for analysis in a way that is remarkably cheap and easy compared to how the ETL of traditional Business Intelligence teams of the past operated.

While data scientists are seen by some as wizardry, the attitude of the analytics engineer is a little different. You’ll hear them refer to themselves as everything from “humble data plumbers’’ to “just a pissed off data analyst.” The work of an Analytics Engineer seems easy to understand, almost banal. They combine data sources, apply logic, make sure there are clean and well-modeled materializations to analyze.

It turns out analytics engineering is a goddamn superpower. Anyone that has worked in, well, basically any organization knows that a tremendous amount of effort goes into standardizing data points that feel like they should be a no-brainer to pull, while more complex questions just sit unanswered for years. Analytics Engineering allows you to have data systems that just work.

A good analytics engineer is hugely impactful for an org, with each analytics engineer being able to help build a truly data-driven culture in ways that would be challenging for a team of people using legacy tools. While in the past there was tremendous repetitive work to do any simple analysis, Analytics Engineers can build complex data models using tools like dbt and have analysis-ready data tables built on any schedule. While before it was impossible to get anyone to agree on standard definitions of metrics, Analytics Engineers can simply build them into their codebase. And in the past, people struggled with incomplete and messy data, and Analytics Engineers… still struggle with incomplete and messy data. But at least we can have a suite of tests on our analytics systems to know when something has gone wrong!

The Rise of Analytics Engineering

You might think that this development would be scary for people working in data — if one analytics engineer is substantially more impactful than a data analyst, won’t our jobs be at risk? Could an org replace five data analysts with one Analytics Engineer and come out ahead?

But the fact of the matter is that no data analyst, anywhere, has ever come close to performing all of the analysis they think could be impactful at their organization — the opposite is far more likely to be the problem. Most data orgs are begging for additional headcount.

As analytics engineers increase the amount of insight organizations can find from data, it actually becomes more likely that these orgs will want to hire additional data talent (both analytics engineers and analysts). In his fantastic post The Reorganization of the Factory, Erik Bernhardsson makes the case that as the toolsets for software engineers has become ever more efficient, the demand for software engineers has counterintuitively grown — as there are more and more use cases where it now makes sense to build software rather than a manual process. This point not only holds for data, but I think it actually is more true for data.

While every organization needs software, not every organization needs software engineers. But every organization needs to learn from their data, and since the ways in which the data needs to be understood will be unique at every organization, they will all need analytics engineers. Software is commonly said to be eating the world — analytics engineering will be embedded in the world. As the incremental value of each data hire rises, there are substantial new areas where data insights and learnings could be applied that they aren’t today. And even if you aren’t interested in becoming an analytics engineer, having well modeled and accurate data makes data analysts and data scientists more effective. It’s a win across the board.

That does not necessarily mean that every analytics engineering role will be doing good for the world. Having more powerful data operations allows you to question, seek insights, and look for new strategies. It can also allow an organization new ways to monitor their employees, surveil, or discriminate. One needs only look at the myriad of public issues in the tech and data science industries right now to see the ways that powerful tech can be misused. It is important to recognize the potential dangers as well as the new opportunities.

If it feels like we’re at a real inflection point for Analytics Engineering — it’s because we are. What was very recently the domain of a few adventurous data teams is quickly becoming industry standard for tech organizations — and there’s every reason to think that other types of organizations will be following along shortly. The impact is just too high.

We’re about to see a huge expansion in the number of and types of places where you can find employment as an analytics engineer. The coming boom in opportunities for analytics engineers will take place across three rough domains, with each having different challenges and opportunities.

  • More and more large enterprises, both tech and non-tech organizations, are going to adapt to the modern data stack. As analytics engineering is brought into the most complex legacy data systems, we’ll begin to see what patterns develop to support analytics engineering at scale. If you are interested in really figuring out what the large-scale data systems of the future look like, this will be the place to go.
  • Just about every new company is going to be searching for an analytics engineer to lead their data initiatives. This will give them a step up against any competition that isn’t investing in their core data. Being an early analytics engineer at a fast-growing company is tremendously fun and exciting, as you are able to build up a data organization from scratch and see firsthand how analytics engineering can change the trajectory of an organization.
  • Finally, many organizations outside the tech business world are going to begin seeing the impact that analytics engineering can bring. You might not have quite the same tech budget, and you might have to learn to advocate for yourself a little more but it might be the area where analytics engineering has the most potential to do good for the world. City governments will use analytics engineering to monitor programs and ensure that government resources are being used effectively. Academic institutions will use analytics engineering to create datasets, many of them public, that will aid in scientific and technological development. The possibility space is wide open.

Analytics engineering is fundamentally a discipline that’s about making sense of the world around us. It’s about allowing everyone in an organization to see a little bit further in their impact on the org and how their work connects to it. Right now, analytics engineering is still a new discipline — pretty soon, it will be everywhere.

Original. Reposted with permission.


Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading

Big Data

Text detection from images using EasyOCR: Hands-on guide



EasyOCR | Text detection from images using EasyOCR: Hands-on guide

Learn everything about Analytics

Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading

Big Data

Why and how to use BERT for NLP Text Classification?



Why and how to use BERT for NLP Text Classification? – Analytics Vidhya

Learn everything about Analytics

Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading
Esports3 days ago

Select Smart Genshin Impact: How to Make the Personality Quiz Work

Blockchain3 days ago

Bitmain Released New Mining Machines For DOGE And LTC

Esports5 days ago

How to complete Path to Glory Update SBC in FIFA 21 Ultimate Team

Blockchain5 days ago

Why is 25th June critical for Ethereum?

Blockchain4 days ago

Digital Renminbi and Cash Exchange Service ATMs Launch in Beijing

Aviation4 days ago

Southwest celebrates 50 Years with a new “Freedom One” logo jet on N500WR

Start Ups4 days ago

Zenefits Payroll Glitch Results In Delayed Paychecks For Small-Business Employees

Blockchain4 days ago

Bitcoin isn’t as Anonymous as People Think it is: Cornell Economist

Blockchain5 days ago

Blockchain Intelligence Firm TRM Labs Secures $14 Million in Funding

Aviation4 days ago

Delta Air Lines Drops Cape Town With Nonstop Johannesburg A350 Flights

AR/VR5 days ago

Larcenauts Review-In-Progress: A Rich VR Shooter With Room To Improve

Blockchain4 days ago

Index Publisher MSCI Considers Launching Crypto Indexes

Esports4 days ago

All Characters, Skills, and Abilities in Naraka: Bladepoint

Energy1 day ago

Inna Braverman, Founder and CEO of Eco Wave Power Will be Speaking at the 2021 Qatar Economic Forum, Powered by Bloomberg

Blockchain4 days ago

Paraguay Follows El Salvador In Tabling Bitcoin Bill, The Crypto Revolution Is Happening

Blockchain4 days ago

Paraguayan Official Confirms: In July We Legislate Bitcoin

Blockchain4 days ago

U.K’s crypto-users are growing in number, but do they even understand the asset class?

Aviation4 days ago

Breaking: British Airways 787 Suffers Nose Gear Collapse

Esports4 days ago

TFT Reckoning 11.13 patch preview highlights Nidalee, Kayle, and Trundle rework

Start Ups4 days ago

Clair Labs Targets $9M Seed On Contactless Patient Monitoring