Connect with us


Prepare data from Snowflake for machine learning with Amazon SageMaker Data Wrangler



Data preparation remains a major challenge in the machine learning (ML) space. Data scientists and engineers need to write queries and code to get data from source data stores, and then write the queries to transform this data, to create features to be used in model development and training. All of this data pipeline development work doesn’t really focus on the building of ML models, but focuses on the building of data pipelines necessary to make the data available to the models. Amazon SageMaker Data Wrangler makes it easier for data scientists and engineers to prepare data in the early phase of developing ML applications by using a visual interface.

Data Wrangler simplifies the process of data preparation and feature engineering using a single visual interface. Data Wrangler comes with over 300 built-in data transformations to help normalize, transform, and combine features without writing any code. You can now use Snowflake as a data source in Data Wrangler to easily prepare data in Snowflake for ML.

In this post, we use a simulated dataset that represents loans from a financial services provider, which has been provided by Snowflake. This dataset contains lender data about loans granted to individuals. We use Data Wrangler to transform and prepare the data for later use in ML models, first building a data flow in Data Wrangler, then exporting it to Amazon SageMaker Pipelines. First, we walk through setting up Snowflake as the data source, then explore and transform the data using Data Wrangler.


This post assumes you have the following:

Set up permissions for Data Wrangler

In this section, we cover the permissions required to set up Snowflake as a data source for Data Wrangler. This section requires you to perform steps in both the AWS Management Console and Snowflake. The user in each environment should have permission to create policies, roles, and secrets in AWS, and the ability to create storage integrations in Snowflake.

All permissions for AWS resources are managed via your IAM role attached to your Amazon SageMaker Studio instance. Snowflake-specific permissions are managed by the Snowflake admin; they can grant granular permissions and privileges to each Snowflake user. This includes databases, schemas, tables, warehouses, and storage integration objects. Make sure that the correct permissions are set up outside of Data Wrangler.

AWS access requirements

Snowflake requires the following permissions on your output S3 bucket and prefix to be able to access objects in the prefix:

  • s3:GetObject
  • s3:GetObjectVersion
  • s3:ListBucket

You can add a bucket policy to ensure that Snowflake only communicates with your bucket over HTTPS. For instructions, see What S3 bucket policy should I use to comply with the AWS Config rule s3-bucket-ssl-requests-only?

Create an IAM policy allowing Amazon S3 access

In this section, we cover creating the policy required for Snowflake to access data in an S3 bucket of your choosing. If you already have a policy and role that allows access to the S3 bucket you plan to use for the Data Wrangler output, you can skip this section and the next section, and start creating your storage integration in Snowflake.

  1. On the IAM console, choose Policies in the navigation pane.
  2. Choose Create policy.
  3. On the JSON tab, enter the following JSON snippet, substituting your bucket and prefix name for the placeholders:
# Example policy for S3 write access
# This needs to be updated
# Be sure to remove the angle brackets around <bucket> and <prefix> # Then replace with your own bucket and prefix names (eg: MY-SAGEMAKER-BUCKET/MY-PREFIX)
{ "Version":"2012-10-17", "Statement":[ { "Effect":"Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:GetObjectVersion", "s3:DeleteObject", "s3:DeleteObjectVersion" ], "Resource":["arn:aws:s3:::<bucket>/<prefix>/*"] }, { "Effect":"Allow", "Action": [ "s3:ListBucket" ], "Resource":["arn:aws:s3:::<bucket>"], "Condition": { "StringLike": { "s3:prefix": ["<prefix>/*"] } } } ]

  1. Choose Next: Tags.
  2. Choose Next: Review.
  3. For Name, enter a name for your policy (for example, snowflake_datawrangler_s3_access).
  4. Choose Create policy.

Create an IAM role

In this section, we create an IAM role and attach it to the policy we created.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose Create role.
  3. Select Another AWS account as the trusted entity type
  4. For Account ID field, enter your own AWS account ID.

You modify the trusted relationship and grant access to Snowflake later.

  1. Select the Require External ID
  2. Enter a dummy ID such as your own account ID.

Later, we modify the trust relationship and specify the external ID for your Snowflake stage. An external ID is required to grant access to your AWS resources (such as Amazon  S3) to a third party (Snowflake).

  1. Choose Next.
  2. Locate the policy you created previously for the S3 bucket and choose this policy.
  3. Choose Next.
  4. Enter a name and description for the role, then choose Create role.

You now have an IAM policy created for an IAM role, and the policy is attached to the role.

  1. Record the role ARN value located on the role summary page.

In the next step, you create a Snowflake integration that references this role.

Create a storage integration in Snowflake

A storage integration in Snowflake stores a generated IAM entity for external cloud storage, with an optional set of allowed or blocked locations, in Amazon S3. An AWS administrator in your organization grants permissions on the storage location to the generated IAM entity. With this feature, users don’t need to supply credentials when creating stages or when loading or unloading data.

Create the storage integration with the following code:


Retrieve the IAM user for your Snowflake account

Run the following DESCRIBE INTEGRATION command to retrieve the ARN for the IAM user that was created automatically for your Snowflake account:


Record the following values from the output:

  • STORAGE_AWS_IAM_USER_ARN – The IAM user created for your Snowflake account
  • STORAGE_AWS_EXTERNAL_ID– The external ID needed to establish a trust relationship

Update the IAM role trust policy

Now we update the trust policy.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose the role you created.
  3. On the Trust relationship tab, choose Edit trust relationship.
  4. Modify the policy document as shown in the following code with the DESC STORAGE INTEGRATION output values you recorded in the previous step:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "AWS": "<snowflake_user_arn>" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "<snowflake_external_id>" } } } ]

  1. Choose Update trust policy.

Create an external stage in Snowflake

We use an external stage within Snowflake for loading data from an S3 bucket in your own account into Snowflake. In this step, we create an external (Amazon S3) stage that references the storage integration you created. For more information, see Creating an S3 Stage.

This requires a role that has the CREATE_STAGE privilege for the schema as well as the USAGE privilege on the storage integration. You can grant these privileges to the role as shown in the code in the next step.

Create the stage using the CREATE_STAGE command with placeholders for the external stage and S3 bucket and prefix. The stage also references a named file format object called my_csv_format:

grant create stage on schema public to role <iam_role>;
grant usage on integration SAGEMAKE_DATAWRANGLER_INTEGRATION to role <iam_role_arn>;
create stage <external_stage> storage_integration = SAGEMAKE_DATAWRANGLER_INTEGRATION url = '<s3_bucket>/<prefix>' file_format = my_csv_format;

Create a secret for Snowflake credentials (Optional)

Data Wrangler allows users to use the ARN of an AWS Secrets Manager secret or a Snowflake account name, user name, and password to access Snowflake. If you intend to use the Snowflake account name, user name, and password option, skip to the next section, which covers adding the data source. By default, Data Wrangler creates a Secrets Manager secret on your behalf, when using the second option.

To create a Secrets Manager secret manually, complete the following steps:

  1. On the Secrets Manager console, choose Store a new secret.
  2. For Select secret type¸ select Other types of secrets.
  3. Specify the details of your secret as key-value pairs.

The names of the key are case-sensitive and must be lowercase. If you enter any of these incorrectly, Data Wrangler raises an error.

If you prefer, you can use the plaintext option and enter the secret values as JSON:

{ "username": "<snowflake username>", "password": "<snowflake password>", "accountid": "<snowflake account id>"

  1. Choose Next.
  2. For Secret name, add the prefix AmazonSageMaker (for example, our secret is AmazonSageMaker-DataWranglerSnowflakeCreds).
  3. In the Tags section, add a tag with the key SageMaker and value true.

  1. Choose Next.
  2. The rest of the fields are optional; choose Next until you have the option to choose Store to store the secret.

After you store the secret, you’re returned to the Secrets Manager console.

  1. Choose the secret you just created, then retrieve the secret ARN.
  2. Store this in the text editor of your choice for use later when you create the Data Wrangler data source.

Set up the data source in Data Wrangler

In this section, we cover setting up Snowflake as a data source in Data Wrangler. This post assumes that you have access to SageMaker, an instance of Studio, and a user for Studio. For more information about prerequisites, see Get Started with Data Wrangler.

Create a new data flow

To create your data flow, complete the following steps:

  1. On the SageMaker console, choose Amazon SageMaker Studio in the navigation pane.
  2. Choose Open Studio.
  3. In the Launcher, choose New data flow.

Alternatively, on the File drop-down, choose New, then choose Data Wrangler Flow.

Creating a new flow can take a few minutes. After the flow has created, you see the Import data page.

Add Snowflake as a data source in Data Wrangler

Next, we add Snowflake as a data source.

  1. On the Add data source menu, choose Snowflake.

  1. Add your Snowflake connection details.

Data Wrangler uses HTTPS to connect to Snowflake.

  1. If you created a Secrets Manager secret manually, choose the Authentication method drop-down menu and choose ARN.

  1. Choose Connect.

You’re redirected to the import menu.

Run a query

Now that Snowflake is set up as a data source, you can access your data in Snowflake directly from the Data Wrangler query editor. The query we write in the editor is what Data Wrangler uses to import data from Snowflake to start our data flow.

  1. On the drop-down menus, choose the data warehouse, database, and schema you want to use for your query.

For this post, our dataset is in the database FIN_LOANS, the schema is DEV, and the table is LOAN_INT_HV. My data warehouse is called MOONMAXW_DEV_WH; depending on your setup, these will likely differ.

Alternatively, you can specify the full path to the dataset in the query editor. Make sure you still choose the database and schema on the drop-down menus.

  1. In the query editor, enter a query and preview the results.

For this post, we retrieve all columns from 1,000 rows.

  1. Choose Import.

  1. Enter a dataset name when prompted (for this post, we use snowflake_loan_int_hv).
  2. Choose Add.

You’re taken to the Prepare page, where you can add transformations and analyses to the data.

Add transformations to the data

Data Wrangler has over 300 built-in transformations. In this section, we use some of these transformations to prepare the dataset for an ML model.

On the Data Wrangler flow page, make sure you have chosen the Prepare tab. If you’re following the steps in the post, you’re directed here automatically after adding your dataset.

Convert data types

The first step we want to perform is to check that the correct data type was inferred on ingest for each column.

  1. Next to Data types, choose the plus sign.
  2. Choose Edit data types.

Looking through the columns, we identify that MNTHS_SINCE_LAST_DELINQ and MNTHS_SINCE_LAST_RECORD should most likely be represented as a number type, rather than string.

  1. On the right-hand menu, scroll down until you find MNTHS_SINCE_LAST_DELINQ and MNTHS_SINCE_LAST_RECORD.
  2. On the drop-down menu, choose Float.

Looking through the dataset, we can confirm that the rest of the columns appear to have been correctly inferred.

  1. Choose Preview to preview the changes.
  2. Choose Apply to apply the changes.
  3. Choose Back to data flow to see the current state of the flow.

Manage columns

The dataset we’re using has several columns that likely aren’t beneficial to future models, so we start our transformation process by dropping the columns that aren’t useful.

  1. Next to Data types, choose the plus sign.
  2. Choose Add transformation.

The transformation console opens. Here you can preview your dataset, select from the available transformations, and preview the transformations.

Looking through the data, we can see that the fields EMP_TITLE, URL, DESCRIPTION, and TITLE will likely not provide value to our model in our use case, so we drop them.

  1. On the Transform menu, choose Manage columns.
  2. On the Transform drop-down menu, leave Drop column
  3. Enter EMP_TITLE for Column to drop.
  4. Choose Preview to review the changes.
  5. Choose Add to add the step.
  6. If you want to see the step you added and previous steps, choose Previous steps on the Transform

  1. Repeat these steps for the remaining columns (URL, DESCRIPTION, and TITLE).
  2. Choose Back to data flow to see the current state of the flow.

In the data flow view, we can see that this node in the flow has four steps, which represent the four columns we’re dropping for this part of the flow.

Format string

Next, we look for columns that are string data that can be formatted to be more beneficial to use later. Looking through our dataset, we can see that INT_RATE might be useful in a future model as float, but has a trailing character of %. Before we can use another built-in transformation (parse as type) to convert this to a float, we must strip the trailing character.

  1. Next to Steps, choose the plus sign.
  2. Choose Add transform.
  3. Choose Format string.
  4. On the Transform drop-down, choose Remove Symbols.
  5. On the Input column drop-down, choose the INT_RATE column.
  6. For Symbols, enter %.
  7. Optionally, in the Output field, enter the name of a column that this data is written to.

For this post, we keep the original column and set the output column to INT_RATE_PERCENTAGE to denote to future users of this data that this column is the interest rate as a percentage. Later, we convert this to a float.

  1. Choose Preview.

When Data Wrangler adds a new column, it’s automatically added as the rightmost column.

  1. Review the change to ensure accuracy.
  2. Choose Add.

Parse column as type

Continuing with the preceding example, we’ve identified that INT_RATE_PERCENTAGE should be converted to a float type.

  1. Next to Steps, choose the plus sign.
  2. Choose Add transform.
  3. Choose Parse Column as Type.
  4. On the Column drop-down, choose INT_RATE_PERCENTAGE.

The From field is automatically populated.

  1. On the to drop-down, choose Float.
  2. Choose Preview.
  3. Choose Add.
  4. Choose Back to data flow.

As you can see, we now have six steps in this portion of the flow, four that represent columns being dropped, one that represents string formatting, and one that represents parse column as type.

Encode categorical data

Next, we want to look for categorical data in our dataset. Data Wrangler has a built-in functionality to encode categorical data using both ordinal and one-hot encodings. Looking at our dataset, we can see that the TERM, HOME_OWNERSHIP, and PURPOSE columns all appear to be categorical in nature.

  1. Next to Steps, choose the plus sign.
  2. Choose Add transform.

The first column in our list TERM has two possible values: 60 months and 36 months. Perhaps our future model would benefit from having these values one-hot encoded and placed into new columns.

  1. Choose Encode Categorical.
  2. On the Transform drop-down, choose One-hot encode.
  3. For Input column, choose TERM.
  4. On the Output style drop-down, choose Columns.
  5. Leave all other fields and check boxes as is.
  6. Choose Preview.

We can now see two columns, TERM_36 months and TERM_60 months, are one-hot encoded to represent the corresponding value in the TERM column.

  1. Choose Add.

The HOME_OWNERSHIP column has  four possible values: RENT, MORTGAGE, OWN, and other.

  1. Repeat the preceding steps to apply a one-hot encoding approach on these values.

Lastly, the PURPOSE column has several possible values. For this data, we use a one-hot encoding approach as well, but we set the output to a vector, rather than columns.

  1. On the Transform drop-down, choose One-hot encode.
  2. For Input column, choose PURPOSE.
  3. On the Output style drop-down, choose Vector.
  4. For Output Column, we call this column PURPOSE_VCTR.

This keeps the original PURPOSE column, if we decide to use it later.

  1. Leave all other fields and check boxes as is.
  2. Choose Preview.

  1. Choose Add.
  2. Choose Back to data flow.

We can now see nine different transformations in this flow, and we still haven’t written a single line of code.

Handle outliers

As our last step in this flow, we want to handle outliers in our dataset. As part of the data exploration process, we can create an analysis (which we cover in the next section). In the following example scatter plot, I explored if I could gain insights from looking at the relationship between annual income, interest rate, and employment length by observing the dataset on a scatter plot. On the graph, we have the loan receivers INT_RATE_PERCENTAGE on the X axis, ANNUAL_INC on the Y axis, and the data is color-coded by EMP_LENGTH. The dataset has some outliers that might skew the result of our model later. To address this, we use Data Wrangler’s built-in transformation for handling outliers.

  1. Next to Steps, choose the plus sign.
  2. Choose Add transform.
  3. Choose Handle outliers.
  4. On the Transform drop-down, choose Standard deviation numeric outliers.
  5. For Input column, enter ANNUAL_INC.
  6. For Output column, enter ANNUAL_INC_NO_OUTLIERS.

This is optional, but it’s good practice to notate that a column has been transformed for later consumers.

  1. On the Fix method drop-down, leave Clip

This option automatically clips values to the corresponding outlier detection bound, which we set next.

  1. For Standard deviations, leave the default of 4 to start.

This allows values within four standard deviations of the mean to be considered valid (and therefore not clipped). Values outside of this bound are clipped.

  1. Choose Preview.
  2. Choose Add.

The output includes an object type. We need to convert this to a float for it to be valid within our dataset and visualization.

  1. Follow the steps as when parsing a column as type, this time using the ANNUAL_INC_NO_OUTLIERS columns.
  2. Choose Back to data flow to see the current state of the flow.

Add analyses to the data

In this section, we walk through adding analyses to dataset. We focus on visualizations, but there are several other options, including detecting target leakage, generating a bias report, or adding your own custom visualizations using the Altair library.

Scatter plot

To create a scatter plot, complete the following steps:

  1. On the data flow page, next to Steps, choose the plus sign.
  2. Choose Add analysis.
  3. For Analysis type¸ choose Scatter plot.
  4. Using the preceding example, we name this analysis EmpLengthAnnualIncIntRate.
  5. For X Axis, enter INT_RATE_PERCENTAGE.
  6. For Y Axis, enter ANNUAL_INC_NO_OUTLIERS.
  7. For Color by, enter EMP_LENGTH.
  8. Choose Preview.

The following screenshot shows our scatter plot.

We can compare this to the old version, before the anomalies were removed.

So far this is looking good, but let’s add a facet to break out each category in the Grade column into its own graph.

  1. For Facet by, choose GRADE.
  2. Choose Preview.

The following screenshot has been trimmed down for display purposes. The Y axis still represents ANNUAL_INC. For faceted plots, this is displayed on the bottommost plot.

  1. Choose Save to save the analysis.

Export the data flow

Finally, we export this whole data flow as a pipeline, which creates a Jupyter notebook with the code pre-populated. With Data Wrangler, you can also export your data to a Jupyter notebook as a SageMaker processing job, SageMaker feature store, or export directly to Python code.

  1. On the Data Flow console, choose the Export
  2. Choose the steps to export. For our use case, we choose each box that represents a step.

  1. Choose Export step, then choose Pipeline.

The pre-populated Jupyter notebook loads and opens automatically, displaying all the generated steps and code for your data flow. The following screenshot shows the input section that defines the data source.

Clean up

If your work with Data Wrangler is complete, shut down your Data Wrangler instance to avoid incurring additional fees.


In this post, we covered setting up Snowflake as a data source for Data Wrangler, adding transformations and analyses to a dataset, then exporting to the data flow for further use in a Jupyter notebook. We further improved our data flow after visualizing our dataset using the Data Wrangler built-in analysis functionality. Most notably, we built a data preparation pipeline without having to write a single line of code.

To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler, and see the latest information on the Data Wrangler product page.

Data Wrangler makes it easy to ingest data and perform data preparation tasks such as exploratory data analysis, feature selection, feature engineering. We’ve only covered a few of the capabilities of Data Wrangler in this post on data preparation; you can use Data Wrangler for more advanced data analysis such as feature importance, target leakage, and model explainability using an easy and intuitive user interface.

About the Authors

Maxwell Moon is a Senior Solutions Architect at AWS working with Independent Software Vendors (ISVs) to design and scale their applications on AWS. Outside of work, Maxwell is a dad to two cats, is an avid supporter of the Wolverhampton Wanderers Football Club, and tries to spend as much time playing music as possible.

Bosco Albuquerque is a Sr Partner Solutions Architect at AWS and has over 20 years of experience in working with database and analytics products from enterprise database vendors, and cloud providers and has helped large technology companies in designing data analytics solutions as well as led engineering teams is designing and implementing data analytics platforms and data products.

Coinsmart. Beste Bitcoin-Börse in Europa


AI-driven hedge fund rules out Bitcoin for lack of ‘fundamentals’



A Swedish hedge fund that returned roughly four times the industry average last year using artificial intelligence won’t touch Bitcoin, based on an assessment that the cryptocurrency doesn’t lend itself to sensible analysis.

Photo by Bloomberg Mercury

Patrik Safvenblad, the chief investment officer of Volt Capital Management AB, says the problem with Bitcoin and other crypto assets is that they “do not have accessible fundamentals that we could build a model on.”

“When there is a crisis, markets generally move toward fundamentals. Not the old fundamentals but new, different fundamentals,” he said in an interview. So if an asset doesn’t provide that basic parameter, “we stay away from that,” he said.

The role of Bitcoin in investment portfolios continues to split managers, as the world’s most popular cryptocurrency remains one of its most volatile asset classes. One coin traded at less than $40,000 on Friday, compared with an April peak of $63,410. This time last year, a single Bitcoin cost around $10,000.

Among Volt’s best-known investors is Bjorn Wahlroos, the former Nordea Bank Abp chairman. His son and former professional poker player, Thomas Wahlroos, is Volt’s board chairman. The fund currently manages assets worth just $73 million, on which it returned 41% in 2020, four times the industry average.

Bitcoin enthusiasts recently received a boost when hedge fund manager Paul Tudor Jones told CNBC he likes it “as a portfolio diversifier.” He went on to say that the “only thing” he’s “certain” about is that he wants “5% in gold, 5% in Bitcoin, 5% in cash, 5% in commodities.”

Meanwhile, Bank of America Corp. research shows that Bitcoin is about four times as volatile as the Brazilian real and Turkish lira. And the International Monetary Fund has warned that El Salvador’s decision to adopt Bitcoin as legal tender “raises a number of macroeconomic, financial and legal issues that require very careful analysis.”

Safvenblad says it’s more than just a matter of Bitcoin’s lack of fundamentals. He says he’s not ready to hold an asset that’s ultimately designed to dodge public scrutiny.

Volt would “much prefer to be in a regulated market with regulated trading,” he said. “And Bitcoin is not yet fully regulated.”

The hedge-fund manager has chosen 250 models it thinks will make money, and its AI program then allocates daily weightings. Volt’s investment horizon is relatively short, averaging about 10-12 trading days. It holds roughly 60 positions at any given time, and its current analysis points toward what Safvenblad calls a “nervous long.”

“In the past few weeks the program has turned more bearish,” he said. We have some positions that anticipate a slowdown, for example long fixed-income, and the models have now trimmed our long positions in commodities. Today, the portfolio reflects a more balanced outlook.”

Safvenblad says the advantage to Volt’s AI model is that it’s unlikely to miss any signals. “We don’t say that we know where the world is heading. But we have a system that monitors everything that could mean something.”

— Jonas Cho Walsgard, Bloomberg Mercury

Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading

Artificial Intelligence

UK’s ICO warns over ‘big data’ surveillance threat of live facial recognition in public



The UK’s chief data protection regulator has warned over reckless and inappropriate use of live facial recognition (LFR) in public places.

Publishing an opinion today on the use of this biometric surveillance in public — to set out what is dubbed as the “rules of engagement” — the information commissioner, Elizabeth Denham, also noted that a number of investigations already undertaken by her office into planned applications of the tech have found problems in all cases.

“I am deeply concerned about the potential for live facial recognition (LFR) technology to be used inappropriately, excessively or even recklessly. When sensitive personal data is collected on a mass scale without people’s knowledge, choice or control, the impacts could be significant,” she warned in a blog post.

“Uses we’ve seen included addressing public safety concerns and creating biometric profiles to target people with personalised advertising.

“It is telling that none of the organisations involved in our completed investigations were able to fully justify the processing and, of those systems that went live, none were fully compliant with the requirements of data protection law. All of the organisations chose to stop, or not proceed with, the use of LFR.”

“Unlike CCTV, LFR and its algorithms can automatically identify who you are and infer sensitive details about you. It can be used to instantly profile you to serve up personalised adverts or match your image against known shoplifters as you do your weekly grocery shop,” Denham added.

“In future, there’s the potential to overlay CCTV cameras with LFR, and even to combine it with social media data or other ‘big data’ systems — LFR is supercharged CCTV.”

The use of biometric technologies to identify individuals remotely sparks major human rights concerns, including around privacy and the risk of discrimination.

Across Europe there are campaigns — such as Reclaim your Face — calling for a ban on biometric mass surveillance.

In another targeted action, back in May, Privacy International and others filed legal challenges at the controversial US facial recognition company, Clearview AI, seeking to stop it from operating in Europe altogether. (Some regional police forces have been tapping in — including in Sweden where the force was fined by the national DPA earlier this year for unlawful use of the tech.)

But while there’s major public opposition to biometric surveillance in Europe, the region’s lawmakers have so far — at best — been fiddling around the edges of the controversial issue.

A pan-EU regulation the European Commission presented in April, which proposes a risk-based framework for applications of artificial intelligence, included only a partial prohibition on law enforcement’s use of biometric surveillance in public places — with wide ranging exemptions that have drawn plenty of criticism.

There have also been calls for a total ban on the use of technologies like live facial recognition in public from MEPs across the political spectrum. The EU’s chief data protection supervisor has also urged lawmakers to at least temporarily ban the use of biometric surveillance in public.

The EU’s planned AI Regulation won’t apply in the UK, in any case, as the country is now outside the bloc. And it remains to be seen whether the UK government will seek to weaken the national data protection regime.

A recent report it commissioned to examine how the UK could revise its regulatory regime, post-Brexit, has — for example — suggested replacing the UK GDPR with a new “UK framework” — proposing changes to “free up data for innovation and in the public interest”, as it puts it, and advocating for revisions for AI and “growth sectors”. So whether the UK’s data protection regime will be put to the torch in a post-Brexit bonfire of ‘red tape’ is a key concern for rights watchers.

(The Taskforce on Innovation, Growth and Regulatory Reform report advocates, for example, for the complete removal of Article 22 of the GDPR — which gives people rights not to be subject to decisions based solely on automated processing — suggesting it be replaced with “a focus” on “whether automated profiling meets a legitimate or public interest test”, with guidance on that envisaged as coming from the Information Commissioner’s Office (ICO). But it should also be noted that the government is in the process of hiring Denham’s successor; and the digital minister has said he wants her replacement to take “a bold new approach” that “no longer sees data as a threat, but as the great opportunity of our time”. So, er, bye-bye fairness, accountability and transparency then?)

For now, those seeking to implement LFR in the UK must comply with provisions in the UK’s Data Protection Act 2018 and the UK General Data Protection Regulation (aka, its implementation of the EU GDPR which was transposed into national law before Brexit), per the ICO opinion, including data protection principles set out in UK GDPR Article 5, including lawfulness, fairness, transparency, purpose limitation, data minimisation, storage limitation, security and accountability.

Controllers must also enable individuals to exercise their rights, the opinion also said.

“Organisations will need to demonstrate high standards of governance and accountability from the outset, including being able to justify that the use of LFR is fair, necessary and proportionate in each specific context in which it is deployed. They need to demonstrate that less intrusive techniques won’t work,” wrote Denham. “These are important standards that require robust assessment.

“Organisations will also need to understand and assess the risks of using a potentially intrusive technology and its impact on people’s privacy and their lives. For example, how issues around accuracy and bias could lead to misidentification and the damage or detriment that comes with that.”

The timing of the publication of the ICO’s opinion on LFR is interesting in light of wider concerns about the direction of UK travel on data protection and privacy.

If, for example, the government intends to recruit a new, ‘more pliant’ information commissioner — who will happily rip up the rulebook on data protection and AI, including in areas like biometric surveillance — it will at least be rather awkward for them to do so with an opinion from the prior commissioner on the public record that details the dangers of reckless and inappropriate use of LFR.

Certainly, the next information commissioner won’t be able to say they weren’t given clear warning that biometric data is particularly sensitive — and can be used to estimate or infer other characteristics, such as their age, sex, gender or ethnicity.

Or that ‘Great British’ courts have previously concluded that “like fingerprints and DNA [a facial biometric template] is information of an ‘intrinsically private’ character”, as the ICO opinion notes, while underlining that LFR can cause this super sensitive data to be harvested without the person in question even being aware it’s happening. 

Denham’s opinion also hammers hard on the point about the need for public trust and confidence for any technology to succeed, warning that: “The public must have confidence that its use is lawful, fair, transparent and meets the other standards set out in data protection legislation.”

The ICO has previously published an Opinion into the use of LFR by police forces — which she said also sets “a high threshold for its use”. (And a few UK police forces — including the Met in London — have been among the early adopters of facial recognition technology, which has in turn led some into legal hot water on issues like bias.)

Disappointingly, though, for human rights advocates, the ICO opinion shies away from recommending a total ban on the use of biometric surveillance in public by private companies or public organizations — with the commissioner arguing that while there are risks with use of the technology there could also be instances where it has high utility (such as in the search for a missing child).

“It is not my role to endorse or ban a technology but, while this technology is developing and not widely deployed, we have an opportunity to ensure it does not expand without due regard for data protection,” she wrote, saying instead that in her view “data protection and people’s privacy must be at the heart of any decisions to deploy LFR”.

Denham added that (current) UK law “sets a high bar to justify the use of LFR and its algorithms in places where we shop, socialise or gather”.

“With any new technology, building public trust and confidence in the way people’s information is used is crucial so the benefits derived from the technology can be fully realised,” she reiterated, noting how a lack of trust in the US has led to some cities banning the use of LFR in certain contexts and led to some companies pausing services until rules are clearer.

“Without trust, the benefits the technology may offer are lost,” she also warned.

There is one red line that the UK government may be forgetting in its unseemly haste to (potentially) gut the UK’s data protection regime in the name of specious ‘innovation’. Because if it tries to, er, ‘liberate’ national data protection rules from core EU principles (of lawfulness, fairness, proportionality, transparency, accountability and so on) — it risks falling out of regulatory alignment with the EU, which would then force the European Commission to tear up a EU-UK data adequacy arrangement (on which the ink is still drying).

The UK having a data adequacy agreement from the EU is dependent on the UK having essentially equivalent protections for people’s data. Without this coveted data adequacy status UK companies will immediately face far greater legal hurdles to processing the data of EU citizens (as the US now does, in the wake of the demise of Safe Harbor and Privacy Shield). There could even be situations where EU data protection agencies order EU-UK data flows to be suspended altogether…

Obviously such a scenario would be terrible for UK business and ‘innovation’ — even before you consider the wider issue of public trust in technologies and whether the Great British public itself wants to have its privacy rights torched.

Given all this, you really have to wonder whether anyone inside the UK government has thought this ‘regulatory reform’ stuff through. For now, the ICO is at least still capable of thinking for them.

Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading


A Brief Into to NLP in the Media & Communication Industry




Gaurav Hacker Noon profile picture


A technical writer with Cogito, who writes about AI. National basketball player. Photographer.

Natural Language Processing (NLP) possesses a massive influence on the media and communication industry. The ability to track people’s choice, filter irrelevant information, speed, and accuracy makes this
technology standing apart in the industry.

In this write-up, we will understand the role of NLP in the media industry, its impact, and how it will help to clear out the issues which are hampering the overall growth.

But first, let’s have a look at the basics.

What is Natural Language Processing (NLP)?

Natural Language Processing or NLP is an automatic manipulation of natural language. It is a branch of Artificial Intelligence (AI) that allows systems to instantly recognize, manipulate and correctly interpret the way human beings communicate – majorly in the form of speech or text.

With the advancement of technologies, even machines can easily decipher the human language and perform the tasks accordingly.

Earlier, the “punch cards” were used by programmers to communicate with machines. Presently, the place has been captured by Siri, Alexa, and other machines, which can fluently communicate.

NLP in the Media & Communication Industry

The implementation of Natural Language Processing in the media industry
has already started across the globe. As we all know, fake news,
irrelevant content & comments go hand in hand with social
media trolls.

From the user’s point of view, almost everyone has grappled with these types of issues. It is an arduous task for the human mind to properly maintain track of everything. Therefore, to get rid of these problems, NLP plays a substantial role.

This will not only help to eliminate the above-mentioned problems but will also open the door for the overall development of the industry. According to the predefined criteria or specific guidelines, computer intelligence will allow automated searching for crucial information, parsing relevant news, and analyzing the news.

Interesting Trivia

We can see the successful implementation of NLP by an American news agency called “Associated Press (AP)” in 2015. According to a report, as many as 3,000 articles per 15 minutes were generated. In 2016, it was around 2000 posts per second.

Apart from AP, other media agencies like The New York Times, The Guardian, Forbes, BBC have also implemented this technology.

Impact of NLP in the News Industry

Any advanced technology has its advantages and disadvantages. It is up to humans to what extent they want to incorporate the technology.

When we talk about the use of NLP in the news industry, it is not only about understanding the text or the speech. It is crucial to develop the algorithms which enable computers to perform the following actions:

1. Predetermine the textual content

2. Summarizing the overall information

3. Analyzing suitable information

4. Filtering the news as per the criteria

All the above steps can only be possible through macro-understanding and micro-understanding of the textual content.

When we talk about integrating NLP in the journalism/media industry, there are immense possibilities where we can assimilate. All the momentous happenings, events, and other significant information are crucial in our daily lives.

It is challenging for humans to remain 100 percent correct and accurate
every time. This arises the need for machines that can think, understand and interpret just like humans do.

For a journalist, NLP-enabled robots can carry out a significant role in developing the overall industry standards. In order to enhance productivity, accuracy, and speed, a journalist can dedicate the research work to a robot and manage the other aspects of the news vertical.

These robots can scan relevant and authentic information from the internet and create a news article or any other news piece. NLP robots can play a crucial role in delivering the content, which requires figures, statistics, and other technical details.

As we mentioned earlier, every technology enjoys its advantages and disadvantages. Implementing NLP in journalism also has few challenges.

Machines are capable of handling multiple assigned tasks at any given point in time as compared to humans. They can perform the work faster and effectively, but there are certain aspects in which machines cannot match the intelligence of the human brain.

Other factors like freedom of speech, social and ethical issues, public sentiments cannot be handled by the machines.

However, when the algorithms of the machines are fed with quality data, it will impact the overall result, and machines can perform the majority of tasks equally.

To train any kind of data, one needs NLP annotation services. It helps machine learning acquire the applicable words from the sentence and make it understandable for AI words.


Join Hacker Noon

Create your free account to unlock your custom reading experience.

Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading

Artificial Intelligence

KeepTruckin raises $190 million to invest in AI products, double R&D team to 700



KeepTruckin, a hardware and software developer that helps trucking fleets manage vehicle, cargo and driver safety, has just raised $190 million in a Series E funding round, which puts the company’s valuation at $2 billion, according to CEO Shoaib Makani. 

G2 Venture Partners, which just raised a $500 million fund to help modernize existing industries, participated in the round, alongside existing backers like Greenoaks Capital, Index Ventures, IVP and Scale Venture Partners, which is managed by BlackRock. 

KeepTruckin intends to invest its new capital back into its AI-powered products like its GPS tracking, ELD compliance and dispatch and workflow, but it’s specifically interested in improving its smart dashcam, which instantly detects unsafe driving behaviors like cell phone distraction and close following and alerts the drivers in real time, according to Makani. 

The company says Usher Transport, one of its clients, says it has seen a 32% annual reduction in accidents after implementing the Smart Dashcam, DRIVE risk score and Safety Hub, products that the company offers to increase safety.

“KeepTruckin’s special sauce is that we can build complex models (that other edge cameras can’t yet run) and make it run on the edge with low-power, low-memory and low-bandwidth constraints,” Makani told TechCrunch. “We have developed in-house IPs to solve this problem at different environmental conditions such as low-light, extreme weather, occluded subject and distortions.”

This kind of accuracy requires billions of ground truth data points that are trained and tested on KeepTruckin’s in-house machine learning platform, a process that is very resource-intensive. The platform includes smart annotation capabilities to automatically label the different data points so the neural network can play with millions of potential situations, achieving similar performance to the edge device that’s in the field with real-world environmental conditions, according to Makani.

A 2020 McKinsey study predicted the freight industry is not likely to see the kind of YOY growth it saw last year, which was 30% up from 2019, but noted that some industries would increase at higher rates than others. For example, commodities related to e-commerce and agricultural and food products will be the first to return to growth, whereas electronics and automotive might increase at a slower rate due to declining consumer demand for nonessentials. 

Since the pandemic, the company said it experienced 70% annualized growth, in large part due to expansion into new markets like construction, oil and gas, food and beverage, field services, moving and storage and agriculture. KeepTruckin expects this demand to increase and intends to use the fresh funds to scale rapidly and recruit more talent that will help progress its AI systems, doubling its R&D team to 700 people globally with a focus on engineering, machine vision, data science and other AI areas, says Makani. 

“We think packaging these products into operator-friendly user interfaces for people who are not deeply technical is critical, so front-end and full-stack engineers with experience building incredibly intuitive mobile and web applications are also high priority,” said Makani. 

Much of KeepTruckin’s tech will eventually power autonomous vehicles to make roads safer, says Makani, something that’s also becoming increasingly relevant as the demand for trucking continues to outpace supply of drivers.

Level 4 and eventually level 5 autonomy will come to the trucking industry, but we are still many years away from broad deployment,” he said. “Our AI-powered dashcam is making drivers safer and helping prevent accidents today. While the promise of autonomy is real, we are working hard to help companies realize the value of this technology now.”

Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading
Energy4 days ago

Extensive Demand from the Personal Care and Cosmetics Industry Coupled with the Booming Construction Industry will Invite Impactful Growth for the Mineral Oil & Mineral Spirit Market: TMR

Esports2 days ago

World of Warcraft 9.1 Release Date: When is it?

Energy2 days ago

Biocides Market worth $13.6 billion by 2026 – Exclusive Report by MarketsandMarkets™

Aviation5 days ago

Spirit Airlines Just Made The Best Argument For Lifting LaGuardia’s Perimeter Rule

Esports3 days ago

Clash of Clans June 2021 Update patch notes

Blockchain4 days ago

Africa Leading Bitcoin P2P Trading Volume Growth in 2021

Aviation4 days ago

Boeing 727 Set To Be Turned Into Luxury Hotel Experience

Big Data4 days ago

In El Salvador’s bitcoin beach town, digital divide slows uptake

Gaming5 days ago

Forza Horizon 5 Announced, Launches November 9

HRTech3 days ago

Pre-Owned Luxury Car dealer Luxury Ride to add 80 Employees across functions to boost growth

Blockchain4 days ago

Since It Adopted Bitcoin As Legal Tender, The World Is Looking At El Salvador

Blockchain2 days ago

Former PayPal Employees Launch Cross-Border Payment System

Gaming5 days ago

Her Story Creator’s Next Game is Immortality, Releases in 2022

Energy2 days ago

XCMG dostarcza ponad 100 sztuk żurawi dostosowanych do regionu geograficznego dla międzynarodowych klientów

Aviation5 days ago

Delta Air Lines Airbus A320 Returns To Minneapolis After Multiple Issues

Blockchain2 days ago

PancakeSwap (CAKE) Price Prediction 2021-2025: Will CAKE Hit $60 by 2021?

Aerospace4 days ago

Delivering economic and societal value

Gaming3 days ago

Super Smash Bros. Ultimate – Tekken’s Kazuya Mishima is the Next Challenger pack

Esports2 days ago

Here are the patch notes for Call of Duty: Warzone’s season 4 update

Gaming5 days ago

Severed Steel is a Bullet Time-Heavy Voxel FPS With a Unique Protagonist