Logo Zephyrnet

Jeki ipilẹ data nipa jijẹ pẹlu Amazon Athena ati AWS Glue lati mu yara awọn ibeere isale | Amazon Web Services

ọjọ:

Ni akoko ti data, awọn ajo ti n pọ si ni lilo awọn adagun data lati fipamọ ati ṣe itupalẹ awọn oye pupọ ti iṣeto ati data ti a ko ṣeto. Awọn adagun data n pese ibi-ipamọ aarin fun data lati awọn orisun oriṣiriṣi, ti n fun awọn ẹgbẹ laaye lati ṣii awọn oye ti o niyelori ati ṣiṣe ipinnu-iṣakoso data. Bibẹẹkọ, bi awọn iwọn data ti n tẹsiwaju lati dagba, iṣapeye iṣeto data ati iṣeto di pataki fun ibeere ati itupalẹ daradara.

Ọkan ninu awọn italaya bọtini ni awọn adagun data ni agbara fun iṣẹ ṣiṣe ibeere ti o lọra, ni pataki nigbati o ba n ṣe pẹlu awọn ipilẹ data nla. Eyi le jẹ ikasi si awọn okunfa bii ipilẹ data aiṣedeede, ti o mu abajade ọlọjẹ data ti o pọ ju ati lilo aiṣedeede awọn orisun iṣiro. Lati koju ipenija yii, awọn iṣe ti o wọpọ bii ipin ati bucketing le mu ilọsiwaju iṣẹ ṣiṣe pọ si ati dinku awọn idiyele iṣiro.

Ipele jẹ ilana ti o pin ipin data nla kan si kere, awọn ẹya iṣakoso diẹ sii ti o da lori awọn ibeere kan pato, gẹgẹbi ọjọ, agbegbe, tabi ẹka ọja. Nipa pipin data, awọn ibeere itupalẹ isale le fo awọn ipin ti ko ṣe pataki, dinku iye data ti o nilo lati ṣayẹwo ati ṣiṣẹ. O le lo awọn ọwọn ipin ni gbolohun WHERE ninu awọn ibeere lati ṣe ayẹwo awọn ipin kan pato ti ibeere rẹ nilo. Eyi le ja si awọn akoko ṣiṣe ibeere ni iyara ati lilo awọn orisun to munadoko diẹ sii. O ṣiṣẹ daradara daradara nigbati awọn ọwọn pẹlu kaadi kekere ti yan bi bọtini.

Kini ti o ba ni iwe kaadi kaadi giga ti o nilo nigbakan lati ṣe àlẹmọ nipasẹ awọn alabara VIP? Onibara kọọkan jẹ idanimọ nigbagbogbo pẹlu ID kan, eyiti o le jẹ awọn miliọnu. Pipin ko dara fun iru awọn ọwọn cardinality giga nitori pe o pari pẹlu awọn faili kekere, sisẹ ipin ti o lọra, ati giga Iṣẹ Ifipamọ Simple Amazon (Amazon S3) API iye owo (ọkan S3 ìpele ti wa ni da fun iye ti ipin iwe). Botilẹjẹpe o le lo ipinya pẹlu bọtini adayeba gẹgẹbi ilu tabi ipinlẹ lati dín akopọ data rẹ si iwọn diẹ, o tun jẹ dandan lati beere kọja awọn ipin ti o da lori ọjọ ti data rẹ ba jẹ jara akoko.

Eyi ni ibi ti bucketing wa sinu ere. Bucketing rii daju pe gbogbo awọn ori ila pẹlu awọn iye kanna ti ọkan tabi diẹ ẹ sii awọn ọwọn pari ni faili kanna. Dipo faili kan fun iye, bii ipinpin, iṣẹ hash ni a lo lati pin kaakiri awọn iye boṣeyẹ kọja nọmba ti o wa titi ti awọn faili. Nipa siseto data ni ọna yii, o le ṣe sisẹ daradara, nitori awọn buckets ti o yẹ nikan ni o nilo lati ni ilọsiwaju, dinku siwaju si oke iṣiro.

Awọn aṣayan pupọ wa fun imuse bucketing lori AWS. Ọna kan ni lati lo Athena Amazon Ṣẹda alaye tabili bi yiyan (CTAS), eyiti o fun ọ laaye lati ṣẹda tabili bucketed taara lati ibeere kan. Ni omiiran, o le lo AWS lẹ pọ fun Apache Spark, eyiti o pese atilẹyin ti a ṣe sinu fun awọn atunto bucketing lakoko ilana iyipada data. AWS Glue ngbanilaaye lati ṣalaye awọn aye ṣiṣe bucketing, gẹgẹbi nọmba awọn buckets ati awọn ọwọn si garawa lori, pese ipilẹ data iṣapeye fun ibeere daradara pẹlu Athena.

Ninu ifiweranṣẹ yii, a jiroro bi o ṣe le ṣe imuse bucketing lori awọn adagun data AWS, pẹlu lilo alaye Athena CTAS ati AWS Glue fun Apache Spark. A tun bo bucketing fun awọn tabili Apache Iceberg.

Apeere lilo apoti

Ninu ifiweranṣẹ yii, o lo dataset ti gbogbo eniyan, awọn NOAA Integrated dada aaye data. Awọn atunnkanka data ṣiṣe awọn ibeere akoko-ọkan fun data lakoko awọn ọdun 5 sẹhin nipasẹ Athena. Pupọ julọ awọn ibeere wa fun awọn ibudo kan pato pẹlu awọn iru ijabọ kan pato. Awọn ibeere nilo lati pari ni iṣẹju-aaya 10, ati pe idiyele naa nilo lati wa ni iṣapeye ni pẹkipẹki. Ninu oju iṣẹlẹ yii, o jẹ ẹlẹrọ data ti o ni iduro fun mimu iṣẹ ṣiṣe ibeere ati idiyele pọ si.

Fun apẹẹrẹ, ti oluyanju ba fẹ lati gba data pada fun ibudo kan pato (fun apẹẹrẹ, ID ibudo 123456) pẹlu iru ijabọ kan pato (fun apẹẹrẹ, CRN01), ibeere naa le dabi ibeere wọnyi:

SELECT station, report_type, columnA, columnB, ...
FROM table_name
WHERE
report_type = 'CRN01'
AND station = '123456'

Ninu ọran ti NOAA Integrated Surface Database, awọn station_id o ṣee ṣe ki ọwọn ni kadinality giga, pẹlu ọpọlọpọ awọn idamo ibudo alailẹgbẹ. Lori awọn miiran ọwọ, awọn report_type iwe le ni a jo kekere cardinality, pẹlu kan lopin ṣeto ti iroyin orisi. Fi fun oju iṣẹlẹ yii, yoo jẹ imọran ti o dara lati pin data naa nipasẹ report_type ati garawa o nipa station_id.

Pẹlu ilana ipin ati bucketing yii, Athena le kọkọ yọkuro awọn ipin fun awọn oriṣi ijabọ ti ko ṣe pataki, ati lẹhinna ṣe ọlọjẹ awọn buckets nikan laarin ipin ti o baamu ti ID ibudo ti a sọ, dinku iye data ti a ti ṣiṣẹ ni pataki ati isare awọn akoko ṣiṣe ibeere. Ọna yii kii ṣe ibamu pẹlu ibeere iṣẹ ṣiṣe nikan, ṣugbọn tun ṣe iranlọwọ lati mu awọn idiyele pọ si nipa didinku iye data ti ṣayẹwo ati idiyele fun ibeere kọọkan.

Ninu ifiweranṣẹ yii, a ṣe ayẹwo bii iṣẹ ṣiṣe ibeere ṣe ni ipa nipasẹ iṣeto data, ni pataki, bucketing. A tun ṣe afiwe awọn ọna oriṣiriṣi mẹta lati ṣaṣeyọri bucketing. Awọn wọnyi tabili duro awọn ipo fun awọn tabili lati wa ni da.

. noaa_remote_original athena_non_bucketed athena_bucketed glue_bucketed athena_bucketed_iceberg
kika CSV Ajẹyọ Ajẹyọ Ajẹyọ Ajẹyọ
funmorawon n / a Snappy Snappy Snappy Snappy
Ti a ṣẹda nipasẹ n / a Athena CTAS Athena CTAS Lẹ pọ ETL Athena CTAS pẹlu Iceberg
engine n / a Tẹtẹ Tẹtẹ Agbejade Afun Apache Iceberg
Ṣe ipin bi? Bẹẹni ṣugbọn pẹlu ọna ti o yatọ Bẹẹni Bẹẹni Bẹẹni Bẹẹni
Ti wa ni garawa? Rara Rara Bẹẹni Bẹẹni Bẹẹni

noaa_remote_original ti wa ni partitioned nipasẹ awọn year ọwọn, sugbon ko nipasẹ awọn report_type ọwọn. Yi kana duro ti o ba ti tabili ti wa ni partitioned nipasẹ awọn gangan ọwọn ti o ti wa ni lilo ninu awọn ibeere.

Ipilẹ tabili

Fun ifiweranṣẹ yii, o ṣẹda awọn tabili pupọ pẹlu awọn ipo oriṣiriṣi: diẹ ninu laisi bucketing ati diẹ ninu pẹlu bucketing, lati ṣafihan awọn abuda iṣẹ ti bucketing. Ni akọkọ, jẹ ki a ṣẹda tabili atilẹba nipa lilo data NOAA. Ni awọn igbesẹ ti o tẹle, o mu data wọle lati tabili yii lati ṣẹda awọn tabili idanwo.

Awọn ọna lọpọlọpọ lo wa lati ṣalaye asọye tabili kan: nṣiṣẹ DDL, AWS Glue crawler, AWS Glue Data Catalog API, ati bẹbẹ lọ. Ni igbesẹ yii, o nṣiṣẹ DDL nipasẹ console Athena.

Pari awọn igbesẹ wọnyi lati ṣẹda "bucketing_blog"."noaa_remote_original" tabili ni Data Catalog:

  1. Ṣii Athena console.
  2. Ninu olootu ibeere, ṣiṣẹ DDL atẹle lati ṣẹda aaye data Glue AWS tuntun kan:
    -- Create Glue database
    CREATE DATABASE bucketing_blog;

  3. fun database labẹ data, yan bucketing_blog lati ṣeto awọn ti isiyi database.
  4. Ṣiṣe awọn DDL wọnyi lati ṣẹda tabili atilẹba:
    -- Create original table
    CREATE EXTERNAL TABLE `bucketing_blog`.`noaa_remote_original`(
      `station` STRING, 
      `date` STRING, 
      `source` STRING, 
      `latitude` STRING, 
      `longitude` STRING, 
      `elevation` STRING, 
      `name` STRING, 
      `report_type` STRING, 
      `call_sign` STRING, 
      `quality_control` STRING, 
      `wnd` STRING, 
      `cig` STRING, 
      `vis` STRING, 
      `tmp` STRING, 
      `dew` STRING, 
      `slp` STRING, 
      `aj1` STRING, 
      `gf1` STRING, 
      `mw1` STRING)
    PARTITIONED BY (
        year STRING)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
    WITH SERDEPROPERTIES ( 
      'escapeChar'='',
      'quoteChar'='"',
      'separatorChar'=',') 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://noaa-global-hourly-pds/'
    TBLPROPERTIES (
      'skip.header.line.count'='1'
    )

Nitori data orisun ti sọ awọn aaye, a lo OpenCSVSerde dipo aiyipada LazySimpleSerde.

Awọn faili CSV wọnyi ni laini akọsori, eyiti a sọ fun Athena lati fo nipa fifi kun skip.header.line.count ati ṣeto iye si 1.

Fun awọn alaye diẹ sii, tọka si ṢiiCSVSerDe fun sisẹ CSV.

  1. Ṣiṣe DDL atẹle lati ṣafikun awọn ipin. A ṣafikun awọn ipin nikan fun awọn ọdun 5 ninu awọn ọdun 124 da lori ibeere ọran lilo:
    -- Load partitions
    ALTER TABLE `bucketing_blog`.`noaa_remote_original` ADD
      PARTITION (year = '2024') LOCATION 's3://noaa-global-hourly-pds/2024/'
      PARTITION (year = '2023') LOCATION 's3://noaa-global-hourly-pds/2023/'
      PARTITION (year = '2022') LOCATION 's3://noaa-global-hourly-pds/2022/'
      PARTITION (year = '2021') LOCATION 's3://noaa-global-hourly-pds/2021/'
      PARTITION (year = '2020') LOCATION 's3://noaa-global-hourly-pds/2020/';

  2. Ṣiṣe DML atẹle lati mọ daju ti o ba le ṣaṣeyọri beere data naa:
    -- Check data 
    SELECT * FROM "bucketing_blog"."noaa_remote_original" LIMIT 10;

Bayi o ti ṣetan lati bẹrẹ ibeere tabili atilẹba lati ṣayẹwo iṣẹ ṣiṣe ipilẹ.

  1. Ṣiṣe ibeere kan lodi si tabili atilẹba lati ṣe iṣiro iṣẹ ṣiṣe ibeere bi ipilẹ. Ibeere atẹle yii yan awọn igbasilẹ fun awọn ibudo kan pato marun pẹlu iru ijabọ CRN05:
    -- Baseline
    SELECT station, report_type, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, tmp
    FROM "bucketing_blog"."noaa_remote_original"
    WHERE
        report_type = 'CRN05'
        AND ( station = '99999904237'
            OR station = '99999953132'
            OR station = '99999903061'
            OR station = '99999963856'
            OR station = '99999994644'
        );

A ran ibeere yii ni igba mẹwa. Iwọn akoko ṣiṣe ibeere apapọ fun awọn ibeere 10 jẹ iṣẹju-aaya 10, eyiti o gun ju ibi-afẹde wa ti awọn aaya 27.6 lọ, ati pe data 10 GB ti ṣayẹwo lati da awọn igbasilẹ miliọnu 155.75 pada. Eyi ni iṣẹ ipilẹ ti tabili aise atilẹba. O to akoko lati bẹrẹ iṣapeye ipilẹ data lati ipilẹsẹ yii.

Nigbamii ti, o ṣẹda awọn tabili pẹlu awọn ipo oriṣiriṣi lati atilẹba: ọkan laisi bucketing ati ọkan pẹlu bucketing, ki o ṣe afiwe wọn.

Ṣe ilọsiwaju ipilẹ data nipa lilo Athena CTAS

Ni abala yii, a lo ibeere Athena CTAS lati mu iṣeto data dara si ati ọna kika rẹ.

Ni akọkọ, jẹ ki a ṣẹda tabili pẹlu ipin ṣugbọn laisi bucketing. Awọn titun tabili ti wa ni partitioned nipasẹ awọn iwe report_type nitori pupọ julọ awọn ibeere ti a nireti lo iwe yii ni gbolohun NIBI, ati awọn nkan ti wa ni ipamọ bi Parquet pẹlu funmorawon Snappy.

  1. Ṣii olootu ibeere Athena.
  2. Ṣiṣe ibeere atẹle, pese garawa S3 tirẹ ati ìpele:
    --CTAS, non-bucketed
    CREATE TABLE "bucketing_blog"."athena_non_bucketed"
    WITH (
        external_location = 's3://<your-s3-location>/athena-non-bucketed/',
        partitioned_by = ARRAY['report_type'],
        format = 'PARQUET',
        write_compression = 'SNAPPY'
    )
    AS
    SELECT
        station, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, vis, tmp, dew, slp, aj1, gf1, mw1, report_type
    FROM "bucketing_blog"."noaa_remote_original"
    ;

Data rẹ yẹ ki o dabi awọn sikirinisoti wọnyi.


Awọn faili 30 wa labẹ ipin.

Nigbamii ti, o ṣẹda tabili kan pẹlu bucketing ara Hive. Nọmba awọn buckets nilo lati farabalẹ ni aifwy nipasẹ awọn idanwo fun ọran lilo tirẹ. Ni gbogbogbo, awọn garawa diẹ sii ti o ni, kere si granularity, eyiti o le ja si iṣẹ ṣiṣe to dara julọ. Ni apa keji, ọpọlọpọ awọn faili kekere le ṣafihan ailagbara ni igbero ibeere ati sisẹ. Paapaa, bucketing ṣiṣẹ nikan ti o ba n beere awọn iye diẹ ti bọtini bucketing. Awọn iye diẹ sii ti o ṣafikun si ibeere rẹ, diẹ sii ni o ṣeeṣe pe iwọ yoo pari ni kika gbogbo awọn garawa.

Atẹle ni ibeere ipilẹ lati mu dara si:

-- Baseline
SELECT station, report_type, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, tmp
FROM "bucketing_blog"."noaa_remote_original"
WHERE
    report_type = 'CRN05'
    AND ( station = '99999904237'
        OR station = '99999953132'
        OR station = '99999903061'
        OR station = '99999963856'
        OR station = '99999994644'
    );

Ni apẹẹrẹ yii, tabili naa yoo wa ni bucket sinu awọn buckets 16 nipasẹ ọwọn kardinality giga kan (station), eyiti o yẹ ki o lo fun gbolohun NIBI ninu ibeere naa. Gbogbo awọn ipo miiran wa kanna. Ibeere ipilẹ ni awọn iye marun ni ID ibudo, ati pe o nireti awọn ibeere lati ni ni ayika nọmba yẹn ni pupọ julọ, eyiti o kere ju nọmba awọn buckets lọ, nitorinaa 16 yẹ ki o ṣiṣẹ daradara. O ṣee ṣe lati pato nọmba nla ti awọn garawa, ṣugbọn CTAS ko le ṣee lo ti apapọ nọmba awọn ipin ba kọja 100.

  1. Ṣiṣe ibeere wọnyi:
    -- CTAS, Hive-bucketed
    CREATE TABLE "bucketing_blog"."athena_bucketed"
    WITH (
        external_location = 's3://<your-s3-location>/athena-bucketed/',
        partitioned_by = ARRAY['report_type'],
        bucketed_by = ARRAY['station'],
        bucket_count = 16,
        format = 'PARQUET',
        write_compression = 'SNAPPY'
    )
    AS
    SELECT
        station, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, vis, tmp, dew, slp, aj1, gf1, mw1, report_type
    FROM "bucketing_blog"."noaa_remote_original"
    ;

Ibeere naa ṣẹda awọn nkan S3 ti a ṣeto bi o ṣe han ninu awọn sikirinisoti atẹle.


Ifilelẹ ipele-tabili wulẹ gangan kanna laarin athena_non_bucketed ati athena_bucketed: awọn ipin 13 wa ni tabili kọọkan. Iyatọ jẹ nọmba awọn nkan labẹ awọn ipin. Awọn nkan 16 wa (awọn garawa) fun ipin kan, ti aijọju 10-25 MB kọọkan ninu ọran yii. Nọmba awọn buckets jẹ igbagbogbo ni iye pàtó kan laibikita iye data, ṣugbọn iwọn garawa da lori iye data.

Bayi o ti ṣetan lati beere lodi si tabili kọọkan lati ṣe iṣiro iṣẹ ṣiṣe ibeere. Ibeere naa yoo yan awọn igbasilẹ pẹlu awọn ibudo kan pato marun ati iru ijabọ CRN05 fun ọdun 5 sẹhin. Botilẹjẹpe o ko le rii iru data ti ibudo kan pato ti o wa ninu garawa wo, o ti ṣe iṣiro ati pe o wa ni deede nipasẹ Athena.

  1. Beere tabili ti ko ni bucket pẹlu alaye atẹle:
    -- No bucketing 
    SELECT station, report_type, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, tmp
    FROM "bucketing_blog"."athena_non_bucketed"
    WHERE
        report_type = 'CRN05'
        AND ( station = '99999904237'
            OR station = '99999953132'
            OR station = '99999903061'
            OR station = '99999963856'
            OR station = '99999994644'
        );


A ran ibeere yii ni igba mẹwa. Iwọn akoko ṣiṣe ti awọn ibeere 10 jẹ iṣẹju-aaya 10, ati 10.95 MB ti data ti ṣayẹwo lati da awọn igbasilẹ 358 milionu pada. Mejeeji asiko asiko ati iwọn ọlọjẹ ti dinku ni pataki nitori pe o ti pin data naa, ati pe o le ka ipin kan nikan nibiti awọn ipin 2.21 ti 12 ti fo. Ni afikun, iye data ti ṣayẹwo ti lọ silẹ lati 13 GB si 206 MB, eyiti o jẹ idinku ti 360%. Eyi kii ṣe nitori ipin nikan, ṣugbọn tun nitori iyipada ọna kika rẹ si Parquet ati funmorawon pẹlu Snappy.

  1. Beere tabili ti o bukẹti pẹlu alaye atẹle:
    -- Hive bucketing
    SELECT station, report_type, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, tmp
    FROM "bucketing_blog"."athena_bucketed"
    WHERE
        report_type = 'CRN05'
        AND ( station = '99999904237'
            OR station = '99999953132'
            OR station = '99999903061'
            OR station = '99999963856'
            OR station = '99999994644'
        );


A ran ibeere yii ni igba mẹwa. Iwọn akoko ṣiṣe ti awọn ibeere 10 jẹ iṣẹju-aaya 10, ati 7.82 MB ti data ti ṣayẹwo lati da awọn igbasilẹ 69 milionu pada. Eyi tumọ si idinku ti apapọ asiko asiko lati 2.21 si 10.95 awọn aaya (-7.82%), ati idinku iyalẹnu ti data ti ṣayẹwo lati 29 MB si 358 MB (-69%) lati pada nọmba kanna ti awọn igbasilẹ ni akawe pẹlu tabili ti kii ṣe bucketed . Ni idi eyi, mejeeji akoko asiko isise ati data ti ṣayẹwo ni ilọsiwaju nipasẹ bucketing. Eyi tumọ si bucketing ṣe alabapin kii ṣe si iṣẹ nikan ṣugbọn tun si idinku idiyele.

riro

Gẹgẹbi a ti sọ tẹlẹ, iwọn garawa rẹ ni pẹkipẹki lati mu iṣẹ ṣiṣe ti ibeere rẹ pọ si. Bucketing ṣiṣẹ nikan ti o ba n beere awọn iye diẹ ti bọtini bucketing. Gbiyanju ṣiṣẹda awọn garawa diẹ sii ju nọmba awọn iye ti a reti ninu ibeere gangan.

Ni afikun, ibeere Athena CTAS kan ni opin lati ṣẹda to awọn ipin 100 ni akoko kan. Ti o ba nilo nọmba nla ti awọn ipin, o le fẹ lo AWS Glue jade, yipada, ati fifuye (ETL), botilẹjẹpe o wa workaround lati pin si ọpọ SQL gbólóhùn.

Jeki ipilẹ data nipa lilo AWS Glue ETL

Apache Spark jẹ ilana iṣelọpọ pinpin orisun ṣiṣi ti o jẹ ki ETL rọ pẹlu PySpark, Scala, ati Spark SQL. O gba ọ laaye lati pin ati garawa data rẹ da lori awọn ibeere rẹ. Spark ni ọpọlọpọ awọn aṣayan atunṣe lati mu yara awọn iṣẹ. O le ṣe adaṣe lainidi ati ṣe atẹle awọn iṣẹ Spark. Ni apakan yii, a lo awọn iṣẹ AWS Glue ETL lati ṣiṣẹ koodu Spark lati mu ipilẹ data dara si.

Ko dabi bucketing Athena, AWS Glue ETL nlo bucketing ti o da lori Spark bi algorithm bucketing kan. Gbogbo ohun ti o nilo lati ṣe ni ṣafikun ohun-ini tabili atẹle yii sori tabili: bucketing_format = 'spark'. Fun awọn alaye nipa ohun-ini tabili yii, wo Pipin ati bucketing ni Athena.

Pari awọn igbesẹ wọnyi lati ṣẹda tabili kan pẹlu bucketing nipasẹ AWS Glue ETL:

  1. Lori console AWS Glue, yan Awọn iṣẹ ETL ninu ohun elo lilọ kiri.
  2. yan Ṣẹda iṣẹ ati yan Oju ETL.
  3. labẹ Fi awọn apa kun, yan Katalogi data lẹ pọ AWS fun awọn orisun.
  4. fun database, yan bucketing_blog.
  5. fun Table, yan noaa_remote_original.
  6. labẹ Fi awọn apa kun, yan Yi Eto fun Awọn iyipada.
  7. labẹ Fi awọn apa kun, yan Aṣa Iyipada fun Awọn iyipada.
  8. fun Name, tẹ ToS3WithBucketing.
  9. fun Node obi, yan Yi Eto.
  10. fun Idina koodu, tẹ koodu snippet wọnyi sii:
    def ToS3WithBucketing (glueContext, dfc) -> DynamicFrameCollection:
        # Convert DynamicFrame to DataFrame
        df = dfc.select(list(dfc.keys())[0]).toDF()
        
        # Write to S3 with bucketing and partitioning
        df.repartition(1, "report_type") 
            .write.option("path", "s3://<your-s3-location>/glue-bucketed/") 
            .mode("overwrite") 
            .partitionBy("report_type") 
            .bucketBy(16, "station") 
            .format("parquet") 
            .option("compression", "snappy") 
            .saveAsTable("bucketing_blog.glue_bucketed")

Sikirinifoto atẹle yii fihan iṣẹ ti a ṣẹda nipa lilo AWS Glue Studio lati ṣe agbekalẹ tabili kan ati data.

Ipin kọọkan duro fun atẹle naa:

  • awọn Katalogi data lẹ pọ AWS ipade èyà awọn noaa_remote_original tabili lati Data Catalog
  • awọn Yi Eto ipade rii daju pe o gbe awọn ọwọn ti a forukọsilẹ ni Iwe akọọlẹ Data
  • awọn ToS3WithBucketing ipade kọ data si Amazon S3 pẹlu ipin mejeeji ati bucketing orisun sipaki

Iṣẹ naa ti kọ ni aṣeyọri ninu olootu wiwo.

  1. labẹ Awọn alaye iṣẹ, fun IAM ipa, yan tirẹ Idanimọ AWS ati Isakoso Wiwọle (IAM) ipa fun iṣẹ yii.
  2. fun Osise iru, yan G.8X.
  3. fun Nọmba ti o beere fun awọn oṣiṣẹ, wọle 5.
  4. yan Fipamọ, lẹhinna yan Run.

Lẹhin awọn igbesẹ wọnyi, tabili glue_bucketed. ti ṣẹda.

  1. yan tabili ninu iwe lilọ kiri, ko si yan tabili glue_bucketed.
  2. Lori išë akojọ, yan Ṣatunkọ tabili labẹ Ṣakoso awọn.
  3. ni awọn Table-ini apakan, yan fi.
  4. Ṣafikun bata bọtini kan pẹlu bọtini bucketing_format ati iye sipaki.
  5. yan Fipamọ.

Bayi o to akoko lati beere awọn tabili.

  1. Beere tabili ti o bukẹti pẹlu alaye atẹle:
    -- Spark bucketing
    SELECT station, report_type, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, tmp
    FROM "bucketing_blog"."glue_bucketed"
    WHERE
        report_type = 'CRN05'
        AND ( station = '99999904237'
            OR station = '99999953132'
            OR station = '99999903061'
            OR station = '99999963856'
            OR station = '99999994644'
        );


A ran ibeere naa ni igba mẹwa. Iwọn akoko ṣiṣe ti awọn ibeere 10 jẹ iṣẹju-aaya 10, ati 7.09 MB ti data ti ṣayẹwo lati da awọn igbasilẹ 88 milionu pada. Ni ọran yii, mejeeji akoko asiko-ṣiṣe ati data ti ṣayẹwo ni ilọsiwaju nipasẹ bucketing. Eyi tumọ si bucketing ṣe alabapin kii ṣe si iṣẹ nikan ṣugbọn tun si idinku idiyele.

Idi fun awọn baiti nla ti a ṣayẹwo ni akawe si apẹẹrẹ Athena CTAS ni pe awọn iye ti pin ni oriṣiriṣi ni tabili yii. Ninu tabili bucketed AWS Glue, awọn iye ti pin lori awọn faili marun. Ninu tabili bucketed Athena CTAS, awọn iye ti pin lori awọn faili mẹrin. Ranti pe awọn ori ila ti pin si awọn garawa nipa lilo iṣẹ hash. Algorithm bucketing Spark nlo iṣẹ hash ti o yatọ ju Hive, ati ninu ọran yii, o yorisi pinpin oriṣiriṣi kọja awọn faili naa.

riro

Lẹ pọ DynamicFrame ko ni atilẹyin bucketing abinibi. O nilo lati lo Spark DataFrame dipo DynamicFrame si awọn tabili garawa.

Fun alaye nipa ṣiṣe atunṣe AWS Glue ETL ti o dara, tọka si Awọn iṣe ti o dara julọ fun titunṣe iṣẹ ṣiṣe AWS Glue fun awọn iṣẹ Apache Spark.

Ṣe ilọsiwaju ipilẹ data Iceberg pẹlu ipin ti o farapamọ

Apache Iceberg jẹ ọna kika tabili ṣiṣi iṣẹ giga fun awọn tabili itupalẹ nla, mu igbẹkẹle ati ayedero ti awọn tabili SQL si data nla. Laipẹ, ibeere nla ti wa lati lo awọn tabili Apache Iceberg lati ṣaṣeyọri awọn agbara ilọsiwaju bii idunadura ACID, ibeere irin-ajo akoko, ati diẹ sii.

Ni Iceberg, bucketing ṣiṣẹ yatọ si ọna tabili Hive ti a ti rii titi di isisiyi. Ni Iceberg, bucketing jẹ ipin ti ipin, ati pe o le lo ni lilo iyipada ipin garawa. Ọna ti o lo ati abajade ipari jẹ iru si bucketing ni awọn tabili Hive. Fun alaye diẹ sii nipa awọn iyipada garawa Iceberg, tọka si Awọn alaye Iyipada garawa.

Pari awọn igbesẹ wọnyi:

  1. Ṣii olootu ibeere Athena.
  2. Ṣiṣe ibeere atẹle lati ṣẹda tabili Iceberg kan pẹlu ipin ti o farapamọ pẹlu bucketing:
    -- CTAS, Iceberg-bucketed
    CREATE TABLE "bucketing_blog"."athena_bucketed_iceberg"
    WITH (table_type = 'ICEBERG',
          location = 's3://<your-s3-location>/athena-bucketed-iceberg/', 
          is_external = false,
          partitioning = ARRAY['report_type', 'bucket(station, 16)'],
          format = 'PARQUET',
          write_compression = 'SNAPPY'
    ) 
    AS
    SELECT
        station, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, vis, tmp, dew, slp, aj1, gf1, mw1, report_type
    FROM "bucketing_blog"."noaa_remote_original"
    ;

Data rẹ yẹ ki o dabi sikirinifoto atẹle.

Awọn folda meji wa: data ati metadata. Lu si isalẹ lati data.

O ri ID ìpele labẹ awọn data folda. Yan ọkan akọkọ lati wo awọn alaye rẹ.

O ri awọn oke-ipele ipin da lori awọn report_type ọwọn. Lu si isalẹ lati awọn tókàn ipele.

O ri keji-ipele ipin, bucketed pẹlu awọn station iwe.

Awọn faili data Parquet wa labẹ awọn folda wọnyi.

  1. Beere tabili ti o bukẹti pẹlu alaye atẹle:
    -- Iceberg bucketing
    SELECT station, report_type, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, tmp
    FROM "bucketing_blog"."athena_bucketed_iceberg"
    WHERE
        report_type = 'CRN05'
        AND
        ( station = '99999904237'
            OR station = '99999953132'
            OR station = '99999903061'
            OR station = '99999963856'
            OR station = '99999994644'
        );


Pẹlu tabili bucketed Iceberg, apapọ akoko asiko ti awọn ibeere 10 jẹ iṣẹju-aaya 8.03, ati 148 MB ti data ti ṣayẹwo lati da awọn igbasilẹ 2.21 milionu pada. Eyi ko ṣiṣẹ daradara ju bucketing pẹlu AWS Glue tabi Athena, ṣugbọn ni imọran awọn anfani ti awọn ẹya oriṣiriṣi Iceberg, o wa laarin iwọn itẹwọgba.

awọn esi

Awọn wọnyi tabili akopọ gbogbo awọn esi.

. noaa_remote_original athena_non_bucketed athena_bucketed glue_bucketed athena_bucketed_iceberg
kika CSV Ajẹyọ Ajẹyọ Ajẹyọ Iceberg (Parquet)
funmorawon n / a Snappy Snappy Snappy Snappy
Ti a ṣẹda nipasẹ n / a Athena CTAS Athena CTAS Lẹ pọ ETL Athena CTAS pẹlu Iceberg
engine n / a Tẹtẹ Tẹtẹ Agbejade Afun Apache Iceberg
Iwọn tabili (GB) 155.8 5.0 5.0 5.8 5.0
Nọmba ti Awọn nkan S3 53360 376 192 192 195
Ṣe ipin bi? Bẹẹni ṣugbọn pẹlu ọna ti o yatọ Bẹẹni Bẹẹni Bẹẹni Bẹẹni
Ti wa ni garawa? Rara Rara Bẹẹni Bẹẹni Bẹẹni
Bucketing kika n / a n / a Agbon Spark tente
Nọmba ti garawa n / a n / a 16 16 16
Àkókò ìṣiṣẹ́ ìpíndọ́gba (iṣẹ́jú ìṣẹ́jú) 29.178 10.950 7.815 7.089 8.030
Iwọn ti a ṣayẹwo (MB) 206640.0 358.6 69.1 87.8 147.7

pẹlu athena_bucketed, glue_bucketed, Ati athena_bucketed_iceberg, o ni anfani lati pade ibi-afẹde lairi ti awọn aaya 10. Pẹlu bucketing, o rii idinku 25–40% ni asiko ṣiṣe ati idinku 60–85% ni iwọn ọlọjẹ, eyiti o le ṣe alabapin si lairi mejeeji ati iṣapeye idiyele.

Gẹgẹbi o ti le rii lati abajade, botilẹjẹpe ipin ṣe alabapin pataki lati dinku akoko asiko mejeeji ati iwọn ọlọjẹ, bucketing tun le ṣe alabapin lati dinku wọn siwaju.

Athena CTAS jẹ taara ati yara to lati pari ilana ṣiṣe bucketing. AWS Glue ETL rọ diẹ sii ati iwọn lati ṣaṣeyọri awọn ọran lilo ilọsiwaju. O le yan boya ọna ti o da lori ibeere rẹ ati ọran lilo, nitori o le lo anfani ti bucketing nipasẹ boya aṣayan.

ipari

Ninu ifiweranṣẹ yii, a ṣe afihan bii o ṣe le mu ki ipilẹ data tabili rẹ pọ si pẹlu ipin ati bucketing nipasẹ Athena CTAS ati AWS Glue ETL. A fihan pe bucketing ṣe alabapin si isare lairi ibeere ati idinku iwọn ọlọjẹ lati mu awọn idiyele siwaju sii. A tun jiroro bucketing fun awọn tabili Iceberg nipasẹ ipin farasin.

Bucketing kan kan ilana lati je ki ipilẹ data nipa didin data ọlọjẹ. Fun iṣapeye gbogbo ifilelẹ data rẹ, a ṣeduro gbigbe awọn aṣayan miiran bii ipinpin, lilo ọna kika faili columnar, ati funmorawon ni apapo pẹlu bucketing. Eyi le mu data rẹ ṣiṣẹ lati mu iṣẹ ṣiṣe ibeere siwaju sii.

Dun bucketing!


Nipa awọn onkọwe

Takeshi Nakatani jẹ Oludamọran Data Nla akọkọ lori ẹgbẹ Awọn iṣẹ Ọjọgbọn ni Tokyo. O ni awọn ọdun 26 ti iriri ni ile-iṣẹ IT, pẹlu oye ni awọn amayederun data ayaworan. Ni awọn ọjọ isinmi rẹ, o le jẹ onilu apata tabi alupupu.

Noritaka Sekiyama jẹ Onitumọ Data Big Principal lori ẹgbẹ AWS Glue. O jẹ iduro fun kikọ awọn ohun elo sọfitiwia lati ṣe iranlọwọ fun awọn alabara. Ni akoko apoju rẹ, o gbadun gigun kẹkẹ pẹlu keke opopona rẹ.

iranran_img

Titun oye

iranran_img