Logo na Zephyrnet

Yadda Amazon ya inganta tsarin sulhuntawa na kudi mai girma tare da Amazon EMR don haɓakawa da haɓakawa | Ayyukan Yanar Gizo na Amazon

kwanan wata:

Sulhun asusu muhimmin mataki ne don tabbatar da cikawa da daidaiton bayanan kuɗi. Musamman, kamfanoni dole ne su yi sulhu balance sheet asusu wanda zai iya ƙunsar kuskuren mahimmanci ko kayan aiki. Masu ba da lissafi suna shiga cikin kowane asusu a cikin babban littafin asusu kuma tabbatar da cewa ma'aunin da aka lissafa cikakke ne kuma daidai. Lokacin da aka sami sabani, masu lissafin kuɗi suna bincika kuma su ɗauki matakin gyara da ya dace.

A matsayin wani ɓangare na ƙungiyar FinTech ta Amazon, muna ba da dandamali na software wanda ke ba ƙungiyoyin lissafin cikin gida a Amazon don gudanar da sasantawar asusu. Don inganta tsarin sulhu, waɗannan masu amfani suna buƙatar babban canji na aiki tare da ikon yin ƙima akan buƙatu, da kuma ikon aiwatar da girman girman fayil ɗin kama daga ƙasa da ƴan MBs zuwa fiye da 100 GB. Ba koyaushe yana yiwuwa a shigar da bayanai akan na'ura ɗaya ko sarrafa shi tare da tsari guda ɗaya a cikin madaidaicin lokaci ba. Dole ne a yi wannan lissafin da sauri don samar da ayyuka masu amfani inda za'a iya raba dabaru na shirye-shirye da cikakkun bayanai (rarraba bayanai, haƙurin kuskure, da tsara jadawalin).

Za mu iya cimma waɗannan ƙididdigewa lokaci guda akan injuna da yawa ko zaren ayyuka iri ɗaya a cikin ƙungiyoyin abubuwan saitin bayanai ta amfani da hanyoyin sarrafa bayanai da aka rarraba. Wannan ya ƙarfafa mu mu sake ƙirƙira sabis ɗin sulhu da sabis na AWS ke ƙarfafawa, gami da Amazon EMR da Apache Spark tsarin sarrafawa rarraba, wanda ke amfani da shi PySpark. Wannan sabis ɗin yana bawa masu amfani damar sarrafa fayiloli sama da 100 GB waɗanda ke ɗauke da har zuwa ma'amala miliyan 100 cikin ƙasa da mintuna 30. Sabis ɗin sulhu ya zama cibiyar sarrafa bayanai, kuma yanzu masu amfani za su iya yin ayyuka iri-iri ba tare da matsala ba, kamar su. pivot, JIIN (kamar aikin VLOOKUP na Excel), ilmin lissafi ayyuka, da Kara, samar da mafita mai dacewa da inganci don daidaita manyan bayanan bayanai. Wannan haɓakawa shaida ce ga haɓakawa da saurin da aka samu ta hanyar ɗaukar hanyoyin sarrafa bayanai da aka rarraba.

A cikin wannan sakon, mun bayyana yadda muka haɗa Amazon EMR don gina tsarin da ake samuwa da kuma daidaitawa wanda ya ba mu damar gudanar da tsarin sulhu na kudi mai girma.

Gine-gine kafin hijira

Hoton da ke gaba yana kwatanta gine-ginenmu na baya.

An gina sabis ɗin gadonmu da Sabis ɗin Kwantena na Ruwa na Amazon (Amazon ECS). Farashin AWS. Mun sarrafa bayanan bi da bi ta amfani da Python. Koyaya, saboda ƙarancin ikon sarrafa shi, yawanci dole ne mu ƙara girman gungu a tsaye don tallafawa manyan bayanan bayanai. Don mahallin, 5 GB na bayanai tare da ayyuka 50 sun ɗauki kusan awanni 3 don aiwatarwa. An saita wannan sabis ɗin don yin ma'auni a kwance zuwa misalan ECS guda biyar waɗanda suka karɓi saƙon daga Sabis ɗin Sauki mai Sauƙi na Amazon (Amazon SQS), wanda ya ciyar da buƙatun canji. An saita kowane misali tare da 4 vCPUs da 30 GB na ƙwaƙwalwar ajiya don ba da damar yin ƙima a kwance. Duk da haka, ba za mu iya faɗaɗa ƙarfinsa kan aikin ba saboda tsarin ya faru a jere, yana ɗaukar ɓangarorin bayanai daga. Sabis na Sauƙi na Amazon (Amazon S3) don sarrafawa. Misali, aikin VLOOKUP inda za'a haɗa fayiloli guda biyu yana buƙatar duka fayilolin da za'a karanta su a cikin ɓangarorin ƙwaƙwalwa ta chunk don samun fitarwa. Wannan ya zama cikas ga masu amfani saboda dole ne su jira na dogon lokaci don aiwatar da bayanan su.

A matsayin wani ɓangare na sake gina gine-ginenmu da zamanantar da mu, muna son cimma abubuwa masu zuwa:

  • Babban samuwa - Rukunin sarrafa bayanai yakamata su kasance sosai, suna samar da 9s na samuwa (99.9%)
  • Ana shigarwa - Sabis ɗin yakamata ya kula da gudu 1,500 kowace rana
  • rashin laka - Ya kamata ya iya sarrafa 100 GB na bayanai a cikin mintuna 30
  • Bambance-bambance - Tarin ya kamata ya iya tallafawa nau'ikan ayyuka iri-iri, tare da fayilolin da suka kama daga ƴan MBs zuwa ɗaruruwan GBs.
  • Tambaya concurrency - Yin aiwatarwa yana buƙatar ikon tallafawa mafi ƙarancin digiri 10 na daidaituwa
  • Amincewar ayyuka da daidaiton bayanai - Ayyuka suna buƙatar gudanar da dogaro da dogaro da kai don guje wa karya Yarjejeniyar Matsayin Sabis (SLAs)
  • Mai tsada-tsari kuma mai daidaitawa - Dole ne ya zama mai daidaitawa bisa ga nauyin aiki, yana sa ya zama mai tsada
  • Tsaro da bin doka - Bisa la'akari da hankali na bayanai, dole ne ya goyi bayan kulawar samun dama mai kyau da kuma aiwatar da tsaro masu dacewa
  • Kulawa – Maganin dole ne ya ba da sa ido na ƙarshe zuwa ƙarshe na gungu da ayyuka

Me yasa Amazon EMR

Amazon EMR shine babban masana'antu-manyan girgije babban bayani don sarrafa bayanan sikelin-petabyte, nazarin ma'amala, da koyan injin (ML) ta amfani da tushen tushen tushen kamar su. Apache Spark, Apache Kado, Da kuma Presto. Tare da waɗannan tsare-tsare da ayyukan tushen buɗe ido masu alaƙa, zaku iya aiwatar da bayanai don dalilai na nazari da ayyukan BI. Amazon EMR yana ba ku damar canzawa da matsar da bayanai masu yawa a ciki da waje da sauran shagunan bayanan AWS da bayanan bayanai, kamar Amazon S3 da DynamoDB na Amazon.

Wani sanannen fa'idar Amazon EMR ya ta'allaka ne cikin ingantaccen amfani da shi na daidaitaccen aiki tare da PySpark, wanda ke nuna babban ci gaba akan lambar Python na gargajiya. Wannan sabuwar dabarar tana daidaita turawa da ƙima na gungu na Apache Spark, yana ba da damar daidaita daidaitattun bayanai akan manyan bayanan. Kayan aikin kwamfuta da aka rarraba ba kawai yana haɓaka aiki ba, har ma yana ba da damar sarrafa bayanai masu yawa a cikin saurin da ba a taɓa gani ba. An sanye shi da dakunan karatu, PySpark yana sauƙaƙe ayyuka irin na Excel Fassarar bayanai, kuma mafi girma-mataki abstraction na DataFrames yana sauƙaƙa rikitattun ma'auni na bayanai, rage ƙididdiga na lamba. Haɗe tare da samar da gungu ta atomatik, rarraba albarkatu masu ƙarfi, da haɗin kai tare da sauran sabis na AWS, Amazon EMR ya tabbatar da zama mafita mai mahimmanci wanda ya dace da nau'ikan ayyuka daban-daban, kama daga sarrafa tsari zuwa ML. Haƙuri na kuskure na asali a cikin PySpark da Amazon EMR yana haɓaka ƙarfi, ko da a cikin yanayin gazawar kumburi, yana mai da shi zaɓi mai ƙima, mai tsada, da babban aiki don sarrafa bayanan daidaitattun akan AWS.

Amazon EMR yana haɓaka ƙarfinsa fiye da abubuwan yau da kullun, yana ba da zaɓuɓɓukan turawa iri-iri don biyan buƙatu daban-daban. Ko da shi Amazon EMR akan EC2, Amazon EMR akan EKS, Amazon EMR Serverless, ko Amazon EMR akan AWS Outposts, zaku iya daidaita tsarin ku zuwa takamaiman buƙatu. Ga waɗanda ke neman yanayi mara sabar don ayyukan Spark, haɗawa AWS Manne Hakanan zaɓi ne mai yiwuwa. Baya ga goyan bayan tsarin buɗe tushen tushen daban-daban, gami da Spark, Amazon EMR yana ba da sassauci a zabar hanyoyin turawa, Eididdigar Rarraba na Amazon na Cloud (Amazon EC2) nau'ikan misali, hanyoyin ƙima, da dabarun inganta kuɗi da yawa.

Amazon EMR yana tsaye a matsayin ƙarfi mai ƙarfi a cikin gajimare, yana ba da damar da ba ta dace ba ga ƙungiyoyin da ke neman ingantattun hanyoyin magance bayanai. Haɗin kai mara kyau, fasalulluka masu ƙarfi, da daidaitawa sun sa ya zama kayan aiki mai mahimmanci don kewaya rikitattun ƙididdigar bayanai da ML akan AWS.

Sake tsara gine-gine

Zane mai zuwa yana kwatanta tsarin gine-ginen da aka sake tsarawa.

Maganin yana aiki a ƙarƙashin kwangilar API, inda abokan ciniki za su iya ƙaddamar da saitunan canji, suna bayyana saitin ayyuka tare da wurin saitin bayanan S3 don sarrafawa. Ana yin layi ta hanyar Amazon SQS, sannan a tura shi zuwa Amazon EMR ta hanyar aikin Lambda. Wannan tsari yana fara ƙirƙirar matakin EMR na Amazon don aiwatar da tsarin Spark akan gungun EMR da aka keɓe. Kodayake Amazon EMR yana ɗaukar matakai marasa iyaka akan tsawon rayuwar gungu mai tsayi, matakai 256 ne kawai za su iya gudana ko suna jira lokaci guda. Don ingantacciyar daidaituwa, an saita matakin daidaitawa a 10, yana barin matakai 10 suyi aiki a lokaci guda. Idan akwai gazawar buƙata, Amazon SQS matattun haruffa (DLQ) yana riƙe da taron. Spark yana aiwatar da buƙatar, fassara ayyuka kamar Excel zuwa lambar PySpark don ingantaccen tsarin tambaya. Resilient DataFrames yana adana shigarwar, fitarwa, da matsakaicin bayanai a cikin ƙwaƙwalwar ajiya, haɓaka saurin sarrafawa, rage farashin I/O faifai, haɓaka aikin aikin aiki, da isar da fitarwa ta ƙarshe zuwa ƙayyadadden wurin Amazon S3.

Muna ayyana SLA ɗinmu a cikin girma biyu: latency da kayan aiki. An ayyana latency azaman adadin lokacin da aka ɗauka don yin aiki ɗaya akan ƙayyadaddun girman saitin bayanai da adadin ayyukan da aka yi akan saitin bayanai. An ayyana abin da ake samarwa azaman matsakaicin adadin ayyuka na lokaci ɗaya da sabis ɗin zai iya yi ba tare da keta latency SLA na aiki ɗaya ba. Gabaɗaya scalability SLA na sabis ya dogara da ma'auni na kwance a kwance na albarkatu masu ƙididdigewa da sikeli a tsaye na kowane sabar.

Saboda dole ne mu gudanar da matakai 1,500 a kowace rana tare da ƙarancin latency da babban aiki, mun zaɓi haɗa Amazon EMR akan yanayin turawa na EC2 tare da sarrafa sikelin da aka kunna don tallafawa sarrafa girman girman fayil.

Tsarin gungun EMR yana ba da zaɓuɓɓuka daban-daban da yawa:

  • Nau'in kumburin EMR - Firamare, cibiya, ko kumburin ɗawainiya
  • Zaɓuɓɓukan siyan misali - Misalai na Bukatu, Abubuwan da aka Keɓance, ko Matsalolin Tabo
  • Zaɓuɓɓukan sanyi - EMR misalin rundunar jiragen ruwa ko rukunin misali na uniform
  • Zaɓuɓɓukan ƙima - Girman atomatik ko Amazon EMR sarrafa sikelin

Dangane da canjin aikin mu, mun saita wani jirgin ruwa na EMR (don mafi kyawun ayyuka, duba aMINCI). Mun kuma yanke shawarar yin amfani da Amazon EMR sarrafa sikelin don auna ainihin asali da nodes ɗin aiki (don yanayin ƙira, koma zuwa Yanayin rabon node). A ƙarshe, mun zaɓi ingantaccen ƙwaƙwalwar ajiya Farashin AWS Graviton misalai, wanda ya samar har zuwa 30% ƙananan farashi kuma har zuwa 15% ingantaccen aiki don ayyukan Spark.

Lambar da ke biyowa tana ba da hoto na tsarin gungu na mu:

Concurrent steps:10

EMR Managed Scaling:
minimumCapacityUnits: 64
maximumCapacityUnits: 512
maximumOnDemandCapacityUnits: 512
maximumCoreCapacityUnits: 512

Master Instance Fleet:
r6g.xlarge
- 4 vCore, 30.5 GiB memory, EBS only storage
- EBS Storage:250 GiB
- Maximum Spot price: 100 % of On-demand price
- Each instance counts as 1 units
r6g.2xlarge
- 8 vCore, 61 GiB memory, EBS only storage
- EBS Storage:250 GiB
- Maximum Spot price: 100 % of On-demand price
- Each instance counts as 1 units

Core Instance Fleet:
r6g.2xlarge
- 8 vCore, 61 GiB memory, EBS only storage
- EBS Storage:100 GiB
- Maximum Spot price: 100 % of On-demand price
- Each instance counts as 8 units
r6g.4xlarge
- 16 vCore, 122 GiB memory, EBS only storage
- EBS Storage:100 GiB
- Maximum Spot price: 100 % of On-demand price
- Each instance counts as 16 units

Task Instances:
r6g.2xlarge
- 8 vCore, 61 GiB memory, EBS only storage
- EBS Storage:100 GiB
- Maximum Spot price: 100 % of On-demand price
- Each instance counts as 8 units
r6g.4xlarge
- 16 vCore, 122 GiB memory, EBS only storage
- EBS Storage:100 GiB
- Maximum Spot price: 100 % of On-demand price
- Each instance counts as 16 units

Performance

Tare da ƙaura zuwa Amazon EMR, mun sami damar cimma aikin tsarin da zai iya sarrafa nau'ikan bayanai iri-iri, kama daga ƙasa da 273 B zuwa sama kamar 88.5 GB tare da p99 na 491 seconds (kimanin mintuna 8).

Hoto mai zuwa yana kwatanta nau'ikan girman fayil ɗin da aka sarrafa.

Adadin da ke gaba yana nuna jinkirin mu.

Don kwatanta da aiwatar da tsari, mun ɗauki bayanan bayanai guda biyu masu ɗauke da bayanan miliyan 53 kuma mun gudanar da aikin VLOOKUP akan juna, tare da wasu ayyuka masu kama da Excel guda 49. Wannan ya ɗauki mintuna 26 don aiwatarwa a cikin sabon sabis ɗin, idan aka kwatanta da kwanaki 5 don aiwatarwa a cikin sabis ɗin gado. Wannan haɓakawa kusan sau 300 ya fi girma fiye da gine-ginen da suka gabata dangane da aiki.

sharudda

Ka tuna da waɗannan abubuwan yayin la'akari da wannan mafita:

  • Tari masu girman dama – Ko da yake Amazon EMR ne mai resizable, yana da muhimmanci a daidai-girma gungu. Daidaita girman dama yana rage jinkirin gungu, idan ƙarancin girma, ko farashi mafi girma, idan gungu ya yi girma. Don tsammanin waɗannan batutuwa, za ku iya ƙididdige lamba da nau'in nodes waɗanda za a buƙaci don nauyin aikin.
  • Daidaitacce matakai - Gudun matakai a cikin layi daya yana ba ku damar gudanar da manyan ayyuka na ci gaba, ƙara yawan amfani da albarkatu, da rage adadin lokacin da aka ɗauka don kammala aikinku. Adadin matakan da aka yarda a yi aiki a lokaci ɗaya ana iya daidaita su kuma ana iya saita su lokacin da aka ƙaddamar da gungu da kowane lokaci bayan tarin ya fara. Kuna buƙatar yin la'akari da haɓaka amfani da CPU/memory kowane aiki lokacin da ayyuka da yawa ke gudana a cikin gungu ɗaya.
  • Rukunin EMR na wucin gadi na tushen Aiki - Idan ya dace, ana ba da shawarar yin amfani da gungu na EMR na wucin gadi na tushen aiki, wanda ke ba da fifiko mafi girma, yana tabbatar da cewa kowane ɗawainiya yana aiki a cikin keɓewar yanayin sa. Wannan tsarin yana inganta amfani da albarkatu, yana taimakawa hana tsangwama tsakanin ayyuka, da haɓaka aikin gabaɗaya da aminci. Yanayin wucin gadi yana ba da damar ƙima mai inganci, yana ba da ingantaccen bayani mai ƙarfi da keɓe don buƙatun sarrafa bayanai daban-daban.
  • EMR Serverless - EMR Serverless shine mafi kyawun zaɓi idan kun fi son kada ku kula da gudanarwa da aiki na gungu. Yana ba ku damar gudanar da aikace-aikace ba tare da wahala ba ta amfani da tsarin buɗe tushen tushen da ake samu a cikin EMR Serverless, yana ba da ƙwarewa mai sauƙi kuma mara wahala.
  • Amazon Farashin EMR - Amazon EMR akan EKS yana ba da fa'idodi daban-daban, kamar saurin farawa da haɓaka haɓaka haɓaka ƙalubalen iya aiki-wanda ke da fa'ida musamman ga masu amfani da Graviton da Spot Instance. Haɗin manyan nau'ikan ƙididdiga na haɓaka ƙimar farashi, yana ba da damar keɓance kayan albarkatu. Bugu da ƙari, tallafin Multi-AZ yana ba da ƙarin samuwa. Waɗannan fasalulluka masu tursasawa suna ba da mafita mai ƙarfi don sarrafa manyan ayyuka na bayanai tare da ingantacciyar aiki, haɓaka farashi, da aminci a cikin yanayin ƙididdiga daban-daban.

Kammalawa

A cikin wannan sakon, mun bayyana yadda Amazon ya inganta tsarin sulhuntawa na kudi mai girma tare da Amazon EMR don haɓakawa da aiki. Idan kuna da aikace-aikacen monolithic wanda ya dogara da sikelin tsaye don aiwatar da ƙarin buƙatu ko saitin bayanai, sannan ƙaura zuwa tsarin sarrafawa da aka rarraba kamar Apache Spark da zaɓar sabis ɗin sarrafawa kamar Amazon EMR don ƙididdigewa na iya taimakawa rage lokacin aiki don rage isar da ku. SLA, kuma yana iya taimakawa rage Jimlar Kudin Mallaka (TCO).

Yayin da muke rungumar Amazon EMR don wannan yanayin amfani na musamman, muna ƙarfafa ku don bincika ƙarin yuwuwar a cikin tafiyar haɓakar bayanan ku. Yi la'akari da kimanta AWS Glue, tare da sauran zaɓuɓɓukan tura Amazon EMR masu ƙarfi kamar EMR Serverless ko Amazon EMR akan EKS, don gano mafi kyawun sabis na AWS wanda aka keɓance ga yanayin amfaninku na musamman. Makomar tafiyar ƙirƙira bayanai tana riƙe da damammaki masu ban sha'awa da ci gaba don ƙarin bincike.


Game da Authors

Jeeshan Khetrapal shi ne Sr. Injiniyan Ci gaban Software a Amazon, inda ya haɓaka samfuran fintech bisa ga tsarin gine-ginen da ba a haɗa da girgije ba wanda ke da alhakin sarrafa manyan kamfanoni na IT, rahoton kuɗi, da sarrafawa don gudanar da mulki, haɗari, da bin ka'ida.

Sakti Mishra shi ne Babban Mahimmin Magani Architect a AWS, inda yake taimaka wa abokan ciniki sabunta tsarin gine-ginen bayanan su da kuma ayyana dabarun bayanan su na ƙarshe zuwa ƙarshen, gami da tsaro na bayanai, samun dama, mulki, da ƙari. Shi ne mawallafin littafin Sauƙaƙe Big Data Analytics tare da Amazon EMR. A wajen aiki, Sakti tana jin daɗin koyan sabbin fasahohi, kallon fina-finai, da wuraren ziyara tare da dangi.

tabs_img

Sabbin Hankali

tabs_img