Logo na Zephyrnet

Jagoran Hijira daga Tafkin Delta na Databricks zuwa Apache Iceberg

kwanan wata:

Gabatarwa

A cikin sauri canza duniya na babba sarrafa bayanai da kuma nazari, yuwuwar gudanar da manyan bayanai na aiki a matsayin ginshiƙi na ginshiƙi ga kamfanoni don yanke shawarar da aka sani. Yana taimaka musu wajen fitar da bayanai masu amfani daga bayanansu. An sami mafita iri-iri a cikin ƴan shekarun da suka gabata, kamar Databricks Delta Lake da Apache Iceberg. An haɓaka waɗannan dandamali don sarrafa tafkin bayanai kuma duka suna ba da fasali mai ƙarfi da ayyuka. Amma ga ƙungiyoyi ya wajaba a fahimci abubuwan da suka shafi gine-gine, fasaha da ayyuka don ƙaura dandamalin da ke akwai. Wannan labarin zai bincika hadadden tsari na sauyawa daga tafkin Databricks Delta zuwa Apache Iceberg.

makasudin

  • Fahimtar fasalulluka na Databricks da Apache Iceberg.
  • Koyi kwatanta abubuwan gine-gine tsakanin Databricks da Apache Iceberg.
  • Fahimtar mafi kyawun ayyuka don ƙaura gine-ginen tafkin delta don buɗe tushen dandamali kamar Iceberg.
  • Don amfani da wasu kayan aikin ɓangare na uku a matsayin madadin dandalin tafkin delta.

An buga wannan labarin a matsayin wani ɓangare na Bayanan Kimiyya Blogathon.

Table da ke ciki

Fahimtar Tafkin Databrick Delta

Tafkin Databricks Delta shine ainihin ƙayyadaddun kayan ajiya da aka gina a saman Apache Spark tsarin aiki. Yana ba da wasu ayyukan bayanai na zamani waɗanda aka haɓaka don sarrafa bayanai mara sumul. Kogin Delta yana da fasali daban-daban a ainihinsa:

  • ACID Ma'amaloli: Kogin Delta yana ba da garantin ka'idodin Atomity, Daidaituwa, Warewa, da Dorewa ga duk gyare-gyare a cikin bayanan mai amfani, don haka tabbatar da ingantaccen aiki na bayanai masu inganci.
  • Juyin Halitta: Sassauci yana zuwa da yawa Lake Delta, saboda yana goyan bayan juyin halitta mai tsari don haka yana bawa masana'antu damar aiwatar da sauye-sauyen tsari ba tare da damun bututun bayanan da ake samarwa ba.
  • Time Travel: Kamar tafiye-tafiyen lokaci a cikin fina-finai na sci-fi, tafkin delta yana ba da damar neman bayanan bayanan da aka samu a wasu lokuta a cikin lokaci. Don haka yana ba masu amfani damar zurfafa zurfafa zurfafa zurfafa bincike na tarihi na bayanai da iyawar siga.
  • Ingantaccen Gudanarwar Fayil: Kogin Delta yana goyan bayan ingantattun dabaru don tsarawa da sarrafa fayilolin bayanai da metadata. Yana haifar da ingantaccen aikin tambaya da rage farashin ajiya.

Siffofin Apache Iceberg

Apache Iceberg yana ba da madadin gasa ga kamfanonin da ke neman ingantacciyar hanyar sarrafa tafkin bayanai. Icebergs yana doke wasu nau'ikan gargajiya kamar Parquet ko ORC. Akwai fa'idodi masu yawa na musamman:

  • Juyin Halitta: Mai amfani zai iya yin amfani da fasalin juyin halittar makirci yayin aiwatar da canje-canjen tsarin ba tare da sake rubuta tebur mai tsada ba.
  • Warewa Hoton hoto: Iceberg yana ba da tallafi don keɓewar hoto, don haka yana ba da tabbacin karantawa da rubutu akai-akai. Yana sauƙaƙe gyare-gyare na lokaci ɗaya a cikin tebur ba tare da lalata amincin bayanai ba.
  • Gudanar da Metadata: Wannan fasalin yana raba metadata da fayilolin bayanai. Kuma adana shi a cikin madaidaicin wurin ajiya wanda ya bambanta da fayilolin bayanan da kansu. Yana yin haka don haɓaka aikin da ƙarfafa ingantaccen ayyukan metadata.
  • Yanke Partition: Yin amfani da ingantattun dabarun dasa, yana inganta aikin tambaya ta hanyar rage bayanan da aka bincika yayin aiwatar da tambaya.

Kwatanta Nazarin Gine-gine

Bari mu zurfafa cikin nazarin kwatancen gine-gine:

Databricks Delta Lake Architecture

  • Layer ajiya: Delta Lake yi amfani da ajiyar girgije misali Amazon S3, Azure Blob a matsayin ma'auni na ma'auni na ma'auni , wanda ya ƙunshi duka fayilolin bayanai da rajistan ayyukan ma'amala.
  • Gudanar da MetadataMetadata yana tsayawa a cikin ma'amala. Don haka yana kaiwa ga ingantattun ayyukan metadata da garantin daidaiton bayanai.
  • Dabaru ingantawa: Delta Lake yana amfani da ton na dabarun ingantawa. Ya haɗa da tsallake-tsallake bayanai da oda Z don inganta aikin tambaya sosai da rage yawan sama yayin bincikar bayanan.
Databricks Delta Lake Architecture

Apache Iceberg Architecture

  • Rabuwar metadata: Akwai bambanci tare da kwatanta da Bayanan bayanai dangane da raba metadata daga fayilolin bayanai. Kankarar kankara tana adana metadata a cikin keɓantaccen wurin ajiya daga fayilolin bayanai.
  • Tallafin Ma'amala: Don tabbatar da amincin bayanan da amincin, Iceberg yana alfahari da ƙaƙƙarfan ƙa'idar ciniki. Wannan ƙa'idar tana ba da garantin atom ɗin da daidaiton ayyukan tebur.
  • karfinsu: Injuna irin su Apache Spark, Flink da Presto suna dacewa da Iceberg cikin sauƙi. Masu haɓakawa suna da sassauci don amfani da Iceberg tare da waɗannan tsarin sarrafa lokaci na ainihi.
Apache Iceberg Architecture

Kewayawa Tsarin Filayen Hijira: La'akari da Mafi kyawun Ayyuka

Yana buƙatar ɗimbin tsari da kisa don aiwatar da ƙaura daga tafkin Databricks Delta zuwa Apache Iceberg. Ya kamata a yi la'akari da wasu abubuwa:

  • Juyin Halitta: Tabbatar da daidaito mara aibi tsakanin fasalin tsarin juyin halitta na tafkin Delta da Iceberg don kiyaye daidaito yayin canje-canjen tsari.
  • Migration na Bayanan: Ya kamata a haɓaka dabarun kuma a cikin su tare da abubuwa kamar ƙarar bayanai, buƙatun lokacin raguwa, da daidaiton bayanai.
  • Dacewar Tambaya: Ya kamata mutum ya bincika game da daidaiton tambayar tsakanin tafkin Delta da Iceberg. Zai haifar da sauyi mai sauƙi kuma aikin tambayar da ake ciki shima zai kasance cikakke bayan ƙaura.
  • Performance Testing: Ƙaddamar da ayyuka masu yawa da gwaje-gwajen koma baya don duba aikin tambaya. Hakanan ya kamata a bincika amfanin albarkatun tsakanin Iceberg da tafkin Delta. Ta wannan hanyar, ana iya gane wuraren da za a iya ingantawa don ingantawa.

Ga masu haɓaka ƙaura na iya amfani da wasu ƙayyadaddun kwarangwal na lamba daga Iceberg da takaddun bulo na data kuma aiwatar da iri ɗaya. An ambaci matakan da ke ƙasa kuma harshen da ake amfani da shi anan shine Scala:

Mataki 1: Ƙirƙirar Tebur Delta

A matakin farko, tabbatar da cewa guga na S3 ba komai bane kuma an tabbatar da shi kafin a ci gaba da ƙirƙirar bayanai a ciki. Da zarar aikin ƙirƙirar bayanan ya cika, yi rajistan shiga mai zuwa:

Mataki 1: Ƙirƙirar Tebur Delta
val data=spark.range(0,5)
data.write.format("delta").save("s3://testing_bucket/delta-table")

spark.read.format("delta").load("s3://testing_bucket/delta-table")
Ƙirƙiri Teburin Tekun Delta
Ƙirƙiri Teburin Tekun Delta

Ƙara lambar vaccum na zaɓi

#adding optional code for vaccum later
val data=spark.range(5,10)
data.write.format("delta").mode("overwrite").save("s3://testing_bucket/delta-table")

Mataki 2: CTAS da Karatun Tebur Delta

#reading delta lake table
spark.read.format("delta").load("s3://testing_bucket/delta-table")

Mataki 3: Karanta Tafkin Delta kuma Rubuta zuwa Teburin Iceberg

val df_delta=spark.read.format("delta").load("s3://testing_bucket/delta-table")
df_delta.writeTo("test.db.iceberg_ctas").create()
spark.read.format("iceberg").load("test.db.iceberg.ctas)

Tabbatar da bayanan da aka zubar zuwa teburin kankara a ƙarƙashin S3

Karatu Lake Delta kuma Rubuta zuwa Teburin Iceberg
Karatu Lake Delta kuma Rubuta zuwa Teburin Iceberg

Kwatanta kayan aikin ɓangare na uku dangane da sauƙi, aiki, dacewa da tallafi. Kayan aikin biyu watau. AWS Glue DataBrew da Snowflake sun zo tare da tsarin aikin nasu.

AWS Manne DataBrew

Tsarin Hijira:

  • Sauƙi na amfani: AWS Glue DataBrew samfur ne a ƙarƙashin girgije na AWS kuma yana ba da ƙwarewar mai amfani don tsaftace bayanai da ayyukan canji.
  • hadewa: Glue DataBrew za a iya haɗawa da sauran ayyukan girgije na Amazon. Ga ƙungiyoyin da ke aiki tare da AWS na iya amfani da wannan sabis ɗin.

Saitin Yanayi:

  • Canza bayanai: Ya zo tare da babban saitin fasali don canjin bayanai (EDA). Yana iya zuwa da amfani yayin ƙauran bayanai.
  • Bayanan martaba ta atomatik: Kamar sauran kayan aikin buɗe tushen, DataBrew bayanan martaba ta atomatik. don gano duk wani rashin daidaituwa kuma ya ba da shawarar ayyukan sauyi.

Ayyuka da Daidaituwa:

  • scalability: Don sarrafa manyan bayanan bayanan da za a iya fuskanta yayin aiwatar da ƙaura, Glue DataBrew yana ba da scalability don ɗaukar hakan.
  • karfinsu: Yana ba da jituwa tare da saiti mafi girma na tsari da tushen bayanai, don haka sauƙaƙe haɗin kai tare da mafita daban-daban na ajiya.

Snowflake

Tsarin Hijira:

  • Sauƙin Hijira: Don sauƙi, Snowflake yana da sabis na ƙaura wanda ke taimakawa masu amfani da ƙarshen ƙaura daga ɗakunan ajiya na bayanai zuwa dandalin Snowflake.
  • Cikakken Takardu: Snowflake yana ba da ɗimbin takardu da wadatattun albarkatu don farawa tare da tsarin ƙaura.

Saitin Yanayi:

  • Ƙarfin Ware Gidajen Bayanai: Yana ba da faffadan fasalulluka na ɗakunan ajiya, kuma yana da goyan baya ga ƙayyadaddun bayanai, raba bayanai, da sarrafa bayanai.
  • Tabbatarwa: Gine-ginen yana ba da izinin babban haɗin gwiwa wanda ya dace da ƙungiyoyi tare da buƙatun sarrafa bayanai.

Ayyuka da Daidaituwa:

  • Performance: Snowflake kuma yana aiki mai inganci dangane da scalability wanda ke ba masu amfani da ƙarshen damar aiwatar da manyan kundin bayanai cikin sauƙi.
  • karfinsu: Snowflake kuma yana ba da masu haɗin kai daban-daban don mabambantan bayanai, don haka yana ba da garantin daidaitawa tare da bambance-bambancen yanayin yanayin bayanai.
"

Kammalawa

Don inganta tafkin bayanai da ayyukan sarrafa kayan ajiya da kuma fitar da sakamakon kasuwanci, canji yana da mahimmanci ga ƙungiyoyi. Masana'antu za su iya yin amfani da duka dandamali dangane da iyawa da rarrabuwar gine-gine da fasaha da yanke shawarar abin da za su zaɓa don amfani da matsakaicin yuwuwar saitin bayanan su. Yana taimaka wa ƙungiyoyi a cikin dogon lokaci kuma. Tare da saurin sauya yanayin yanayin bayanai, sabbin hanyoyin magance su na iya ci gaba da ci gaba da ƙungiyoyi.

Maɓallin Takeaways

  • Apache Iceberg yana ba da kyawawan fasaloli kamar keɓewar hoto, ingantaccen sarrafa metadata, datsa yanki don haka yana haifar da haɓaka ƙarfin sarrafa tafkin bayanai.
  • Hijira zuwa Apache Iceberg yana hulɗar da tsare-tsare da aiwatar da hankali. Ya kamata ƙungiyoyi suyi la'akari da abubuwan kamar haɓakar ƙira, dabarun ƙaura bayanai, da dacewar tambaya.
  • Databricks Delta Lake yana ba da damar ajiyar girgije a matsayin tushen ma'auni, adana fayilolin bayanai da rajistar ma'amala, yayin da Iceberg ke raba metadata daga fayilolin bayanai, haɓaka aiki da haɓakawa.
  • Ƙungiyoyi kuma suyi la'akari da abubuwan da suka shafi kuɗi kamar farashin ajiya, ƙididdige cajin, kuɗaɗen lasisi, da duk wasu albarkatun ad-hoc da ake buƙata don ƙaura.

Tambayoyin da

Q1. Yaya tsarin ƙaura daga tafkin Databricks Delta zuwa Apache Iceberg ake yin?

A. Ya ƙunshi fitar da bayanan daga tafkin Databricks Delta, tsaftace shi idan ya cancanta, sannan a shigo da shi cikin teburin Apache Iceberg.

Q2. Shin akwai wasu kayan aikin atomatik da ake da su don taimakawa tare da ƙaura ba tare da sa hannun hannu ba?

A. Ƙungiyoyi gabaɗaya suna yin amfani da rubutun python/Scala na al'ada da kayan aikin ETL don gina wannan aikin.

Q3. Wadanne kalubale ne kungiyoyi ke fuskanta yayin tafiyar hijira?

A. Wasu ƙalubalen waɗanda ke da yuwuwar faruwa su ne - daidaiton bayanai, sarrafa bambance-bambancen juyin halitta, da haɓaka aiki bayan ƙaura.

Q4. Menene bambanci tsakanin Apache Iceberg da sauran tsarin tebur kamar Parquet ko ORC?

A. Apache Iceberg yana ba da fasali kamar haɓakar ƙira, keɓewar hoto, da ingantaccen sarrafa metadata wanda ya bambanta shi da Parquet da ORC.

Q5. Shin za mu iya amfani da Apache Iceberg tare da mafita na tushen girgije?

A. Tabbatacce, Apache Iceberg yana dacewa da yawancin amfani da hanyoyin ajiya na tushen girgije kamar AWS S3, Azure Blob Storage, da Google Cloud Storage.

Kafofin watsa labaru da aka nuna a cikin wannan labarin ba mallakin Vidhya Analytics bane kuma ana amfani dashi bisa ga ra'ayin Mawallafin.

tabs_img

Sabbin Hankali

tabs_img