DAY 27: started module 6 batch processing of data engineering zoomcamp. using pyspark from jupyter notebook to handle some "massive files".
check the videos:
www.youtube.com/watch?v=r_Sf...
#python #sql #dezoomcamp
DAY 27: started module 6 batch processing of data engineering zoomcamp. using pyspark from jupyter notebook to handle some "massive files".
check the videos:
www.youtube.com/watch?v=r_Sf...
#python #sql #dezoomcamp
i don't wanna go back to X but it fees like a desert here. there are only some people from usa here.
#python #sql
DAY 26: still working on dlt. i'm trying to understand how much dlt makes the job easier by writing a manual python script. a rest api json parser, i will than try it with dtl.
#python #sql #dlt
github.com/middaycoffee...
DAY 25: did some deep diggin on dlt-hub, the story behind it and how it will change data-engineering practices, maybe more than ai. also marimo is a great python notebook tool that you can check here: marimo.io
DAY 24: finalized dlt and commited to github! check the code here:
github.com/middaycoffee...
DAY 23: Initiated the dlt pipeline via Claude MCP help!
It is a quite smooth operation for friendly APIs. Check my code here:
github.com/middaycoffee...
#python #sql #dezoomcamp #dlt
DAY 22: started dlt workshop, creating end-to-end pipelines just from a python script. dlt is really a good open-source project that helps all of us.
I'm now initiating dlt with MCP configurations.
#dlt #python #sql #dezoomcamp
DAY 21: Bruin module successfully completed!
I've learned to built end- to-end pipelines in this module vie Bruin. Encountered several problems and fixed. Also utilized Bruin Cloud & GCP.
#dezoomcamp #sql #python
DAY 20: Bruin Cloud implementation problems solved. A Bruin engineer helped from Slack and we've found the issue was arising from the size of the reposition. my reposition included all modules as sub-folders and in total they were exceeding 100MB which was more than what Bruin Cloud allowed.
#sql
DAY 19: I've encountered some problems in Bruin Cloud. It doesn't initiate when GitHub repo is changed to something full than cloud template given. I've asked the Slack community and now waiting for some answers.
#dezoomcamp #python #sql
I'm just exploring the Bruin Cloude but I was thinking integrating it with BigQuery and use for end-to-end capstone project actually. Do you use Bruin?
DAY 18: Crated end-to-end pipeline with Bruin MCP + Claude. There seems to be some small problems (timezone, ingested day count and some numerical shifts) but overall it's quite easy. Cloud integration comes after. I'm trying it with GCP, otherwise I will shit to MotherDuck.
#python #sql #dezoomcamp
DAY 17: completed assets and created several successful pipelines in Bruin. Now we can successfully pull NYC taxi data transform it and query it with dbt (checking seeds) from one platform only!
tomorrow will dive into cloud integration.
#python #sql #dezoomcamp
DAY 16: continue Bruin pipeline, created ingestion assets for nyc dataset. Got help from claude for table schema and sql queries. set dependencies, requirements.
#dezoomcamp #sql #python
is this because the internet is now owned by big tech or is there some other reason don't know but even to get an internship you need to have a full-package name brand online (github, linkedin, web page, social). that's disgustinggg
thank you. it's great indeed. from visualization to seeds and modeling dbt gives full control. and the jinja for sql is a powerful surprise.
DAY 15: Met with Bruin, appears to be a fully in-house data platform where we can create end-to-end pipelines easily.
i'm not sure about scalability but for sure it's easier than gluing docker-kestra-python-gcp-looker together :)
#dezoomcamp #python #sql
i wish there were remote companies that hire interns globally. most of the companies look for recruiting from their own country. due to tax, social security stuff i guess.
#python #sql
day 14: addition to data engineering zoomcamp module 4. loaded fhv_tripdata directly from gcp command line with a python script. it takes longer than a kestra orchestration but a clean method if you don't want to deal with a docker-kestra setup all over again. github.com/middaycoffee... #dezoomcamp
DAY 13: Module 4 of Data Engineering Zoomcamp done!
- Analytics Engineering with dbt
- Transformation models & tests
- Data lineage & dependencies
- NYC taxi revenue analysis
My solution: github.com/middaycoffee...
Free course by DataTalksClub: github.com/DataTalksClu...
DAY 12:
created marts, intermediate and other folders inside models, created unioned tables of green and yellow taxi data
here's my repo: github.com/middaycoffee...
#dezoomcamp
#python
#sql
DAY 11:
created dbt folders like models, snapshots etc. ran a query on staging with nyc green taxi records.
here's my repo: github.com/middaycoffee...
#dezoomcamp
#python
#sql
DAY 10:
initialised dbt, watched the reasoning behind dbt. created repo and downloaded green and yellow nyc taxi data for 2019 and 2020.
here's my repo: github.com/middaycoffee...
#dezoomcamp
#python
#sql
DAY 9: continued some work on BigQuery and got assistance from Gemini on how to improve Query run times and costs. Tested partitioned and clustered queries and checked the differences. Followed relevant lessons from youtube.
- GCP
- BigQuery
- Cloud Storage Buckets
- SQL
#dezoomcamp
DAY 8: Module 3 of Data Engineering Zoomcamp done!
- BigQuery & GCS
- External vs materialized tables
- Partitioning & clustering
- Query optimization
My solution: github.com/middaycoffee...
Free course by DataTalksClub: github.com/DataTalksClu...
DAY 7: Clustering with BigQuery!
today i've learned the logic behind clustering and partitioning and run some example queries to see the results.
#dezoomcamp
#sql
check this free and beautiful course at: github.com/DataTalksClu...
a colleague does this? people gone mad
DAY 6: BigQuery partitioning!
done:
- used GCS and extracted the tables to BigQuery (cheaper)
- used BigQuery to query and partition the tables (faster)
#dezoomcamp
#sql
check this free and beautiful course at: github.com/DataTalksClu...
DAY 5: AI usage with Kestra, Cloud implementation of Kestra
done:
- used GCS and BigQuery to upload Taxi data and transform it.
- implemented Kestra on GCP VM with Cloud SQL and GCS
#dezoomcamp
#sql