Data Code 101's Avatar

Data Code 101

@datacode101

Data / Software Engineering

38
Followers
64
Following
232
Posts
05.10.2023
Joined
Posts Following

Latest posts by Data Code 101 @datacode101

Prompting is temporary.

Structure is permanent.

When your repo is organized this way, Claude stops behaving like a chatbot…

…and starts acting like a project-native engineer.

10.03.2026 11:58 👍 0 🔁 0 💬 0 📌 0

5️⃣ Local CLAUDE.md for risky modules

Put small files near sharp edges:

src/auth/CLAUDE.md
src/persistence/CLAUDE.md
infra/CLAUDE.md

Now Claude sees the gotchas exactly when it works there.

10.03.2026 11:58 👍 0 🔁 0 💬 1 📌 0

4️⃣ docs/ = Progressive Context

Don’t bloat prompts.

Claude just needs to know where truth lives:

• architecture overview
• ADRs (engineering decisions)
• operational runbooks

10.03.2026 11:58 👍 0 🔁 0 💬 1 📌 0

3️⃣ .claude/hooks/ = Guardrails

Models forget.

Hooks don’t.

Use them for things that must be deterministic:

• run formatter after edits
• run tests on core changes
• block unsafe directories (auth, billing, migrations)

10.03.2026 11:58 👍 0 🔁 0 💬 1 📌 0

2️⃣ .claude/skills/ = Reusable Expert Modes

Stop rewriting instructions.

Turn common workflows into skills:

• code review checklist
• refactor playbook
• release procedure
• debugging flow

Result:
Consistency across sessions and teammates.

10.03.2026 11:58 👍 1 🔁 0 💬 2 📌 0

1️⃣ CLAUDE.md = Repo Memory (keep it short)

This is the north star file.

Not a knowledge dump. Just:

• Purpose (WHY)
• Repo map (WHAT)
• Rules + commands (HOW)

If it gets too long, the model starts missing important context.

10.03.2026 11:58 👍 0 🔁 0 💬 1 📌 0

Claude needs 4 things at all times:

• the why → what the system does
• the map → where things live
• the rules → what’s allowed / not allowed
• the workflows → how work gets done

The Anatomy of a Claude Code Project 👇

10.03.2026 11:58 👍 0 🔁 0 💬 1 📌 0
Post image

Most people treat CLAUDE.md like a prompt file.

That’s the mistake.

If you want Claude Code to feel like a senior engineer living inside your repo, your project needs structure.

#Agentic #AI #Claude

10.03.2026 11:58 👍 1 🔁 0 💬 2 📌 0

If you enjoy system design, infrastructure, and data flow — engineering may suit you.
If you enjoy analysis, modeling, and problem-solving with algorithms — science may be your path.

10.02.2026 22:42 👍 0 🔁 0 💬 0 📌 0

A Data Scientist analyzes data, builds models, applies statistics, and translates patterns into actionable insights. They focus on prediction, experimentation, and business impact.

10.02.2026 22:42 👍 0 🔁 0 💬 1 📌 0

A Data Engineer designs pipelines, manages large-scale systems, ensures data reliability, and works heavily with cloud and distributed frameworks. They focus on performance, scalability, and architecture.

10.02.2026 22:42 👍 0 🔁 0 💬 1 📌 0
Post image

Data Engineer vs. Data Scientist: What’s the Difference?

One builds the data foundation.
The other turns data into intelligence.

10.02.2026 22:42 👍 0 🔁 0 💬 1 📌 0

- Using coding agents to increase the speed at which they build pipelines
- Crushing data siloes with data lakehouse architectures like Iceberg and Delta. Getting the entire company to agree upon business definitions

Data engineering is one of the few "safe" roles in the coming decade!

10.02.2026 20:28 👍 0 🔁 0 💬 0 📌 0

Data engineers in 2030 are:
- Able to handle all types of data: structured, semi-structured, and unstructured
- Integrating private data into AI in a privacy-compliant and efficient way using multi-tenant architectures

10.02.2026 20:28 👍 0 🔁 0 💬 1 📌 0

Things like Claude Code will make "building pipelines" easier, but data engineering is so much more than building pipelines!

10.02.2026 20:28 👍 0 🔁 0 💬 1 📌 0
Post image

Data engineering is projected to grow faster than AI engineering over the next decade, according to the World Economic Forum!

AI is not going to replace data engineering; it will make it increasingly more valuable!

10.02.2026 20:28 👍 0 🔁 0 💬 1 📌 0

- Typically 30–60% fewer tokens than JSON1
- Explicit lengths and fields enable validation
- Removes redundant punctuation (braces, brackets, most quotes)
- Indentation-based structure, like YAML, uses whitespace instead of braces
- Tabular arrays: declare keys once, stream data as rows

06.11.2025 06:01 👍 0 🔁 0 💬 0 📌 0

JSON:

{
"users": [
{ "id": 1, "name": "Alice", "role": "admin" },
{ "id": 2, "name": "Bob", "role": "user" }
]
}

TOON:

users[2]{id,name,role}:
1,Alice,admin
2,Bob,user

06.11.2025 06:01 👍 0 🔁 0 💬 1 📌 0
Post image

Token-Oriented Object Notation (TOON) is a compact, human-readable serialization format designed for passing structured data to Large Language Models with significantly reduced token usage. It's intended for LLM input as a lossless, drop-in representation of JSON data.

#dataengineering #llm

06.11.2025 06:01 👍 0 🔁 0 💬 1 📌 0

RAG is not just an integration problem. It’s a design problem. Each layer of this stack requires deliberate choices that impact latency, quality, explainability, and cost.

If you're serious about GenAI, it's time to think in terms of stacks—not just models.

27.10.2025 10:36 👍 0 🔁 0 💬 0 📌 0

Evaluation

Tools like Ragas, Trulens, and Giskard bring much-needed observability—measuring hallucinations, relevance, grounding, and model behavior under pressure.

27.10.2025 10:36 👍 0 🔁 0 💬 1 📌 0

Text Embeddings

The quality of retrieval starts here. Open-source models (Nomic, SBERT, BGE) are gaining ground, but proprietary offerings (OpenAI, Google, Cohere) still dominate enterprise use.

27.10.2025 10:36 👍 0 🔁 0 💬 1 📌 0

Open LLM Access

Platforms like Hugging Face, Ollama, Groq, and Together AI abstract away infra complexity and speed up experimentation across models.

27.10.2025 10:36 👍 0 🔁 0 💬 1 📌 0

Data Extraction (Web + Docs)

Whether you're crawling the web (Crawl4AI, FireCrawl) or parsing PDFs (LlamaParse, Docling), raw data access is non-negotiable. No context means no quality answers.

27.10.2025 10:36 👍 0 🔁 0 💬 1 📌 0

Vector Database

Chroma, Qdrant, Weaviate, Milvus, and others power the retrieval engine behind every RAG system. Low-latency search, hybrid scoring, and scalable indexing are key to relevance.

27.10.2025 10:36 👍 0 🔁 0 💬 1 📌 0

Frameworks

LangChain, LlamaIndex, Haystack, and txtai are now essential for building orchestrated, multi-step AI workflows. These tools handle chaining, memory, routing, and tool-use logic behind the scenes.

27.10.2025 10:36 👍 0 🔁 0 💬 1 📌 0

LLMs (Open vs Closed)

Open models like LLaMA 3, Phi-4, and Mistral offer control and customization. Closed models (OpenAI, Claude, Gemini) bring powerful performance with less overhead. Your tradeoff: flexibility vs convenience.

27.10.2025 10:36 👍 0 🔁 0 💬 1 📌 0
Post image

RAG Stack

Building with Retrieval-Augmented Generation (RAG) isn't just about choosing the right LLM. It's about assembling an entire stack—one that's modular, scalable, and future-proof.
#ai #rag #dataengineering

27.10.2025 10:36 👍 0 🔁 0 💬 1 📌 0

EtLT (Extract, transform, Load, Transform) (2/2)

Best for scenarios requiring strict data security/compliance (pre-load masking) while still benefiting from the speed and flexibility of cloud data warehouse transformations.

19.10.2025 04:45 👍 0 🔁 0 💬 0 📌 0

EtLT (Extract, transform, Load, Transform) (1/2)

Attempts to balance the data governance of ETL with the speed and flexibility of ELT. A minimal transformation step is performed before loading. Essential tasks like data cleaning, basic formatting, masking sensitive data for immediate compliance.

19.10.2025 04:45 👍 0 🔁 0 💬 1 📌 0