Trending

#postgresSQL

Latest posts tagged with #postgresSQL on Bluesky

Latest Top
Trending

Posts tagged #postgresSQL

Awakari App

Understanding Concurrency in PostgreSQL: How It Handles Multiple Transactions Efficiently Introduction Continue reading on Medium »

#sql #postgressql #design #software-architecture #software-development

Origin | Interest | Match

1 0 0 0
Preview
OpenAI’s Postgres Architecture: A Brilliant Fix for a Billion-Dollar Mistake Author: Matthew Penaroza, Head of AI Solution Architecture at SurrealDB OpenAI’s write-up on scaling...

OpenAI’s Postgres Architecture: A Brilliant Fix for a Billion-Dollar Mistake Author: Matthew Penaroza , Head of AI Solution Architecture at SurrealDB OpenAI’s write-up on scaling PostgreSQL for...

#database #openai #postgres #postgressql

Origin | Interest | Match

0 0 0 0
SQL Joins & Window Functions - prodSens.live JOINS They allow us to work with multiple tables and allows us to join data in different tables.…

SQL Joins & Window Functions JOINS They allow us to work with multiple tables and allows us to join data in different tables.… The post SQL Joins & Window Functions appeared first on prod...

#Software #postgres #postgressql #prodsens #live #sql

Origin | Interest | Match

0 0 0 0
SQL Joins & Window Functions - prodSens.live JOINS They allow us to work with multiple tables and allows us to join data in different tables.…

SQL Joins & Window Functions JOINS They allow us to work with multiple tables and allows us to join data in different tables.… The post SQL Joins & Window Functions appeared first on prod...

#Software #postgres #postgressql #prodsens #live #sql

Origin | Interest | Match

0 0 0 0
Preview
SQL Joins & Window Functions JOINS They allow us to work with multiple tables and allows us to join data in different...

SQL Joins & Window Functions JOINS They allow us to work with multiple tables and allows us to join data in different tables. Joins happen when there's is a primary and foreign keys-they al...

#sql #postgres #postgressql

Origin | Interest | Match

0 0 0 0
How I Built a System That Tracks 900,000 Real-Time Events Per Day - prodSens.live A few months ago I started building a system to collect and analyze real-time event data. What began…

How I Built a System That Tracks 900,000 Real-Time Events Per Day A few months ago I started building a system to collect and analyze real-time event data. What began… The post How I Built a Syst...

#Software #backend #database #postgressql #prodsens #live #webdev

Origin | Interest | Match

1 0 0 0
Preview
From the PostgreSQL community on Reddit: Scaling PostgreSQL to power 800 million ChatGPT users Posted by Ncell50 - 105 votes and 10 comments

www.reddit.com/r/PostgreSQL.... #PostgresSQL #OpenAI #Scaling #Database

1 0 0 0
Post image

🧱 Lesson 5 – Working with PostgreSQL (Multi-Database Setup) Series: From Code to Cloud: Building a Production-Ready .NET Application By: Farrukh Rehman - Senior .NET Full Stack Develope...

#Software #csharp #entityframework #postgres #postgressql #prodsens #live

Origin | Interest | Match

1 0 0 0
Preview
I Built a Multi-Agent Code Review System That's 4x Faster (Thanks to Zero-Copy Database Forks!) "I didn't know you could do that!" - My reaction when I discovered Tiger Cloud's zero-copy...

A little over a week left in the Agentic Postgres Challenge with Tiger Database. Need some inspiration? Check out this project by Simran Shaikh where she runs multiple AI agents concurrently for super speedy code analysis.

dev.to/simranshaikh...

#agenticpostgreschallenge #postgressql #ai

1 0 0 0
Preview
How to Configure Nagios to Monitor a PostgreSQL 16 Database with Dashboards on Ubuntu 24.04 LTS #postgresql #postgres #nagios inchirags@gmail.com Chirag PostgreSQL DBA Tutorial https://www.chirags.in How to...

How to Configure Nagios to Monitor a PostgreSQL 16 Database with Dashboards on Ubuntu 24.04 LTS #postgresql #postgres #nagios inchirags@gmail.com Chirag PostgreSQL DBA Tutorial https://www.chirags

#postgres #postgressql #nagios #database

Origin | Interest | Match

0 0 0 0
Post image

How to Design a PostgreSQL Schema Visually (Step-by-Step) 1. What is a Schema? In PostgreSQL, a schema is just a folder inside your database where you… The post How to Design a PostgreSQL Schema ...

#Software #database #postgres #postgressql #prodsens #live #sql

Origin | Interest | Match

0 0 0 0
Preview
Using PgBouncer to improve performance and reduce the load on PostgreSQL inchirags@gmail.com Chirag's PostgreSQL DBA Tutorial https://www.chirags.in Using PgBouncer to...

Using PgBouncer to improve performance and reduce the load on PostgreSQL inchirags@gmail.com Chirag's PostgreSQL DBA Tutorial https://www.chirags.in Using PgBouncer to improve performance and r...

#pgbouncer #postgres #postgressql #database

Origin | Interest | Match

0 0 0 0
Preview
Manage Replication and Failover on a PostgreSQL 16 Cluster Using repmgr in 02 Nodes on Ubuntu 24.04 LTS inchirags@gmail.com Chirag's PostgreSQL DBA Tutorial ...

Manage Replication and Failover on a PostgreSQL 16 Cluster Using repmgr in 02 Nodes on Ubuntu 24.04 LTS inchirags@gmail.com Chirag's PostgreSQL DBA Tutorial https://www.chirags.in Manage Replic...

#repmgr #postgres #database #postgressql

Origin | Interest | Match

0 0 0 0
Preview
Building a smarter Web scraper: Vector embeddings for intelligent content retrieval and analysis # Yet Another Scraper? This One's Actually Worth Your Time! Hey there, fellow code wranglers! 👋 Alex here, and today I'm going to introduce you to my latest python creation - (and yes, I will do a rust follow-up if this gets enough attention). ## Foreword Whether you're just starting your coding journey or you've got a few years under your belt as a developer, this project is designed to be both educational and immediately useful. Entry-level devs will find the codebase approachable with clear patterns to learn from, while mid-level engineers can appreciate the architecture choices and extend the functionality for their specific needs - it is meant to be a starting point for you to build upon. ## What Makes This One Special? This isn't just any scraper. It's a full-stack solution that combines `FastAPI`, `PostgreSQL` with `pgvector`, and `Playwright` to create a powerful content scraping and similarity search system. Think of it as your personal web librarian that not only collects content but also helps you find related information using the magic of vector embeddings. User: "Find me content similar to this article about AI ethics" Scraper: *does vector magic* "Here are 5 related articles ranked by similarity" User: *surprised Pikachu face* ## The stack * **FastAPI** : Because life's too short for slow APIs * **PostgreSQL + pgvector** : Vector similarity search that doesn't make your database cry * **Playwright** : Headless browsing that actually works with modern websites * **Sentence Transformers** : Turning words into math (vectors) so computers can understand content similarity ## Show Me The Code Already! Here's how easy it is to scrape URLs: # The regular endpoint supports multiple URLs too curl -X POST http://localhost:8000/v1/scrape/ \ -H "X-API-Key: your_api_key" \ -H "Content-Type: application/json" \ -d '{"urls": ["https://dev.to/alexandrughinea", "https://dev.to/topics/python"]}' And searching for similar content is just as simple: curl -X GET "http://localhost:8000/v1/search?text=fastapi%20vector%20search&limit=5" \ -H "X-API-Key: your_api_key" For larger batches, the async batch endpoint provides better performance: # The batch endpoint processes URLs asynchronously for better performance curl -X POST http://localhost:8000/v1/scrape/batch/ \ -H "X-API-Key: your_api_key" \ -H "Content-Type: application/json" \ -d '{"urls": ["https://example.com", "https://example.org", "https://example.net"]}' And once you've scraped some content, you can query it directly: # Get all scraped data (paginated) curl -X GET "http://localhost:8000/v1/data/?limit=10" \ -H "X-API-Key: your_api_key" # Get a specific item by ID curl -X GET "http://localhost:8000/v1/data/42" \ -H "X-API-Key: your_api_key" ## The Secret Sauce: Vector Similarity The real magic happens in the search functionality. When you scrape content, it gets converted into vector embeddings (fancy math arrays) that represent the semantic meaning of the text. Then when you search, your query text gets converted to the same format, and we find the closest matches using cosine similarity. ## Why I Built This I got tired of scraping websites and then having to build separate systems for search and content analysis. This project combines everything into one cohesive system that: 1. Scrapes content efficiently using headless browsers 2. Stores content and vector embeddings in PostgreSQL 3. Provides a clean API for searching and retrieving similar content 4. Handles batch operations for processing multiple URLs (sync and async versions) 5. Avoids duplicate content through similarity detection 6. Respects robots.txt rules (configurable via environment variables) ## Getting Started in 30 Seconds # Clone the repo git clone https://github.com/alexandrughinea/python-fastapi-postgres-scraper # Start with Docker cd python-fastapi-postgres-scraper docker-compose up -d # Your API is now running at http://localhost:8000 ## Final Thoughts Is this yet another scraper? Technically, yes. But it's a scraper with superpowers. It doesn't just collect data; it understands it, organizes it, and helps you find connections between different pieces of content in your own database. The next time someone says "we need to scrape some websites," you can smugly pull out this repo instead of cobbling together a BeautifulSoup script that will break the moment the website changes its CSS classes. ## Getting Started To make your life even easier, I've included Bruno API documentation in the `/docs` folder. Bruno is a great open-source API client that lets you test all the endpoints without writing a single line of code. Just install Bruno, open the collection, and start experimenting! The complete project is available on GitHub at github.com/alexandrughinea/python-fastapi-postgres-scraper. Star it if you find it useful, and contributions are always welcome! Give it a try and let me know what you think! And remember, with great scraping power comes great responsibility, don't hammer websites with requests, be respectful of robots.txt, and maybe consider asking for permission first - because being a good web citizen matters. You can toggle this behavior with the `RESPECT_ROBOTS_TXT` environment variable if you really need to. Happy scraping! 🕸️🚀

Building a smarter Web scraper: Vector embeddings for intelligent content retrieval and analysis ...

dev.to/alexandrughinea/building...

#python #ai #postgressql #fastapi

Result Details

0 0 0 0
Preview
Building a smarter Web scraper: Vector embeddings for intelligent content retrieval and analysis # Yet Another Scraper? This One's Actually Worth Your Time! Hey there, fellow code wranglers! 👋 Alex here, and today I'm going to introduce you to my latest python creation - (and yes, I will do a rust follow-up if this gets enough attention). ## Foreword Whether you're just starting your coding journey or you've got a few years under your belt as a developer, this project is designed to be both educational and immediately useful. Entry-level devs will find the codebase approachable with clear patterns to learn from, while mid-level engineers can appreciate the architecture choices and extend the functionality for their specific needs - it is meant to be a starting point for you to build upon. ## What Makes This One Special? This isn't just any scraper. It's a full-stack solution that combines `FastAPI`, `PostgreSQL` with `pgvector`, and `Playwright` to create a powerful content scraping and similarity search system. Think of it as your personal web librarian that not only collects content but also helps you find related information using the magic of vector embeddings. User: "Find me content similar to this article about AI ethics" Scraper: *does vector magic* "Here are 5 related articles ranked by similarity" User: *surprised Pikachu face* ## The stack * **FastAPI** : Because life's too short for slow APIs * **PostgreSQL + pgvector** : Vector similarity search that doesn't make your database cry * **Playwright** : Headless browsing that actually works with modern websites * **Sentence Transformers** : Turning words into math (vectors) so computers can understand content similarity ## Show Me The Code Already! Here's how easy it is to scrape URLs: # The regular endpoint supports multiple URLs too curl -X POST http://localhost:8000/v1/scrape/ \ -H "X-API-Key: your_api_key" \ -H "Content-Type: application/json" \ -d '{"urls": ["https://dev.to/alexandrughinea", "https://dev.to/topics/python"]}' And searching for similar content is just as simple: curl -X GET "http://localhost:8000/v1/search?text=fastapi%20vector%20search&limit=5" \ -H "X-API-Key: your_api_key" For larger batches, the async batch endpoint provides better performance: # The batch endpoint processes URLs asynchronously for better performance curl -X POST http://localhost:8000/v1/scrape/batch/ \ -H "X-API-Key: your_api_key" \ -H "Content-Type: application/json" \ -d '{"urls": ["https://example.com", "https://example.org", "https://example.net"]}' And once you've scraped some content, you can query it directly: # Get all scraped data (paginated) curl -X GET "http://localhost:8000/v1/data/?limit=10" \ -H "X-API-Key: your_api_key" # Get a specific item by ID curl -X GET "http://localhost:8000/v1/data/42" \ -H "X-API-Key: your_api_key" ## The Secret Sauce: Vector Similarity The real magic happens in the search functionality. When you scrape content, it gets converted into vector embeddings (fancy math arrays) that represent the semantic meaning of the text. Then when you search, your query text gets converted to the same format, and we find the closest matches using cosine similarity. ## Why I Built This I got tired of scraping websites and then having to build separate systems for search and content analysis. This project combines everything into one cohesive system that: 1. Scrapes content efficiently using headless browsers 2. Stores content and vector embeddings in PostgreSQL 3. Provides a clean API for searching and retrieving similar content 4. Handles batch operations for processing multiple URLs (sync and async versions) 5. Avoids duplicate content through similarity detection 6. Respects robots.txt rules (configurable via environment variables) ## Getting Started in 30 Seconds # Clone the repo git clone https://github.com/alexandrughinea/python-fastapi-postgres-scraper # Start with Docker cd python-fastapi-postgres-scraper docker-compose up -d # Your API is now running at http://localhost:8000 ## Final Thoughts Is this yet another scraper? Technically, yes. But it's a scraper with superpowers. It doesn't just collect data; it understands it, organizes it, and helps you find connections between different pieces of content in your own database. The next time someone says "we need to scrape some websites," you can smugly pull out this repo instead of cobbling together a BeautifulSoup script that will break the moment the website changes its CSS classes. ## Getting Started To make your life even easier, I've included Bruno API documentation in the `/docs` folder. Bruno is a great open-source API client that lets you test all the endpoints without writing a single line of code. Just install Bruno, open the collection, and start experimenting! The complete project is available on GitHub at github.com/alexandrughinea/python-fastapi-postgres-scraper. Star it if you find it useful, and contributions are always welcome! Give it a try and let me know what you think! And remember, with great scraping power comes great responsibility, don't hammer websites with requests, be respectful of robots.txt, and maybe consider asking for permission first - because being a good web citizen matters. You can toggle this behavior with the `RESPECT_ROBOTS_TXT` environment variable if you really need to. Happy scraping! 🕸️🚀

Building a smarter Web scraper: Vector embeddings for intelligent content retrieval and analysis ...

dev.to/alexandrughinea/building...

#python #ai #postgressql #fastapi

Result Details

0 0 0 0
Preview
How to Set Up PostgreSQL for Your Django App on a VPS Server If you're hosting your first Django API on a server (like Contabo) and you’ve previously used PostgreSQL locally, you’ll need to configure it on your server too. This guide will walk you through installing and configuring PostgreSQL on your VPS and connecting it to your Django app using your `.env` file. ## Prerequisites * A running VPS (e.g., from Contabo) with Ubuntu or Debian. * Domain/subdomain already set up. * Django project already deployed to the server. * `.env` file with your database credentials ready. * SSH access to the server. ## Step 1: Install PostgreSQL on Your VPS SSH into your server: ssh your-user@your-server-ip Then install PostgreSQL and its dependencies: sudo apt update sudo apt install postgresql postgresql-contrib libpq-dev ## Step 2: Create PostgreSQL Database and User Switch to the `postgres` user: sudo -u postgres psql Run the following SQL commands to create a database and user (update placeholders to match your actual values from `.env`): CREATE DATABASE mydb_name; CREATE USER mydb_user WITH PASSWORD 'strongpassword'; ALTER ROLE mydb_user SET client_encoding TO 'utf8'; ALTER ROLE mydb_user SET default_transaction_isolation TO 'read committed'; ALTER ROLE mydb_user SET timezone TO 'UTC'; GRANT ALL PRIVILEGES ON DATABASE mydb_name TO mydb_user; \q ## Step 3: Configure PostgreSQL for Remote Access (Optional) Only do this **if** you need to connect to the DB remotely (e.g., from your local machine or external tools). ### Update `pg_hba.conf`: sudo nano /etc/postgresql/*/main/pg_hba.conf Add or update the line: host all all 0.0.0.0/0 md5 ### Update `postgresql.conf`: sudo nano /etc/postgresql/*/main/postgresql.conf Uncomment and edit: listen_addresses = '*' Restart PostgreSQL: sudo systemctl restart postgresql ## Step 4: Configure Django to Use PostgreSQL Update your `.env` file in the Django project root: DB_NAME=mydb_name DB_USER=mydb_user DB_PASSWORD=strongpassword DB_HOST=localhost DB_PORT=5432 Update your `settings.py` to use `os.environ` (if not done already): import os DATABASES = { 'default': { 'ENGINE': 'django.db.backends.postgresql', 'NAME': os.getenv('DB_NAME'), 'USER': os.getenv('DB_USER'), 'PASSWORD': os.getenv('DB_PASSWORD'), 'HOST': os.getenv('DB_HOST'), 'PORT': os.getenv('DB_PORT', '5432'), } } ## Step 5: Install PostgreSQL Client for Django If you haven’t already installed it: pip install psycopg2-binary ## Step 6: Run Django Migrations Once the connection is set, apply migrations: python manage.py makemigrations python manage.py migrate You can run the development server to confirm it works: python manage.py runserver 0.0.0.0:8000 Then visit `http://your-server-ip:8000` to confirm everything runs as expected. ## Bonus: Secure PostgreSQL (Recommended) If you're not using external tools to access the DB: * Keep `DB_HOST=localhost` * Avoid opening port `5432` to the world (skip remote config steps) ## What’s Next? Once PostgreSQL is working with Django: * Set up **Gunicorn** and **Nginx** for production deployment if not earlier done. * Use **HTTPS** with Let’s Encrypt. * Set up **supervisor** to keep your Django app running.
0 0 0 0
Preview
Schemas in PostgreSQL: A Practical Guide for Developers Schemas in PostgreSQL aren’t just for large systems—they’re for anyone who wants to keep their data structured. A schema is like a folder within your database where related objects (tables, views, etc.) are grouped together. This helps you separate concerns, organize logic, and secure your data more effectively. Here’s a practical look at how they work. ## Why You Should Use Schemas * **Organization** : Separate business domains (like inventory and users). * **Control** : Limit access to different parts of your app. * **Maintenance** : Make backups and updates more targeted. * **Cleaner Queries** : Avoid clutter in your namespace. ## Schema Types * ### Public: Comes with every PostgreSQL database. If you don’t specify a schema, objects go here. * ### Custom: Created manually to isolate logic or control access. CREATE SCHEMA hr; **How Data is Structured** PostgreSQL Cluster └── Database └── Schema └── Tables, Views, Functions, etc. This layered model helps manage growing projects without chaos. ## Real Example: E-commerce Separation CREATE SCHEMA inventory; CREATE TABLE inventory.products ( product_id serial PRIMARY KEY, product_name VARCHAR(255), price DECIMAL, stock_quantity INT ); CREATE TABLE public.users ( user_id serial PRIMARY KEY, username TEXT, email TEXT ); Using schemas lets you organize your app by responsibility—making it easier to evolve or scale parts of your system. ## FAQs ### Can I have multiple schemas in one DB? Yes, and it’s a common best practice. ### What’s the difference between public and custom schemas? Public is open by default; custom is restricted unless granted. ### Can schemas improve performance? Not directly, but they help manage large systems more cleanly. ### Are schemas portable between environments? Yes, using tools like `pg_dump` or schema migration scripts. ## Conclusion PostgreSQL schemas give you better structure, access control, and scalability. They're an essential tool for organizing your data without complicating your architecture. For a more detailed guide, visit the schemas in PostgreSQL article.
0 0 0 0
Preview
[Boost] Implementing Distributed Caching with PostgreSQL in .NET:...

[Boost] Implementing Distributed Caching with PostgreSQL in .NET: Sats.PostgresDistributedCache S...

https://dev.to/gentilpinto/-2n2c

#distributedcaching #postgressql #dotnet #caching

Result Details

0 0 0 0
Post image

Supabase’s $200M Raise Signals Big Ambitions Supabase has announced a $200 million Series D fun...

www.bigdatawire.com/2025/04/24/supabases-200...

#News #in #Brief #backend #coatue #developer #Firebase #funding #open #source #PostgresSQL

Result Details

1 0 0 0
Let's build event store in one hour! - Oskar Dudycz - NDC Oslo 2022
Let's build event store in one hour! - Oskar Dudycz - NDC Oslo 2022 YouTube video by NDC Conferences

I’m probably also one to blame as I showed how to build #PostgresSQL event store in one hour: m.youtube.com/watch?v=gaoZ...

Tho, in the end I concluded with “Kids don't do this at home!” warning. But then we know how many people are watching something till the end 🙃

1 0 1 0
Preview
Using TF-IDF Vectors With PHP & PostgreSQL Vectors in PostgreSQL are used to compare data to find similarities, outliers, groupings, classifications and other things. pg_vector is a popular extension for PostgreSQL that adds vector functionality to PostgreSQL. ## What is TF-IDF? TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a way to compare the importance of a word in a document compared to a collection of documents. ## Term Frequency Term frequency refers to how often a word is used within a document. In a 100 word document, if the word 'test' occurs 5 times, then the term frequency would be 5/100 = 0.05 ## Inverse Document Frequency Inverse Document Frequency measures how unique a word is across a group of documents. Common words like "the" or "and" appear in almost all documents, so they are assigned a low IDF score. Rare, specific words are assigned a higher IDF score. The TF-IDF score is **TF * IDF**. ## Normalizing TF-IDF A drawback to using TF-IDF is that it unfairly advantages long documents over short documents. Longer documents can accumulate higher TF-IDF scores simply because they contain more words, not necessarily because the word is more relevant. This can be corrected by normalizing the score based on the document length. `TD-IDF score / total words in document.` ## PHP Implementation Guide To create vectors in PHP, select all articles from a database and loop through them. foreach ($articles as $article) { $articleText = $article['description']; $tokenizedDocuments[$article['id']] = $this->tokenizeArticle($articleText); $this->updateDocumentFrequencies($documentFrequencies, $words); } Break up the document into an array of words. Additional word processing could be done here if required. protected function tokenizeArticle(string $text): array { $text = strtolower($text); $text = preg_replace('/[^\w\s]/', '', $text); $words = preg_split('/\s+/', trim($text)); return $words; } Create an array to keep track of the word frequency across all documents. protected function updateDocumentFrequencies(array &$documentFrequencies, array $words): void { $uniqueWords = array_unique($words); foreach ($uniqueWords as $word) { if (!isset($documentFrequencies[$word])) { $documentFrequencies[$word] = 0; } $documentFrequencies[$word]++; } } Once the articles have been processed, create the embedding vector protected function createEmbeddings( array $articles, array $tokenizedDocuments, array $documentFrequencies, ): void { $totalDocuments = count($articles); foreach ($articles as $article) { $articleId = $article['id']; $words = $tokenizedDocuments[$articleId]; $embedding = $this->calculateEmbedding( $words, $documentFrequencies, $totalDocuments ); } } CaclulateEmbedding() is where the main calculations for TF-IDF score is done. protected function calculateEmbedding( array $words, array $documentFrequencies, int $totalDocuments ): array { $termFrequencies = array_count_values($words); $totalWords = count($words); $embedding = array_fill(0, $this->dimension, 0.0); foreach ($termFrequencies as $word => $count) { $tf = $count / $totalWords; $idf = log($totalDocuments / ($documentFrequencies[$word] + 1)); $tfidf = $tf * $idf; $index = abs(crc32($word) % $this->dimension); $embedding[$index] += $tfidf; } return $this->normalizeVector($embedding); } _**Dimensions**_ The number chosen for dimensions is critical to good quality TF-IDF. The number should be large enough to hold the number of unique words in any of your documents. 768 or 1536 are good numbers for medium sized documents. As a general rule about 20 - 30% of words in a document are unique. 1536 equates to about a 20 to 30 page document. _**Calculate TF**_ Divide the number of times a word occurs in a document by the total words in the document. `$tf = $count / $totalWords;` **_Calculate IDF_** Since IDF is the inverse of the document frequency, we use log to calculate the score `$idf = log($totalDocuments / ($documentFrequencies[$word] + 1));` **_Calculate TF-IDF_** $tfidf = $tf * $idf; **_TF-IDF array_** TF-IDF arrays do not store values in order, instead they are stored in a calculated array key. This ensures that the same word will always appear in the same array position across all documents. While it is possible to calculate duplicate array keys, as long as the vectors dimension size chosen is appropriate for the size of the document, duplicates are rare and is generally represents a similar word. To calculate the position, use crc32 to generate an integer representation of the word and then divide it by the dimension size, and use the remainder as the array key position. This will give a good spread of spaces that are filled with the TF-IDF scores. **_Normalizing_** Earlier we talked about normalizing as word frequency/docuiment length, while it can be calculated this way, normalizing is more commonly calculated using the `Euclidean norm formula`: √(x₁² + x₂² + ... + xₙ²) The normalizeVector method is a PHP representation of this formula. protected function normalizeVector(array $vector): array { $magnitude = sqrt(array_sum(array_map(function ($x) { return $x * $x; }, $vector))); if ($magnitude > 0) { return array_map(function ($x) use ($magnitude) { return $x / $magnitude; }, $vector); } return $vector; } The final vector may look something like this: [0.052876625,0,0,0,0,0,0,0,0,0,0,0,0.013156515,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-0.012633555,0,0,0,0,0.0065987236,0,0 ...] This is known as a sparse vector. A sparse vector has a lot of empty array keys whereas a dense vector is much more filled in. Dense vectors can improve the quality of the vector. One method of doing this is to include bi-grams in the vector along with the single words. For example: `This is known as a sparse vector` would include each word [this,is,known,as,a,sparse,vector] adding bi-grams would include [this_is,is_known,known_as,as_a,a_sparse,sparse_vector] which adds more context to the words by taking into account the words around them. ## Creating Queries in PostgreSQL Once vectors have been generated for your documents, it's time to store them in PostgreSQL. Selecting the right dimension for your document is also critical here, once you choose a dimension size, all vectors going into the field have to be the same dimension. ALTER TABLE "articles" ADD COLUMN "embedding" vector(1536); ## Types of Comparisions There are three types of comparisons in PostgreSQL **Euclidean (L2) distance** : <-> : Measures how far apart two vectors are. Smaller numbers mean vectors are more similar. Good for finding similar products etc. **Cosine similarity** : <=> : Measures the angle between vectors, ignoring their magnitude. Smaller numbers mean vectors are more similar in direction. Good for text similarity where length shouldn't matter. **Inner product** : <#> : Measures how much vectors "align" with each other. Larger numbers mean vectors are more similar (opposite of the others!). Useful for normalized comparisons. Try them all with your data to find the one that best suits your use case. ## Creating a recommendation system One of the use cases of vectors is to create a recommendation system, in this case to find articles that are related in some way to the one you are currently reading. To do this, we need to order the rows by the comparison to find the ones most relevant to the current article. In this query, first, the embedding of the current article needs to be selected and then compare other articles to it, to find the most relevant. For a recommendation, filtering out the current article from the query makes sense. WITH search_article AS ( SELECT embedding FROM articles WHERE id = 12 ) SELECT id,title FROM articles WHERE id <> 12 ORDER BY embedding <-> ( SELECT embedding FROM search_article ) LIMIT 4; ## Creating a search engine Vectors can be used to create a search engine for your documents. Comparing articles with the user entered question or keywords. To do this, The user entered question would need to be converted into a vector using the term frequencies of your current articles (recommend this be stored in the database so you are not calculating them every time a search query is run). The user vector would need to be the same dimension size as the articles. Create a query to compare the user vector to the stored vectors SELECT id,title FROM articles ORDER BY embedding <=> :embedding LIMIT 4; ## Other use cases **_Classifying Articles_** A more complex use case for vectors would be to classify documents to put similar articles together. You may not have specific tags/keywords to classify documents against, but articles can still be classified into similar items. This results in similar articles having the same cluster id **_Finding Anomalies_** If users post articles about tech, and suddenly someone posts an article about places to buy plushies, that would be an anomaly and might be worth checking to see if it fits the site's requirements. To implement an anomaly checker, a distance threshold would need to be set and anything further away than the threshold would be flagged for manual review. WITH article_distances AS ( SELECT id, title, embedding <-> ( SELECT AVG(embedding)::vector(1536) FROM articles ) AS distance_from_average FROM articles ) SELECT id, title, distance_from_average FROM article_distances WHERE distance_from_average > 0.75 ORDER BY distance_from_average DESC; This query calculates the "average" embedding across all articles (representing your typical content) and then finds articles that are significantly different from this average. Experiment with the threshold to find what is right for your use case. ## Conclusion Vectors are both complex and powerful, well planned vectors can help automate many use cases or add features to your website. TF-IDF, while it is the method I chose here, it's not the only vector type. Open AI has their own model for generating vectors from text, as does Ollama. These may or may not be better for your use case. It's important to experiment with different approaches - test various dimension sizes, comparison methods, and even vector generation techniques to find what works best for your specific needs.
0 0 0 0
Preview
Using TF-IDF Vectors With PHP & PostgreSQL Vectors in PostgreSQL are used to compare data to find similarities, outliers, groupings, classifications and other things. pg_vector is a popular extension for PostgreSQL that adds vector functionality to PostgreSQL. ## What is TF-IDF? TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a way to compare the importance of a word in a document compared to a collection of documents. ## Term Frequency Term frequency refers to how often a word is used within a document. In a 100 word document, if the word 'test' occurs 5 times, then the term frequency would be 5/100 = 0.05 ## Inverse Document Frequency Inverse Document Frequency measures how unique a word is across a group of documents. Common words like "the" or "and" appear in almost all documents, so they are assigned a low IDF score. Rare, specific words are assigned a higher IDF score. The TF-IDF score is **TF * IDF**. ## Normalizing TF-IDF A drawback to using TF-IDF is that it unfairly advantages long documents over short documents. Longer documents can accumulate higher TF-IDF scores simply because they contain more words, not necessarily because the word is more relevant. This can be corrected by normalizing the score based on the document length. `TD-IDF score / total words in document.` ## PHP Implementation Guide To create vectors in PHP, select all articles from a database and loop through them. foreach ($articles as $article) { $articleText = $article['description']; $tokenizedDocuments[$article['id']] = $this->tokenizeArticle($articleText); $this->updateDocumentFrequencies($documentFrequencies, $words); } Break up the document into an array of words. Additional word processing could be done here if required. protected function tokenizeArticle(string $text): array { $text = strtolower($text); $text = preg_replace('/[^\w\s]/', '', $text); $words = preg_split('/\s+/', trim($text)); return $words; } Create an array to keep track of the word frequency across all documents. protected function updateDocumentFrequencies(array &$documentFrequencies, array $words): void { $uniqueWords = array_unique($words); foreach ($uniqueWords as $word) { if (!isset($documentFrequencies[$word])) { $documentFrequencies[$word] = 0; } $documentFrequencies[$word]++; } } Once the articles have been processed, create the embedding vector protected function createEmbeddings( array $articles, array $tokenizedDocuments, array $documentFrequencies, ): void { $totalDocuments = count($articles); foreach ($articles as $article) { $articleId = $article['id']; $words = $tokenizedDocuments[$articleId]; $embedding = $this->calculateEmbedding( $words, $documentFrequencies, $totalDocuments ); } } CaclulateEmbedding() is where the main calculations for TF-IDF score is done. protected function calculateEmbedding( array $words, array $documentFrequencies, int $totalDocuments ): array { $termFrequencies = array_count_values($words); $totalWords = count($words); $embedding = array_fill(0, $this->dimension, 0.0); foreach ($termFrequencies as $word => $count) { $tf = $count / $totalWords; $idf = log($totalDocuments / ($documentFrequencies[$word] + 1)); $tfidf = $tf * $idf; $index = abs(crc32($word) % $this->dimension); $embedding[$index] += $tfidf; } return $this->normalizeVector($embedding); } _**Dimensions**_ The number chosen for dimensions is critical to good quality TF-IDF. The number should be large enough to hold the number of unique words in any of your documents. 768 or 1536 are good numbers for medium sized documents. As a general rule about 20 - 30% of words in a document are unique. 1536 equates to about a 20 to 30 page document. _**Calculate TF**_ Divide the number of times a word occurs in a document by the total words in the document. `$tf = $count / $totalWords;` **_Calculate IDF_** Since IDF is the inverse of the document frequency, we use log to calculate the score `$idf = log($totalDocuments / ($documentFrequencies[$word] + 1));` **_Calculate TF-IDF_** $tfidf = $tf * $idf; **_TF-IDF array_** TF-IDF arrays do not store values in order, instead they are stored in a calculated array key. This ensures that the same word will always appear in the same array position across all documents. While it is possible to calculate duplicate array keys, as long as the vectors dimension size chosen is appropriate for the size of the document, duplicates are rare and is generally represents a similar word. To calculate the position, use crc32 to generate an integer representation of the word and then divide it by the dimension size, and use the remainder as the array key position. This will give a good spread of spaces that are filled with the TF-IDF scores. **_Normalizing_** Earlier we talked about normalizing as word frequency/docuiment length, while it can be calculated this way, normalizing is more commonly calculated using the `Euclidean norm formula`: √(x₁² + x₂² + ... + xₙ²) The normalizeVector method is a PHP representation of this formula. protected function normalizeVector(array $vector): array { $magnitude = sqrt(array_sum(array_map(function ($x) { return $x * $x; }, $vector))); if ($magnitude > 0) { return array_map(function ($x) use ($magnitude) { return $x / $magnitude; }, $vector); } return $vector; } The final vector may look something like this: [0.052876625,0,0,0,0,0,0,0,0,0,0,0,0.013156515,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-0.012633555,0,0,0,0,0.0065987236,0,0 ...] This is known as a sparse vector. A sparse vector has a lot of empty array keys whereas a dense vector is much more filled in. Dense vectors can improve the quality of the vector. One method of doing this is to include bi-grams in the vector along with the single words. For example: `This is known as a sparse vector` would include each word [this,is,known,as,a,sparse,vector] adding bi-grams would include [this_is,is_known,known_as,as_a,a_sparse,sparse_vector] which adds more context to the words by taking into account the words around them. ## Creating Queries in PostgreSQL Once vectors have been generated for your documents, it's time to store them in PostgreSQL. Selecting the right dimension for your document is also critical here, once you choose a dimension size, all vectors going into the field have to be the same dimension. ALTER TABLE "articles" ADD COLUMN "embedding" vector(1536); ## Types of Comparisions There are three types of comparisons in PostgreSQL **Euclidean (L2) distance** : <-> : Measures how far apart two vectors are. Smaller numbers mean vectors are more similar. Good for finding similar products etc. **Cosine similarity** : <=> : Measures the angle between vectors, ignoring their magnitude. Smaller numbers mean vectors are more similar in direction. Good for text similarity where length shouldn't matter. **Inner product** : <#> : Measures how much vectors "align" with each other. Larger numbers mean vectors are more similar (opposite of the others!). Useful for normalized comparisons. Try them all with your data to find the one that best suits your use case. ## Creating a recommendation system One of the use cases of vectors is to create a recommendation system, in this case to find articles that are related in some way to the one you are currently reading. To do this, we need to order the rows by the comparison to find the ones most relevant to the current article. In this query, first, the embedding of the current article needs to be selected and then compare other articles to it, to find the most relevant. For a recommendation, filtering out the current article from the query makes sense. WITH search_article AS ( SELECT embedding FROM articles WHERE id = 12 ) SELECT id,title FROM articles WHERE id <> 12 ORDER BY embedding <-> ( SELECT embedding FROM search_article ) LIMIT 4; ## Creating a search engine Vectors can be used to create a search engine for your documents. Comparing articles with the user entered question or keywords. To do this, The user entered question would need to be converted into a vector using the term frequencies of your current articles (recommend this be stored in the database so you are not calculating them every time a search query is run). The user vector would need to be the same dimension size as the articles. Create a query to compare the user vector to the stored vectors SELECT id,title FROM articles ORDER BY embedding <=> :embedding LIMIT 4; ## Other use cases **_Classifying Articles_** A more complex use case for vectors would be to classify documents to put similar articles together. You may not have specific tags/keywords to classify documents against, but articles can still be classified into similar items. This results in similar articles having the same cluster id **_Finding Anomalies_** If users post articles about tech, and suddenly someone posts an article about places to buy plushies, that would be an anomaly and might be worth checking to see if it fits the site's requirements. To implement an anomaly checker, a distance threshold would need to be set and anything further away than the threshold would be flagged for manual review. WITH article_distances AS ( SELECT id, title, embedding <-> ( SELECT AVG(embedding)::vector(1536) FROM articles ) AS distance_from_average FROM articles ) SELECT id, title, distance_from_average FROM article_distances WHERE distance_from_average > 0.75 ORDER BY distance_from_average DESC; This query calculates the "average" embedding across all articles (representing your typical content) and then finds articles that are significantly different from this average. Experiment with the threshold to find what is right for your use case. ## Conclusion Vectors are both complex and powerful, well planned vectors can help automate many use cases or add features to your website. TF-IDF, while it is the method I chose here, it's not the only vector type. Open AI has their own model for generating vectors from text, as does Ollama. These may or may not be better for your use case. It's important to experiment with different approaches - test various dimension sizes, comparison methods, and even vector generation techniques to find what works best for your specific needs.
0 0 0 0

身為工程師兼DJ,把文件中的 #PostgresSQL 看成 #ProgressiveSQL 是很正常的事....

#ProgressvieTrance

3 0 0 0
Video

Self-hosted #AI starter kit , #docker container with #Ollama for #LLMs #quadrant vector database and #postgresSQL:
🚀🚀🚀 github.com/n8n-io/self-...

5 1 0 0