How I Built Language-Aware Search in PostgreSQL JSON Data with Spring Boot

11 min read

A few months ago, I was building a search feature for an Angular/Spring Boot app with localized content. We were under a tight deadline, and waiting for managers to choose an external search tool just wasn't an option. So, I decided to work with what we had. Here's how I pulled it off, and why the result worked even better than I expected.

Postgresql Jsonb operators Springboot Lucene NLP
How I Built Language-Aware Search in PostgreSQL JSON Data with Spring Boot

In this article, I'll share the technical journey and the final solution.


The Challenge:

The application was built with Angular, Spring Boot, and PostgreSQL. We had around a thousand entries in the database with mixed locales and their corresponding JSON data. The search endpoint receives the locale used on the frontend and the text to search for. The main challenge here was figuring out how to compare this search text to any value in the JSON object, taking into consideration similarity matching.

We'll use a very simple database schema:

CREATE TABLE entries (
    id     int8,
    locale varchar(2),
    data   jsonb
);

I could have loaded all the entries as POJOs and searched through them in memory, but that would have been a poor performance solution. Luckily for us, PostgreSQL has some special operators that are perfect for this kind of need.


Step 1: Exploring PostgreSQL's JSONB Operators

PostgreSQL gives us a rich set of operators for querying JSONB data. Since our data column holds the entire entry content, my first idea was: "Can I just search inside all of it without restructuring the schema?"

Here's what I found most useful:

The @> Operator — Containment

Checks if the JSON contains a given key-value pair.

SELECT * FROM entries
WHERE data @> '{"status": "published"}';

Best for: Filtering on specific fields without knowing the whole structure. Limitations: It's exact, no partial matches or fuzzy search.

The ? Operator — Key Exists

Checks if a key exists at the top level of the JSON.

SELECT * FROM entries
WHERE data ? 'title';

Great for: Quick existence checks before deeper processing. Limitations: Doesn't look into nested objects unless you access them explicitly.

The -> Operator — Extract JSON Value

Returns a JSON value that can be further processed or nested.

SELECT * FROM entries
WHERE data->'content'->'title' ? 'postgres';

Useful for: Nested JSON navigation and chaining. Limitations: Returns JSON type, not text, so you need additional operators for text search.

The ->> Operator — Extract Text Value

Extracts text from nested JSON objects, combining navigation with text extraction.

SELECT * FROM entries
WHERE data->'content'->>'title' ILIKE '%postgres%';

Ideal for: Combining nested navigation with text extraction. Limitations: Requires knowing the exact nested path structure.

At this point, I realized these operators are great for targeted lookups, but not enough for searching across multiple fields or the whole JSON.

📚 More info: PostgreSQL JSONB Operators, Neon Docs


Step 2: Trying websearch_to_tsquery for "Search Everywhere"

After JSONB operators, I needed a way to search the entire data JSON, not just a specific field. That's where PostgreSQL's full‑text search came in.

What these functions do:

  • to_tsvector(config, text) converts your text into a searchable format by breaking it into words and normalizing them (removing punctuation, converting to lowercase, etc.)
  • websearch_to_tsquery(config, text) converts a user's search query into a format that PostgreSQL can understand, supporting Google-like syntax with AND/OR operators

In simple terms: One function prepares your data for searching, the other prepares the search query.

Example:

SELECT *
FROM entries
WHERE to_tsvector('simple', data::text)
      @@ websearch_to_tsquery('simple', 'postgres OR json');

Key advantages:

  • Understands AND/OR and quoted phrases.
  • Removes stopwords (depending on config).

Limitations:

  1. Language-specific: 'english' works for English, but fails for French or Spanish. 'simple' keeps everything but no stopwords.
  2. No fuzzy matching: "postres" won't match "postgres".
  3. JSON attribute names are included in search results.

At this point, I had a working search that could find text across the entire JSON, but it still couldn't handle typos or partial matches. That's when I discovered PostgreSQL's pg_trgm extension for fuzzy matching.


Step 3: Adding pg_trgm for Fuzzy Matches

This is where things got interesting. I needed fuzzy matching to handle typos and partial searches.

What pg_trgm does:

  • pg_trgm breaks text into groups of three characters (trigrams) for similarity comparison
  • The similarity() function returns a score between 0 and 1, where 1 means perfect match
  • The % operator performs fuzzy matching using trigram similarity

You need to enable this extension in PostgreSQL:

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS pg_trgm;

-- Verify installation
SELECT * FROM pg_extension WHERE extname = 'pg_trgm';
-- Should return a row if installed correctly

Here's how it works:

-- Test basic similarity comparison
SELECT similarity('postgres', 'postgresql');  -- Returns ~0.73
SELECT similarity('postgres', 'postgres');    -- Returns 1.0
SELECT similarity('postgres', 'mysql');       -- Returns ~0.18

-- Test fuzzy matching with %
SELECT * FROM products
WHERE name % 'postgres';  -- Finds 'postgresql', 'postgres', etc.

Here's how I used it:

-- Repository method for fuzzy search
List<SearchResultItem> entryContentSearch(
    @Param("locale") String locale,
    @Param("searchedValues") String[] searchedValues
);

The SQL:

SELECT
  e.id,
  MAX(similarity(value, sv)) AS score
FROM entries e,
     unnest(CAST(:searchedValues AS text[])) AS sv,
     jsonb_array_elements_text(
       jsonb_path_query_array(e.data, '$.** ? (@.type() == "string")')
     ) AS value
WHERE (e.locale = :locale OR e.locale = 'en')  -- Fallback to 'en'
  AND (value % sv OR value ILIKE CONCAT('%', sv, '%'))
GROUP BY e.id
ORDER BY score DESC;

Breaking down the complex parts:

  • unnest(CAST(:searchedValues AS text[])) converts our Java array into individual rows, so each search term becomes a separate row to compare against
  • jsonb_path_query_array(e.data, '$.** ? (@.type() == "string")') finds ALL string values in the JSON, no matter how deeply nested they are ($.** means "anywhere in the JSON"), but ignores JSON keys - it only searches the actual content values
  • jsonb_array_elements_text() converts the array of found strings into individual rows, so we can compare each string value against each search term

What made this solution effective:

  • Searches all string values in JSON.
  • Accepts typos and partial matches (%).
  • Still supports strict contains (ILIKE).

Limitations:

  1. Requires the pg_trgm extension to be enabled in PostgreSQL.
  2. Can be slower than exact matches due to trigram calculations.
  3. Stopwords problem: Words like "the", "and", "or" still appear in results, making searches less relevant.

The stopwords issue:

When users searched for "the database", our fuzzy search would match entries containing "the" and "database" separately, but "the" is not meaningful for search results. This created noise in our results and made the search less useful.

What we achieved so far:

  • ✅ Fuzzy matching with typos and partial matches
  • ✅ Search across all JSON string values
  • ✅ Similarity scoring for result ranking
  • Still included irrelevant stopwords

At this point, I had fuzzy search working, but the results were polluted with stopwords and weren't language-aware. That's when I decided to add Lucene for proper language processing.


Step 4: Cleaning the Query with Lucene

What this still didn't do: understand language. I needed to remove stopwords, fold accents, normalize case, and keep it locale-aware.

What is Lucene?

Apache Lucene is a high-performance, full-featured text search engine library written in Java. It provides powerful text analysis capabilities including tokenization, stemming, stopword removal, and language-specific processing. It's the foundation for many search applications and is particularly effective for natural language processing tasks.

📚 More info: Apache Lucene Documentation

Language-specific processing explained:

Each language has unique characteristics that affect search quality:

  • French: Removes French stopwords ("le", "la", "de", "et"), handles accents (é → e), and applies French-specific stemming
  • English: Removes English stopwords ("the", "and", "or"), handles contractions, and applies English stemming rules
  • Spanish: Removes Spanish stopwords ("el", "la", "de", "y"), handles Spanish-specific word forms and accents
  • German: Handles German compound words, removes German stopwords ("der", "die", "das"), and applies German stemming
  • Italian: Removes Italian stopwords ("il", "la", "di", "e"), handles Italian-specific word forms and accents
  • Default (StandardAnalyzer): Basic tokenization without language-specific processing

Example of the difference:

// French query: "le système de base de données"
FrenchAnalyzer → ["systeme", "base", "donnees"]  // Removes "le", "de", handles accents

// English query: "the database system"
EnglishAnalyzer → ["databas", "system"]  // Removes "the", applies stemming

// Spanish query: "el sistema de base de datos"
SpanishAnalyzer → ["sistema", "base", "datos"]  // Removes "el", "de"

// Italian query: "il sistema di base di dati"
ItalianAnalyzer → ["sistema", "base", "dati"]  // Removes "il", "di"

This ensures that search queries are processed according to the user's language, significantly improving relevance and reducing noise from stopwords.

I mapped locale → analyzer:

try (Analyzer analyzer = switch (locale) {
    case "fr" -> new FrenchAnalyzer();
    case "en" -> new EnglishAnalyzer();
    case "de" -> new GermanAnalyzer();
    case "es" -> new SpanishAnalyzer();
    case "it" -> new ItalianAnalyzer();
    default -> new StandardAnalyzer();
})

Full version:

// Required imports
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.fr.FrenchAnalyzer;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.de.GermanAnalyzer;
import org.apache.lucene.analysis.es.SpanishAnalyzer;
import org.apache.lucene.analysis.it.ItalianAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.io.StringReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public static List<String> getTokens(String[] inputs, String locale) {
    List<String> result = new ArrayList<>();

    try (Analyzer analyzer = switch (locale) {
      case "fr" -> new FrenchAnalyzer();
      case "en" -> new EnglishAnalyzer();
      case "de" -> new GermanAnalyzer();
      case "es" -> new SpanishAnalyzer();
      case "it" -> new ItalianAnalyzer();
      default -> new StandardAnalyzer();
    }) {
      for (String input : inputs) {
        try (TokenStream tokenStream = analyzer.tokenStream("", new StringReader(input))) {
          CharTermAttribute attr = tokenStream.addAttribute(CharTermAttribute.class);
          tokenStream.reset();
          while (tokenStream.incrementToken()) {
            result.add(attr.toString());
          }
          tokenStream.end();
        } catch (IOException e) {
          log.error("Error during lucene processing : {}", e.getMessage());
        }
      }
    }

    return result;
  }

In the service:

public List<SearchResultItem> search(String locale, String userQuery) {
  List<String> tokens = getTokens(new String[]{userQuery}, locale);
  if (tokens.isEmpty()) return List.of();
  return entryRepository.entryContentSearch(locale, tokens.toArray(String[]::new));
}

Why I didn't go back to websearch_to_tsquery:

  • Mixed locales are hard to handle well.
  • JSON keys show up as matches.
  • No fuzzy matching.

Lucene preprocessing + pg_trgm gave me clean, language-aware tokens with fuzzy and strict matching.

Performance benefit of this architecture:

Splitting the process between Spring (Lucene preprocessing) and PostgreSQL (pg_trgm search) had a positive impact on performance. Spring handles the lightweight text analysis and tokenization, while PostgreSQL focuses on the heavy lifting of fuzzy matching across the database. This separation of concerns allows each system to do what it does best, resulting in faster overall search performance compared to doing everything in one place.

Limitations of this solution:

Limited locale support: Lucene analyzers are only available for a limited number of languages. While we implemented 5 languages, Lucene's built-in analyzers cover around 30+ languages, but many languages (especially non-European ones) lack proper analyzer support.


Step 5: Performance Optimization

Once the search was working, I needed to optimize performance for production use. This involved two key areas: pagination for handling large result sets and proper database indexing.

Pagination

Spring's Pageable interface provides built-in pagination support. I integrated it into the search method:

public Page<SearchResultItem> search(String locale, String userQuery, Pageable pageable) {
  List<String> tokens = getTokens(new String[]{userQuery}, locale);
  if (tokens.isEmpty()) return Page.empty(pageable);
  return entryRepository.entryContentSearch(locale, tokens.toArray(String[]::new), pageable);
}

This allows the frontend to request specific pages of results, reducing memory usage and improving response times for large result sets.

Indexation

I didn't do that, but if the database size is constantly increasing, adding an index on locale could help.

-- Index on locale for fast filtering
CREATE INDEX idx_entries_locale ON entries(locale);

Why limited indexation benefits?

Since our search combines locale filtering (WHERE locale = 'en') with JSONB content search, the indexing options are limited:

  • Locale index: Helps with the WHERE locale = 'en' part
  • JSONB GIN index: Would help if we were searching only within JSONB, but since we filter by locale first, PostgreSQL often can't effectively use both indexes together
  • Composite GIN index: Available but limited effectiveness for our specific query pattern

The locale index provides the main performance benefit by quickly narrowing down the dataset before applying the JSONB search operations.


Conclusion

What started as a simple search requirement evolved into a sophisticated, language-aware search system. By combining PostgreSQL's JSONB capabilities with Lucene's text analysis and pg_trgm's fuzzy matching, I built a solution that exceeded my expectations.

Key takeaways:

  • PostgreSQL JSONB operators provide powerful tools for querying structured JSON data
  • pg_trgm extension enables fuzzy matching and similarity scoring for handling typos and partial matches
  • Lucene analyzers offer language-specific text processing that significantly improves search relevance
  • Separation of concerns between Spring (text analysis) and PostgreSQL (search) optimizes performance
  • Pagination and indexing ensure the solution scales with growing datasets

The final result: A temporary but effective, language-aware search that handles multiple locales, supports fuzzy matching, removes stopwords, and provides relevant results across complex JSONB structures.

While external search engines like Elasticsearch or Algolia might offer more features out-of-the-box, this PostgreSQL-based solution proved that with the right approach, you can build sophisticated search capabilities using your existing database infrastructure. It served as an excellent temporary solution while the team evaluated long-term search engine options.

Complete example: You can find a working implementation of this approach in the postgres-jsonb-search repository.

Salomon Nghukam

Salomon Nghukam

Software Engineer at CGI France