min park

NLP and Data Analytics

Fine-tuning open-source LLMs for Multi-entity Sentiment Analysis task using [mlx-lm]

April 18, 2026Project

Multi-entity Sentiment Analysis revisited

As first laid out in this post, Multi-entity Sentiment Analysis was one of the toughest tasks I have been pondering about.

How intelligent models were, msot of them failed to capture the separate sentiment extraction per entity.

Take the following sentence for example.

Apple shares shot up thanks to iPhone sales, while its peers struggled with the increased AI spending.

Unlike human readers who can capture the essence of the text, since many sentiment models (or simple bag-of-words) try to determine the overall sentiment of the given text, not of specific entity, juxtaposing contrasting polarity often confuses the models.

(Original) Approaches

The main approach when I first floated the idea of multi-entity sentiment analysis was to simplify and formulate the sentence structure, which might confuse the language model less.

To simplify the sentence, I had organized the following structure.

  • Entity: (Subject and object)
  • Polarity: + / - / 0
  • Direction of the polarity:
    • Posline (Positive/Positive): Both entities have the same directional polarity (positive)
    • Pos (Positive/Neutral): One of the entity has the positive polarity
    • Over (Positive/Negative): While the first entity is positive, latter has the negative polarity
    • Under (Negative/Positive): First entity has the negative polarity, while the latter is positive
    • Negline (Negative/Negative): Both entities have the same directional polarity (negative)
    • Neg (Negative/Neutral): One of the entity has the negativ polarity
  • Category:
    • Investments: Buy, Sell, IPO, Privatization, Invest, Bid
    • Cooperation: Win-win situations (in-tandem)
    • Family / Ownership: Same line of business (e.g., Franchise)
    • Performance: Stock market performance
    • Legal: File, [Sued / Indicted / Subpoenaed / Alleged] (by), Win, Lose (Bidirectional)
    • News release: Launch, Patent, Authorization
    • Bankruptcy: Entered, Exited

Multi-entity sentiment analysis

As with the coref, time has changed and LLMs can help a lot with the most tedius tasks, i.e., generating train/validation datasets.

The above definition came in handy for Claude to refine and generate datasets needed for the training process.

  • Entity: (Subject and object)
  • Polarity: + / - / 0 / ~
  • Category: Legal / Business / Performance / Recruitment / NewsRelease / Bankruptcy

Sample dataset is as follows.

{"id": "eval-016", "text": "Apple launched its first generative AI features across iPhone and Mac, positioning Apple Intelligence as a privacy-first alternative to ChatGPT.", "extractions": [{"entity": "Apple", "polarity": "+", "category": "NewsRelease"}, {"entity": "Apple Intelligence", "polarity": "+", "category": "NewsRelease"}, {"entity": "ChatGPT", "polarity": "-", "category": "NewsRelease"}]}

Fine-tuning open-source LLMs

With the hardware constraints (Macbook Air M1 16GB), and minimizing the Cloud Inference usage, the models considered were similar to the process I went through in coref.

As before, my initial choice is Gemma-family (both Gemma3 and Gemma4), but unfortunately, it didn't live up to my expectations.

Rather surprisingly, Llama family (Llama 3.1 and Llama 3.2) still outperformed for my use-case. Moreover, Llama 3.2 3B was even better than Llama 3.1 8B as in the below summary.

Eval Results (20-sample eval set)

MetricGemma3 4B (Base)Gemma3 (Fine-tuned)Llama 3.1 8B (Base)Llama 3.1 8B (Fine-tuned)Llama 3.2 3B (Base)Llama 3.2 3B (Fine-tuned)
Perfect records20%❌ broken JSON40%55%N/A65%
Entity Precision69.1%70.0%86.4%N/A97.4%
Entity Recall92.7%85.4%92.7%N/A92.7%
Entity F179.2%76.9%89.4%N/A95.0%
Polarity Accuracy81.6%68.6%81.6%N/A89.5%

While the sample was small, but at least in terms of sentiment polarity for expected entities, Llama 3.2 3B overshadowed all other models, here.

Inference with the model

Using the locally saved fine-tuned model (adapter)

Using the scripts in repo, inference could be used as below.

uv run python sentiment_inf.py --text "Microsoft to acquire Activision Blizzard for $68.7 billion." --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --adapter adapters
Loading model: mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
Fetching 6 files: 100%|███████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 26351.65it/s]
Download complete: : 0.00B [00:00, ?B/s]                                                             | 0/6 [00:00<?, ?it/s]
Extractions:
[
  {
    "entity": "Microsoft",
    "polarity": "+",
    "category": "Business"
  },
  {
    "entity": "Activision Blizzard",
    "polarity": "+",
    "category": "Business"
  }
]

Using the remote model

A fine-tuned model based on Llama-3.2-3B-Instruct is also available on huggingface.

from mlx_lm import load, generate

model, tokenizer = load("staedi/sentiment-llama-3.2")

prompt = (
    "You are a financial analyst specializing in directed sentiment extraction. "
    "Given a financial news text, identify all mentioned entities and determine "
    "the sentiment directed toward each one. Return your answer as a JSON array "
    "where each element has: \"entity\" (name), \"polarity\" (+ positive, - negative, "
    "0 neutral, ~ context-dependent), and \"category\" (one of: Legal, Business, "
    "Performance, Recruitment, NewsRelease, Bankruptcy).\n\n"
    "Valid polarities: \"+\", \"-\", \"0\", \"~\"\n"
)

text = "Microsoft to acquire Activision Blizzard for $68.7 billion."
user_content = f"Extract the directed financial sentiment from the following text:\n\n{text}"

if tokenizer.chat_template is not None:
    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": user_content},
    ]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)