Fine-tuning open-source LLMs for Multi-entity Sentiment Analysis task using [mlx-lm]
April 18, 2026Project
Multi-entity Sentiment Analysis revisited
As first laid out in this post, Multi-entity Sentiment Analysis was one of the toughest tasks I have been pondering about.
How intelligent models were, msot of them failed to capture the separate sentiment extraction per entity.
Take the following sentence for example.
Apple shares shot up thanks to iPhone sales, while its peers struggled with the increased AI spending.
Unlike human readers who can capture the essence of the text, since many sentiment models (or simple bag-of-words) try to determine the overall sentiment of the given text, not of specific entity, juxtaposing contrasting polarity often confuses the models.
(Original) Approaches
The main approach when I first floated the idea of multi-entity sentiment analysis was to simplify and formulate the sentence structure, which might confuse the language model less.
To simplify the sentence, I had organized the following structure.
- Entity: (Subject and object)
- Polarity:
+/-/0 - Direction of the polarity:
Posline(Positive/Positive): Both entities have the same directional polarity (positive)Pos(Positive/Neutral): One of the entity has the positive polarityOver(Positive/Negative): While the first entity is positive, latter has the negative polarityUnder(Negative/Positive): First entity has the negative polarity, while the latter is positiveNegline(Negative/Negative): Both entities have the same directional polarity (negative)Neg(Negative/Neutral): One of the entity has the negativ polarity
- Category:
- Investments: Buy, Sell, IPO, Privatization, Invest, Bid
- Cooperation: Win-win situations (in-tandem)
- Family / Ownership: Same line of business (e.g., Franchise)
- Performance: Stock market performance
- Legal: File, [Sued / Indicted / Subpoenaed / Alleged] (by), Win, Lose (Bidirectional)
- News release: Launch, Patent, Authorization
- Bankruptcy: Entered, Exited
Multi-entity sentiment analysis
As with the coref, time has changed and LLMs can help a lot with the most tedius tasks, i.e., generating train/validation datasets.
The above definition came in handy for Claude to refine and generate datasets needed for the training process.
- Entity: (Subject and object)
- Polarity:
+/-/0/~ - Category: Legal / Business / Performance / Recruitment / NewsRelease / Bankruptcy
Sample dataset is as follows.
{"id": "eval-016", "text": "Apple launched its first generative AI features across iPhone and Mac, positioning Apple Intelligence as a privacy-first alternative to ChatGPT.", "extractions": [{"entity": "Apple", "polarity": "+", "category": "NewsRelease"}, {"entity": "Apple Intelligence", "polarity": "+", "category": "NewsRelease"}, {"entity": "ChatGPT", "polarity": "-", "category": "NewsRelease"}]}
Fine-tuning open-source LLMs
With the hardware constraints (Macbook Air M1 16GB), and minimizing the Cloud Inference usage, the models considered were similar to the process I went through in coref.
As before, my initial choice is Gemma-family (both Gemma3 and Gemma4), but unfortunately, it didn't live up to my expectations.
Rather surprisingly, Llama family (Llama 3.1 and Llama 3.2) still outperformed for my use-case. Moreover, Llama 3.2 3B was even better than Llama 3.1 8B as in the below summary.
Eval Results (20-sample eval set)
| Metric | Gemma3 4B (Base) | Gemma3 (Fine-tuned) | Llama 3.1 8B (Base) | Llama 3.1 8B (Fine-tuned) | Llama 3.2 3B (Base) | Llama 3.2 3B (Fine-tuned) |
|---|---|---|---|---|---|---|
| Perfect records | 20% | ❌ broken JSON | 40% | 55% | N/A | 65% |
| Entity Precision | 69.1% | ❌ | 70.0% | 86.4% | N/A | 97.4% |
| Entity Recall | 92.7% | ❌ | 85.4% | 92.7% | N/A | 92.7% |
| Entity F1 | 79.2% | ❌ | 76.9% | 89.4% | N/A | 95.0% |
| Polarity Accuracy | 81.6% | ❌ | 68.6% | 81.6% | N/A | 89.5% |
While the sample was small, but at least in terms of sentiment polarity for expected entities, Llama 3.2 3B overshadowed all other models, here.
Inference with the model
Using the locally saved fine-tuned model (adapter)
Using the scripts in repo, inference could be used as below.
uv run python sentiment_inf.py --text "Microsoft to acquire Activision Blizzard for $68.7 billion." --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --adapter adapters
Loading model: mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
Fetching 6 files: 100%|███████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 26351.65it/s]
Download complete: : 0.00B [00:00, ?B/s] | 0/6 [00:00<?, ?it/s]
Extractions:
[
{
"entity": "Microsoft",
"polarity": "+",
"category": "Business"
},
{
"entity": "Activision Blizzard",
"polarity": "+",
"category": "Business"
}
]
Using the remote model
A fine-tuned model based on Llama-3.2-3B-Instruct is also available on huggingface.
from mlx_lm import load, generate
model, tokenizer = load("staedi/sentiment-llama-3.2")
prompt = (
"You are a financial analyst specializing in directed sentiment extraction. "
"Given a financial news text, identify all mentioned entities and determine "
"the sentiment directed toward each one. Return your answer as a JSON array "
"where each element has: \"entity\" (name), \"polarity\" (+ positive, - negative, "
"0 neutral, ~ context-dependent), and \"category\" (one of: Legal, Business, "
"Performance, Recruitment, NewsRelease, Bankruptcy).\n\n"
"Valid polarities: \"+\", \"-\", \"0\", \"~\"\n"
)
text = "Microsoft to acquire Activision Blizzard for $68.7 billion."
user_content = f"Extract the directed financial sentiment from the following text:\n\n{text}"
if tokenizer.chat_template is not None:
messages = [
{"role": "system", "content": prompt},
{"role": "user", "content": user_content},
]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_dict=False,
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)