If you want full control over your AI agent with zero per-message API costs, open-source models are the way to go. Hermes 3 (built on Llama 3.1) is one of the best open-source models for conversational AI - it follows instructions well, supports tool use, and runs on your own hardware or through affordable inference providers.
This guide shows you how to build an Instagram DM agent using Hermes 3 with two deployment options: self-hosted via Ollama (completely free) or cloud-hosted via Together AI (pay-per-token but no hardware needed).
Why Open-Source for Instagram DMs?
| Advantage | Details |
|---|---|
| Zero API fees | Run on your own GPU with Ollama - no per-message cost |
| Full data privacy | Conversations never leave your server |
| No rate limits | You control the throughput |
| Customizable | Fine-tune on your own data for better brand voice |
| No vendor lock-in | Switch models anytime without changing your code |
| Commercial use | Hermes 3 / Llama 3.1 license allows commercial use |
Trade-offs vs Claude/GPT
| Factor | Open-Source (Hermes 3) | Claude/GPT |
|---|---|---|
| Response quality | Very good (90% of GPT-4o for DMs) | Best available |
| Setup complexity | Higher (need GPU or inference provider) | Simple API key |
| Running cost | Free (self-hosted) or very cheap | Pay per token |
| Speed | Depends on hardware | Consistently fast |
| Tool use | Supported (Hermes 3) | Native, polished |
Architecture
Option A: Self-Hosted with Ollama (Free)
Instagram User
|
v
InstantDM (webhook)
|
v
Your Server (Node.js/Python)
|
v
Ollama (running Hermes 3 locally)
|
v
Your Server (sends reply)
|
v
InstantDM API (delivers DM)
Option B: Together AI (Cloud, No GPU Needed)
Instagram User
|
v
InstantDM (webhook)
|
v
Your Server
|
v
Together AI API (Hermes 3 hosted)
|
v
Your Server (sends reply)
|
v
InstantDM API (delivers DM)
What You Need
| Requirement | Option A (Ollama) | Option B (Together AI) |
|---|---|---|
| InstantDM account | Trendsetter+ | Trendsetter+ |
| InstantDM API key | Settings > API | Settings > API |
| GPU server | 8GB+ VRAM for 8B model, 24GB+ for 70B | Not needed |
| Together AI key | Not needed | From together.ai |
| Server | Node.js or Python | Node.js or Python |
Option A: Self-Hosted with Ollama
Install Ollama and Pull Hermes 3
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull Hermes 3 (8B parameter version - needs ~5GB VRAM)
ollama pull hermes3:8b
# Or the larger 70B version for better quality (needs ~40GB VRAM)
ollama pull hermes3:70b
# Test it
ollama run hermes3:8b "Hello, how can I help you today?"
Ollama runs a local API server on http://localhost:11434 that's compatible with the OpenAI API format.
Node.js Agent with Ollama
const express = require('express');
const crypto = require('crypto');
const app = express();
app.use(express.json());
const INSTANTDM_API_KEY = process.env.INSTANTDM_API_KEY;
const OLLAMA_URL = 'http://localhost:11434/v1/chat/completions';
const conversations = new Map();
const SYSTEM_PROMPT = `You are the customer support assistant for [Your Brand].
Rules:
- Answer product questions accurately
- Keep responses under 280 characters
- Be friendly and conversational
- Never invent information
- No markdown formatting
- If unsure, offer to connect with a human
Product info:
[Your product details here]`;
function verifySignature(req) {
const sig = req.headers['x-webhook-signature'];
if (!sig) return false;
const expected = crypto
.createHmac('sha256', INSTANTDM_API_KEY)
.update(JSON.stringify(req.body))
.digest('hex');
return crypto.timingSafeEqual(Buffer.from(sig), Buffer.from(expected));
}
app.post('/webhook/instantdm', async (req, res) => {
res.status(200).json({ received: true });
if (!verifySignature(req)) return;
const { event, data } = req.body;
if (event === 'dm_received') await handleDM(data);
});
async function handleDM(data) {
const { instagram_user_id, message_text } = data;
if (!conversations.has(instagram_user_id)) {
conversations.set(instagram_user_id, []);
}
const history = conversations.get(instagram_user_id);
history.push({ role: 'user', content: message_text });
if (history.length > 20) history.splice(0, history.length - 20);
try {
const response = await fetch(OLLAMA_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'hermes3:8b',
messages: [{ role: 'system', content: SYSTEM_PROMPT }, ...history],
max_tokens: 250,
temperature: 0.7,
stream: false,
}),
});
const result = await response.json();
const reply = result.choices[0].message.content;
history.push({ role: 'assistant', content: reply });
await sendReply(instagram_user_id, reply);
} catch (err) {
console.error('Ollama error:', err);
await sendReply(instagram_user_id,
"Hey! Quick technical issue. A team member will follow up soon."
);
}
}
async function sendReply(recipientId, text) {
if (text.length > 1000) text = text.substring(0, 997) + '...';
text = text.replace(/[*_~`#]/g, '');
await fetch('https://api.instantdm.com/api-webhook', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${INSTANTDM_API_KEY}`,
},
body: JSON.stringify({
action: 'send_message',
type: 'dm',
recipient_id: recipientId,
message: { type: 'text', text: text.trim() },
}),
});
}
app.listen(3000, () => console.log('Hermes agent running on port 3000'));
Python Agent with Ollama
import os, hmac, hashlib, threading, requests
from flask import Flask, request, jsonify
app = Flask(__name__)
INSTANTDM_API_KEY = os.environ['INSTANTDM_API_KEY']
OLLAMA_URL = 'http://localhost:11434/v1/chat/completions'
conversations = {}
SYSTEM_PROMPT = """You are the customer support assistant for [Your Brand].
Rules:
- Answer product questions accurately
- Keep responses under 280 characters
- Be friendly and conversational
- Never invent information
- No markdown formatting
- If unsure, offer to connect with a human
Product info:
[Your product details here]"""
def verify_signature(req):
sig = req.headers.get('X-Webhook-Signature', '')
expected = hmac.new(
INSTANTDM_API_KEY.encode(), req.get_data(), hashlib.sha256
).hexdigest()
return hmac.compare_digest(sig, expected)
@app.route('/webhook/instantdm', methods=['POST'])
def webhook():
if not verify_signature(request):
return jsonify({'error': 'Unauthorized'}), 401
payload = request.get_json()
if payload.get('event') == 'dm_received':
threading.Thread(target=handle_dm, args=(payload['data'],)).start()
return jsonify({'received': True}), 200
def handle_dm(data):
user_id = data['instagram_user_id']
message = data['message_text']
if user_id not in conversations:
conversations[user_id] = []
history = conversations[user_id]
history.append({'role': 'user', 'content': message})
if len(history) > 20:
conversations[user_id] = history[-20:]
history = conversations[user_id]
try:
resp = requests.post(OLLAMA_URL, json={
'model': 'hermes3:8b',
'messages': [{'role': 'system', 'content': SYSTEM_PROMPT}] + history,
'max_tokens': 250,
'temperature': 0.7,
'stream': False,
})
result = resp.json()
reply = result['choices'][0]['message']['content']
history.append({'role': 'assistant', 'content': reply})
send_reply(user_id, reply)
except Exception as e:
print(f'Ollama error: {e}')
send_reply(user_id, "Hey! Quick technical issue. A team member will follow up soon.")
def send_reply(recipient_id, text):
if len(text) > 1000: text = text[:997] + '...'
for ch in ['*','_','~','`','#']: text = text.replace(ch, '')
requests.post('https://api.instantdm.com/api-webhook',
headers={'Content-Type': 'application/json', 'Authorization': f'Bearer {INSTANTDM_API_KEY}'},
json={'action': 'send_message', 'type': 'dm', 'recipient_id': recipient_id,
'message': {'type': 'text', 'text': text.strip()}})
if __name__ == '__main__':
app.run(port=3000)
Option B: Together AI (Cloud-Hosted, No GPU)
Together AI hosts open-source models with an OpenAI-compatible API. You get the benefits of open-source models without managing hardware.
Node.js with Together AI
const OpenAI = require('openai');
// Together AI uses OpenAI-compatible API
const together = new OpenAI({
apiKey: process.env.TOGETHER_API_KEY,
baseURL: 'https://api.together.xyz/v1',
});
async function handleDM(data) {
const { instagram_user_id, message_text } = data;
if (!conversations.has(instagram_user_id)) {
conversations.set(instagram_user_id, []);
}
const history = conversations.get(instagram_user_id);
history.push({ role: 'user', content: message_text });
if (history.length > 20) history.splice(0, history.length - 20);
try {
const completion = await together.chat.completions.create({
model: 'NousResearch/Hermes-3-Llama-3.1-8B',
messages: [{ role: 'system', content: SYSTEM_PROMPT }, ...history],
max_tokens: 250,
temperature: 0.7,
});
const reply = completion.choices[0].message.content;
history.push({ role: 'assistant', content: reply });
await sendReply(instagram_user_id, reply);
} catch (err) {
console.error('Together AI error:', err);
await sendReply(instagram_user_id,
"Hey! Quick technical issue. A team member will follow up soon."
);
}
}
Python with Together AI
from openai import OpenAI
# Together AI uses OpenAI-compatible API
together = OpenAI(
api_key=os.environ['TOGETHER_API_KEY'],
base_url='https://api.together.xyz/v1',
)
def handle_dm(data):
user_id = data['instagram_user_id']
message = data['message_text']
if user_id not in conversations:
conversations[user_id] = []
history = conversations[user_id]
history.append({'role': 'user', 'content': message})
if len(history) > 20:
conversations[user_id] = history[-20:]
history = conversations[user_id]
try:
completion = together.chat.completions.create(
model='NousResearch/Hermes-3-Llama-3.1-8B',
messages=[{'role': 'system', 'content': SYSTEM_PROMPT}] + history,
max_tokens=250,
temperature=0.7,
)
reply = completion.choices[0].message.content
history.append({'role': 'assistant', 'content': reply})
send_reply(user_id, reply)
except Exception as e:
print(f'Together AI error: {e}')
send_reply(user_id, "Hey! Quick technical issue. A team member will follow up soon.")
Other Open-Source Models You Can Use
The code above works with any model available on Ollama or Together AI. Just change the model name:
| Model | Parameters | Quality | Speed | Ollama Name | Together AI Name |
|---|---|---|---|---|---|
| Hermes 3 8B | 8B | Good for simple DMs | Fast | hermes3:8b | NousResearch/Hermes-3-Llama-3.1-8B |
| Hermes 3 70B | 70B | Excellent, near GPT-4 | Slower | hermes3:70b | NousResearch/Hermes-3-Llama-3.1-70B |
| Llama 3.1 8B | 8B | Good general purpose | Fast | llama3.1:8b | meta-llama/Meta-Llama-3.1-8B-Instruct |
| Llama 3.1 70B | 70B | Excellent | Medium | llama3.1:70b | meta-llama/Meta-Llama-3.1-70B-Instruct |
| Mistral 7B | 7B | Good, very fast | Very fast | mistral:7b | mistralai/Mistral-7B-Instruct-v0.3 |
| Qwen 2.5 7B | 7B | Strong multilingual | Fast | qwen2.5:7b | Qwen/Qwen2.5-7B-Instruct |
Recommendation
- Low volume (< 100 DMs/day): Hermes 3 8B on Ollama (free, runs on a decent GPU)
- Medium volume (100-1000 DMs/day): Hermes 3 8B on Together AI (~$0.20 per million tokens)
- High quality needed: Hermes 3 70B on Together AI (~$0.90 per million tokens)
- Multilingual DMs: Qwen 2.5 on Ollama or Together AI
Cost Comparison
| Setup | Monthly Cost (1,000 DMs/day) |
|---|---|
| Ollama on your GPU | $0 (electricity only) |
| Together AI (Hermes 3 8B) | ~$3/month |
| Together AI (Hermes 3 70B) | ~$15/month |
| OpenAI GPT-4o-mini | ~$5/month |
| Anthropic Claude Sonnet | ~$20/month |
Hermes 3 Tool Use (Function Calling)
Hermes 3 supports tool use natively. The format is similar to OpenAI's function calling:
async function handleDMWithTools(data) {
const { instagram_user_id, message_text } = data;
const history = await getConversation(instagram_user_id);
history.push({ role: 'user', content: message_text });
const tools = [
{
type: 'function',
function: {
name: 'search_products',
description: 'Search the product catalog',
parameters: {
type: 'object',
properties: { query: { type: 'string' } },
required: ['query'],
},
},
},
];
// Together AI supports tool use with Hermes 3
const completion = await together.chat.completions.create({
model: 'NousResearch/Hermes-3-Llama-3.1-8B',
messages: [{ role: 'system', content: SYSTEM_PROMPT }, ...history],
tools: tools,
max_tokens: 300,
});
const message = completion.choices[0].message;
if (message.tool_calls) {
// Execute tool and get result
const call = message.tool_calls[0];
const result = await executeTool(call.function.name, JSON.parse(call.function.arguments));
history.push(message);
history.push({ role: 'tool', tool_call_id: call.id, content: JSON.stringify(result) });
// Get final response with tool result
const final = await together.chat.completions.create({
model: 'NousResearch/Hermes-3-Llama-3.1-8B',
messages: [{ role: 'system', content: SYSTEM_PROMPT }, ...history],
max_tokens: 250,
});
const reply = final.choices[0].message.content;
await sendReply(instagram_user_id, reply);
} else {
await sendReply(instagram_user_id, message.content);
}
}
GPU Requirements for Self-Hosting
| Model | VRAM Needed | Example GPU |
|---|---|---|
| Hermes 3 8B (Q4) | ~5 GB | RTX 3060, RTX 4060 |
| Hermes 3 8B (FP16) | ~16 GB | RTX 4080, A10 |
| Hermes 3 70B (Q4) | ~40 GB | 2x RTX 4090, A100 |
For most Instagram DM use cases, the 8B model with Q4 quantization is more than sufficient and runs on consumer GPUs.
Troubleshooting
| Issue | Fix |
|---|---|
| Ollama not responding | Check if Ollama is running: ollama list |
| Slow responses | Use the 8B model instead of 70B, or switch to Together AI |
| Poor response quality | Try Hermes 3 70B or switch to a larger model |
| Out of VRAM | Use a smaller quantization (Q4 instead of FP16) or switch to Together AI |
| Model not following instructions | Hermes 3 follows system prompts well - make sure your prompt is clear and specific |
What's Next
- Start with Ollama locally to test for free
- Move to Together AI when you need reliability and scale
- Add Redis for conversation persistence
- Connect to your CRM with HubSpot or Google Sheets
- Set up Slack notifications for human handoff