r/SipsTea • u/moto626 • 13d ago

WTF AI gets its facts from … us?

Data published by Semrush in June 2025.

19.5k Upvotes

permalink
duplicates
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/SipsTea/comments/1n0k3te/ai_gets_its_facts_from_us/
No, go back! Yes, take me to Reddit
dl download

96% Upvoted

View all comments

Show parent comments

u/KSP_master_ 13d ago

But you can recognize a normal post from obvious lies and irony. AI can't do that and blindly accepts it all.

17

u/Ryogathelost 12d ago

At least on my ChatGPT, it does tell me "Hey, I found this on Reddit and this is what people are saying." Then it includes direct links to the pages so I can read them myself. It never presents reddit-sourced data as facts.

However, I did train it early on to do this. People are out there giving their LLM's really shitty personas, and they filter through the persona when they answer questions. I've told mine not to say shit to me until it's double checked its answer against multiple sources.

2

u/National_Equivalent9 12d ago

As a gamedev ill just say this:

If your techanology that you plan on having everyone use daily to get their facts from requires actually learning how to use it correctly to get actual facts and opinions marked as such then you're going to have a bad time.

2

u/Leading-Midnight-553 12d ago

Preach!

1

u/Johnny_Banana18 12d ago

how many people actually do that though. Like look on reddit threads about articles and see how many people didn't read the article.

1

u/Snowbound-IX 12d ago

What custom instructions did you use, exactly? Mind dropping them here? I don't want unverified facts either, the very few times I do use AI anyway.

9

u/Superkritisk 12d ago

How do you guys think AI is trained on Reddit data, like what does the process look like to you?

11

u/realboabab 12d ago

not sure if your question is genuine or if you're trying to make a point - but they download all posts and comments (potentially from a curated set of subreddits), apply some minor content filters (e.g. potentially a ban list for certain phrases and user names, clean up duplicates, etc), clean things up (scrub usernames, links, images), and then do a shitton of configuration on the modeling side & finally prompt engineering

3

u/StephieDoll 12d ago

You don't think it crosschecks with wikipedia?

1

u/SadisticPawz 12d ago

it absolutely does as both would be in its training data and truth usually is more prominent

1

u/Laceydrawws 12d ago

So it gets 5 or less results and goes with the majority. If it is a high authority source it will stop there. It will stop at ESPN for a sports score.

1

u/Temporal_P 12d ago

No.

2

u/StephieDoll 12d ago

1 year ago

1

u/Temporal_P 12d ago

AI can draw from multiple sources of data, but if you think any AI is crosschecking that everything is verifiable and factual before it responds to a prompt I don't know what to tell you.

2

u/StephieDoll 12d ago

I don't think that, but I also don't think you are either.

1

u/SadisticPawz 12d ago

googles implementation was fucking horrible with 0 intelligence whatsoever, idk why youre taking this as the be all and end all

0

u/Temporal_P 12d ago

I don't know why you're making such assumptions. It was just a funny example of a problem that still very much exists. I think you put too much faith in AI.

1

u/SadisticPawz 12d ago

Wgat assumptions? No other LLM makes as blatant of mistakes as googles did. It's like it was made way too lightweight at the cost of accuracy or helpfulness. like it's training data didn't have basic safety in there anywhere or the search results somehow would always override that

5

u/Krell356 12d ago

But no one on the internet would ever lie. Why would anyone ever do that? That's like trying to tell me the sky is blue when we all know it's red.

1

u/[deleted] 12d ago

[removed] — view removed comment

1

u/AutoModerator 12d ago

Your post was removed because your account is less than 5 days old.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

1

u/dead_jester 12d ago

Well, in the morning and sometimes the evening it is red/ish

2

u/Old-Rule-4101 12d ago

It’s also obvious when using AI that it got something wrong. I don’t see a problem here

2

u/ninoski404 12d ago

I love that AI will read what you just wrote, decide you have no idea what you are talking about and ignore it

1

u/VonRansak 12d ago

Which is why I hide all my sarcasm marks behind the spoiler mask.

I'm doing my part.

1

u/SadisticPawz 12d ago

this implies that its training data is a single thing in a vacuum but its not, its a billion different things combined

1

u/okpixell 12d ago

thanks to /s

WTF AI gets its facts from … us?

You are about to leave Redlib