Couple of things. There is no way to prove they are using "pirated content". There are web scrapers that go through the internet and scrape everything. This is going to include discussion, articles, blog posts, and video transcripts of many people discussing copyrighted content. The AI can give you a reasonable analysis for a book without ever having read the book because of this.
Everything online is public. You cannot force someone to pay for seeing your reddit post. Second, I bet they have large databases of all the books in the public domain. This includes a very large corpus of text.
This alone is probably enough to train their AI. Beyond that, presumably, they could pay for books. Textbooks, fiction, biographies, etc. They could pay for these and pump them into the system.
If I were them, personally, I would probably just find large torrents with all the books. Or write some automated script to pull from libgen. But there's no real way of proving how they did this and there's no real way of proving what content the AI was trained on.