For context I created a video search engine last year, I shut it down and put the data online. You can read about it here: https://www.bendangelo.me/2024/07/16/failed-attempt-at-creating-a-video-search-engine/

I put that project on hold because of scaling issues, anyway I’m back with an other idea. I’ve been frustrated with how AI slop is ruining the internet and recently it’s been hitting Youitube pretty hard with AI videos. I’m brainstorming a tool for people to selfhost:

Self-hosted crawler: Pick which sites/videos to index (blogs, forums, YT channels, etc.). AI chat interface: Ask questions like, “Show me Rust tutorials from 2023” or “Summarize recent posts about homelab backups.” Optional sharing: Pool indexes with trusted friends/communities.

Why? No Google/YouTube spam—only content you choose. Works offline (archive forums, videos, docs). Local AI (Mistral) or cloud (paid) for smarter searches.

Would this be useful to you? What sites would you crawl? Any killer features I’m missing?

Prototype in progress—just testing interest!

  • CameronDev@programming.dev
    link
    fedilink
    English
    arrow-up
    40
    arrow-down
    3
    ·
    6 days ago

    I personally have zero interest in AI search, if you mean LLM. The fact that it can make stuff up, also means it can miss stuff as well. Neither are acceptable for a search engine.

    If you mean some kind of deterministic algorithm for indexing and searching, then maybe.

    Also, attempting to crawl sites locally sounds like a great way to get banned from those sites for looking like a bot.

    • T156@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      ·
      6 days ago

      I can’t imagine self hosting an LLM-based search engine would be too viable. The hardware demands, even for a relatively small quantised model, are considerable. Doubly so if you don’t have a GPU to accelerate with.

      • CameronDev@programming.dev
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        6 days ago

        Yeah, absolutely. And running a GPU 24/7 to occasionally search is just a waste of power. I’m not convinced that google and bings AI search makes financial sense either, Google dropped live search (where the results updated as you typed realtime) because it was too expensive, how does LLM search end up cheaper than live search?!

        Edit: This is the live search thing: https://searchengineland.com/test-google-updating-search-results-as-you-type-49116 ~~Annoyingly hard to find, and I can’t find the articles on its cancellation, but from memory it was related to expense. ~~

        Edit2: Google Instant Search, and the death was blamed on mobile, and wanting to unify the mobile/desktop experience. I do vaguely remember expense being an unofficial/rumored reason, but I can’t back that up.

        • Jakeroxs@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          2
          ·
          5 days ago

          You realize the gpu site idle when not actively being used right?

          It’d be cheaper if you host it locally, essentially just your normal electricity bill which is the entire point of what op is saying lol.

          • CameronDev@programming.dev
            link
            fedilink
            English
            arrow-up
            2
            ·
            5 days ago

            Idle is low power, not zero power. And it won’t be idle when its scraping and parsing the sites, so depending on how much scraping its doing, it could be significant non-idle energy usage.

            • Jakeroxs@sh.itjust.works
              link
              fedilink
              English
              arrow-up
              2
              ·
              edit-2
              5 days ago

              The gpu is already running because it’s in the device, by this logic I shouldn’t have a GPU in my homelab until I want to use it for something, rip jellyfin and immich I guess.

              I get the impression you don’t really understand how local LLMs work, you likely wouldn’t need a very large model to run basic scraping, just would depend on what OP has in mind really or what kind of schedule it runs on. You should consider the difference between a mega corps server farm compared to some rando using this locally on consumer hardware. (which seems to be the intent from OP)

              • CameronDev@programming.dev
                link
                fedilink
                English
                arrow-up
                2
                ·
                5 days ago

                I didn’t say you can’t have a GPU, but to me, its wasteful. I keep my jellyfin server off when not in use, and use WoL to start it when its needed.

                I have played with local LLMs, and the models I used were unimpressive, but without knowing what the OP has in mind, we cant know how much power it will use. If it just spins up the GPU once a day for 20 minutes, probably okay, you won’t even notice it. But anyone like me who doesn’t already have a GPU in their lab will probably notice it quite clearly on their power bill.

                A megacorps server farm is huge, but its also amortised over millions of users, they probably don’t need 1-1 GPU to customers, so the efficiency isnt necessarily bad. (Although at the moment, given megacorps are tripping over themselves to throw compute at LLM training, this may not be true)

  • Zwuzelmaus@feddit.org
    link
    fedilink
    English
    arrow-up
    18
    ·
    6 days ago

    No. Never would I self-host a search engine.

    The crawler would eat up so much more ressources than I am ever willing to spend.

    • wise_pancake@lemmy.ca
      link
      fedilink
      English
      arrow-up
      3
      ·
      6 days ago

      Potentially that would be a good application of federation and distributed computing

      An Internet archive like distributed tool, that then feeds into local tokenization and indexing.

      Alternatively a centralized service that generates indices and then locally they are queried would save a lot of energy.

  • rtxn@lemmy.world
    link
    fedilink
    English
    arrow-up
    13
    arrow-down
    3
    ·
    edit-2
    6 days ago

    No. I’m so bloody fed up with AI “search” solutions that return everything on the fucking planet except what I want. Text search has been a solved problem for a decade. All I want out of a search engine is to be deterministic, stable, and reliable, and to look in titles, descriptions, and keywords. Vibe processing is completely unnecessary and will only create issues.

    If you really want to iNnoVAte, then consider creating an index with transcripts and summaries that users can search by keywords.

  • DarkSpectrum@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    5 days ago

    AI uses so much more resources than standard search engines and it comes at a time when the whole planet needs to slow down climate change

  • A_norny_mousse@feddit.org
    link
    fedilink
    English
    arrow-up
    9
    arrow-down
    2
    ·
    6 days ago

    No.

    AI search offers me nothing that “normal” search doesn’t also offer.

    But it uses a thousand times more resources.

    10 years ago people were shocked by the size of Google’s server halls. Now imagine the increase in size/numbers through AI.

    Fuck this shit. The internet isn’t what’s driving the climate catastrophe, it’s how people use it.

  • Harlehatschi@lemmy.ml
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    3
    ·
    6 days ago

    Why would I need AI for that? We should really stop trying to slap AI on everything. Also no, I’m not that big of a fan of wasting energy on web crawlers.

  • solrize@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    6 days ago

    AI has become an abbreviation for “bad” and I wouldn’t want that, but yes, I’ve been interested for a while in building language models into search engines, to give the queries more reach into the document semantics. Unfortunately, naive approaches like looking for matching vector embeddings instead of (or alongside) search terms seems near useless, and just clutters up the results.

    I’d be interested in knowing what approaches you’re using. FOSS, I hope.

  • Fedditor385@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    5 days ago

    People will pay for solutions to their problems, and most people and companies don’t seem to want to hear - that we have the problem of AI being in everything.

    The next BIG THING will have a single marketing label - no AI inside.

    Actually, I need to update my github repos with “No AI inside” labels, stickers, etc. Might bring in more visibility.

  • Jakeroxs@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    5 days ago

    Seems nifty, bake in stuff like selecting your AI provider (support local llama, local openAI api, and if you have to use a third party I guess lol) make sure it’s dockerized (or is relatively easy to do, bonus points for including a compose)

    OH being able to hook into a self host engine like searxng would be nice too, can do that with Oogabooga web search plug-in currently as an example.

  • SoftestSapphic@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    1
    ·
    edit-2
    5 days ago

    Web scrapers are all that’s needed,

    AI is worthless except for the few uses it has combing through medical data.

    AI should never be used to try to influence people.

  • Avid Amoeba@lemmy.ca
    link
    fedilink
    English
    arrow-up
    1
    ·
    6 days ago

    I think the really useful idea here is solving the scaling issue by limiting the source sites to a known good set. 95% of the time I am not looking for results from unknown sites. In fact I actively work to get information from the sites I trust.

    • lautan@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      4 days ago

      Yeah that’s the idea. Let people build their own lists and share them.

  • Ptsf@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    1
    ·
    6 days ago

    Indexing websites adds significant traffic to those sites. It’s not a good idea for the health of the internet for everyone to be Indexing, maybe you should search for a precompiled index you can train the lmm on and distribute it daily. Or do the crawling yourself and distribute that index.

  • Sims@lemmy.ml
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    3
    ·
    6 days ago

    Absolutely. I’m not as AI-phobic, and I absolutely need both an information gathering agent that searches, follows rss/channels, explores/researches topics, summarizes swats of daily events, and I need a filter between me and corporate internet. Let the AI discard obvious slop, Ads/other propaganda and general informational noise.

    However, you should think about how to share search-results so all our agents don’t floor small services/commoners. We really need to get more information out of corporate silo’s, and into public search/knowledge systems. Perhaps you can integrate some distributed search/knowledge properties ? (now that you have AI to help build it)

  • EarMaster@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    4
    ·
    6 days ago

    While almost everyone here seems to hate AI (maybe for the wrong reason, but who am I to judge) I like to have AI as it is able to provide answers a simple search engine cannot.

    What I don’t see is hosting something like this myself. The managing of source and indexing them would take too much of my, my server’s and the web servers to be indexed energy (maybe I am wrong).

    There are already good solutions (OpenWebUI with Ollama) that can be tweaked to almost do what you’re describing and the AI models get better every month, so I don’t think a custom AI search engine could keep up with it.