I mean it in the sense that I can upload a low quality phone photo of a page from a Chinese cookbook and it will OCR it, translate it into English and give me a summary of the ingredients.

I’ve been looking into vision models but they seem daunting to set up, and the specs say stuff like 384x384 image resolution, so it doesn’t seem like it would be able to do what I look for. Am I even searching in the right direction right now?

  • afansfw@lemmynsfw.comOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    9 days ago

    Looks like what I’m looking for, and llama.cpp has added support this year, so should be easy to try, thank you!