Hacker Newsnew | past | comments | ask | show | jobs | submit | kapitalx's commentslogin

https://doctly.ai

We're building Doctly.ai - PDF Extraction with AI.

We started out with document conversions to Markdown but quickly realized that most use cases were for JSON conversion. We recently launched our "Extractor Studio" where you can have AI analyze a few sample variations of your documents and come up with a schema for you and publish it to an API endpoint.

We've built a technique on top of AI models that dramatically improves run to run consistency of JSON output.

Checkout the blog post here: https://medium.com/@abasiri/introducing-doctlys-extractor-st...


Check it out at https://doctly.ai


This is approximately the approach we're taking also at https://doctly.ai, add to that a "multiple experts" approach for analyzing the image (for our 'ultra' version), and we get really good results. And we're making it better constantly.


To be fair, they didn't include themselves at all in the graph.


They did. It’s in the #1 spot

Update: looks like the removed themselves from the graph since I saw it earlier today!


Yup, they did.

The beauty of version control: https://github.com/getomni-ai/benchmark/commit/0544e2a439423...


If you're limited to open source models, that's very true. But for larger models and depending on your document needs, we're definitely seeing very high accuracy (95%-99%) for direct to json extraction (no markdown in between step) with our solution at https://doctly.ai.


In addition, gemini Pro 2.5 does really well with bounding boxes, but yeah not open source :(


Great list! I’ll definitely run your benchmark against Doctly.ai (our PDF-to-Markdown service) specially as we publish our workflow service, to see how we stack up.

One thing I’ve noticed in many benchmarks, though, is the potential for bias. I’m actually working on a post about this issue, so it’s top of mind for me. For example, in the omni benchmark, the ground truth expected a specific order for heading information—like logo, phone number, and customer details. While this data was all located near the top of the document, the exact ordering felt subjective. Should the model prioritize horizontal or vertical scanning? Since the ground truth was created by the company running the benchmark, their model naturally scored the highest for maintaining the same order as the ground-truth.

However, this approach penalized other LLMs for not adhering to the "correct" order, even though the order itself was arguably arbitrary. This kind of bias can skew results and make it harder to evaluate models fairly. I’d love to see benchmarks that account for subjectivity or allow for multiple valid interpretations of document structure.

Did you run into this when looking at the benchmarks?

On a side note, Doctly.ai leverages multiple LLMs to evaluate documents, and runs a tournament with a judge for each page to get the best data (this is only on the Precision Ultra selection).


Hey I wrote the Omni benchmark. I think you might be misreading the methodology on our side. Order on page does not matter in our accuracy scoring. In fact we are only scoring on JSON extraction as a measurement of accuracy. Which is order independent.

We chose this method for all the same reasons you highlight. Text similarity based measurements are very subject to bias, and don't correlate super well with accuracy. I covered the same concepts in the "The case against text-similarity"[1] section of our writeup.

[1] https://getomni.ai/ocr-benchmark


I'll dig deeper into your code, but scanning your post does look like your are addressing this. That's great.

If I do find anything, I'll share with you for comments before I publish the post.


Bias wrt ordering is a great point. What we consider structured information in this benchmark is irrespective of how its presentation (Order, format etc), it should be directly comparable. So the benchmark does that it into account.

Example is if you are only converting lets say an invoice into markdown, you can introduce bias wrt ordering etc. But if the task is to find out invoice number, total amount, number of line items with headers like price, amount, description, in that case you can compare two outputs without a lot of bias. Eg even if columns are interchanged, you will still get the same metric.


Exactly. You still have to be explicit in order to remove bias. Either by sorting the keys, or looking up specific keys. For arrays, I would say order still matters. For example when you capture a list of invoice items, you should maintain order.


Looks to be API only for now. Documentation here: https://docs.mistral.ai/capabilities/document/


From my testing so far, it seems it's super fast and responded synchronously. But it decided that the entire page is an image and returned `![img-0.jpeg](img-0.jpeg)` with coordinates in the metadata for the image, which is the entire page.

Our tool, doctly.ai is much slower and async, but much more accurate and gets you the content itself as an markdown.


I thought we stopped -ly company names ~8 years ago?


Haha for sure. Naming isn't just the hardest problem in computer science, it's always hard. But at some point you just have to pick something and move forward.


if you talk to people gen-x and older, you still need .com domains

for all those people that aren't just clicking on a link on their social media feed, chat group, or targeted ad


But doctr.ai was taken.


Co-founder of doctly.ai here (OCR tool)

I love mistral and what they do. I got really excited about this, but a little disappointed after my first few tests.

I tried a complex table that we use as a first test of any new model, and Mistral OCR decided the entire table should just be extracted as an 'image' and returned this markdown:

``` ![img-0.jpeg](img-0.jpeg) ```

I'll keep testing, but so far, very disappointing :(

This document I try is the entire reason we created Doctly to begin with. We needed an OCR tool for regulatory documents we use and nothing could really give us the right data.

Doctly uses a judge, OCRs a document against multiple LLMs and decides which one to pick. It will continue to run the page until the judge scores above a certain score.

I would have loved to add this into the judge list, but might have to skip it.


Where did you test it? At the end of the post they say:

> Mistral OCR capabilities are free to try on le Chat

but when asked, Le Chat responds:

> can you do ocr?

> I don't have the capability to perform Optical Character Recognition (OCR) directly. However, if you have an image with text that you need to extract, you can describe the text or provide details, and I can help you with any information or analysis related to that text. If you need OCR functionality, you might need to use a specialized tool or service designed for that purpose.

Edit: Tried anyway by attaching an image; it said it could do OCR and then output... completely random text that had absolutely nothing to do with the text in the image!... Concerning.

Tried again with a better definition image, output only the first twenty words or so of the page.

Did you try using the API?


Yes I used the API. They have examples here:

https://docs.mistral.ai/capabilities/document/

I used base64 encoding of the image of the pdf page. The output was an object that has the markdown, and coordinates for the images:

[OCRPageObject(index=0, markdown='![img-0.jpeg](img-0.jpeg)', images=[OCRImageObject(id='img-0.jpeg', top_left_x=140, top_left_y=65, bottom_right_x=2136, bottom_right_y=1635, image_base64=None)], dimensions=OCRPageDimensions(dpi=200, height=1778, width=2300))] model='mistral-ocr-2503-completion' usage_info=OCRUsageInfo(pages_processed=1, doc_size_bytes=634209)


Any luck with this? I'm trying to process photos of paperwork (.pdf, .png) and got the same results as you.

Feels like something is missing in the docs, or the API itself.

https://imgur.com/a/1J9bkml


Interestingly I’m currently going through and scanning the hundreds of journal papers my grandfather authored in medicine and thinking through what to do about graphs. I was expecting to do some form of multiphase agent based generation of LaTeX or SVG rather than a verbal summary of the graphs. At least in his generation of authorship his papers clearly explained the graphs already. I was pretty excited to see your post naturally but when I looked at the examples what I saw was, effectively, a more verbose form of

``` ![img-0.jpeg](img-0.jpeg) ```

I’m assuming this is partially because your use case is targeting RAG under various assumptions bur also partially because multimodal models aren’t near what I would need to be successful with?


We need to update the examples on the front page. Currently for things that are considered charts/graphs/figures we convert to a description. For things like logos or images we do an image tag. You can also choose to exclude them.

The difference with this is that it took the entire page as an image tag (it's just a table of text in my document). rather than being more selective.

I do like that they give you coordinates for the images though, we need to do something like that.

Give the actual tool a try. Would love to get your feedback for that use case. It gives you 100 free credits initially but if you email me (ali@doctly.ai), I can give you an extra 500 (goes for anyone else here also)


If you have a judge system, and Mistral performs well on other tests, wouldn't you want to include it so if it scores the highest by your judges ranking it would select the most accurate result? Or are you saying that mistral's image markdown would score higher on your judge score?


We'll definitely be doing more tests, but the results I got on the complex tests would result in a lower score and might not be worth the extra cost of the judgement itself.

In our current setup Gemini wins most often. We enter multiple generations from each model into the 'tournament', sometimes one generation from gemini could be at the top while another in the bottom, for the same tournament.


Does doctly do handwritten forms like dates?

I have a lot of "This document filed and registered in the county of ______ on ______ of _____ 2023" sort of thing.


We've been getting great results with those aswell. But ofcourse there is always some chance of not getting it perfect, specially with different handwritings.

Give it a try, no credit cards needed to try it. If you email me (ali@doctly.ai) i can give you extra free credits for testing.


Just tried it. Got all the dates correct and even extracted signatures really well.

Now to figure out how many millions of pages I have.


How do you stay competitive with $2/100 pages pricing as compared to mistral and others offering 1000 pages for $1 approx?


Customers are willing to pay for accuracy compared to existing solutions out there. We started out in need of an accurate solution for a RAG product we were building, but none of the solutions we tried were providing the accuracy we needed.


Why pay more for doctly than an AWS Textract?


I did not try doctly, but AWS Textract does not support in my case Russian, so the output is completely useless


Great question. The language models are definitely beating the old tools. Take a look at Gemini for example.

Doctly runs a tournament style judge. It will run multiple generations across LLMs and pick the best one. Outperforming single generation and single model.


Would love to see the test file.


would be glad to see benchmarking results


This is a good idea. We should publish a benchmark results/comparison.


Love the website design


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: