HTML as an Accessible Format for Papers (2023)
Posted by el3ctron 5 days ago
Comments
Comment by dginev 5 days ago
As a very brief update - we are pending a larger update.
You will spot many (many) issues with our current coverage and fidelity of the paper rendering. When they jump at you, please report them to us. All reports from the last 2 years have landed on github. We have made a bit of progress since, but there are (a lot of) more low-hanging fruit to pick.
Project issues:
https://github.com/arXiv/html_feedback/issues/
The main bottleneck at the moment is developer time. And the main vehicle for improvements on the LaTeX side of things continues to be LaTeXML. Happy to field any questions.
Comment by istillwritecode 4 days ago
Comment by dginev 3 days ago
https://github.com/brucemiller/LaTeXML/issues
It's a pretty deep rabbit hole, but I wholeheartedly agree most standard package support incantations should be easy and few to use.
Comment by RandyOrion 5 days ago
Compared to PDF format, HTML format is much more accessible because of browsers. Basically I can reuse my browser extensions to do anything I like without hassle, like translation, note taking, sending texts to LLMs, and so on.
For now, arXiv offers two HTML services: the default one in https://arxiv.org/html/xxxx.xxxxx , and the alternative one in https://ar5iv.labs.arxiv.org/html/xxxx.xxxxx , here 'x' is a placeholder for a number or digit.
The most glaring problem of the default HTML service is the coverage of papers. Sometimes it just doesn't work, e.g., https://arxiv.org/html/2505.06708 . The solution may be switch to alternative HTML service, e.g., https://ar5iv.labs.arxiv.org/html/2505.06708 .
Note that alternative HTML service also has coverage problem. Sometimes both HTML services fail, e.g. https://arxiv.org/abs/2511.22625 .
Comment by rhubarbtree 4 days ago
Comment by cxr 2 days ago
And to respond to your implied criticism: the stability/reliability/fidelity of PDFs is a myth. It would be hard to say how many dozens of PDFs I've come across in the last two years that don't look the same across devices/viewers (or sometimes just fail to render in their entirety). This played a significant part in a cascade of errors in one incident I know of that resulted in the payout of a claim more than $1,000 but less than $10,000—not to mention a lot of strife and anger for the persons involved over the course of multiple months before resolution.
(As I write this now, I realize I'd almost forgotten about the fact that almost every time I've taken something to FedEx or UPS to be printed at a self-service kiosk, the result has been unusable, so I've had to take it to the clerk to have them print it instead.)
HTML at least has the property that it's still trivial to access and extract the data if you run into either malformed inputs or ones that are valid but incompatible/unsupported by whatever viewer (browser) you happen to be using, which is a lot more than you can say for more opaque formats like Java, PDF, and Flash.
Comment by ComputerGuru 5 days ago
I’ve actually dug into this in the past and it was never lack of technical ability that prevented them from even adding just proper superscript/subscript support before, but rather their opinion that this didn’t belong in the symbolic layer. But since emoji abuse/rely on ZWJ and modifiers left and right to display in one of a myriad of variations, there’s really no good reason not to allow the same, because 2 and the squares symbol are not semantically the same (so it’s not a design choice).
An interesting (complete) tangent is that Gemini 3 Pro is the only model I’ve tested (I do a lot of math-related stuff with LLMs) that absolutely will not under any circumstances respect (system/user) prompt requests to avoid inline math mode (aka LATeX) in the output, regardless of whether I asked for a blanket ban on TeX/MathJax/etc or when I insisted that it use extended unicode codes points to substitute all math formula rendering (I primarily use LLMs via the TUI where I don’t have MathJax support, and as familiar as I once was with raw TeX mathematical notations and symbols, it’s still quite easy to confuse unrendered raw output by missing something if you’re not careful). I shared my experiment and results here – Gemini 3 Pro would insist on even rendering single letter constants or variables as $k$ instead of just k (or k in markdown italics, etc) no matter how hard I asked it not to (which makes me think it may have been overfit against raw LATeX papers, and is also an interesting argument in favor of the “VL LLMs are the more natural construct”): https://x.com/NeoSmart/status/1995582721327071367?s=20
Comment by crazygringo 5 days ago
At a fundamental level, Unicode is for characters, not layout. Unicode may abuse the ZWJ for emoji, but it still ultimately results in a single emoji character, not a layout of characters. So I don't really understand what you're asking for.
Comment by bsder 5 days ago
Why not? Things like Arabic ligatures already do that, no?
Comment by bruce343434 4 days ago
Comment by austinjp 5 days ago
Comment by bsder 5 days ago
That's the open source font shaping engine. It does a lot of work to handle font shaping and rendering for languages that can't really be reduced to characters.
Comment by lukan 5 days ago
Comment by raincole 5 days ago
Comment by SOTGO 5 days ago
Comment by hannahnowxyz 5 days ago
Comment by baby 5 days ago
Comment by toastal 4 days ago
Comment by yannis 5 days ago
Comment by franga2000 4 days ago
But authors still refuse. It's not real science if the layout isn't two-column, written in an old serif font, tables and figures float randomly disconnected from their reference points, code isn't syntax higlighted and has completely nonsensical line breaks... If the reader wants to read it on a phone, or needs to change to font to be larger or more legible, they're not a real scientist and don't deserve to read real papers.
Seriously, what the fuck?? Even the economists are laughing at us with their MS Word and third-party cloud-based bibliography plugin subscription.
Comment by gus_massa 4 days ago
In unoficial notes for the classes, most authors use single column, and try to remember the magic spell to keep the figures in place. Something like [H!] ???
Also most books are single column.
Comment by moelf 5 days ago
Comment by ForceBru 5 days ago
EDIT: indeed, it was introduced in 2023: https://blog.arxiv.org/2023/12/21/accessibility-update-arxiv...
Comment by Tagbert 5 days ago
Why "experimental" HTML?
Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices. In addition to the technical challenges, the conversion must be both rapid and automated in order to maintain arXiv’s core service of free and fast dissemination.
Comment by ForceBru 5 days ago
Comment by fooofw 5 days ago
> 1. TeX has many advantages that make it ideal as a format for the archives: It is plain text, it is compact, it is freely available for all platforms, it produces extremely high-quality output, and it retains contextual information.
> 2. It is thus more likely to be a good source from which to generate newer formats, e.g., HTML, MathML, various ePub formats, etc. [...]
Not that I disagree with the effort and it surely is a unique challenge to, at scale, convert the Turing complete macro language TeX to something other than PDF. And, at the same time, the task would be monumentally more difficult if only the generated PDFs were available. So both are right at the same time.
Comment by tosti 4 days ago
HTML has better separation of concerns than latex. Latex does typesetting a lot better than html. HTML layout can differ wildly in the same document. Latex documents are easier to layout in the first place.
...etc...
Comment by daemonologist 5 days ago
Comment by inglor 5 days ago
Comment by DominikPeters 5 days ago
Comment by xworld21 5 days ago
Comment by el3ctron 5 days ago
Comment by lalithaar 5 days ago
Comment by dginev 5 days ago
Comment by ekjhgkejhgk 5 days ago
Comment by mmooss 5 days ago
Is there an epub reader that can format text approximately as usably and beautifully as pdf? What I've seen makes it noticeably harder to read longer texts, though I haven't looked around much.
epub also lacks annotation, or at least annotation that will be readable across platforms and time.
Comment by hombre_fatal 5 days ago
Not really what you want researchers to waste their time doing.
But you can use any of the numerous html->epub packagers yourself.
Comment by pspeter3 5 days ago
Comment by ekjhgkejhgk 5 days ago
Comment by silon42 4 days ago
Comment by constantcrying 4 days ago
What really is needed is a markup language which natively can target both PDF and HTML. This is something typst is working on, but I am not aware of any other project, which either comes close to the features of LaTeX or supports both target formats.
To me this is the only reasonably way to address the accessibility and usability issues around Papers. Have one markup, with sufficient accessibility features, which simultaneously targets HTML and PDF.
Comment by _dain_ 5 days ago
Comment by fsh 5 days ago
Comment by cxr 2 days ago
Comment by teddy-smith 4 days ago
Comment by Barbing 5 days ago
Challenging. Good work!
Comment by cubefox 5 days ago
It's not much of an "experiment" if you don't plan to use some experimental data to improve things somehow.
Comment by leobg 5 days ago
Comment by percentcer 5 days ago
Comment by bo1024 5 days ago
Comment by pwdisswordfishy 5 days ago
Comment by gbear605 5 days ago
Comment by fph 4 days ago
Comment by ErroneousBosh 5 days ago
Edit: Genuine question, not rhetorical - I don't know how well it would work but it sounds like it should.
Comment by fooofw 5 days ago
Comment by ErroneousBosh 3 days ago
Comment by fooofw 2 days ago
Comment by zipy124 4 days ago
Comment by cxr 2 days ago
Comment by sundarurfriend 5 days ago
Comment by sega_sai 5 days ago
Comment by sundarurfriend 5 days ago
> View any arXiv article URL [in HTML] by changing the X to a 5
The line
> Sources upto the end of November 2025.
sounds to me like this is indeed intended for older articles.
Comment by dginev 5 days ago
There used to be another showcase, called arxiv-vanity. They captured what happened pretty well with their farewell post on their homepage:
Comment by jas39 5 days ago
Comment by stephenlf 5 days ago
Comment by nateroling 5 days ago
Comment by qart 5 days ago
Comment by s0rce 5 days ago
Comment by nateroling 5 days ago
Comment by DANmode 5 days ago
Truth in general, if we aren't careful.
Comment by doc_ick 5 days ago
Comment by sansseriff 5 days ago
Comment by JadeNB 5 days ago
Well, that's terrifying. I mean, I knew it about undergrads, but I sure hoped people going into grad school would be aware of the dangers of making your main contact with research, where subtle details are important, through a known-distorting filter.
(I mean, I'd still be kinda terrified if you said that grad students first encounter papers through LLMs. But if it is the front end for all knowledge they consume? Absolutely dystopian.)
Comment by sansseriff 5 days ago
In some ways I’m scared too. But that’s the way things are going because younger people far prefer the interface of chat and question answering to flipping through a textbook.
Even if AI makes more mistakes or is more misaligned with the reader’s intentions than a random human reviewer (which is debatable in certain fields since the latest models game out), the behavior of young people requires us to improve the reputability of these systems. (Make sure they use citations, make sure they don’t hallucinate, etc). I think the technology is so much more user friendly that fixing the engineering bugs will be easier than forcing new generations to use the older systems.
Comment by notorandit 4 days ago
LaTeX and TeX are the de facto standard for this context and converting all existing documents is a lot of work and energy to be spent for basically little gain, if any.
Comment by ashleyn 5 days ago
Comment by jrk 5 days ago
Comment by billconan 5 days ago
the actual paper content format should be separated from its rendering.
i.e. it should contain abstract, sections, equations, figures, citations etc. but it shouldn't have font sizes, layout etc.
the viewer platforms then should be able to style the content differently.
Comment by cluckindan 5 days ago
They are converting to HTML to make the content more accessible. Accessibility in this context means a11y, in effect ”more accessible” equates to ”more compatible with screen readers”.
While PDF documents can be made accessible, it is way easier to do it in HTML, where browsers build an actual AOM (accessibility object model) tree and expose it to screen readers.
>it should contain abstract, sections, equations, figures, citations etc.
So <article>, <section>, <math>, <figure>, <cite>, etc.
Comment by o11c 5 days ago
Comment by cluckindan 4 days ago
The <i> HTML element represents a range of text that is set off from the normal text for some reason, such as idiomatic text, technical terms, taxonomical designations, among others. Historically, these have been presented using italicized type, which is the original source of the <i> naming of this element.
The <em> element is for words that have a stressed emphasis compared to surrounding text, which is often limited to a word or words of a sentence and affects the meaning of the sentence itself.
Typically this element is displayed in italic type. However, it should not be used to apply italic styling; use the CSS font-style property for that purpose. Use the <cite> element to mark the title of a work (book, play, song, etc.). Use the <i> element to mark text that is in an alternate tone or mood, which covers many common situations for italics such as scientific names or words in other languages.
Comment by pwdisswordfishy 1 day ago
Unfortunately, a lot of people who missed the point entirely.
(We can, however, still disagree with the commenter that this "killed" semantic HTML. Fond of overstating things a bit?)
Comment by benatkin 5 days ago
Comment by cluckindan 5 days ago
HTML was explicitly designed to semantically represent scientific documents. [1]
”HTML documents represent a media-independent description of interactive content. HTML documents might be rendered to a screen, or through a speech synthesizer, or on a braille display. To influence exactly how such rendering takes place, authors can use a styling language such as CSS.” [2]
1: https://html.spec.whatwg.org/multipage/introduction.html#bac...
2: https://html.spec.whatwg.org/multipage/introduction.html#:~:...
Comment by Theodores 5 days ago
I would be delighted if they could do better than that, with figcaptions as well as figures, and sections 'scoped' with just one <h2-6> heading per section. They could specify how it really should be done, the HTML way, with a well defined way of doing the abstract and getting the cited sources to be in semantic markup yet not in some massive footer at the back.
There should also be a print stylesheet so that the paper prints out elegantly on A4 paper. Yes, I know you can 'print to PDF' but you can get all the typesetting needed in modern CSS stylesheets.
Furthermore, they need to write a whole new HTML editor that discards WYSIWYG in favour of semantic markup. WYSIWYG has held us back by decades as it is useless for creating a semantic document. We haven't moved on from typewriters and the conventions needed to get those antiques to work, with word processors just emulating what people were used to at the time. What we really need is a means to evolve the written word, so that our thinking is 'semantic' when we come to put together documents, with a 'document structure first' approach.
LaTeX is great, however, last time I used it was many decades ago, when the tools were 'vi' (so not even vim) and GhostScript, running on a Sun workstation with mono screen. Since then I have done a few different jobs and never have I had the need to do anything in LaTex or even open a LaTeX file. In the wild, LaTeX is rarer than hen's teeth. Yet we all read scientific papers from time to time, and Arxiv was founded on the availability of Tex files.
The lack of widespread adoption of semantic markup has been a huge bonus to Google and other gatekeepers that have the money to develop their own heuristics to make sense of 'seas of divs'. As it happens, Google have also been somewhat helpful with Chrome and advancing the web, even if it is for their gatekeeping purposes.
The whole world of gatekeeping is also atrocious in academia. Knowledge wants to be free, but it is also big business to the likes of Springer, who are already losing badly to open publishing.
As you say, in this instance, accessibility means screen readers, however, I hope that we can do better than that, to get back to the OG Tim Berners Lee vision of what the web should be like, as far as structuring information is concerned.
Comment by dginev 5 days ago
Comment by dimal 5 days ago
And as another commenter has pointed out, HTML does exactly what you ask for. If it’s done correctly, it doesn’t contain font sizes or layout. Users can style HTML differently with custom CSS.
Comment by billconan 5 days ago
HTML was a digital format, but it wanted to be a generic format for all document types, not just papers, so it contains a lot of extras that a paper format doesn't need.
for research papers, since they share the same structure, we can further separate content from rendering.
for example, if you want to later connect a paper with an AI, do you want to send <div class="abstract"> ... ?
or do some nasty heuristic to extract the abstract? like document. getElementsByClassName("abstract")[0] ?
Comment by simonw 5 days ago
Comment by m-schuetz 5 days ago
Comment by bob1029 5 days ago
I disagree. PDF is the most desirable format for printed media and its analogues. Any time I plan to seriously entertain a paper from Arxiv, I print it out first. I prefer to have the author's original intent in hand. Arbitrary page breaks and layout shifts that are a result of my specific hardware/software configuration are not desirable to me in this context of use.
Comment by ACCount37 5 days ago
In research and in embedded hardware both, I've met some people who had entire stacks of papers printed out - research papers or datasheets or application notes - but also people who had 3 monitors and 64GB of RAM and all the papers open as browser tabs.
I'm far closer to the latter myself. Is this a "generational split" thing?
Comment by pfortuny 5 days ago
Comment by s0rce 5 days ago
Comment by afavour 5 days ago
Comment by billconan 5 days ago
<div class="abstract-container">
<div class="abstract">
<pre><code> abstract text ... </code></pre>
</div>
<div class="author-list">
<ol>
<li>author one</li>
<li>author two</li>
<ol>
</div>
should be just:
[abstract]
abstract text
[authors]
author one | email | affiliation
author two | email | affiliation
Comment by afavour 5 days ago
But you could still use HTML. Elements with a dash in are reserved for custom elements (that is, a new standardised element will never take that name) so you could do:
<paper-author-list>
<paper-author />
</paper-author-list>
And it would be valid HTML. Then you’d style it with CSS, with paper-author {
display: list-item;
}
And so on.Comment by bawolff 5 days ago
Comment by afavour 5 days ago
Comment by bawolff 5 days ago
Comment by afavour 5 days ago
If you distribute the paper as XML with an XSLT transform you need to run something that’ll perform that transform before you can read the paper. No matter whether that transform happens on the server or on the client it’s still an extra complication in the flow of sharing information.
Comment by xworld21 5 days ago
Comment by panzi 5 days ago
Comment by kevindamm 5 days ago
Comment by chr15m 5 days ago
Comment by teddy-smith 5 days ago
All papers should be in HTML/CSS or Tex then just simply converted to PDF.
Why are we even talking about this?
Comment by tefkah 5 days ago
The problem is having the submissions be in TeX and converting that to HTML, when the only output has been PDF for so long.
The problem isn’t converting HTML to PDF, it’s making available a giant portion of TeX/pdf only papers in HTML.
If you’re arguing that maybe TeX then shouldn’t be the source format for papers then I agree, but other than Typst (which also isn’t perfect about HTML output yet) there aren’t that many widely accepted/used authoring formats for physics/math papers, which is what ArXiV primarily hosts.
Comment by teddy-smith 5 days ago
Comment by crazygringo 5 days ago
HTML doesn't support the necessary features. Citations in various formats, footnotes, references to automatically numbered figures and tables, I could go on and on.
HTML could certainly be extended to support those, but it hasn't been. That's why we're talking about this.
Comment by teddy-smith 5 days ago
Comment by crazygringo 5 days ago
It doesn't really matter if HTML/CSS is more powerful at a hundred other layout things, if it doesn't provide the absolute necessary features for papers.
Comment by teddy-smith 4 days ago
> https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/...
> https://codepen.io/tag/citation
footnotes
>https://codepen.io/SitePoint/pen/QbMgvY
references to automatically numbered figures and tables
> https://stackoverflow.com/questions/25869906/table-auto-numb...\
Comment by crazygringo 2 days ago
Citations need to generate reference lists. Footnotes require automatic placement at the bottom of each page. Your examples of numbered tables are numbering the rows, not the tables. And figure numbers need to be referenced in the text.
None of what you're pointing to does what academic papers need. Why are you trying to push this agenda?
Comment by ekjhgkejhgk 5 days ago
Comment by teddy-smith 5 days ago
Comment by benatkin 5 days ago
Either way it gets shoehorned.
Comment by carlosjobim 5 days ago
Comment by teddy-smith 5 days ago
Comment by carlosjobim 5 days ago
Comment by teddy-smith 4 days ago
Literally part of Mozilla's docs.
Comment by carlosjobim 4 days ago
Edit to clarify: The break-after property works with the worthless print dialogues, but doesn't function with "Export to PDF", which is what most people will want to use.
Comment by nkrisc 5 days ago
Comment by teddy-smith 5 days ago
Comment by nkrisc 4 days ago
Comment by lalithaar 5 days ago
Comment by rootnod3 5 days ago
Comment by xigoi 5 days ago
Comment by doc_ick 5 days ago
I also haven’t had good luck with images/graphs/custom tables in anything but typist/latex.
Comment by vatsachak 5 days ago
HTML rendering requires you to be connected to the internet, or setting up the images and mathJax locally. A PDF just works.
HTML obviously supports dynamic embedding, such as programs, much better but people just usually post a github.io page with the paper.
Comment by devnull3 5 days ago
Not really. One can always generate a self-contained html. Both CSS and JS (if needed) can be inline.
Comment by vatsachak 5 days ago
Comment by mmooss 5 days ago
Comment by nine_k 5 days ago
Comment by vatsachak 5 days ago
Comment by recursive 5 days ago