Displaying PDF on the Web is a common requirement. However, the community has long lacked a suitable library to help developers do this. Pdf.js can meet the needs of displaying PDF, but it seems to be a bit powerless when it comes to integrating it into business systems. The latter is a fairly common requirement in China, whether it is a government-customized system or a commercial company's application system.
Here are some common problems when using pdf.js in application systems:
- It is easy to integrate pdf.js into the web, but it is not very convenient to make pdf.js interact with the backend.
- The basic functions of the PDF reader are available, but it is difficult for developers to control these functions through code.
- The source code is relatively complex and difficult to read, modify and extend, and customization based on the source code is even more difficult.
- Lack of strong documentation support, the official documentation is not clear enough, and the community documentation is not deep and wide enough (at least in the Chinese network).
As the times continue to move forward, it is increasingly necessary to solve some of these problems. I have helped several customers solve their needs for customizing pdf.js, such as loading PDF fragments, reading and storing annotations through a database, customizing specific special annotations, and solving bugs they encountered during the integration process.
Over a long period of time, I have communicated with many of my clients. I realized that most of the problems they face are common problems with certain commonalities. From a technical point of view, they can be solved through code reuse. As for the remaining small number of individual problems, they can be solved through plug-ins and custom codes.
So I decided to develop an advanced version based on pdf.js. And set two basic goals. The first goal is to strengthen the API and documentation so that developers can smoothly integrate pdf.js into the system they are developing. Front-end developers can use the API to operate the PDF reader, and back-end developers can persist various information in the PDF reader and return it to them when the front-end requests this data. The second goal is to develop some functions that are needed by the market but have not yet been developed or are not perfect, such as a richer annotation system, PDF permission management, image and text extraction, customized watermarks, etc.
But when I tried to do this, I found that it was not only very difficult, but also very ineffective. I used a tool to scan the source code of pdf.js and found that its code volume was actually more than I thought. If blank lines and comments are counted, its total code volume is about 200,000 lines. Even if blank lines and comments are removed, the total code volume is more than 100,000 lines. At first, I developed it for two to three weeks on this basis, and encountered many more problems than I expected at the beginning. On the basis of the existing code, even if it is just to achieve some basic goals, it is very difficult. For example, when I tried to change the PDF reader from the up and down sliding style to the page turning style, I found that not only a lot of code needs to be modified, but such changes will also affect other parts and bring potential bugs. And even if I spent a long time developing the functions I wanted, I might not be able to provide freely customized interfaces and APIs to developers, and even the information about function parameters in the source code is difficult to inform people who are willing to read the code in a convenient way.
After a short attempt, I found that I could not solve the seemingly simple problems mentioned above with simple methods. In fact, if minor improvements could solve all these problems, I believe this job would have been done long ago. Even if it is not the pdf.js team itself, there will be other developers active in the community to complete this task.
So I prepared to make some major changes. I planned to refactor pdf.js directly with TypeScript, retaining its core logic for PDF document processing, and refactoring its code related to displaying PDF on the web. The relevant tool chain and development style should also keep up with the times. I decided to abandon npm, remove gulp, and replace them with pnpm and vite. The traditional single module style is no longer used, and the multi-module approach of monorepo is a better choice.
I used to focus on the backend, and only occasionally wrote front-end, so I didn't have a deep understanding of front-end development, and I knew very little about the front-end technology ecosystem. But I still have a lot of experience in refactoring medium and large projects. I have refactored business systems such as warehouse management systems, and big data systems such as knowledge graph systems. In general, there are challenges in refactoring, and there is a period of pain, and this period of pain can even be said to be not short, nor easy. From the beginning of the reconstruction of a system to the smooth operation of this system, for a long period of time, there is almost no output of new functions, bugs may be strange, and difficulties are one after another. But when this period of pain is over, the whole project begins to see the light. Bugs that could not be located before can now be located. Functions that could not be developed before can now be developed. The low development efficiency has now been greatly improved. Various mainstream components that could not be connected before can now be easily connected. Therefore, if we want to fully tap the potential of pdf.js and allow developers to perfectly integrate it into their daily development, a thorough reconstruction is inevitable.
I think pdf.js does not perform well in the aspect of accessing business systems, mainly because it is not designed for the Chinese market alone. From a product perspective, its positioning is not a library for developers to call, but a tool for users to view PDF. pdf.js does a very good job in being a PDF viewing tool. It has also been integrated into many tools, such as VSCode, IDEA, chrome, etc. This allows users of these tools to easily view PDF on these tools. For ordinary computer users, they can also use chrome to view PDF directly without installing any other viewing tools. In contrast, every tool has its own strengths and weaknesses, and pdf.js cannot perfectly take into account both developers and users. For developers, it is only a little aggrieved. It is possible to integrate pdf.js, but the effect after integration is not so good.
When I refactored pdf.js, the first thing I had to do was to re-establish its positioning and goals - it does not directly serve users, but developers. Developers should be able to open a PDF with simple code as shown below, and customize their PDF reader through a series of configurations, rather than having to embed it through an iframe.
const viewer = WebSerenViewer.init('app', {
viewerScale: 0.7
});
viewer.open({
url: 'compressed.tracemonkey-pldi-09.pdf',
verbosity: VerbosityLevel.WARNINGS
}).then(() => {
const controller = viewer.getViewController();
bindEvents(controller);
})
function bindEvents(controller: WebViewerController) {
document.getElementById("pdf-page-up")?.addEventListener("click", () => {
controller.pageUp();
})
document.getElementById("pdf-page-down")?.addEventListener("click", () => {
controller.pageDown();
})
}
For a single function, what I want to provide is not a specific button or icon, but the API, callbacks and various events that developers can operate.
For developers who want to debug, customize, and contribute source code, I should provide mainstream tool chains, clear parameters, clear calling relationships, etc. I should ensure that they can understand, debug, and modify relevant code at a low cost.
After spending more than 100 days on this, I have made several key steps in these matters. The main languages and tools used in the code have changed to TypeScript, pnpm, and vite. The code development style has changed to a multi-module monorepo:
The parameter types of all functions are clear, and hard coding has been eliminated one after another. The overall code quality has been improved to a certain extent. And the most basic demo has also run through:
In this example, I used the initialization method in the previous code to create a PDF reader with the 'init' method, opened a PDF file with the 'open' method, and bound the previous page and next page APIs to the relevant buttons.
The whole process was full of challenges. When I wrote a bash script to simply and roughly change the suffix of all files ending with js to ts, there were more than 20,000 direct errors. For nearly two months, I kept fixing these errors one by one like Yugong moving mountains. After I fixed an error, several new errors might pop up. For example, after I clarified the parameters of a function, the previously hidden errors such as "The object has no xx attribute" and "The object may be null" also appeared. Therefore, the actual number of errors fixed is more than 20,000.
In the process of fixing the error, I also found many problems such as irregular writing and coupling. For example, a certain section of code arbitrarily passed parameter types that did not conform to the JSDoc description just for convenience; the reflection was used arbitrarily, resulting in a broken reference chain between codes; hard coding; confusing call relationships, etc. I changed them all. For the process-oriented writing, I also changed it to object-oriented. The following is a comparison of the code before and after the rewrite, before the rewrite:
handler.on("GetPage", function (data) {
return pdfManager.getPage(data.pageIndex).then(function (page) {
return Promise.all([
pdfManager.ensure(page, "rotate"),
pdfManager.ensure(page, "ref"),
pdfManager.ensure(page, "userUnit"),
pdfManager.ensure(page, "view"),
]).then(function ([rotate, ref, userUnit, view]) {
return {
rotate,
ref,
refStr: ref?.toString() ?? null,
userUnit,
view,
};
});
});
});
This code listens to the "GetPage" event, but the parameter type of this event is unknown (the type of the parameter data in function(data)), and it takes a long time to analyze. After listening to this code, pdfManager calls the get method of rotate in page, and directly uses such a hard-coded + reflection method to do it. Such code is not easy to modify and is prone to problems. For example, when someone accidentally changes the name of rotate in page, the code at this place may become a missed point and cause errors.
After the adjustment of TypeScript, all parameters are clear, the call is obvious, and the association between page and rotate has become a strong association:
handler.onGetPage(async data => {
return pdfManager!.getPage(data.pageIndex).then(async page => {
return Promise.all([
pdfManager!.ensure(page, page => page.rotate),
pdfManager!.ensure(page, page => page.ref),
pdfManager!.ensure(page, page => page.userUnit),
pdfManager!.ensure(page, page => page.view),
]).then(([rotate, ref, userUnit, view]) => {
return { rotate, ref, refStr: ref?.toString() ?? null, userUnit, view };
});
});
});
There are many more changes like this, and I will discuss these issues in detail and share my experiences in subsequent blogs.
After I completed the basic refactoring, the code could not be run directly. Once it was run, many runtime errors would be reported. At this time, the code base had been compiled. My previous work included refactoring some large files and methods, modifying Record to Map in large quantities, and unifying null and undefined in the project. These works brought a lot of bugs. It took more than a day to locate and solve several of the bugs. After fixing these bugs, my refactored code finally ran through. But this is not enough. In order to show readers a most basic case, which is the page turning case mentioned earlier. I did a lot of work to extract the parameters and APIs to ensure that if a developer wants to use this library, it can directly control the created PDF reader through configuration and API.
Although the code has been run, it is still a long way from being truly usable in production. I will continue to spend some time to conduct a rich test on the new code to ensure that all PDFs in pdf.js perform consistently on the old and new codes. The rewrite of APIs and callback events is not yet complete, which is also very important for developers to control the readers created with this library. Maybe I won't develop all the points at once. But it is still necessary to develop those key points well. After all this work is completed, I will release the first version with supporting cases, codes, and documents. Make sure this library is efficient and easy to use.
I have hosted the code for this project on Github:
https://github.com/xingshen24/seren-pdf
Debugging this library is relatively simple. Just use pnpm recursive install
to install the dependencies, then go to the packages-private/seren-viewer-develop/ directory and run it with pnpm run dev
.
Top comments (0)