专栏名称: Java爱好者

分享android开发编程知识和相关技术应用

Excel大批量数据的导入和导出，如何做优化？

Java爱好者 · 公众号 · · 2021-01-27 09:48

正文

作者：Alben

链接：https://albenw.github.io/posts/d093ca4

概要

Java对Excel的操作一般都是用POI，但是数据量大的话可能会导致频繁的FGC或OOM，这篇文章跟大家说下如果避免踩POI的坑，以及分别对于xls和xlsx文件怎么优化大批量数据的导入和导出。

一次线上问题

这是一次线上的问题，因为一个大数据量的Excel导出功能，而导致服务器频繁FGC，具体如图所示

可以看出POI的对象以及相关的XML对象占用了绝大部分的内存消耗，频繁FGC说明这些对象一直存活，没有被回收。

原因是由于导出的数据比较大量，大概有10w行 * 50列，由于后台直接用XSSFWorkbook导出，在导出结束前内存有大量的Row，Cell，Style等，以及基于XLSX底层存储的XML对象没有被释放。

Excel的存储格式

下面的优化内容涉及Excel的底层存储格式，所以要先跟大家讲一下。

XLS

03版的XLS采用的是一种名为BIFF8(Binary-Interchange-File-Format)，基于OLE2规范的二进制文件格式。大概就是一种结构很复杂的二进制文件，具体细节我也不是很清楚，大家也没必要去了解它，已经被淘汰了。想了解的话可以看看Excel XLS文件格式

XLSX

07版的XLSX则是采用OOXML(Office Open Xml)的格式存储数据。简单来说就是一堆xml文件用zip打包之后文件。这个对于大家来说就熟悉了，把xlsx文件后缀名改为zip后，再解压出来就可以看到文件结构

打开sheet1.xml，可以看到是描述第一个sheet的内容

导出优化

事例源码基于POI3.17版本

XLSX

由于xlsx底层使用xml存储，占用内存会比较大，官方也意识到这个问题，在3.8版本之后，提供了SXSSFWorkbook来优化写性能。

官方说明

https://poi.apache.org/components/spreadsheet/how-to.html#sxssf

使用

SXSSFWorkbook使用起来特别的简单，只需要改一行代码就OK了。

原来你的代码可能是长这样的

Workbook workbook = new XSSFWorkbook(inputStream);

那么你只需要改成这样子，就可以用上SXSSFWorkbook了

Workbook workbook = new SXSSFWorkbook(new XSSFWorkbook(inputStream));

其原理是可以定义一个window size（默认100），生成Excel期间只在内存维持window size那么多的行数Row，超时window size时会把之前行Row写到一个临时文件并且remove释放掉，这样就可以达到释放内存的效果。

SXSSFSheet在创建Row时会判断并刷盘、释放超过window size的Row。

@Override
    public SXSSFRow createRow(int rownum)
    {
        int maxrow = SpreadsheetVersion.EXCEL2007.getLastRowIndex();
        if (rownum 0 || rownum > maxrow) {
            throw new IllegalArgumentException("Invalid row number (" + rownum
                    + ") outside allowable range (0.." + maxrow + ")");
        }

        // attempt to overwrite a row that is already flushed to disk
        if(rownum <= _writer.getLastFlushedRow() ) {
            throw new IllegalArgumentException(
                    "Attempting to write a row["+rownum+"] " +
                    "in the range [0," + _writer.getLastFlushedRow() + "] that is already written to disk.");
        }

        // attempt to overwrite a existing row in the input template
        if(_sh.getPhysicalNumberOfRows() > 0 && rownum <= _sh.getLastRowNum() ) {
            throw new IllegalArgumentException(
                    "Attempting to write a row["+rownum+"] " +
                            "in the range [0," + _sh.getLastRowNum() + "] that is already written to disk.");
        }

        SXSSFRow newRow=new SXSSFRow(this);
        _rows.put(rownum,newRow);
        allFlushed = false;
        //如果大于窗口的size，就会flush
        if(_randomAccessWindowSize>=0&&_rows.size()>_randomAccessWindowSize)
        {
            try
            {
               flushRows(_randomAccessWindowSize);
            }
            catch (IOException ioe)
            {
                throw new RuntimeException(ioe);
            }
        }
        return newRow;
    }

    public void flushRows(int remaining) throws IOException
    {
        //flush每一个row
        while(_rows.size() > remaining) {
            flushOneRow();
        }
        if (remaining == 0) {
            allFlushed = true;
        }
    }

    private void flushOneRow() throws IOException
    {
        Integer firstRowNum = _rows.firstKey();
        if (firstRowNum!=null) {
            int rowIndex = firstRowNum.intValue();
            SXSSFRow row = _rows.get(firstRowNum);
            // Update the best fit column widths for auto-sizing just before the rows are flushed
            _autoSizeColumnTracker.updateColumnWidths(row);
            //写盘
            _writer.writeRow(rowIndex, row);
            //然后把row remove掉，这里的_rows是一个TreeMap结构
            _rows.remove(firstRowNum);
            lastFlushedRowNumber = rowIndex;
        }
    }

我们再看看刷盘的具体操作

SXSSFSheet在创建的时候，都会创建一个SheetDataWriter，刷盘动作正是由这个类完成的

看下SheetDataWriter的初始化

public SheetDataWriter() throws IOException {
    //创建临时文件
    _fd = createTempFile();
    //拿到文件的BufferedWriter
    _out = createWriter(_fd);
}
//在本地创建了一个临时文件前缀为poi-sxssf-sheet，后缀为.xml
public File createTempFile() throws IOException {
    return TempFile.createTempFile("poi-sxssf-sheet", ".xml");
}

public static




    
 File createTempFile(String prefix, String suffix) throws IOException {
    //用一个策略去创建文件
    return strategy.createTempFile(prefix, suffix);
}

//这个策略就是在执行路径先创建一个目录（如果不存在的话），然后再在里面创建一个随机唯一命名的文件
public File createTempFile(String prefix, String suffix) throws IOException {
    // Identify and create our temp dir, if needed
    createPOIFilesDirectory();
    
    // Generate a unique new filename 
    File newFile = File.createTempFile(prefix, suffix, dir);

    // Set the delete on exit flag, unless explicitly disabled
    if (System.getProperty(KEEP_FILES) == null) {
        newFile.deleteOnExit();
    }

    // All done
    return newFile;
}

POI就是把超过window size的Row刷到临时文件里，然后再把临时文件转为正常的xlsx文件格式输出。

我们看看刷盘时写了什么，SheetDataWriter的writeRow方法

public void writeRow(int rownum, SXSSFRow row) throws IOException {
    if (_numberOfFlushedRows == 0)
        _lowestIndexOfFlushedRows = rownum;
    _numberLastFlushedRow = Math.max(rownum, _numberLastFlushedRow);
    _numberOfCellsOfLastFlushedRow = row.getLastCellNum();
    _numberOfFlushedRows++;
    beginRow(rownum, row);
    Iterator cells = row.allCellsIterator();
    int columnIndex = 0;
    while (cells.hasNext()) {
        writeCell(columnIndex++, cells.next());
    }
    endRow();
}

void beginRow(int rownum, SXSSFRow row) throws IOException {
    _out.write(");
    writeAttribute("r", Integer.toString(rownum + 1));
    if (row.hasCustomHeight()) {
        writeAttribute("customHeight", "true");
        writeAttribute("ht", Float.toString(row.getHeightInPoints()));
    }
    if (row.getZeroHeight()) {
        writeAttribute("hidden", "true");
    }
    if (row.isFormatted()) {
        writeAttribute("s", Integer.toString(row.getRowStyleIndex()));
        writeAttribute("customFormat", "1");
    }
    if (row.getOutlineLevel() != 0) {
        writeAttribute("outlineLevel", Integer.toString(row.getOutlineLevel()));
    }
    if(row.getHidden() != null) {
        writeAttribute("hidden", row.getHidden() ? "1" : "0");
    }
    if(row.getCollapsed() != null) {
        writeAttribute("collapsed", row.getCollapsed() ? "1" : "0");
    }
    
    _out.write(">\n");
    this._rownum = rownum;
}

void endRow() throws IOException {
    _out.write("\n");
}

public void writeCell(int columnIndex, Cell cell) throws IOException {
    if (cell == null) {
        return;
    }
    String ref = new CellReference(_rownum, columnIndex).formatAsString();
    _out.write(");
    writeAttribute("r", ref);
    CellStyle cellStyle = cell.getCellStyle();
    if (cellStyle.getIndex() != 0) {
        // need to convert the short to unsigned short as the indexes can be up to 64k
        // ideally we would use int for this index, but that would need changes to some more 
        // APIs
        writeAttribute("s", Integer.toString(cellStyle.getIndex() & 0xffff));
    }
    CellType cellType = cell.getCellTypeEnum();
    switch (cellType) {
        case BLANK: {
            _out.write('>');
            break;
        }
        case FORMULA: {
            _out.write(">");
            outputQuotedString(cell.getCellFormula());
            _out.write("");
            switch (cell.getCachedFormulaResultTypeEnum()) {
                case NUMERIC:
                    double nval = cell.getNumericCellValue();
                    if (!Double.isNaN(nval)) {
                        _out.write("");
                        _out.write(Double.toString(nval));
                        _out.write("");
                    }
                    break;
                default:
                    break;
            }
            break;
        }
        case STRING: {
            if (_sharedStringSource != null) {
                XSSFRichTextString rt = new XSSFRichTextString(cell.getStringCellValue());
                int sRef = _sharedStringSource.addEntry(rt.getCTRst());

                writeAttribute("t", STCellType.S.toString());
                _out.write(">");
                _out.write(String.valueOf(sRef));
                _out.write("");
            } else {
                writeAttribute("t", "inlineStr");
                _out.write(">);
                if (hasLeadingTrailingSpaces(cell.getStringCellValue())) {
                    writeAttribute("xml:space", "preserve");
                }
                _out.write(">");
                outputQuotedString(cell.getStringCellValue());
                _out.write("");
            }
            break;
        }
        case NUMERIC: {
            writeAttribute("t", "n");
            _out.write(">");
            _out.write(Double.toString(cell.getNumericCellValue()));
            _out.write("");
            break;
        }
        case BOOLEAN: {
            writeAttribute("t", "b");
            _out.write(">");
            _out.write(cell.getBooleanCellValue() ? "1" : "0");
            _out.write("");
            break;
        }
        case ERROR: {
            FormulaError error = FormulaError.forInt(cell.getErrorCellValue());

            writeAttribute("t", "e");
            _out.write(">");
            _out.write(error.getString());
            _out.write("");
            break;
        }
        default: {
            throw new IllegalStateException("Invalid cell type: " + cellType);
        }
    }
    _out.write("");
}

可以看到临时文件里内容跟xlsx的文件格式是保持一致的。

测试

本地测试使用SXSSFWorkbook导出30w行 * 10列内存使用情况

可以看出内存有被回收的情况，比较平稳。

XLS

POI没有像XLSX那样对XLS的写做出性能的优化，原因是：

官方认为XLS的不像XLSX那样占内存
XLS一个Sheet最多也只能有65535行数据

导入优化

POI对导入分为3种模式，用户模式User Model，事件模式Event Model，还有Event User Model。

用户模式

用户模式（User Model）就类似于dom方式的解析，是一种high level api，给人快速、方便开发用的。缺点是一次性将文件读入内存，构建一颗Dom树。并且在POI对Excel的抽象中，每一行，每一个单元格都是一个对象。当文件大，数据量多的时候对内存的占用可想而知。

用户模式就是类似用 WorkbookFactory.create(inputStream)，poi 会把整个文件一次性解析，生成全部的Sheet，Row，Cell以及对象，如果导入文件数据量大的话，也很可能会导致OOM。

本地测试用户模式读取XLSX文件，数据量10w行 * 50列，内存使用如下

事件模式

事件模式（Event Model）就是SAX解析。Event Model使用的方式是边读取边解析，并且不会将这些数据封装成Row，Cell这样的对象。而都只是普通的数字或者是字符串。并且这些解析出来的对象是不需要一直驻留在内存中，而是解析完使用后就可以回收。所以相比于User Model，Event Model更节省内存，效率也更。

但是作为代价，相比User Model功能更少，门槛也要高一些。我们需要去学习Excel存储数据的各个Xml中每个标签，标签中的属性的含义，然后对解析代码进行设计。

User Event Model

User Event Model也是采用流式解析，但是不同于Event Model，POI基于Event Model为我们封装了一层。我们不再面对Element的事件编程,而是面向StartRow，EndRow，Cell等事件编程。而提供的数据，也不再像之前是原始数据，而是全部格式化好，方便开发者开箱即用。大大简化了我们的开发效率。

XLSX

POI对XLSX支持Event Model和Event User Model

XLSX的Event Model

使用

官网例子：

http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/examples/xssf/eventusermodel/FromHowTo.java

简单来说就是需要继承DefaultHandler，覆盖其startElement，endElement方法。然后方法里获取你想要的数据。

Excel大批量数据的导入和导出，如何做优化？

正文

概要

一次线上问题

Excel的存储格式

XLS

XLSX

导出优化

XLSX

使用

测试

XLS

导入优化

用户模式

事件模式

User Event Model

XLSX

XLSX的Event Model

使用

请到「今天看啥」查看全文