Question accessing FileBlob

Dileep D

Vote:

I am having a scheduled job that creates a file using IBlobFactory's CreateBlob method and eventually saving it in media documents using ContentRepository.Save() with SaveAction.Default.

Now I want to access this file to transfer it to datalake system. Below is the code I am trying : The line where am checking if the BinaryData is FileBlob errors out with object reference error. This works locally but errors out in DXC environment. Is there anything i am missing in the process to access the file or is this not the correct to get handle of the physical file to transfer it.

var file = this._contentLoader.Get<DataLakeDataFile>(this.DataFileContentReference);

if (file != null)
{

if (file.BinaryData is FileBlob fileBlob)
{
var filePath = fileBlob.FilePath;
DataLakeHelper.UploadFile(DataLakeDocumentReportETLSettings.DocumentReportBucketName, filePath);
DataLakeReportsHelper.BroadcastAndLogInformation(this._loggingService, this._statusBroadcast, "File has been transferred.");
}
else
{
throw new Exception($"Error: data file was retrieved but it can't be used as a FileBlob.");
}
}

#219256

Apr 01, 2020 0:13

Quan Mai

Vote:

On DXC assets are not FileBlob, but AzureBlob. that's why your code won't run properly. I suggest to go with the abstraction (Blob) instead. I'm not familiar with DataLakeHelper, but it might have a method that take a stream, then you can use blob.OpenRead() to supply that stream to upload

#219668

Apr 01, 2020 7:32

valdis

Vote:

And see if you can do it (upload to Data Lake) async.

#219698

Apr 01, 2020 19:41

Dileep D

Vote:

Let me try the options. For the basic part, if some one can guide me if the approach am taking is correct.

I use a schedule job which in turn uses epi find to get content (may be around 25000) and write them to a file in Azure blob and then read this file to transfer to AWS Datalake using Amazon sdk from nuget.

Not sure if there is any preferred way for this kind of requirement. I see that epi find sometimes chokes to pull the amount of content and secondly accessing the file from Azure blob is kind of getting tricky.

#219720

Apr 02, 2020 2:56

valdis

Vote:

I have no idea whether your approach is the most optimal one, but I would reverse responsibilities a bit and would implement following workflow:

epi job access find index to get items
then (depending on size of your blobs because there are size limitations on queue) - I would create new Azure Storage Queue (or Service Bus Queue) item to describe work item
then on the other side I would have some sort of trigger mechanism that reacts on new queue item and does processing -> DateLake upload. This could be Amazon Almbdas for example

It of course depends of the size of your blobs and whether you will be able to embed blob content into queue item (usually queues have quite small item size limitations) meaning that you might need to either just add reference to blob using SAS tokens or similar access option, or use Queue Attachment plugin (https://www.nuget.org/packages/ServiceBus.AttachmentPlugin/) for example (if you are on .NET on the other side).

Why I would split this workflow? It's because I would have few benefits out of the box:

there is clean separation of responsibilities -> one is producer of work items other is subscriber (or actual worker that performs actions on the item)
you get retry out of the box
you can scale by adding more workers on the other side (if queue gets long enough). This even could be accomplished automatiucally by auto-scaling Lambdas
you get nice statistics of the performance of each item processing time
you do not set workload on epi server hardware and allow site to function properly (if you are on the same server)

#220544

Apr 02, 2020 9:41

- Apr 03, 2020 17:35

I second this approach, having just done a real-time data export project based on ServiceBus and Azure Functions. Almost the same flow, for the same reasons.